2017 © Pedro Peláez
 

library sitemapparser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

image

vipnytt/sitemapparser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

  • Monday, June 18, 2018
  • by JanPetterMG
  • Repository
  • 2 Watchers
  • 11 Stars
  • 6,977 Installations
  • PHP
  • 2 Dependents
  • 0 Suggesters
  • 5 Forks
  • 0 Open issues
  • 6 Versions
  • 30 % Grown

The README.md

Build Status Scrutinizer Code Quality Code Climate Test Coverage License Packagist Join the chat at https://gitter.im/VIPnytt/SitemapParser, (*1)

XML Sitemap parser

An easy-to-use PHP library to parse XML Sitemaps compliant with the Sitemaps.org protocol., (*2)

The Sitemaps.org protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others., (*3)

SensioLabsInsight, (*4)

Features

  • Basic parsing
  • Recursive parsing
  • String parsing
  • Custom User-Agent string
  • Proxy support
  • URL blacklist
  • request throttling (using https://github.com/hamburgscleanest/guzzle-advanced-throttle)
  • retry (using https://github.com/caseyamcl/guzzle_retry_middleware)
  • advanced logging (using https://github.com/gmponos/guzzle_logger)

Formats supported

  • XML .xml
  • Compressed XML .xml.gz
  • Robots.txt rule sheet robots.txt
  • Line separated text (disabled by default)

Requirements:

  • PHP 5.6 or 7.0+, alternatively HHVM
  • PHP extensions:
  • Optional:
    • https://github.com/caseyamcl/guzzle_retry_middleware
    • https://github.com/hamburgscleanest/guzzle-advanced-throttle

Installation

The library is available for install via Composer. Just add this to your composer.json file:, (*5)

{
    "require": {
        "vipnytt/sitemapparser": "^1.0"
    }
}

Then run composer update., (*6)

Getting Started

Basic example

Returns an list of URLs only., (*7)

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser();
    $parser->parse('http://php.net/sitemap.xml');
    foreach ($parser->getURLs() as $url => $tags) {
        echo $url . '<br>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Advanced

Returns all available tags, for both Sitemaps and URLs., (*8)

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent');
    $parser->parse('http://php.net/sitemap.xml');
    foreach ($parser->getSitemaps() as $url => $tags) {
        echo 'Sitemap<br>';
        echo 'URL: ' . $url . '<br>';
        echo 'LastMod: ' . $tags['lastmod'] . '<br>';
        echo '


'; } foreach ($parser->getURLs() as $url => $tags) { echo 'URL: ' . $url . '<br>'; echo 'LastMod: ' . $tags['lastmod'] . '<br>'; echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>'; echo 'Priority: ' . $tags['priority'] . '<br>'; echo '
'; } } catch (SitemapParserException $e) { echo $e->getMessage(); }

Recursive

Parses any sitemap detected while parsing, to get an complete list of URLs., (*9)

Use url_black_list to skip sitemaps that are part of parent sitemap. Exact match only., (*10)

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent');
    $parser->parseRecursive('http://www.google.com/robots.txt');
    echo '

Sitemaps

'; foreach ($parser->getSitemaps() as $url => $tags) { echo 'URL: ' . $url . '<br>'; echo 'LastMod: ' . $tags['lastmod'] . '<br>'; echo '
'; } echo '

URLs

'; foreach ($parser->getURLs() as $url => $tags) { echo 'URL: ' . $url . '<br>'; echo 'LastMod: ' . $tags['lastmod'] . '<br>'; echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>'; echo 'Priority: ' . $tags['priority'] . '<br>'; echo '
'; } } catch (SitemapParserException $e) { echo $e->getMessage(); }

Parsing of line separated text strings

Note: This is disabled by default to avoid false positives when expecting XML, but fetches plain text instead., (*11)

To disable strict standards, simply pass this configuration to constructor parameter #2: ['strict' => false]., (*12)

use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;

try {
    $parser = new SitemapParser('MyCustomUserAgent', ['strict' => false]);
    $parser->parse('https://www.xml-sitemaps.com/urllist.txt');
    foreach ($parser->getSitemaps() as $url => $tags) {
            echo $url . '<br>';
    }
    foreach ($parser->getURLs() as $url => $tags) {
            echo $url . '<br>';
    }
} catch (SitemapParserException $e) {
    echo $e->getMessage();
}

Throttling

  1. Install middleware:
composer require hamburgscleanest/guzzle-advanced-throttle
  1. Define host rules:
$rules = new RequestLimitRuleset([
    'https://www.google.com' => [
        [
            'max_requests'     => 20,
            'request_interval' => 1
        ],
        [
            'max_requests'     => 100,
            'request_interval' => 120
        ]
    ]
]);
  1. Create handler stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
  1. Create middleware:
$throttle = new ThrottleMiddleware($rules);

 // Invoke the middleware
$stack->push($throttle());

// OR: alternatively call the handle method directly
$stack->push($throttle->handle());
  1. Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
  1. Pass client as an argument or use setClient method:
$parser = new SitemapParser();
$parser->setClient($client);

More details about this middle ware is available here, (*13)

Automatic retry

  1. Install middleware:
composer require caseyamcl/guzzle_retry_middleware
  1. Create stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
  1. Add middleware to the stack:
$stack->push(GuzzleRetryMiddleware::factory());
  1. Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
  1. Pass client as an argument or use setClient method:
$parser = new SitemapParser();
$parser->setClient($client);

More details about this middle ware is available here, (*14)

Advanced logging

  1. Install middleware:
composer require gmponos/guzzle_logger
  1. Create PSR-3 style logger
$logger = new Logger();
  1. Create handler stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
  1. Push logger middleware to stack
$stack->push(new LogMiddleware($logger));
  1. Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
  1. Pass client as an argument or use setClient method:
$parser = new SitemapParser();
$parser->setClient($client);

More details about this middleware config (like log levels, when to log and what to log) is available here, (*15)

Additional examples

Even more examples available in the examples directory., (*16)

Configuration

Available configuration options, with their default values:, (*17)

$config = [
    'strict' => true, // (bool) Disallow parsing of line-separated plain text
    'guzzle' => [
        // GuzzleHttp request options
        // http://docs.guzzlephp.org/en/latest/request-options.html
    ],
    // use this to ignore URL when parsing sitemaps that contain multiple other sitemaps. Exact match only.
    'url_black_list' => []
];
$parser = new SitemapParser('MyCustomUserAgent', $config);

If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent., (*18)

The Versions

18/06 2018

dev-master

9999999-dev https://github.com/VIPnytt/SitemapParser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Jan-Petter Gundersen
by VIP nytt

parser xml sitemap robots.txt sitemaps.org

18/06 2018

1.0.4

1.0.4.0 https://github.com/VIPnytt/SitemapParser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Jan-Petter Gundersen
by VIP nytt

parser xml sitemap robots.txt sitemaps.org

30/04 2017

1.0.3

1.0.3.0 https://github.com/VIPnytt/SitemapParser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Jan-Petter Gundersen
by VIP nytt

parser xml sitemap robots.txt sitemaps.org

05/05 2016

v1.0.2

1.0.2.0 https://github.com/VIPnytt/SitemapParser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Jan-Petter Gundersen
by VIP nytt

parser xml sitemap robots.txt sitemaps.org

05/04 2016

v1.0.1

1.0.1.0 https://github.com/VIPnytt/SitemapParser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Jan-Petter Gundersen
by VIP nytt

parser xml sitemap robots.txt sitemaps.org

04/04 2016

v1.0.0

1.0.0.0 https://github.com/VIPnytt/SitemapParser

XML Sitemap parser class compliant with the Sitemaps.org protocol.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Jan-Petter Gundersen
by VIP nytt

parser xml sitemap robots.txt sitemaps.org