, (*1)
XML Sitemap parser
An easy-to-use PHP library to parse XML Sitemaps compliant with the Sitemaps.org protocol., (*2)
The Sitemaps.org protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others., (*3)
, (*4)
Features
- Basic parsing
- Recursive parsing
- String parsing
- Custom User-Agent string
- Proxy support
- URL blacklist
- request throttling (using https://github.com/hamburgscleanest/guzzle-advanced-throttle)
- retry (using https://github.com/caseyamcl/guzzle_retry_middleware)
- advanced logging (using https://github.com/gmponos/guzzle_logger)
- XML
.xml
- Compressed XML
.xml.gz
- Robots.txt rule sheet
robots.txt
- Line separated text (disabled by default)
Requirements:
- PHP 5.6 or 7.0+, alternatively HHVM
- PHP extensions:
- Optional:
- https://github.com/caseyamcl/guzzle_retry_middleware
- https://github.com/hamburgscleanest/guzzle-advanced-throttle
Installation
The library is available for install via Composer. Just add this to your composer.json
file:, (*5)
{
"require": {
"vipnytt/sitemapparser": "^1.0"
}
}
Then run composer update
., (*6)
Getting Started
Basic example
Returns an list of URLs only., (*7)
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser();
$parser->parse('http://php.net/sitemap.xml');
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Advanced
Returns all available tags, for both Sitemaps and URLs., (*8)
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parse('http://php.net/sitemap.xml');
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'Sitemap<br>';
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '
';
}
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '
';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Recursive
Parses any sitemap detected while parsing, to get an complete list of URLs., (*9)
Use url_black_list
to skip sitemaps that are part of parent sitemap. Exact match only., (*10)
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent');
$parser->parseRecursive('http://www.google.com/robots.txt');
echo '
Sitemaps
';
foreach ($parser->getSitemaps() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo '
';
}
echo '
URLs
';
foreach ($parser->getURLs() as $url => $tags) {
echo 'URL: ' . $url . '<br>';
echo 'LastMod: ' . $tags['lastmod'] . '<br>';
echo 'ChangeFreq: ' . $tags['changefreq'] . '<br>';
echo 'Priority: ' . $tags['priority'] . '<br>';
echo '
';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Parsing of line separated text strings
Note: This is disabled by default to avoid false positives when expecting XML, but fetches plain text instead., (*11)
To disable strict
standards, simply pass this configuration to constructor parameter #2: ['strict' => false]
., (*12)
use vipnytt\SitemapParser;
use vipnytt\SitemapParser\Exceptions\SitemapParserException;
try {
$parser = new SitemapParser('MyCustomUserAgent', ['strict' => false]);
$parser->parse('https://www.xml-sitemaps.com/urllist.txt');
foreach ($parser->getSitemaps() as $url => $tags) {
echo $url . '<br>';
}
foreach ($parser->getURLs() as $url => $tags) {
echo $url . '<br>';
}
} catch (SitemapParserException $e) {
echo $e->getMessage();
}
Throttling
- Install middleware:
composer require hamburgscleanest/guzzle-advanced-throttle
- Define host rules:
$rules = new RequestLimitRuleset([
'https://www.google.com' => [
[
'max_requests' => 20,
'request_interval' => 1
],
[
'max_requests' => 100,
'request_interval' => 120
]
]
]);
- Create handler stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
- Create middleware:
$throttle = new ThrottleMiddleware($rules);
// Invoke the middleware
$stack->push($throttle());
// OR: alternatively call the handle method directly
$stack->push($throttle->handle());
- Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
- Pass client as an argument or use
setClient
method:
$parser = new SitemapParser();
$parser->setClient($client);
More details about this middle ware is available here, (*13)
Automatic retry
- Install middleware:
composer require caseyamcl/guzzle_retry_middleware
- Create stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
- Add middleware to the stack:
$stack->push(GuzzleRetryMiddleware::factory());
- Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
- Pass client as an argument or use setClient method:
$parser = new SitemapParser();
$parser->setClient($client);
More details about this middle ware is available here, (*14)
Advanced logging
- Install middleware:
composer require gmponos/guzzle_logger
- Create PSR-3 style logger
$logger = new Logger();
- Create handler stack:
$stack = new HandlerStack();
$stack->setHandler(new CurlHandler());
- Push logger middleware to stack
$stack->push(new LogMiddleware($logger));
- Create client manually:
$client = new \GuzzleHttp\Client(['handler' => $stack]);
- Pass client as an argument or use
setClient
method:
$parser = new SitemapParser();
$parser->setClient($client);
More details about this middleware config (like log levels, when to log and what to log) is available here, (*15)
Additional examples
Even more examples available in the examples directory., (*16)
Configuration
Available configuration options, with their default values:, (*17)
$config = [
'strict' => true, // (bool) Disallow parsing of line-separated plain text
'guzzle' => [
// GuzzleHttp request options
// http://docs.guzzlephp.org/en/latest/request-options.html
],
// use this to ignore URL when parsing sitemaps that contain multiple other sitemaps. Exact match only.
'url_black_list' => []
];
$parser = new SitemapParser('MyCustomUserAgent', $config);
If an User-agent also is set using the GuzzleHttp request options, it receives the highest priority and replaces the other User-agent., (*18)