Octopus Sitemap Crawler
Small PHP tool to crawl collections of URLs in a Sitemap using the PHPReact library for asynchronous loading of the URLs. Both plain text files and XML Sitemaps are supported., (*1)
, (*2)
Usage from the Command Line Interface (CLI)
Crawl the URLs in a Sitemap with verbose logging (-vvv)., (*3)
php application.php http://www.domain.ext/sitemap.xml -vvv
Using 15 concurrent connections instead of the default 5 concurrent connections:, (*4)
php application.php http://www.domain.ext/sitemap.xml --concurrency 15 -vvv
Use a HTTP GET request instead of the default HTTP HEAD. Note that HTTP HEAD requests involve less data transfer since no body is involved:, (*5)
php application.php http://www.domain.ext/sitemap.xml --requestType GET -vvv
Use a timeout of 3 seconds instead of the default 10 seconds:, (*6)
php application.php http://www.domain.ext/sitemap.xml --timeout 3 -vvv
Use a specific UserAgent instead of the default Octopus/1.0, for example, to simulate a search engine crawling a sitemap:, (*7)
php application.php http://www.domain.ext/sitemap.xml --userAgent 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' -vvv
Use the TablePresenter to display intermediate results instead of the default EchoPresenter:, (*8)
php application.php http://www.domain.ext/sitemap.xml --presenter Octopus\\Presenter\\TablePresenter -vvv
Usage from your own application
You can easily integrate sitemap crawling in your own application, have a look at the Config class for all possible configuration options. If required you can use a PSR3-Logger for logging purposes., (*9)
use Octopus\Config;
use Octopus\Processor;
$config = new Config();
$config->concurrency = 2;
$config->targetFile = 'https://www.domain.ext/sitemap.xml';
$config->additionalResponseHeadersToCount = array(
'CF-Cache-Status', //Useful to check CloudFlare edge server cache status
);
$config->requestHeaders = array(
'User-Agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', //Simulate Google's webcrawler
);
$processor = new Processor($config, $this->logger); //A PSR3 Logger can be injected if required
$processor->run();
$this->logger->info('Statistics: ' . print_r($processor->result->getStatusCodes(), true));
$this->logger->info('Applied concurrency: ' . $config->concurrency);
$this->logger->info('Total amount of processed data: ' . $processor->result->getTotalData());
$this->logger->info('Failed to load #URLs: ' . count($processor->result->getBrokenUrls()));
Limitations
Currently, Octopus is mainly an experimental / educational tool. Advanced use cases in HTTP response handling might not be supported., (*10)
Tests
To run the test suite, you first need to clone this repository and then install all dependencies using Composer:, (*11)
$ composer install
To run the test suite, go to the project root and run:, (*12)
$ make test