2017 © Pedro Peláez
 

library octopus

PHP Sitemap crawler

image

octopoda/octopus

PHP Sitemap crawler

  • Monday, June 11, 2018
  • by dpovshed
  • Repository
  • 2 Watchers
  • 3 Stars
  • 705 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 1 Forks
  • 4 Open issues
  • 8 Versions
  • 22 % Grown

The README.md

Octopus Sitemap Crawler

Small PHP tool to crawl collections of URLs in a Sitemap using the PHPReact library for asynchronous loading of the URLs. Both plain text files and XML Sitemaps are supported., (*1)

Logo, (*2)

Usage from the Command Line Interface (CLI)

Crawl the URLs in a Sitemap with verbose logging (-vvv)., (*3)

php application.php http://www.domain.ext/sitemap.xml -vvv

Using 15 concurrent connections instead of the default 5 concurrent connections:, (*4)

php application.php http://www.domain.ext/sitemap.xml --concurrency 15 -vvv

Use a HTTP GET request instead of the default HTTP HEAD. Note that HTTP HEAD requests involve less data transfer since no body is involved:, (*5)

php application.php http://www.domain.ext/sitemap.xml --requestType GET -vvv

Use a timeout of 3 seconds instead of the default 10 seconds:, (*6)

php application.php http://www.domain.ext/sitemap.xml --timeout 3 -vvv

Use a specific UserAgent instead of the default Octopus/1.0, for example, to simulate a search engine crawling a sitemap:, (*7)

php application.php http://www.domain.ext/sitemap.xml --userAgent 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)' -vvv

Use the TablePresenter to display intermediate results instead of the default EchoPresenter:, (*8)

php application.php http://www.domain.ext/sitemap.xml --presenter Octopus\\Presenter\\TablePresenter -vvv

Usage from your own application

You can easily integrate sitemap crawling in your own application, have a look at the Config class for all possible configuration options. If required you can use a PSR3-Logger for logging purposes., (*9)

use Octopus\Config;
use Octopus\Processor;

$config = new Config();
$config->concurrency = 2;
$config->targetFile = 'https://www.domain.ext/sitemap.xml';
$config->additionalResponseHeadersToCount = array(
    'CF-Cache-Status', //Useful to check CloudFlare edge server cache status
);
$config->requestHeaders = array(
    'User-Agent' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)', //Simulate Google's webcrawler
);
$processor = new Processor($config, $this->logger); //A PSR3 Logger can be injected if required
$processor->run();

$this->logger->info('Statistics: ' . print_r($processor->result->getStatusCodes(), true));
$this->logger->info('Applied concurrency: ' . $config->concurrency);
$this->logger->info('Total amount of processed data: ' . $processor->result->getTotalData());
$this->logger->info('Failed to load #URLs: ' . count($processor->result->getBrokenUrls()));

Limitations

Currently, Octopus is mainly an experimental / educational tool. Advanced use cases in HTTP response handling might not be supported., (*10)

Tests

To run the test suite, you first need to clone this repository and then install all dependencies using Composer:, (*11)

$ composer install

To run the test suite, go to the project root and run:, (*12)

$ make test

The Versions

11/06 2018
07/10 2017

0.1.2

0.1.2.0

PHP Sitemap crawler

  Sources   Download

MIT

The Requires

 

The Development Requires

03/10 2017

0.1.1

0.1.1.0

PHP Sitemap crawler

  Sources   Download

MIT

The Requires

 

The Development Requires

26/07 2017

0.1.0

0.1.0.0

PHP Sitemap crawler

  Sources   Download

MIT

The Requires

 

The Development Requires