baqend/spider

URL spider which crawls a page and all its subpages

Friday, March 16, 2018
by baqend
Repository
10 Watchers
2 Stars
21 Installations

PHP
0 Dependents
0 Suggesters
1 Forks
0 Open issues
2 Versions
200 % Grown

PHP Spider

URL spider which crawls a page and all its subpages, _(*1)

Installation
Usage
Processors
URL Handlers
Alternatives

Installation

Make sure you have Composer installed. Then execute:, _(*2)

composer require baqend/spider

This package requires at least PHP 5.5.9 and has no package dependencies!, _(*3)

Usage

The entry point is the Spider class. For it to work, it requires the following services:, _(*4)

Queue: Collects URLs to be processed. This package comes with a breadth-first and a depth-first implementation.
URL Handler: Checks if a URL should be processed. If no URL handler is provided, every URL is processed. More about URL handlers
Downloader: Takes URLs and downloads them. To have no dependency on a HTTP client library like Guzzle, you have to implement this class by yourself.
Processor: Retrieves downloaded assets and performs operations on it. More about Processors

You initialize the spider in the following way:, _(*5)

<?php
use Baqend\Component\Spider\Processor;
use Baqend\Component\Spider\Queue\BreadthQueue;
use Baqend\Component\Spider\Spider;
use Baqend\Component\Spider\UrlHandler\BlacklistUrlHandler;

// Use the breadth-first queue
$queue = new BreadthQueue();

// Implement the DownloaderInterface
$downloader /* your downloader implementation */;

// Create a URL handler, e.g. the provided blacklist URL handler
$urlHandler = new BlacklistUrlHandler(['**.php']);

// Create some processors which will be executed after another
// More details on the processors below!
$processor = new Processor\Processor();
$processor->addProcessor(new Processor\UrlRewriteProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor($cssProcessor = new Processor\CssProcessor());
$processor->addProcessor(new Processor\HtmlProcessor($cssProcessor));
$processor->addProcessor(new Processor\ReplaceProcessor('https://example.org', 'https://example.com/archive'));
$processor->addProcessor(new Processor\StoreProcessor('https://example.com/archive', '/tmp/output'));

// Create the spider instance
$spider = new Spider($queue, $downloader, $urlHandler, $processor);

// Enqueue some URLs
$spider->queue('https://example.org/index.html');
$spider->queue('https://example.org/news/other-landingpage.html');

// Execute the crawling
$spider->crawl();

Processors

This package comes with the following built-in processors., _(*6)

`Processor`

This is an aggregate processor which allows adding and removing other processors which it will execute one after the other., _(*7)

<?php
use Baqend\Component\Spider\Processor\Processor;

$processor = new Processor();
$processor->addProcessor($firstProcessor);
$processor->addProcessor($secondProcessor);
$processor->addProcessor($thirdProcessor);

// This will call `process` on $firstProcessor, $secondProcessor, and finally on $thirdProcessor:
$processor->process($asset, $queue);

`HtmlProcessor`

This processor can process HTML assets and enqueue its containing URLs. It will also modify all relative URLs and make them absolute. Also, if you provide a CssProcessor, style attributes are found and URLs within CSS will be resolved., _(*8)

`CssProcessor`

This processor can process CSS assets and enqueue its containing URLs from @imports and url(...) statements., _(*9)

`ReplaceProcessor`

Performs simple str_replace operations on asset contents:, _(*10)

<?php
use Baqend\Component\Spider\Processor\ReplaceProcessor;

$processor = new ReplaceProcessor('Hello World', 'Hallo Welt');

// This will replace all occurrences of
// "Hello World" in the asset with "Hallo Welt":
$processor->process($asset, $queue);

The ReplaceProcessor does not enqueue other URLs., _(*11)

`StoreProcessor`

Takes a URL prefix and a directory and will store all assets relative to the prefix in the according file structure in directory., _(*12)

The StoreProcessor does not enqueue other URLs., _(*13)

`UrlRewriteProcessor`

Changes the URL of an asset to another prefix. Use this to let HtmlProcessor and CssProcessor resolve relative URLs from a different origin., _(*14)

The UrlRewriteProcessor does not enqueue other URLs. Also, it does not modify the asset's content – only its URL., _(*15)

URL Handlers

URL handlers tell the spider whether to download and process a URL. There are the following built-in URL handlers:, _(*16)

`OriginUrlHandler`

Handles only URLs coming from some given origin, i.e. "https://example.org"., _(*17)

`BlacklistUrlHandler`

Does not handle URLs being part of some blacklist. You can use glob patterns to provide a blacklist:, _(*18)

<?php
use Baqend\Component\Spider\UrlHandler\BlacklistUrlHandler;

$blacklist = [
    'https://other.org/**',     // Don't handle anything from other.org over HTTPS    
    'http{,s}://other.org/**',  // Don't handle anything from other.org over HTTP or HTTPS    
    '**.{png,gif,jpg,jpeg}',    // Don't handle any image files    
];

$urlHandler = new BlacklistUrlHandler($blacklist);

Alternatives

If this project does not match your needs, check the following other projects:, _(*19)

spatie/crawler (Requires PHP 7)
vdb/php-spider

16/03 2018

1.0.0

1.0.0.0

URL spider which crawls a page and all its subpages

Sources Download

MIT

The Requires

php >= 5.5.9

The Development Requires

phpunit/phpunit ^7.0

by Konstantin Simon Maria Möllers

16/03 2018

dev-master

9999999-dev

URL spider which crawls a page and all its subpages

Sources Download

MIT

The Requires

php >= 5.5.9

The Development Requires

phpunit/phpunit ^7.0

library spider

URL spider which crawls a page and all its subpages

baqend/spider

The README.md

PHP Spider

Installation

Usage

Processors

`Processor`

`HtmlProcessor`

`CssProcessor`

`ReplaceProcessor`

`StoreProcessor`

`UrlRewriteProcessor`

URL Handlers

`OriginUrlHandler`

`BlacklistUrlHandler`

Alternatives

The Versions

1.0.0

The Requires

The Development Requires

by Konstantin Simon Maria Möllers

dev-master

The Requires

The Development Requires

by Konstantin Simon Maria Möllers