2017 © Pedro Peláez
 

library spider

Light weight spider for the web.

image

ddliu/spider

Light weight spider for the web.

  • Monday, April 6, 2015
  • by ddliu
  • Repository
  • 4 Watchers
  • 19 Stars
  • 34 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 4 Forks
  • 0 Open issues
  • 21 Versions
  • 0 % Grown

The README.md

Spider Build Status

A flexible spider in PHP., (*1)

Concepts

A spider contains many processors called pipes, you can pass as many tasks as you like to the spider, each task go through these pipes and get processed., (*2)

Installation

composer require ddliu/spider

Requirements

  • PHP5.3+
  • curl(RequestPipe)

Dependencies

See composer.json., (*3)

Usage

use ddliu\spider\Spider;
use ddliu\spider\Pipe\NormalizeUrlPipe;
use ddliu\spider\Pipe\RequestPipe;
use ddliu\spider\Pipe\DomCrawlerPipe;

(new Spider())
    ->pipe(new NormalizeUrlPipe())
    ->pipe(new RequestPipe())
    ->pipe(new DomCrawlerPipe())
    ->pipe(function($spider, $task) {
        $task['$dom']->filter('a')->each(function($a) use ($task) {
            $href = $a->attr('href');
            $task->fork($href);
        })
    })
    // the entry task
    ->addTask('http://example.com')
    ->run()
    ->report();

Find more examples in examples folder., (*4)

Spider

The Spider class., (*5)

Options

  • limit: maxmum tasks to run

Methods

  • pipe($pipe): add a pipe
  • addTask($task): add a task
  • run(): run the spider
  • report(): write report to log

Task

A task contains the data array and some helper functions., (*6)

The Task class implements ArrayAccess interface, so you can access data like array., (*7)

Methods

  • fork($task): add a sub task to the spider
  • ignore(): ignore the task

Pipes

Pipes define how each task being processed., (*8)

A pipe can be a function:, (*9)

function($spider, $task) {}

Or extends the BasePipe:, (*10)

use ddliu\spider\Pipe\BasePipe;

class MyPipe extends BasePipe {
    public function run($spider, $task) {
        // process the task...
    }
}

Useful Pipes

NormalizeUrlPipe

Normalize $task['url']., (*11)

new NormalizeUrlPipe()

RequestPipe

Start an HTTP request with $task['url'] and save the result in $task['content']., (*12)

new RequestPipe(array(
    'useragent' => 'myspider',
    'timeout' => 10
));

FileCachePipe

Cache a pipe (e.g. RequestPipe)., (*13)

$requestPipe = new RequestPipe();
$cacheForReqPipe = new FileCachePipe($requestPipe, [
    'input' => 'url',
    'output' => 'content',
    'root' => '/path/to/cache/root',
]);

RetryPipe

Retry on failure., (*14)

$requestPipe = new RequestPipe();
$retryForReqPipe = new RetryPipe($requestPipe, [
    'count' => 10,
]);

DomCrawlerPipe

Create a DomCrawler from $task['content']. Access it with $task['$dom'] in following pipes., (*15)

ReportPipe

Report every 10 minutes., (*16)

new ReportPipe(array(
    'seconds' => 600
))

Logging

$spider->logger is an instance of Monolog\Logger. You can add logging handlers to it before start:, (*17)

use Monolog\Handler\StreamHandler;

$spider->logger->pushHandler(new StreamHandler('path/to/your.log', Logger::WARNING));

TODO/Ideas

  • Real world examples.
  • Running tasks concurrently.(With pthread?)

Alternate

Use golang version for better performance!, (*18)

The Versions

13/11 2014
13/11 2014
12/11 2014

v0.1.9

0.1.9.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

12/11 2014

v0.1.8

0.1.8.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

12/11 2014

v0.1.7

0.1.7.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

07/11 2014

v0.1.6

0.1.6.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

07/11 2014

v0.1.5

0.1.5.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

07/11 2014

v0.1.4

0.1.4.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

06/11 2014

v0.1.3

0.1.3.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

06/11 2014

v0.1.2

0.1.2.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

06/11 2014

v0.1.1

0.1.1.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong

06/11 2014

v0.1.0

0.1.0.0

Light weight spider for the web.

  Sources   Download

MIT

The Requires

 

by dong