Crawly

Crawly is a simple web crawler able to extract and follow links depending on the discovers., _(*1)

Simple Example

require_once("vendor/autoload.php");

// Create a new Crawly object
$crawler = Crawly\Factory::generic();

// Discovers are allows you to extract links to follow
$crawler->attachDiscover(
    new Crawly\Discovers\CssSelector('nav.pagination > ul > li > a')
);

// After we scrapped and discovered links you can add your own closures to handle the data
$crawler->attachExtractor(
    function($response) {
        // here we have the response, work with it!
    }
);

// set seed page
$crawler->setSeed("http://www.webpage.com/test/");

// start the crawler
$crawler->run();

Crawler object

You can create a simple crawler with the Crawler Factory, it will generate a Crawly object using Guzzle as Http client., _(*2)

$crawler = Crawly\Factory::generic();

You can create a personalized crawler specified which Http client, Url queue and Visited link collection to use., _(*3)

$crawler = Crawly\Factory::create(new MyHttpClass(), new MyUrlQueue(), new MyVisitedCollection());

Discovers

Discovers are used to extract from the html a set of links to include to the queue. You can include as many discovers as you want and you can create your own discovers classes too., _(*4)

At the moment Crawly only includes a Css Selector discover., _(*5)

Create your own discover

Just create a new class that implements the Discoverable interface. This new class should look like this example:, _(*6)

class MyOwnDiscover implements Discoverable
{
    private $configuration;

    public function __construct($configuration) 
    {
        $this->configuration = $configuration;
    }

    public function find(Crawly &$crawler,  $response)
    {
        // $response has the crawled url content
        // do some magin on the response and get a colleciton of links

        foreach($links as $node) {
            $uri = new Uri($node->getAttribute('href'), $crawler->getHost());

            // if url was not visited we should include this new links to the Url Queue
            if(!$crawler->getVisitedUrl()->seen($uri->toString())) {
                $crawler->getUrlQueue()->push($uri);
            }
        }
    }
}

Limiters

Limiters are used to limit the crawler actions. For instance, we can limit how many links can been crawled or which is the maximum amout of bandwitdth to use., _(*7)

Extractors

12/08 2015

dev-master

9999999-dev

Simple web crawler library

Sources Download

The Requires

by ssola

18/11 2014

1.0.0

1.0.0.0

Simple web crawler library

Sources Download

library crawly

Simple web crawler library

ssola/crawly

The README.md

Crawly

Simple Example

Crawler object

Discovers

Create your own discover

Limiters

Extractors

The Versions

dev-master

The Requires

by ssola

1.0.0

The Requires

by ssola