2017-25 © Pedro Pelåez
 

application walker

A basic wrapper around Goutte to crawl a website

image

c2is/walker

A basic wrapper around Goutte to crawl a website

  • Wednesday, September 25, 2013
  • by korby
  • Repository
  • 2 Watchers
  • 0 Stars
  • 28 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 1 Versions
  • 0 % Grown

The README.md

Walker

A simple wrapper around Goutte to crawl an entire website and get some stats about each page : - status, - pages referring to it, - other informations cactchable if you use run() method to implement your needs., (*1)

Walker get all "a href" values to build its crawling, so there is an extensions' exclusion mechanism to ignore elements which are not relevants, for example images. See Parameters section below for more informations., (*2)

By default crawling is bound to subdomain given, but the second parameter of constructor allow you to define which other subdomains could be crawled. A regexp defines allowed subdomains, example which allow any subdomains :, (*3)

$walker = new \Walker\Walker("http://www.somewebsite.fr", ".*");

Usage :

In your composer.json add Walker into "require" block :, (*4)

{
    "require": {
        "c2is/walker" : "dev-master"
    },
    "minimum-stability": "dev",
    "autoload": {
        "psr-0": {
            "": "src/"
        }
    },
}

Run composer update :, (*5)

php ./composer.phar update

Instanciate the crawler, start the crawl and output stats after the process :, (*6)

$walker = new \Walker\Walker("http://www.somewebsite.fr");
$walker -> start();
echo "<pre>".implode(" | ", $walker->storage->getColumns("stats"));
foreach($walker->storage->get("stats") as $stats){
    printf("\n%s | %s | %s",$stats["URL"], $stats["STATUS"], $stats["CALLED IN"]);
}
echo "</pre>";

If you want more informations or operations to be performed real-time during crawling you can pass an anonymous function to the run() method :, (*7)

echo "<pre>".implode(" | ", $walker->storage->getColumns("stats"))." | LAST MODIF";
$walker -> run(function ($crawler, $client) {
    $lastMod = $client->getResponse()->getHeader("last-modified");
    $stats = $client->getStats();
    printf("\n%s | %s | %s| %s",$stats["URL"], $stats["STATUS"], $stats["CALLED IN"], $lastMod);
    flush();
});
echo "</pre>";

Parameters :

You can override configurations using setConfiguration() method, for example, (*8)

$walker->setConfiguration("httpClientOptions",['curl.options' => array(
        CURLOPT_TIMEOUT      => 150
    )]
);
$walker->setConfiguration("excludedFileExt","`\.(jpg|jpeg|gif|png)$`i");

The Versions

25/09 2013

dev-master

9999999-dev https://github.com/c2is/Walker

A basic wrapper around Goutte to crawl a website

  Sources   Download

GPL

The Requires

 

by André Cianfarani

crawler