2017 © Pedro Peláez
 

library arachnid

A crawler to find all unique internal pages on a given website

image

thewinterwind/arachnid

A crawler to find all unique internal pages on a given website

  • Thursday, June 15, 2017
  • by AnthonyVipond
  • Repository
  • 1 Watchers
  • 0 Stars
  • 2 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 60 Forks
  • 0 Open issues
  • 7 Versions
  • 0 % Grown

The README.md

Arachnid Web Crawler

This library will crawl all unique internal links found on a given website up to a specified maximum page depth., (*1)

This library is based on the original blog post by Zeid Rashwani here:, (*2)

http://zrashwani.com/simple-web-spider-php-goutte, (*3)

Josh Lockhart adapted the original blog post's code (with permission) for Composer and Packagist and updated the syntax to conform with the PSR-2 coding standard., (*4)

SensioLabsInsight Build Status codecov, (*5)

How to Install

You can install this library with Composer. Drop this into your composer.json manifest file:, (*6)

{
    "require": {
        "zrashwani/arachnid": "dev-master"
    }
}

Then run composer install., (*7)

Getting Started

Here's a quick demo to crawl a website:, (*8)

<?php
require 'vendor/autoload.php';

$url = 'http://www.example.com';
$linkDepth = 3;
// Initiate crawl    
$crawler = new \Arachnid\Crawler($url, $linkDepth);
$crawler->traverse();

// Get link data
$links = $crawler->getLinks();
print_r($links);

Advanced Usage:

There are other options you can set to the crawler:, (*9)

Set additional options to underlying guzzle client, by specifying array of options in constructor or passing it to setCrawlerOptions:, (*10)

<?php
    //third parameter is the options used to configure guzzle client
    $crawler = new \Arachnid\Crawler('http://github.com',2, 
                             ['auth'=>array('username', 'password')]);

    //or using separate method `setCrawlerOptions`
    $options = array(
        'curl' => array(
            CURLOPT_SSL_VERIFYHOST => false,
            CURLOPT_SSL_VERIFYPEER => false,
        ),
        'timeout' => 30,
        'connect_timeout' => 30,
    );

    $crawler->setCrawlerOptions($options);

You can inject a PSR-3 compliant logger object to monitor crawler activity (like Monolog):, (*11)

<?php    
$crawler = new \Arachnid\Crawler($url, $linkDepth); // ... initialize crawler   

//set logger for crawler activity (compatible with PSR-3)
$logger = new \Monolog\Logger('crawler logger');
$logger->pushHandler(new \Monolog\Handler\StreamHandler(sys_get_temp_dir().'/crawler.log'));
$crawler->setLogger($logger);
?>

You can set crawler to visit only pages with specific criteria by specifying callback closure using filterLinks method:, (*12)

<?php
//filter links according to specific callback as closure
$links = $crawler->filterLinks(function($link){
                    //crawling only blog links
                    return (bool)preg_match('/.*\/blog.*$/u',$link); 
                })
                ->traverse()
                ->getLinks();

How to Contribute

  1. Fork this repository
  2. Create a new branch for each feature or improvement
  3. Apply your code changes along with corresponding unit test
  4. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually., (*13)

All pull requests must adhere to the PSR-2 standard., (*14)

System Requirements

  • PHP 5.6.0+

Authors

License

MIT Public License, (*15)

The Versions

15/06 2017

dev-master

9999999-dev http://github.com/thewinterwind/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

The Development Requires

search spider scrape crawl

25/12 2016

1.1

1.1.0.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

The Development Requires

search spider scrape crawl

02/11 2015

1.0.4

1.0.4.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

12/09 2015

1.0.3

1.0.3.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

10/01 2014

v1.0.2

1.0.2.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

06/01 2014

1.0.1

1.0.1.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl

06/01 2014

1.0.0

1.0.0.0 http://github.com/codeguy/arachnid

A crawler to find all unique internal pages on a given website

  Sources   Download

MIT

The Requires

 

search spider scrape crawl