2017 © Pedro Peláez
 

library tarantula

Another PHP crawler based on Guzzle.

image

mihaeu/tarantula

Another PHP crawler based on Guzzle.

  • Monday, November 30, 2015
  • by mihaeu
  • Repository
  • 2 Watchers
  • 11 Stars
  • 43 Installations
  • HTML
  • 0 Dependents
  • 0 Suggesters
  • 0 Forks
  • 1 Open issues
  • 9 Versions
  • 0 % Grown

The README.md

Logo Tarantula

Build Status Coverage Status SensioLabsInsight, (*1)

Tarantula is a web crawler written in PHP. It utilizes the amazing work of the people behind Guzzle and Symfony's DomCrawler., (*2)

Installation

Global tool

Make sure ~/.composer/bin is in your $PATH and then simply execute:, (*3)

composer global require mihaeu/tarantula:1.*

Library

Assuming you are using Composer, add the following to your composer.json file:, (*4)

{
    "require": {
        "mihaeu/tarantula": "1.*"
    }
}

or use Composer's cli tool composer require mihaeu/tarantula:1.*., (*5)

Usage

Global tool

Right now the only command available is crawl. Some usage examples would be:, (*6)

# most basic use case
tarantula crawl "http://google.com"

# go deeper
tarantula crawl "http://products.com/categories" --depth=4

# mirror
tarantula crawl "http://myblog.com" --mirror=/tmp/blog-backup

# filters
tarantula crawl "http://myblog.com" --contains=yolo
tarantula crawl "http://myblog.com" --regex="(post)\|(\d+)"

# dump crawled file in hashed files
tarantula crawl "http://myblog.com" --save-hashed=/tmp/blog-backup --minify-html

# HTTP basic auth
tarantula crawl "http://secure.com" --user=admin --password=admin

# search for "Avatar" on imdb
bin/tarantula crawl "http://www.imdb.com/find?q=avatar&s=all" --depth=0 --quiet --css=".findSection td.result_text"

# today's weather in seattle
bin/tarantula crawl --depth=0 "http://www.weather.com/weather/today/Seattle+WA+USWA0395:1:US" --css=".wx-first" | head -n 2

For all arguments and options use the help command:, (*7)

tarantula help                    # displays all available commands
tarantula help crawl              # all arguments and options for the crawler
tarantula crawl "..." --verbose   # switch on debugging output

Library

Have a look at the tests to see what's possible or just try the following in your code:, (*8)

use Mihaeu\Tarantula\Crawler;
use Mihaeu\Tarantula\HttpClient;

$crawler = new Crawler(new HttpClient('http://google.com'));
$links = $crawler->go(1);

All HTTP requests go through Guzzle and you can add any configuration for Guzzle's request object also to Tarantula's HttpClient., (*9)

Tests

Test coverage is not at 100%, the reason being that this was an afternoon project and testing a crawler takes a lot of time due to the testing setup., (*10)

If you want to get a quick overview of the project, I recommend running the test suite with the --testdox flag:, (*11)

vendor/bin/phpunit --testdox

To Do

  • [ ] filters (url, filetype, etc.)
  • [ ] allow for Guzzle to be configured via command line
  • [ ] more actions (save plain result, crawl via DOM/XPath, ...)

Troubleshooting

Composer global install fails

This is most likely due to a conflict with some requirements of other global installs. Unfortunately Composer's architecture doesn't offer a solution for this yet. I tried to keep the requirements Tarantula loose to avoid this problem., (*12)

If you want to have Tarantula available throughout your system, just install to another directory (e.g. using composer create-project) and symlink bin/tarantula into a folder in your $PATH., (*13)

Thanks to

License

MIT, see LICENSE file., (*14)

The Versions

05/07 2014

v1.2.0

1.2.0.0 https://github.com/mihaeu/tarantula

Another PHP crawler based on Guzzle.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Michael Haeuslmann

crawler spider

29/06 2014

v1.1.3

1.1.3.0 https://github.com/mihaeu/tarantula

Another PHP crawler based on Guzzle.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Michael Haeuslmann

crawler spider

29/06 2014

v1.1.2

1.1.2.0 https://github.com/mihaeu/tarantula

Another PHP crawler based on Guzzle.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Michael Haeuslmann

crawler spider

28/06 2014

v1.1.1

1.1.1.0 https://github.com/mihaeu/tarantula

Another PHP crawler based on Guzzle.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Michael Haeuslmann

crawler spider

28/06 2014

v1.1

1.1.0.0 https://github.com/mihaeu/tarantula

Another PHP crawler based on Guzzle.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Michael Haeuslmann

crawler spider

27/06 2014

v1.0.0

1.0.0.0 https://github.com/mihaeu/tarantula

Another PHP crawler based on Guzzle.

  Sources   Download

MIT

The Requires

 

The Development Requires

by Michael Haeuslmann

crawler spider