dev-master
9999999-devA configuration based web scraper.
GPL-2.0
The Requires
The Development Requires
- mockery/mockery ^0.9.8
- phpunit/phpunit ^6.0
- psy/psysh ^0.8
- squizlabs/php_codesniffer ^2.8
- symfony/console ^3.2
A configuration based web scraper.
A configuration based web scraper., (*1)
This repo is the first (only?) pass at an idea that has been floating around in my head for a while., (*2)
It is fully functional, but it is neither complete nor particularly well written., (*3)
I DO NOT RECOMMEND THAT YOU USE THIS FOR ANYTHING, (*4)
With that out of the way..., (*5)
Composer and PHP 7.0 or greater., (*6)
This package is not on Packagist. You will have to include this repo in your composer.json repositories to use it., (*7)
This package aims to be a framework for creating simple and maintainable scrapers with limited PHP experience., (*8)
The general idea is to allow scrapers to be written as config files in any arbitrary format (e.g. PHP, JSON, YAML)., (*9)
Scrapers should have a method for declaring what sites they support., (*10)
Scrapers should be able to extract data given nothing more than a CSS selector., (*11)
Scrapers should provide distinct steps for extraction, normalization and transformation of data., (*12)
Scrapers should be able to extend other scrapers, overriding individual properties as necessary., (*13)
As mentioned above, this package is far from complete and therefore subject to change at any time., (*14)
Only the extraction portion of scraping is covered: You will need to handle the HTTP/crawler portion on your own. Alternatively, just use Goutte., (*15)
As a simple example, create a file named duckduckgo.com.php
with the following contents:, (*16)
<?php return [ 'schema' => [ [ 'name' => 'results', 'selector' => '.web-result', 'schema' => [ [ 'name' => 'title', 'selector' => '.result__title', ], [ 'name' => 'description', 'selector' => '.result__snippet', ], ], ], ], ];
Create a scraper instance using the ScraperFactory
class:, (*17)
$scraper = SSNepenthe\Hermes\Scraper\ScraperFactory::fromConfigFile('/path/to/duckduckgo.com.php');
The scraper works against a Symfony DOM Crawler instance. Create this however you see fit - The example below uses Goutte:, (*18)
$client = new Goutte\Client; $crawler = $client->request('GET', 'https://duckduckgo.com/html?q=firefox');
And lastly, pass the crawler to the scrape method on the scraper instance:, (*19)
$result = $scraper->scrape($crawler);
You will wind up with an array that looks like the following:, (*20)
[ 'results' => [ [ 'title' => 'Download Firefox — Free Web Browser — Mozilla', 'description' => 'Download Mozilla Firefox, a free Web browser. Firefox is created by a global non-profit dedicated to putting individuals in control online. Get Firefox for Windows ...', ], [ 'title' => 'Firefox - Home | Facebook', 'description' => 'Firefox. 18,714,317 likes · 14,556 talking about this. The only browser built for freedom, not for profit. Get Firefox: https://mzl.la/292SfT5.', ], [ 'title' => 'Firefox 🦊🌍 (@firefox) | Twitter', 'description' => 'The latest Tweets from Firefox (@firefox). go forth and internet freely. All over the world', ], // ... ], ]
For more examples, check out the various files in tests/fixtures/scrapers
., (*21)
A configuration based web scraper.
GPL-2.0