hermes

A configuration based web scraper., _(*1)

This repo is the first (only?) pass at an idea that has been floating around in my head for a while., _(*2)

It is fully functional, but it is neither complete nor particularly well written., _(*3)

I DO NOT RECOMMEND THAT YOU USE THIS FOR ANYTHING, _(*4)

With that out of the way..., _(*5)

Requirements

Composer and PHP 7.0 or greater., _(*6)

Installation

This package is not on Packagist. You will have to include this repo in your composer.json repositories to use it., _(*7)

Goals

This package aims to be a framework for creating simple and maintainable scrapers with limited PHP experience., _(*8)

The general idea is to allow scrapers to be written as config files in any arbitrary format (e.g. PHP, JSON, YAML)., _(*9)

Scrapers should have a method for declaring what sites they support., _(*10)

Scrapers should be able to extract data given nothing more than a CSS selector., _(*11)

Scrapers should provide distinct steps for extraction, normalization and transformation of data., _(*12)

Scrapers should be able to extend other scrapers, overriding individual properties as necessary., _(*13)

Usage

As mentioned above, this package is far from complete and therefore subject to change at any time., _(*14)

Only the extraction portion of scraping is covered: You will need to handle the HTTP/crawler portion on your own. Alternatively, just use Goutte., _(*15)

As a simple example, create a file named duckduckgo.com.php with the following contents:, _(*16)

<?php

return [
    'schema' => [
        [
            'name' => 'results',
            'selector' => '.web-result',
            'schema' => [
                [
                    'name' => 'title',
                    'selector' => '.result__title',
                ],
                [
                    'name' => 'description',
                    'selector' => '.result__snippet',
                ],
            ],
        ],
    ],
];

Create a scraper instance using the ScraperFactory class:, _(*17)

$scraper = SSNepenthe\Hermes\Scraper\ScraperFactory::fromConfigFile('/path/to/duckduckgo.com.php');

The scraper works against a Symfony DOM Crawler instance. Create this however you see fit - The example below uses Goutte:, _(*18)

$client = new Goutte\Client;
$crawler = $client->request('GET', 'https://duckduckgo.com/html?q=firefox');

And lastly, pass the crawler to the scrape method on the scraper instance:, _(*19)

$result = $scraper->scrape($crawler);

You will wind up with an array that looks like the following:, _(*20)

[
    'results' => [
        [
            'title' => 'Download Firefox — Free Web Browser — Mozilla',
            'description' => 'Download Mozilla Firefox, a free Web browser. Firefox is created by a global non-profit dedicated to putting individuals in control online. Get Firefox for Windows ...',
        ],
        [
            'title' => 'Firefox - Home | Facebook',
            'description' => 'Firefox. 18,714,317 likes · 14,556 talking about this. The only browser built for freedom, not for profit. Get Firefox: https://mzl.la/292SfT5.',
        ],
        [
            'title' => 'Firefox 🦊🌍 (@firefox) | Twitter',
            'description' => 'The latest Tweets from Firefox (@firefox). go forth and internet freely. All over the world',
        ],
        // ...
    ],
]

For more examples, check out the various files in tests/fixtures/scrapers., _(*21)

17/03 2017

dev-master

9999999-dev

A configuration based web scraper.

Sources Download

GPL-2.0

library hermes

A configuration based web scraper.

ssnepenthe/hermes

The README.md

hermes

Requirements

Installation

Goals

Usage

The Versions

dev-master

The Requires

The Development Requires