2017 © Pedro Peláez
 

sdk web-scraping-sdk

Composer package that simplifies web scraping

image

daa/web-scraping-sdk

Composer package that simplifies web scraping

  • Monday, December 22, 2014
  • by danielanteloagra
  • Repository
  • 1 Watchers
  • 2 Stars
  • 43 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 3 Versions
  • 231 % Grown

The README.md

Web Scraping PHP SDK

This is a composer package that simplifies web content scraping providing a lightweight and easy to use code base., (*1)

Simply extend the Scraper class provided and implement the gather() method to extract the desired content using xpaths. You can then output this content to a file, store in a database, return a json string, etc., (*2)

Highlights:, (*3)

  • XPath driven extraction of content
  • Just one method to implement
  • Allows easy file writing, database storage or formatted string/object return
  • PSR2 coding standards
  • Uses cURL to retrieve content from specified source
  • Configurable failed attempts retry count and pause time
  • Easily follow links to get additional content

Packagist link: https://packagist.org/packages/daa/web-scraping-sdk, (*4)

Usage

Add the following requirement to your composer file and do a composer install/update:, (*5)

  "require": {
        ...
        "daa/web-scraping-sdk: "1.*"
  },

Write your own scraper class which extends Scraper\Sdk\WebScraper and implements the gather method:, (*6)

namespace Your\Package\Scraper;

use Scraper\Sdk\WebScraper;

class YourScraper extends WebScraper 
{
    /**
     * {@inheritdoc}
     */
    protected function gather(\DOMXPath $dom)
    {
        $nodes = $dom->query(".//article[@class='product']");
        foreach ($nodes as $node) {
            ...
            // follow a url and extract more data
            $linkDom = $this->getLinkContent($node->getElementsByTagName('a')->item(0));
            $linkDom->query...

        }
    }
}

Now call your class, for example from a script that is executed by a cron job:, (*7)

require __DIR__.'/../vendor/autoload.php';

$scraper = new Your\Package\Scraper\YourScraper('http://www.someurl.com/with/content/');
$scraper->execute();

With troublesome sources you can specify the retry configuration (default is 3 retries with a 3 second pause in between), (*8)

$scraper = new Your\Package\Scraper\YourScraper('http://www.someurl.com/with/content/', $retryAttempts, $pauseSeconds);
$scraper->execute();

You can use the same instance to scrape several urls with the same structure:, (*9)

$pages = array(
    'http://www.someurl.com/section-one/',
    'http://www.someurl.com/section-two/page1',
    'http://www.someurl.com/section-one/page2'
); 

$scraper = new Your\Package\Scraper\YourScraper();

foreach ($pages as $url) {
    $scraper->setSource($url);
    $scraper->execute();
}

Check out the examples folder for more details and fully working examples., (*10)

The Versions

22/12 2014

dev-master

9999999-dev

Composer package that simplifies web scraping

  Sources   Download

MIT

The Requires

 

The Development Requires

22/12 2014

v1.1

1.1.0.0

Composer package that simplifies web scraping

  Sources   Download

MIT

The Requires

 

The Development Requires

21/12 2014

v1.0

1.0.0.0

Composer package that simplifies web scraping

  Sources   Download

MIT

The Requires

  • php >=5.3.3
  • ext-curl *

 

The Development Requires