2017 © Pedro Peláez
 

library spider

PHP async scrapper used multi curl and reactphp, and proxy inspired by python grab

image

grab/spider

PHP async scrapper used multi curl and reactphp, and proxy inspired by python grab

  • Wednesday, May 10, 2017
  • by strelov1
  • Repository
  • 4 Watchers
  • 9 Stars
  • 59 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 2 Forks
  • 0 Open issues
  • 1 Versions
  • 5 % Grown

The README.md

grab-spider

PHP async scrapper used multi curl and reactphp inspired by python grab, (*1)

Installation

To install grab-spider run the command:, (*2)


composer require grab/spider "dev-master"

Quick start

<?php

require __DIR__ . '/../vendor/autoload.php';

class HackerNewCrawler extends \Grab\Spider
{
    public function taskGenerator()
    {
        $range = array_map(function($item) {
            return sprintf('https://news.ycombinator.com/news?p=%d', $item);
        }, range(1, 4)) ;

        foreach ($range as $url) {
            $this->task('page', [
                'url' => $url,
                'max_request' => 10,
            ]);
        }
    }

    public function taskPage($parser, $task)
    {
        $links = $parser->find('.storylink');
        foreach ($links as $link) {
            $this->task('topic', [
                'url' => $link->getAttribute('href'),
                'curl_config' => [
                    CURLOPT_TIMEOUT => 60,
                ],
                'max_request' => 10,
            ]);
        }
    }

    public function taskTopic($parser, $task)
    {
        $products = $parser->find('title');
        echo trim($products[0]->text()) . PHP_EOL;
    }
}

$bot = new HackerNewCrawler();
$bot->debug = true;
$bot->setCurlSetting([
    CURLOPT_USERAGENT => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
]);
//$bot->loadProxy(__DIR__ . '/proxy_list.txt');
$bot->run();

Simple DI from change parser


$parser = new \DiDom\Document(); $bot = new HackerNewCrawler([$parser, 'load']); $bot = new HackerNewCrawler(function ($content) { $parser = new \DiDom\Document(); return $parser->load($content); }); $bot = new HackerNewCrawler(function ($content) { return simplexml_load_string($content); }); $bot = new HackerNewCrawler(function ($content) { return new \SoapClient($content); });

The Versions

10/05 2017

dev-master

9999999-dev https://github.com/strelov1/Spider

PHP async scrapper used multi curl and reactphp, and proxy inspired by python grab

  Sources   Download

MIT

The Requires

 

The Development Requires

by Ilya Strelov

curl parser html async crawler react spider multi scraper reactphp grab