2017 © Pedro Peláez
 

library aicrawler

A web scraping pattern combining heuristics with the Symfony DOMCrawler.

image

dan/aicrawler

A web scraping pattern combining heuristics with the Symfony DOMCrawler.

  • Thursday, June 9, 2016
  • by dan
  • Repository
  • 0 Watchers
  • 0 Stars
  • 4 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 3 Versions
  • 0 % Grown

The README.md

AiCrawler

Leverage Ai design patterns by using heuristics with the Symfony DOMCrawler., (*1)

Please crawl on over to the docs which are also available as a gitbook., (*2)

GitBook, (*3)

, (*4)

Quickstart

The AiCrawler package has the responsibility of making boolean assertions on a node in the HTML DOM. It comes with a straight-forward data point trait which will record the results of your heuristics (rules) for a given "item" or context., (*5)

Install with Composer

$ composer require dan/aicrawler dev-master

Trivial example

$crawler = new AiCrawler('<html>...</html>');

$node = $crawler->filter('div[id="content-start"]');
$args = ['words' => 15];

// Does the content have at least 15 words?
$assertion = Heuristics::words($node, $args); // true / false

A more expressive example

$crawler = new AiCrawler("<html>...</html>");

$args = [
    'elements' => [
        "elements" => "/p/ /blockquote/ /(u|o)l/ /h[1-6]/",
        "regex" => true,
        'words' => [
            'words' => 15,
            'descendants' => true,
            'words2' => [
                'words' => "/(cod(ing|ed|e)|program|language|php)/",
                'regex' => true,
                'descendants' => true
            ]
        ]
    ],
    'matches' => 3
]


/**
 * Do at least 3 of this div's children which are p, blockquote, ul, ol or any
 * h element AND contain at least 15 words (including text from the child's 
 * descendants) AND words such as coding, coded, code, program, language, php 
 * (including text from the child's descendants).
 */
$crawler->filter("div")->each(function(&$node) use ($args) {
    if (Heuristics::children($node, $args) {
        $node->setDataPoint("example", "words", 1);
    }
});

Sound interested? Read on about the Heuristics class or go right to a similar example with complete notes., (*6)

, (*7)

Version 0.0.1

  • A Heuristics class with some cool rules to get you started.
  • A Scorable trait is on our AiCrawlerclass so there is a pattern for data points.
  • A Extra trait is on our AiCrawler class so there is a pattern for storing extra data.

, (*8)

Todo

, (*9)

Contributing

  1. Fork this project on GitHub.
  2. Existing unit tests must pass.
  3. Contributions must be unit tested.
  4. New heuristics should be portable (have few or no dependencies).
  5. New heuristics should have helpful doc blocks.
  6. Submit a pull request.
  7. See guide on extending Heuristics for special heuristics.

, (*10)

Documentation

  • Follow PSR-2.
  • Add PHPDoc blocks for all classes, methods, and functions
  • Omit the @return tag if the method does not return anything
  • Add a blank line before @param, @return or @throws

Any issues, please report here, (*11)

, (*12)

License

AiCrawler is free software distributed under the terms of the MIT license., (*13)

The Versions

09/06 2016

dev-master

9999999-dev

A web scraping pattern combining heuristics with the Symfony DOMCrawler.

  Sources   Download

MIT

The Requires

 

The Development Requires

crawler ai scraper

21/11 2015

v0.1.0

0.1.0.0

A web scraping pattern combining heuristics with the Symfony DOMCrawler.

  Sources   Download

MIT

The Requires

 

The Development Requires

crawler ai scraper

05/05 2015

v0.0.1

0.0.1.0

A web scraping pattern using heuristics with Symfony Components.

  Sources   Download

MIT

The Requires

 

The Development Requires

crawler ai scraper