2017 © Pedro Peláez
 

library diggin-robotrules

parser/handler for Robots Exclusion Protocol (robots.txt and more)

image

diggin/diggin-robotrules

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  • PHP
  • 1 Dependents
  • 0 Suggesters
  • 3 Forks
  • 0 Open issues
  • 8 Versions
  • 2 % Grown

The README.md

Diggin_RobotRules

PHP parser/handler for Robots Exclusion Protocol (robots.txt and more..), (*1)

Master: Build Status Coverage Status, (*2)

Features

  • implements http://www.robotstxt.org/norobots-rfc.txt, (*3)

    • [DONE] "3.2.2 The Allow and Disallow lines" - as test-case
    • [DONE] "4.Examples" as test-case
  • passing Nutch's test code ref, (*4)

    • [DONE] @see tests/Diggin/RobotRules/Imported/NutchTest.php
  • parsing & handling html-meta

ToDos

  • handle Crawl-Delay
  • sync or testing a little pattern w/ Google Test robots.txt tool
    • https://www.google.com/webmasters/tools/robots-analysis-ac?hl=en&siteUrl=http://yourdomain
  • rewrite with PHPPEG.(because current preg* base parser makes difficulty.)
  • more test, refactoring on and on..

USAGE

``` php <?php use Diggin\RobotRules\Accepter\TxtAccepter; use Diggin\RobotRules\Parser\TxtStringParser;, (*5)

$robotstxt = <<<'ROBOTS', (*6)

sample robots.txt

User-agent: YourCrawlerName Disallow:, (*7)

User-agent: * Disallow: /aaa/ #comment ROBOTS;, (*8)

$accepter = new TxtAccepter; $accepter->setRules(TxtStringParser::parse($robotstxt));, (*9)

$accepter->setUserAgent('foo'); var_dump($accepter->isAllow('/aaa/')); //false var_dump($accepter->isAllow('/b.html')); //true, (*10)

$accepter->setUserAgent('YourCrawlerName'); var_dump($accepter->isAllow('/aaa/')); // true ```, (*11)

INSTALL

Diggin_RobotRules is following PSR-0, so to register namespace Diggin\RobotRules into your ClassLoader., (*12)

To install via composer - $php composer.phar require diggin/diggin-robotrules "dev-master", (*13)

License

Diggin_RobotRules is licensed under new-bsd., (*14)

Reference & alternatives in others language.

  • Perl
    • http://search.cpan.org/~dmaki/Gungho-0.09008/docs/ja/Gungho/Component/RobotRules.pod
    • http://homepage3.nifty.com/hippo2000/perltips/WWW/RobotRules.html
  • Python
    • http://docs.python.org/library/robotparser.html
    • http://svn.python.org/projects/python/trunk/Lib/robotparser.py
  • Ruby
    • https://github.com/knu/webrobots

The Versions

13/01 2018

dev-master

9999999-dev

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

The Development Requires

crawler spider robots robots.txt robotstxt

18/11 2016

dev-php71

dev-php71

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

The Development Requires

crawler spider robots robots.txt robotstxt

26/02 2016

dev-develop

dev-develop

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

The Development Requires

crawler spider robots robots.txt robotstxt

26/02 2016

0.10.0

0.10.0.0

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

The Development Requires

crawler spider robots robots.txt robotstxt

23/02 2016

v0.9.0

0.9.0.0

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

The Development Requires

crawler spider robots robots.txt robotstxt

21/06 2014

v0.8.1

0.8.1.0

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

The Development Requires

crawler spider robots robotstxt

19/10 2013

v0.8.0

0.8.0.0

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

The Development Requires

crawler spider robots robotstxt

15/06 2012

v0.1.0

0.1.0.0

parser/handler for Robots Exclusion Protocol (robots.txt and more)

  Sources   Download

BSD-3-Clause

The Requires

  • php >=5.3.4

 

crawler spider robots robots.txt