2017 © Pedro Peláez
 

library robotstxt

A small package for parsing websites' robots.txt files

image

jmajors/robotstxt

A small package for parsing websites' robots.txt files

  • Friday, March 31, 2017
  • by jasonrmajors
  • Repository
  • 1 Watchers
  • 2 Stars
  • 14 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 9 Versions
  • 0 % Grown

The README.md

Robotstxt Parser

Build Status, (*1)

This is a small package to make parsing robots.txt rules easier. The URL matching follows the rules outlined by Google in their webmasters guide., (*2)

Quick example:

// basic usage
$robots  = new Robots\RobotsTxt();
$allowed = $robots->isAllowed("https://www.example.com/some/path"); // true
$allowed = $robots->isAllowed("https://www.another.com/example");   // false

Setup

Install via composer:, (*3)

$ composer require jmajors/robotstxt

Make sure composer's autoloader is included in your project:, (*4)

require __DIR__ . '/vendor/autoload.php';

That's it., (*5)

Usage

This package is a class made mainly for checking if a crawler is allowed to visit a particular URL. Use the isAllowed(string $url) method to check whether or not a crawler is disallowed from crawling a particular path, which returns true if the URL's path is not included in the robots.txt Disallowed rules (i.e. you're free to crawl), and false if the path is disallowed (no crawling!). Here's an example:, (*6)

<?php
use Robots\RobotsTxt;

$robotsTxt = new RobotsTxt();
$allowed = $robotsTxt->isAllowed("https://www.example.com/this/is/fine"); // returns true

Additionally, setUserAgent($userAgent) will allow you to specify a User Agent in the request header., (*7)

$robotsTxt = new RobotsTxt();
$userAgent = 'RobotsTxtBot/1.0; (+https://github.com/jasonmajors/robotstxt)';
// set a user agent
$robotsTxt->setUserAgent($userAgent);
$allowed = $robotsTxt->isAllowed("https://www.example.com/not/sure/if/allowed");

// Alternatively...
$allowed = $robotsTxt->setUserAgent($userAgent)->isAllowed("https://www.someplace.com/a/path");

If for some reason there's no robots.txt file at the root of the domain, a MissingRobotsTxtException will be thrown., (*8)

<?php
// Typical usage
use Robots\RobotsTxt;
use Robots\Exceptions\MissingRobotsTxtException;
...

$robotsTxt = new RobotsTxt();
$userAgent = 'RobotsTxtBot/1.0; (+https://github.com/jasonmajors/robotstxt)';

try {
    $allowed = $robotsTxt->setUserAgent($userAgent)->isAllowed("https://www.example.com/some/path");
} catch (MissingRobotsTxtException $e) {
    $error = $e->getMessage();
    // Handle the error
}

Further, getDisallowed will return an array of the disallowed paths for User-Agent: *:, (*9)

$robots     = new RobotsTxt();
$disallowed = $robots->getDisallowed("https://www.example.com");

TODO's

  • Add ability to check disallowed paths based on user agent
  • Return a list of user agents in the file

The Versions

31/03 2017

dev-master

9999999-dev

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

31/03 2017

v1.7.1

1.7.1.0

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

22/02 2017

v1.6

1.6.0.0

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

05/02 2017

v1.5

1.5.0.0

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

03/02 2017

v1.4

1.4.0.0

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

02/02 2017

v1.3

1.3.0.0

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

02/02 2017

v1.2

1.2.0.0

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

30/01 2017

v1.1

1.1.0.0

A small package for parsing websites' robots.txt files

  Sources   Download

MIT

The Requires

  • php >=5.4.0

 

The Development Requires

php robots.txt

30/01 2017

v1.0

1.0.0.0

  Sources   Download

The Development Requires