2017 © Pedro Peláez
 

library robots-txt-processor

robots.txt filter and tester for untrusted source.

image

ranvis/robots-txt-processor

robots.txt filter and tester for untrusted source.

  • Monday, July 2, 2018
  • by ranvis
  • Repository
  • 1 Watchers
  • 0 Stars
  • 11 Installations
  • PHP
  • 1 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 3 Versions
  • 0 % Grown

The README.md

robots.txt filter and tester for untrusted source

, (*1)

Introduction

robots-txt-processor is a tester with a filter for natural wild robots.txt data of the Internet. The module can filter like:, (*2)

  • Rules for other User-agents
  • Rules that are too long
  • Paths that contains too many wildcards
  • Comments (inline or the whole line)

Also, it can for example:, (*3)

  • Parse line continuation (LWS,) although not used widely
  • Identify misspelled Useragent directive
  • Complement missing leading slash in a path

Tester module can process Allow/Disallow directives containing */$ meta characters. Alternatively, you can use the filter module alone and feed an output to another tester module as a single User-agent: * record with a non-group record (e.g. Sitemap.), (*4)

License

BSD 2-Clause License, (*5)

Installation

composer require "ranvis/robots-txt-processor:^1.0", (*6)

Example Usage

require_once __DIR__ . '/vendor/autoload.php';

$source = "User-agent: *\nDisallow: /path";
$userAgents = 'MyBotIdentifier';
$tester = new \Ranvis\RobotsTxt\Tester();
$tester->setSource($source, $userAgents);
var_dump($tester->isAllowed('/path.html')); // false

Tester->setSource(string) is actually a shorthand of Tester->setSource(RecordSet):, (*7)

use Ranvis\RobotsTxt;

$source = "User-agent: *\nDisallow: /path";
$userAgents = 'MyBotIdentifier';
$filter = new RobotsTxt\Filter();
$filter->setUserAgents($userAgents);
$recordSet = $filter->getRecordSet($source);
$tester = new RobotsTxt\Tester();
$tester->setSource($recordSet);
var_dump($tester->isAllowed('/path.php')); // false

See EXAMPLES.md for more examples, including filter-only usage., (*8)

Implementation Notes

Setting user-agents

When setting source, you can (optionally) pass user-agents like the examples above. If you pass a user-agent string or an array of strings, subsequent Filter will filter out unspecified user-agent records (aside from *.) While Tester->isAllowed() accepts user-agents, it should run faster to filter (with Filter->setUserAgents() or Tester->setSource(source, userAgents)) and call Tester->isAllowed() multiple times without specifying user-agents. (When an array of user-agent strings is passed, a user-agent specified earlier takes precedence when testing.), (*9)

Record separator

This parser ignores blank lines. Another record starts on User-agent lines after group member lines (i.e. Disallow/Allow.), (*10)

Case sensitivity

User-agent value and directive names like Disallow are case-insensitive. Filter class normalizes directive names to First-character-uppercased form., (*11)

Encoding conversion

This filter/tester themselves don't handle encoding conversion because it isn't needed. If a remote robots.txt uses some non-Unicode (specifically not UTF-8) encoding, URL path should be in that encoding too. The filter/tester safely work with any character or percent-encoded sequence which can result in invalid UTF-8. An exception is when a remote robots.txt uses any Unicode encoding with BOM. If this will ever happen, you will need to convert it to UTF-8 (without BOM) beforehand., (*12)

Features

See features/behaviors table of robots-txt-processor-test project., (*13)

Options

Options can be specified in the first argument of constructors. Normally, the default values should suffice to filter potentially offensive input while preserving requested rules., (*14)

Tester class options

  • 'respectOrder' => false,, (*15)

    If true, process path rules in their specified order. If false, longer path is processed first like Googlebot does., (*16)

  • 'ignoreForbidden' => false,, (*17)

    If true, setResponseCode() with 401 Unauthorized or 403 Forbidden is treated as if no robots.txt existed, like Googlebot does, as opposed to robotstxt.org spec., (*18)

  • 'escapedWildcard' => false,, (*19)

    If true, %2A in path line is treated as wildcard *. Normally you don't want to set this true for this class. See Filter class for some more information., (*20)

Tester->setSource(string) internally instantiates Filter with initially passed options and calls Filter->getRecordSet(string)., (*21)

Filter class options

  • 'maxRecords' => 1000,, (*22)

    Maximum number of records (grouped rules) to parse. Any records thereafter will not be kept. Don't set too low or filter will give up before your user-agents. This limitation is only for parsing. Calling setUserAgents() limits what user-agents to keep., (*23)

Filter->getRecordSet(string) internally instantiates FilterParser with initially passed options., (*24)

FilterParser class options

  • 'maxLines' => 1000,, (*25)

    Maximum number of lines to parse for each record (grouped or non-grouped). Any lines thereafter for the current record will not be kept., (*26)

  • 'keepTrailingSpaces' => false,, (*27)

    If false, trailing spaces (including tabs) of line without comment is trimmed. For lines with comment, spaces before # are always trimmed. Retaining spaces is the requirement of both robotstxt.org and Google specs., (*28)

  • 'maxWildcards' => 10,, (*29)

    Maximum number of non-repeated * in path to accept. If a path contains more than this, the rule itself will be ignored., (*30)

  • 'escapedWildcard' => true,, (*31)

    If true, %2A in path line is treated as wildcard * and will be a subject to the limitation of maxWildcards. When using an external tester, don't set to false unless you are sure that your tester doesn't treat %2A that way (and this tester does not,) so that rules cannot circumvent maxWildcards limitation. (Testers listed as PeDecodeWildcard=yes in feature test table should not change this flag.), (*32)

  • 'complementLeadingSlash' => true,, (*33)

    If true and the path doesn't start with / or * (which must be a mistake,) / is prepended., (*34)

  • 'pathMemberRegEx' => '/^(?:Dis)?Allow$/i',, (*35)

    A value of a directive matching this regex is treated as a path and configurations like maxWildcards are applied., (*36)

FilterParser extends Parser class., (*37)

Parser class options

  • 'maxUserAgents' => 1000,, (*38)

    Maximum number of user-agents to parse. Any user-agents thereafter will be ignored and any new grouped records thereafter will be skipped., (*39)

  • 'maxDirectiveLength' => 32,, (*40)

    Maximum number of characters for the directive. Any directives longer than this will be skipped. This must be at least 10 to parse User-agent directive. Increase if you need to keep custom long named directive value., (*41)

  • 'maxNameLength' => 200,, (*42)

    Maximum number of characters for the User-agent value. Any user-agent names longer than this are truncated., (*43)

  • 'maxValueLength' => 2000,, (*44)

    Maximum number of characters for the directive value. Any values longer than this will be changed to -ignored- directive with a value containing the original value length., (*45)

  • 'userAgentRegEx' => '/^User-?agent$/i',, (*46)

    A directive matching this regex is treated as a User-agent directive., (*47)

Interface

  • new Tester(array $options = [])
  • Tester->setSource($source, $userAgents = null)
  • Tester->setResponseCode(int $code)
  • Tester->isAllowed(string $targetPath, $userAgents = null)
  • new Filter(array $options = [])
  • Filter->setUserAgents($userAgents, bool $fallback = true) : RecordSet
  • Filter->getRecordSet($source) : RecordSet
  • new Parser(array $options = [])
  • Parser->registerGroupDirective(string $directive)
  • Parser->getRecordIterator($it) : \Traversable
  • (string)RecordSet
  • RecordSet->extract($userAgents = null)
  • RecordSet->getRecord($userAgents = null, bool $dummy = true) : ?RecordSet
  • RecordSet->getNonGroupRecord(bool $dummy = true) : ?RecordSet
  • (string)Record
  • Record->getValue(string $directive) : ?string
  • Record->getValueIterator(string $directive) : \Traversable

The Versions

02/07 2018

dev-master

9999999-dev https://github.com/ranvis/robots-txt-processor/blob/master/README.md

robots.txt filter and tester for untrusted source.

  Sources   Download

BSD-2-Clause

The Requires

  • php >=7.0.0

 

The Development Requires

by SATO Kentaro

parser filter tester robots.txt

11/06 2017

1.0.1

1.0.1.0 https://github.com/ranvis/robots-txt-processor/blob/master/README.md

robots.txt filter and tester for untrusted source.

  Sources   Download

BSD-2-Clause

The Requires

  • php >=7.0.0

 

The Development Requires

by SATO Kentaro

parser filter tester robots.txt

29/09 2016

1.0.0

1.0.0.0 https://github.com/ranvis/robots-txt-processor/blob/master/README.md

robots.txt filter and tester for untrusted source.

  Sources   Download

BSD-2-Clause

The Requires

  • php >=7.0.0

 

The Development Requires

by SATO Kentaro

parser filter tester robots.txt