itgalaxy/webcrawler-verifier

PHP library providing functionality to verify that user-agents are who they claim to be.

Wednesday, November 23, 2016
by evilebottnawi
Repository
4 Watchers
2 Stars
114 Installations

PHP
0 Dependents
0 Suggesters
0 Forks
8 Open issues
6 Versions
0 % Grown

The README.md

webcrawler-verifier

, _(*1)

Webcralwer-Verifier is a PHP library to ensure that robots are from the operator they claim to be, eg that Googlebot is actually coming from Google and not from some spoofer., _(*2)

Installation

Install with Composer

If you're using Composer to manage dependencies, you can add Requests with it., _(*3)

composer require itgalaxy/webcrawler-verifier

or, _(*4)

{
    "require": {
        "itgalaxy/webcrawler-verifier": ">=1.0.0"
    }
}

Usage

<?php
require_once 'vendor/autoload.php';

$userAgent = 'Some user agent';
$ip = '192.168.0.1';

$webcrawlerVerifier = new \WebcrawlerVerifier\WebcrawlerVerifier();
$verifiedStatus = $webcrawlerVerifier->verify(
    $userAgent, 
    $ip
);

if ($verifiedStatus === $webcrawlerVerifier::VERIFIED) {
    echo 'Good webcrawler';
} elseif ($verifiedStatus === $webcrawlerVerifier::UNVERIFIED) {
    echo 'Bad webcrawler';
} else {
    // Alias `$verifiedStatus === $webcrawlerVerifier::UNKNOWN`
    echo 'Unknown good or bad wecrawler';
}

Or, _(*5)

<?php
// This file is generated by Composer
require_once 'vendor/autoload.php';

if (!empty($_SERVER['HTTP_USER_AGENT']) && !empty($_SERVER['REMOTE_ADDR'])) {
    $webcrawlerVerifier = new \WebcrawlerVerifier\WebcrawlerVerifier();
    $verifiedStatus = $webcrawlerVerifier->verify(
        $_SERVER['HTTP_USER_AGENT'], 
        $_SERVER['REMOTE_ADDR']
    );

    if ($verifiedStatus === $webcrawlerVerifier::VERIFIED) {
        echo 'Good webcrawler';
    } elseif ($verifiedStatus === $webcrawlerVerifier::UNVERIFIED) {
        echo 'Bad webcrawler';
    } else {
        // Alias `$verifiedStatus === $webcrawlerVerifier::UNKNOWN`
        echo 'Unknown good or bad wecrawler';
    }
}

Built in crawler detection

By company

By webcrawler name

Coming soon

Contributions are welcome., _(*6)

How it works

Step one is identification.

If the user-agent identifies as one of the bots you are checking for, it goes into step 2 for verification. If not, none is reported., _(*7)

Step two is verification.

The robot that was reported in the user-agent is verified by looking at the client's network address. The big ones work with a combination of dns + reverse-dns lookup. That's not a hack, it's the officially recommended way. The ip resolves to a hostname of the provider, and the hostname has a reverse dns entry pointing back to that ip. This gives the crawler operators the freedom to to change and add networks without risking of being locked out of websites., _(*8)

The other method is to maintain lists of ip addresses. This is used for those operators that don't officially endorse the first method. And it can optionally be used in combination with the first method to avoid the one-time cost of the dns verification., _(*9)

Except where it's required (for the 2nd method) this project does not maintain ip lists. The ones that can currently be found on the internet all seem outdated. And that's exactly the problem... they will always be lagging behind the ip ranges that the operators use., _(*10)

Contribution

Don't hesitate to create a pull request. Every contribution is appreciated., _(*11)

Changelog

License

The Versions

23/11 2016

dev-master

9999999-dev

PHP library providing functionality to verify that user-agents are who they claim to be.

Sources Download

MIT

The Requires

php ^5.6 || ^7.0
s1lentium/iptools ~1.1.0

The Development Requires

by Itgalaxy

validation crawler bots webcrawler verifier

31/10 2016

2.1.0

2.1.0.0

PHP library providing functionality to verify that user-agents are who they claim to be.

Sources Download

MIT

The Requires

php ^5.6 || ^7.0
s1lentium/iptools ~1.1.0

The Development Requires

by Itgalaxy

validation crawler bots webcrawler verifier

28/10 2016

2.0.0

2.0.0.0

PHP library providing functionality to verify that user-agents are who they claim to be.

Sources Download

MIT

The Requires

php ^5.6 || ^7.0
s1lentium/iptools ~1.1.0

The Development Requires

by Itgalaxy

validation crawler bots webcrawler verifier

27/10 2016

1.2.0

1.2.0.0

PHP library providing functionality to verify that user-agents are who they claim to be.

Sources Download

MIT

The Requires

php ^5.6 || ^7.0
s1lentium/iptools ~1.1.0

The Development Requires

by Itgalaxy

validation crawler bots webcrawler verifier

24/10 2016

1.1.0

1.1.0.0

PHP library providing functionality to verify that user-agents are who they claim to be.

Sources Download

MIT

The Requires

php ^5.6 || ^7.0
s1lentium/iptools ~1.1.0

The Development Requires

by Itgalaxy

validation crawler bots webcrawler verifier

20/10 2016

1.0.0

1.0.0.0

PHP library providing functionality to verify that user-agents are who they claim to be.

Sources Download

MIT

The Requires

php ^5.6 || ^7.0

The Development Requires

by Itgalaxy

validation crawler bots webcrawler verifier

library webcrawler-verifier

PHP library providing functionality to verify that user-agents are who they claim to be.

itgalaxy/webcrawler-verifier

The README.md

webcrawler-verifier

Installation

Install with Composer

Usage

Built in crawler detection

By company

By webcrawler name

How it works

Step one is identification.

Step two is verification.

Contribution

The Versions

dev-master

The Requires

The Development Requires

by Itgalaxy

2.1.0

The Requires

The Development Requires

by Itgalaxy

2.0.0

The Requires

The Development Requires

by Itgalaxy

1.2.0

The Requires

The Development Requires

by Itgalaxy

1.1.0

The Requires

The Development Requires

by Itgalaxy

1.0.0

The Requires

The Development Requires

by Itgalaxy