2017 © Pedro Pelรกez
 

library hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

image

duzun/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  • Wednesday, July 18, 2018
  • by duzun
  • Repository
  • 15 Watchers
  • 159 Stars
  • 9,942 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 33 Forks
  • 8 Open issues
  • 32 Versions
  • 20 % Grown

The README.md

hQuery.php Donate

An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye., (*1)

You can use the familiar jQuery/CSS selector syntax to easily find the data you need., (*2)

In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM., (*3)

See tests/README.md., (*4)

API Documentation, (*5)

๐Ÿ’ก Features

  • Very fast parsing and lookup
  • Parses broken HTML
  • jQuery-like style of DOM traversal
  • Low memory usage
  • Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
  • Doesn't require cURL to be installed and automatically handles redirects (see hQuery::fromUrl())
  • Caches response for multiple processing tasks
  • PSR-7 friendly (see hQuery::fromHTML($message))
  • PHP 5.3+
  • No dependencies

๐Ÿ›  Install

Just add this folder to your project and include_once 'hquery.php'; and you are ready to hQuery., (*6)

Alternatively composer require duzun/hquery, (*7)

or using npm install hquery.php, require_once 'node_modules/hquery.php/hquery.php';., (*8)

โš™ Usage

Basic setup:

// Optionally use namespaces
use duzun\hQuery;

// Either use composer, or include this file:
include_once '/path/to/libs/hquery.php';

// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() would make a new request on each call
hQuery::$cache_path = "/path/to/cache";

// Time to keep request data in cache, seconds
// A value of 0 disables cache
hQuery::$cache_expires = 3600; // default one hour

I would recommend using php-http/cache-plugin with a PSR-7 client for better flexibility., (*9)

Load HTML from a file

hQuery::fromFile( string $filename, boolean $use_include_path = false, resource $context = NULL )
// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');

// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);

Where $context is created with stream_context_create()., (*10)

For an example of using $context to make a HTTP request with proxy see #26., (*11)

Load HTML from a string

hQuery::fromHTML( string $html, string $url = NULL )
$doc = hQuery::fromHTML('<html><head><title>Sample HTML Doc</title><body>Contents...</body></html>');

// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrieve absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';

Load a remote HTML document

hQuery::fromUrl( string $url, array $headers = NULL, array|string $body = NULL, array $options = NULL )
use duzun\hQuery;

// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);

var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request

// with POST
$doc = hQuery::fromUrl(
    'http://example.com/someDoc.html', // url
    ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
    ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
    ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);

For building advanced requests (POST, parameters etc) see hQuery::http_wr(), though I recommend using a specialized (PSR-7?) library for making requests and hQuery::fromHTML($html, $url=NULL) for processing results. See Guzzle for eg., (*12)

PSR-7 example:

composer require php-http/message php-http/discovery php-http/curl-client

If you don't have cURL PHP extension, just replace php-http/curl-client with php-http/socket-client in the above command., (*13)

use duzun\hQuery;

use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;

$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();

$request = $messageFactory->createRequest(
  'GET',
  'http://example.com/someDoc.html',
  ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);

$response = $client->sendRequest($request);

$doc = hQuery::fromHTML($response, $request->getUri());

Another option is to use stream_context_create() to create a $context, then call hQuery::fromFile($url, false, $context)., (*14)

Processing the results

hQuery::find( string $sel, array|string $attr = NULL, hQuery\Node $ctx = NULL )
// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');

// Extract links and images
$links  = array();
$images = array();
$titles = array();

// If the result of find() is not empty
// $banners is a collection of elements (hQuery\Element)
if ( $banners ) {

    // Iterate over the result
    foreach($banners as $pos => $a) {
        // $a->href property is the resolved $a->attr('href') relative to the
        // documents <base href=...>, if present, or $doc->baseURL.
        $links[$pos] = $a->href; // get absolute URL from href property
        $titles[$pos] = trim($a->text()); // strip all HTML tags and leave just text

        // Filter the result
        if ( !$a->hasClass('logo') ) {
            // $a->style property is the parsed $a->attr('style'), same as $a->attr('style', true)
            if ( strtolower($a->style['position']) == 'fixed' ) continue;

            $img = $a->find('img')[0]; // ArrayAccess
            if ( $img ) $images[$pos] = $img->src; // short for $img->attr('src', true)
        }
    }

    // If at least one element has the class .home
    if ( $banners->hasClass('home') ) {
        echo 'There is .home button!', PHP_EOL;

        // ArrayAccess for elements and properties.
        if ( $banners[0]['href'] == '/' ) {
            echo 'And it is the first one!';
        }
    }
}

// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;

// Get the size of the document ( strlen($html) )
$size = $doc->size;

// The URL at which the document was requested
$requestUri = $doc->href;

// <base href=...>, if present, or the origin + dir path part from $doc->href.
// The .href and .src props are resolved using this value.
$baseURL = $doc->baseURL;

Note: In case the charset meta attribute has a wrong value or the internal conversion fails for any other reason, hQuery would ignore the error and continue processing with the original HTML, but would register an error message on $doc->html_errors['convert_encoding']., (*15)

๐Ÿ–ง Live Demo

On DUzun.Me, (*16)

A lot of people ask for sources of my Live Demo page. Here we go:, (*17)

view-source:https://duzun.me/playground/hquery, (*18)

๐Ÿƒ Run the playground

You can easily run any of the examples/ on your local machine. All you need is PHP installed in your system. After you clone the repo with git clone https://github.com/duzun/hQuery.php.git, you have several options to start a web-server., (*19)

Option 1:
cd hQuery.php/examples
php -S localhost:8000

# open browser http://localhost:8000/
Option 2 (browser-sync):

This option starts a live-reload server and is good for playing with the code., (*20)

npm install
gulp

# open browser http://localhost:8080/
Option 3 (VSCode):

If you are using VSCode, simply open the project and run debugger (F5)., (*21)

๐Ÿ”ง TODO

  • Unit tests everything
  • Document everything
  • ~~Cookie support~~ (implemented in mem for redirects)
  • ~~Improve selectors to be able to select by attributes~~
  • Add more selectors
  • Use HTTPlug internally

๐Ÿ’– Support my projects

I love Open Source. Whenever possible I share cool things with the world (check out NPM and GitHub)., (*22)

If you like what I'm doing and this project helps you reduce time to develop, please consider to:, (*23)

  • โ˜… Star and Share the projects you like (and use)
  • โ˜• Give me a cup of coffee - PayPal.me/duzuns (contact at duzun.me)
  • โ‚ฟ Send me some Bitcoin at this addres: bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa (or using the QR below) bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa

The Versions

18/07 2018

dev-master

9999999-dev https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.3

 

The Development Requires

php xml html web scraping scraper crawling xhtml

18/07 2018

2.1.0

2.1.0.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.3

 

The Development Requires

php xml html web scraping scraper crawling xhtml

17/07 2018

2.0.3

2.0.3.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.3

 

The Development Requires

php xml html web scraping scraper crawling xhtml

05/07 2018

2.0.2

2.0.2.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.3

 

The Development Requires

php xml html web scraping scraper crawling xhtml

03/07 2018

2.0.1

2.0.1.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.3

 

The Development Requires

php xml html web scraping scraper crawling xhtml

19/06 2018

1.7.4

1.7.4.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

The Development Requires

php xml html web scraping scraper crawling xhtml

03/02 2018

1.7.3

1.7.3.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

The Development Requires

php xml html web scraping scraper crawling xhtml

31/01 2018

1.7.2

1.7.2.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

The Development Requires

php xml html web scraping scraper crawling xhtml

27/11 2017

1.7.1

1.7.1.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

The Development Requires

php xml html web scraping scraper crawling xhtml

20/10 2017

1.7.0

1.7.0.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

The Development Requires

php xml html web scraping scraper crawling xhtml

06/10 2017

1.6.2

1.6.2.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

The Development Requires

php xml html web scraping scraper crawling xhtml

21/04 2017

1.6.1

1.6.1.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

The Development Requires

php xml html web scraping scraper crawling xhtml

13/03 2017

1.6.0

1.6.0.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

03/01 2017

1.5.3

1.5.3.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

14/09 2016

1.5.2

1.5.2.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

13/04 2016

1.5.1

1.5.1.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

26/01 2016

1.5.0

1.5.0.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

26/01 2016

1.4.3

1.4.3.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

26/01 2016

1.4.2

1.4.2.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

18/11 2015

1.4.1

1.4.1.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

03/11 2015

1.4.0

1.4.0.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

24/10 2015

1.3.0

1.3.0.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

28/08 2015

1.2.5

1.2.5.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

30/07 2015

1.2.4

1.2.4.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

26/06 2015

1.2.3

1.2.3.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.0.0

 

php xml html web scraping scraper crawling xhtml

19/06 2015

1.2.2

1.2.2.0 https://duzun.me/playground/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

  Sources   Download

MIT

The Requires

  • php >=5.3.0

 

php xml html web scraping scraper crawling xhtml

12/06 2015

1.2.1

1.2.1.0 https://duzun.me/playground/hquery

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye

  Sources   Download

MIT

The Requires

  • php >=5.3.0

 

php xml html web scraping scraper crawling xhtml

11/06 2015

1.2.0

1.2.0.0 https://duzun.me/playground/hquery

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye

  Sources   Download

MIT

The Requires

  • php >=5.3.0

 

php xml html web scraping scraper crawling xhtml

11/06 2015

1.1.3

1.1.3.0 https://duzun.me/playground/hquery

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye

  Sources   Download

MIT

The Requires

  • php >=5.3.0

 

php xml html web scraping scraper crawling xhtml

11/06 2015

1.1.2

1.1.2.0 https://duzun.me/playground/hquery

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye

  Sources   Download

MIT

The Requires

  • php >=5.3.0

 

php xml html web scraping scraper crawling xhtml

11/06 2015

1.1.1

1.1.1.0 https://duzun.me/playground/hquery

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye

  Sources   Download

MIT

The Requires

  • php >=5.3.0

 

php html web scraping scraper crawling

04/06 2015

1.1.0

1.1.0.0 https://duzun.me/playground/hquery

An extremely fast and efficient web scraper that parses megabytes of HTML in a blink of an eye

  Sources   Download

MIT

The Requires

  • php >=5.3.0

 

php html web scraping scraper crawling