dev-master
9999999-dev
The Requires
- php >=5.3.0
- guzzle/guzzle *
- illuminate/support 4.2.*
by Boy Hagemann
Wallogit.com
2017 © Pedro Peláez
With this awesome Laravel 4 package you can: - scrape any contents from a page - follow specific link paths to get to that content - use closures for optimum usability, (*1)
Use Composer to install with all the dependencies:
composer require boyhagemann/scrape *, (*2)
Then you add the ServiceProvider to the application config:, (*3)
'Boyhagemann\Scrape\ScrapeServiceProvider'
You can optionally add the alias:, (*4)
'Scraper' => 'Boyhagemann\Scrape\Facades\Container'
Scrape uses two components for scraping pages:, (*5)
The container is the single class you will use in most cases. It registers a name and a closure with information how to scrape the page., (*6)
A page is a template that is used for many urls. An example would be a news item page. It can have many urls, but it has one page template. With Scrape you can define how to deal with the content on that page., (*7)
The first thing you want to do is to add a page to the container., (*8)
Scraper::add('my-first-page', function() {
// Start scraping...
});
If you don't use the facade, you can do something like this:, (*9)
$container = App::make('Boyhagemann\Scrape\Container');
$container->add('my-second-page', function($crawler) {
// Your magic scraping starts here...
});
After you defined all pages, you are ready to scrape some content! This is done very easily, like this:, (*10)
Scraper::scrape('my-first-page', 'http://theurl.toscrape.com');
Scrape uses the Symfony DomCrawler component to crawl the html from a url. You can check out there documention for full details. In order to use autocomplete in your IDE, it is useful to type hint the $crawler variable:, (*11)
use Symfony\Component\DomCrawler\Crawler;
Scraper::add('page-name', function(Crawler $crawler) {
// You have autocompletion on the $crawler instance...
});
Most of the time, you don't know exactly all urls to the desired content. If you have thousands of urls to crawl, it is impossible to manage this manually. You can use Scrape to follow links to get to the desired content., (*12)
You can have crawl multiple pages after each other with great ease:, (*13)
// Add a page that has links to your content Scraper::add('page-1', function($crawler) { $crawler->filter('.your-link')->each(function($node) { Scraper::scrape('page-2', $node->attr('href')); }); }); // Add the page with all the content Scraper::add('page-2', function($crawler) { $crawler->filter('.your-content')->each(function($node) { // Get the content and do a little dance! }); });
Chained processes can consume lots of time and resources, so don't go mental on chaining everything. You can use the Laravel Queue or a database in conjunction with cron jobs to manage all page crawls. This will save you from the nasty requrest time outs!, (*14)
Scraper::add('page-1', function($crawler) {
$crawler->filter('.link')->each(function($node) {
// Put the next crawl on a queue
Queue::push(function($job) use ($node) {
// Scrape this page!
Scraper::scrape('page-2', $node->attr('href'));
// Delete the queue job once finished
$job->delete();
});
});
});