2017 © Pedro Peláez
 

library php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

image

scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  • Sunday, July 1, 2018
  • by scotteh
  • Repository
  • 17 Watchers
  • 270 Stars
  • 18,918 Installations
  • PHP
  • 2 Dependents
  • 0 Suggesters
  • 78 Forks
  • 4 Open issues
  • 21 Versions
  • 8 % Grown

The README.md

PHP Goose - Article Extractor

Note

This repository has been archived as of 2023-09-05., (*1)

Intro

PHP Goose is a port of Goose originally developed in Java and converted to Scala by GravityLabs. Portions have also been ported from the Python port python-goose. Its mission is to take any news article or article type web page and not only extract what is the main body of the article but also all metadata and most probable image candidate., (*2)

The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image., (*3)

Goose will try to extract the following information:, (*4)

  • Main text of an article
  • Main image of article
  • Any YouTube/Vimeo movies embedded in article
  • Meta Description
  • Meta tags
  • Publish Date

The PHP version was rewritten by:, (*5)

  • Andrew Scott

Requirement

  • PHP 7.1 or later
  • PSR-4 compatible autoloader

The older 0.x versions with PHP 5.5+ support are still available under releases., (*6)

Install

This library is designed to be installed via Composer., (*7)

Add the dependency into your projects composer.json., (*8)

{
  "require": {
    "scotteh/php-goose": "^1.0"
  }
}

Download the composer.phar ``` bash curl -sS https://getcomposer.org/installer | php, (*9)


Install the library. ``` bash php composer.phar install

Autoloading

This library requires an autoloader, if you aren't already using one you can include Composers autoloader., (*10)

``` php require('vendor/autoload.php');, (*11)


## Usage ``` php use \Goose\Client as GooseClient; $goose = new GooseClient(); $article = $goose->extractContent('http://url.to/article'); $title = $article->getTitle(); $metaDescription = $article->getMetaDescription(); $metaKeywords = $article->getMetaKeywords(); $canonicalLink = $article->getCanonicalLink(); $domain = $article->getDomain(); $tags = $article->getTags(); $links = $article->getLinks(); $videos = $article->getVideos(); $articleText = $article->getCleanedArticleText(); $entities = $article->getPopularWords(); $image = $article->getTopImage(); $allImages = $article->getAllImages();

Configuration

All config options are not required and are optional. Default (fallback) values have been used below., (*12)

``` php use \Goose\Client as GooseClient;, (*13)

$goose = new GooseClient([ // Language - Selects common word dictionary // Supported languages (ISO 639-1): // ar, cs, da, de, en, es, fi, fr, hu, id, it, ja, // ko, nb, nl, no, pl, pt, ru, sv, vi, zh 'language' => 'en', // Minimum image size (bytes) 'image_min_bytes' => 4500, // Maximum image size (bytes) 'image_max_bytes' => 5242880, // Minimum image size (pixels) 'image_min_width' => 120, // Maximum image size (pixels) 'image_min_height' => 120, // Fetch best image 'image_fetch_best' => true, // Fetch all images 'image_fetch_all' => false, // Guzzle configuration - All values are passed directly to Guzzle // See: http://guzzle.readthedocs.io/en/stable/request-options.html 'browser' => [ 'timeout' => 60, 'connect_timeout' => 30 ] ]); ```, (*14)

Licensing

PHP Goose is licensed by Gravity.com under the Apache 2.0 license, see the LICENSE file for more details., (*15)

The Versions

01/07 2018

dev-master

9999999-dev https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

01/07 2018

1.0.7

1.0.7.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

24/04 2018

1.0.6

1.0.6.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

19/03 2018

1.0.5

1.0.5.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

04/03 2018

1.0.4

1.0.4.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

26/02 2018

dev-doc

dev-doc https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

19/02 2018

1.0.3

1.0.3.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

14/02 2018

1.0.2

1.0.2.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

14/02 2018

1.0.1

1.0.1.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

14/02 2018

1.0.0

1.0.0.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

31/12 2017

0.6.4

0.6.4.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

19/12 2017

0.6.3

0.6.3.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

15/11 2017

dev-php7.1

dev-php7.1 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

15/11 2017

0.6.2

0.6.2.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

14/10 2017

0.6.1

0.6.1.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

22/08 2017

0.6.0

0.6.0.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

12/01 2017

0.5.0

0.5.0.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

01/11 2016

0.4.0

0.4.0.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

06/05 2015

0.3.0

0.3.0.0 https://github.com/scotteh/php-goose

Readability / Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

The Development Requires

http text content extractor scraping website readability scraper

21/10 2014

0.2.0

0.2.0.0 https://github.com/scotteh/php-goose

Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

http text content extractor scraping website scraper

03/10 2014

0.1.0

0.1.0.0 https://github.com/scotteh/php-goose

Html Content / Article Extractor & Web Scrapping library written in PHP

  Sources   Download

Apache-2.0

The Requires

 

http extractor scraper