2017 © Pedro Peláez
 

library page-meta

Get preview data on any URL from the internet!

image

layered/page-meta

Get preview data on any URL from the internet!

  • Friday, July 20, 2018
  • by AndreiHere
  • Repository
  • 0 Watchers
  • 0 Stars
  • 6 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 1 Forks
  • 0 Open issues
  • 7 Versions
  • 200 % Grown

The README.md

Page Meta 🕵

Page Meta is a PHP library than can retrieve detailed info on any URL from the internet! It uses data from HTML meta tags and OpenGraph with fallback to detailed HTML scraping., (*1)

Highlights

  • Works for any valid URL on the internet!
  • Follows page redirects
  • Uses all scraping methods available: HTML tags, OpenGraph, Schema data

Potential use cases

  • Display Info Cards for links in a article
  • Rich preview for links in messaging apps
  • Extract info from a user-submitted URL layered-page-meta-link-card

How to use

Installation

Add layered/page-meta as a dependency in your project's composer.json file: ``` bash $ composer require layered/page-meta, (*2)


#### Usage Create a `UrlPreview` instance, then call `loadUrl($url)` method with your URL as first argument. Preview data is retrieved with `get($section)` or `getAll()` methods:

require 'vendor/autoload.php';, (*3)

$preview = new Layered\PageMeta\UrlPreview([ 'HTTP_USER_AGENT' => 'Mozilla/5.0 (compatible; YourApp/1.0; +https://example.com)' ]); $preview->loadUrl('https://www.instagram.com/p/BbRyo_Kjqt1/');, (*4)

$allPageData = $preview->getAll(); // contains all scraped data $siteInfo = $preview->get('site'); // get general info about the website, (*5)


#### Behind the scenes The library downloads the HTML source of the url you provided, then uses specialized scrapers to extract pieces of information. Core scrapers can be seen in `src/scrapers/`, and they extract general info for a page: title, author, description, page type, main image, etc. If you would like to extract a new field, see [Extending the library](#extending-the-library) section. User Agent or extra headers can make a big difference when downloading HTML from a website. There are some websites that forbid scraping and hide the content when they detect a tool like this one. Make sure to read their dev docs & TOS. The default User Agent is blocked on sites like Twitter, Instagram, Facebook and others. A workaround is to use this one (thanks for the tip [PVGrad](https://github.com/LayeredStudio/page-meta/issues/2)): `'HTTP_USER_AGENT' => 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'` #### Returned data Returned data will be an `Array` with following format:

{ "site": { "secure": true, "url": "https:\/\/www.instagram.com", "icon": "https:\/\/www.instagram.com\/static\/images\/ico\/favicon-192.png\/b407fa101800.png", "language": "en", "responsive": true, "name": "Instagram" }, "page": { "type": "photo", "url": "https:\/\/www.instagram.com\/p\/BbRyo_Kjqt1\/", "title": "GitHub on Instagram", "description": "There\u2019s still time to join the #GitHubGameOff and build a game inspired by throwbacks. Get started\u2026", "image": { "url": "https:\/\/scontent-mad1-1.cdninstagram.com\/vp\/73b1790d77548031327e64ee83196706\/5B4AD567\/t51.2885-15\/e35\/23421974_1768724519826754_3855913942043852800_n.jpg" } }, "author": { "name": "GitHub", "handle": "@github", "url": "https:\/\/www.instagram.com\/github\/" }, "app_links": { "ios": { "url": "nflx:\/\/www.netflix.com\/title\/80014749", "app_store_id": "363590051", "app_name": "Netflix", "store_url": "https:\/\/itunes.apple.com\/us\/app\/Netflix\/id363590051" }, "android": { "url": "nflx:\/\/www.netflix.com\/title\/80014749", "package": "com.netflix.mediaclient", "app_name": "Netflix", "store_url": "https:\/\/play.google.com\/store\/apps\/details?id=com.netflix.mediaclient" } } }, (*6)

See [`UrlPreview::getAll()`](#getall-array) for info on each returned field.

## Public API
`UrlPreview` class provides the following public methods:

#### `__construct(array $headers): UrlPreview`
Start the UrlPreview instance. Pass extra headers to send when requesting the page URL

#### `loadUrl(string $url): UrlPreview`
Load and start the scrape process for any valid URL

#### `getAll(): array`
Get all data scraped from page

**Return:** `Array` with scraped data in following format:
- `site` - info about the website
  - `url` - main site URL
  - `name` - site name, ex: 'Instagram' or 'Medium'
  - `secure` - Boolean true|false depending on http connection
  - `responsive` - Boolean true|false. `True` if site has `viewport` meta tag present. Basic check for responsiveness
  - `icon` - site icon
  - `language` - ISO 639-1 language code, ex: `en`, `es`
- `page` - info about the page at current URL
  - `type` - page type, ex: `website`, `article`, `profile`, `video`, etc
  - `url` - canonical URL for the page
  - `title` - page title
  - `description` - page description
  - `image` - `Array` containing image info, if present:
    - `url` - image URL
    - `width` - image width
    - `height` - image width
  - `video` - `Array` containing video info, if found on page:
    - `url` - video URL
    - `width` - video width
    - `height` - video width
- `author` - info about the content author, ex:
  - `name` - Author's name on a blog, person's name on social network sites
  - `handle` - Social media site username
  - `url` - Author URL for more articles or Profile URL on social network sites
- `app_links` - `Array` containing apps linked to page, like:
  - `ios` - iOS app
    - `url` - link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'
    - `app_store_id` - Apple AppStore app ID
    - `app_name` - name of the app
    - `store_url` - link to installable app
  - `android` - Android app
    - `url` - link for in-app action, ex: 'nflx://www.netflix.com/title/80014749'
    - `package` - Android PlayStore app ID
    - `app_name` - name of the app
    - `store_url` - link to installable app

#### `get(string $section): array`
Get data in one scraped section `site`, `page`, `profile` or `app_links`

**Return:** `Array` with section scraped data. See [`UrlPreview::getAll()`](#getall-array) for data format

#### `addListener(string $eventName, callable $listener, int $priority = 0): UrlPreview`
Attach an event on `UrlPreview` for data processing or scrape process. Arguments:
- `$eventName` - on which event to listen. Available:
  - `page.scrape` - fired when the scraping process starts
  - `data.filter` - fired when data is requested by `getData()` or `getAll()` methods
- `$listener` - a callable reference, which will get the `$event` parameter with available data
- `$priority` - order on which the callable should be executed


### Extending the library
If there's need to more scraped data for a URL, more functionality can be attached to **PageMeta** library. Example for returing the 'Terms and Conditions' link from pages:

use Symfony\Component\EventDispatcher\Event;, (*7)

$previewer = new \Layered\PageMeta\UrlPreview; $previewer->addListener('page.scrape', function(Event $event) { $currentScrapedData = $event->getData(); // check data from other scrapers $crawler = $event->getCrawler(); // instance of DomCrawler Symfony Component $termsLink = '';, (*8)

$crawler->filter('a[href*=terms]')->each(function($node) use(&$termsLink) {
    $termsLink = $node->attr('href');
});

// forwards the scraped data
$event->addData('site', [
    'termsLink' =>  $termsLink
]);

}); $previewer->loadUrl('http://github.com'); ```, (*9)

More

Please report any issues here on GitHub., (*10)

Any contributions are welcome, (*11)

The Versions

20/07 2018

dev-master

9999999-dev

Get preview data on any URL from the internet!

  Sources   Download

MIT

The Requires

 

by Andrei Igna

oembed opengraph scraper embed url-preview link-preview

20/07 2018

1.1.1

1.1.1.0

Get preview data on any URL from the internet!

  Sources   Download

MIT

The Requires

 

by Andrei Igna

oembed opengraph scraper embed url-preview link-preview

20/07 2018

1.1

1.1.0.0

Get preview data on any URL from the internet!

  Sources   Download

MIT

The Requires

 

by Andrei Igna

oembed opengraph scraper embed url-preview link-preview

21/04 2018

1.0.1

1.0.1.0

Get preview data on any URL from the internet!

  Sources   Download

MIT

The Requires

 

by Andrei Igna

oembed opengraph scraper embed url-preview link-preview

17/03 2018

v1.0

1.0.0.0

Get preview data on any URL from the internet!

  Sources   Download

MIT

The Requires

 

by Andrei Igna

oembed opengraph scraper embed url-preview link-preview

05/09 2017

v0.2

0.2.0.0

Get preview data from any URL from the internet

  Sources   Download

MIT

The Requires

 

by Andrei Igna

oembed opengraph scraper embed url-preview link-preview

04/09 2017

v0.1

0.1.0.0

Get preview data from any URL from the internet

  Sources   Download

MIT

The Requires

 

by Andrei Igna

opengraph scraper url-preview