2017-25 © Pedro Peláez
 

library textractor

An efficient class library for extracting text from HTML.

image

shiba/textractor

An efficient class library for extracting text from HTML.

  • Wednesday, April 12, 2017
  • by shibahuasheng
  • Repository
  • 1 Watchers
  • 0 Stars
  • 6 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 5 Forks
  • 0 Open issues
  • 1 Versions
  • 0 % Grown

The README.md

Textractor

An efficient class library for extracting text from HTML., (*1)

一个高效的从HTML中提取正文的类库。, (*2)

正文提取采用了基于文本密度的提取算法,支持从压缩的HTML文档中提取正文,每个页面平均提取时间为30ms,正确率在95%以上。, (*3)

特色

  • 标签无关,提取正文不依赖标签;
  • 支持从压缩的HTML文档中提取正文内容;
  • 支持带标签输出原始正文;
  • 核心算法简洁高效,平均提取时间在30ms左右。

安装

  1. 安装包文件, (*4)

    composer require "shiba/textractor:dev-master"
    
  2. 添加 ServiceProvider 到您项目 config/app.php 中的 providers 部分:, (*5)

    shiba\Textractor\TextractorServiceProvider::class,
    
  3. 创建配置文件:, (*6)

    php artisan vendor:publish --provider="shiba\Textractor\TextractorServiceProvider"
    

    然后请修改 config/textractor.php 中对应的项即可。, (*7)

使用

<?php
$url = 'http://news.163.com/17/0204/08/CCDTBQ9E000189FH.html';
// 创建提取实例
$textractor = new \Lukin\Textractor();
// 下载并解析文章
$article = $textractor->download($url)->parse();

printf('

URL: %s
' . PHP_EOL, $url); printf('
Title: %s
' . PHP_EOL, $article->getTitle()); printf('
Publish: %s
' . PHP_EOL, $article->getPublishDate()); printf('
Text:
%s
' . PHP_EOL, $article->getText()); printf('
Content: %s
' . PHP_EOL, $article->getHTML());

License

MIT, (*8)

The Versions

12/04 2017

dev-master

9999999-dev

An efficient class library for extracting text from HTML.

  Sources   Download

MIT

The Requires

 

The Development Requires

by shiba

extractor article html2article