library textractor
An efficient class library for extracting text from HTML.
crlwinner/textractor
An efficient class library for extracting text from HTML.
- Monday, June 5, 2017
- by alex.love.77
- Repository
- 1 Watchers
- 0 Stars
- 2 Installations
- PHP
- 0 Dependents
- 0 Suggesters
- 5 Forks
- 0 Open issues
- 1 Versions
- 0 % Grown
An efficient class library for extracting text from HTML., (*1)
一个高效的从HTML中提取正文的类库。, (*2)
正文提取采用了基于文本密度的提取算法,支持从压缩的HTML文档中提取正文,每个页面平均提取时间为30ms,正确率在95%以上。, (*3)
特色
- 标签无关,提取正文不依赖标签;
- 支持从压缩的HTML文档中提取正文内容;
- 支持带标签输出原始正文;
- 核心算法简洁高效,平均提取时间在30ms左右。
安装
-
安装包文件, (*4)
composer require "mylukin/textractor:dev-master"
-
添加 ServiceProvider
到您项目 config/app.php
中的 providers
部分:, (*5)
Lukin\Textractor\TextractorServiceProvider::class,
-
创建配置文件:, (*6)
php artisan vendor:publish --provider="Lukin\Textractor\TextractorServiceProvider"
然后请修改 config/textractor.php
中对应的项即可。, (*7)
使用
<?php
$url = 'http://news.163.com/17/0204/08/CCDTBQ9E000189FH.html';
// 创建提取实例
$textractor = new \Lukin\Textractor\Textractor();
// 下载并解析文章
$article = $textractor->download($url)->parse();
printf('
URL: %s
' . PHP_EOL, $url);
printf('
Title: %s
' . PHP_EOL, $article->getTitle());
printf('
Publish: %s
' . PHP_EOL, $article->getPublishDate());
printf('
' . PHP_EOL, $article->getText());
printf('
Content: %s
' . PHP_EOL, $article->getHTML());
License
MIT, (*8)
dev-master
9999999-dev
An efficient class library for extracting text from HTML.
Sources
Download
MIT
The Requires
The Development Requires
by
Lukin
extractor
article
html2article