dev-master
9999999-devAn efficient class library for extracting text from HTML.
MIT
The Requires
The Development Requires
by shiba
extractor article html2article
An efficient class library for extracting text from HTML.
An efficient class library for extracting text from HTML., (*1)
一个高效的从HTML中提取正文的类库。, (*2)
正文提取采用了基于文本密度的提取算法,支持从压缩的HTML文档中提取正文,每个页面平均提取时间为30ms,正确率在95%以上。, (*3)
安装包文件, (*4)
composer require "shiba/textractor:dev-master"
添加 ServiceProvider
到您项目 config/app.php
中的 providers
部分:, (*5)
shiba\Textractor\TextractorServiceProvider::class,
创建配置文件:, (*6)
php artisan vendor:publish --provider="shiba\Textractor\TextractorServiceProvider"
然后请修改 config/textractor.php
中对应的项即可。, (*7)
<?php $url = 'http://news.163.com/17/0204/08/CCDTBQ9E000189FH.html'; // 创建提取实例 $textractor = new \Lukin\Textractor(); // 下载并解析文章 $article = $textractor->download($url)->parse(); printf('URL: %s' . PHP_EOL, $url); printf('Title: %s' . PHP_EOL, $article->getTitle()); printf('Publish: %s' . PHP_EOL, $article->getPublishDate()); printf('Text:' . PHP_EOL, $article->getText()); printf('%sContent: %s' . PHP_EOL, $article->getHTML());
MIT, (*8)
An efficient class library for extracting text from HTML.
MIT
extractor article html2article