2017 © Pedro Peláez
 

library pdf-text-parser

Library to parse XML resulting from pdftotext

image

skuola/pdf-text-parser

Library to parse XML resulting from pdftotext

  • Sunday, July 8, 2018
  • by skuola
  • Repository
  • 2 Watchers
  • 1 Stars
  • 116 Installations
  • HTML
  • 0 Dependents
  • 0 Suggesters
  • 1 Forks
  • 0 Open issues
  • 6 Versions
  • 0 % Grown

The README.md

PDF text parser

Build Status Code Climate SensioLabsInsight, (*1)

This library is a parser for XML text files obtained via pdftotext, (*2)

You can install it using composer require skuola/pdf-text-parser, (*3)

Suppose you're just converted a pdf file, getting some text like the following:, (*4)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<doc>
  <page width="595.200000" height="841.800000">
    <word xMin="56.640000" yMin="59.770680" xMax="118.022880" yMax="72.406680">Lorem</word>
    <word xMin="121.209960" yMin="59.770680" xMax="176.485440" yMax="72.406680">ipsum</word>
  </page>
</doc>
</body>
</html>

The above text is the result of a command like pdftotext -htmlmeta -bbox-layout yourfile.pdf -., (*5)

You can use this library as follows:, (*6)

<?php

require_once 'vendor/autoload.php';

$data = '...';  // the text above

$converter = new \Skuola\PdfTextParser\Converter($data);
// get as plain text...
$txt = $converter->getAsText();
// ...or get as HTML
$html = $converter->getAsHtml();

As alternate mode, you can save your HTML file and pass it to library:, (*7)

<?php

require_once 'vendor/autoload.php';

$path = '...';  // a path containing the same text as previous example

$converter = new \Skuola\PdfTextParser\Converter(null, $path);
$html = $converter->getAsHtml();

Generated HTML is composed by a <h2> tag or an <p> tag for each document line (depending on the line being a title or not)., (*8)

More informations to come..., (*9)

The Versions

08/07 2018

dev-master

9999999-dev

Library to parse XML resulting from pdftotext

  Sources   Download

MIT

The Requires

 

The Development Requires

text pdf pdftotext

08/07 2018

v0.3.1

0.3.1.0

Library to parse XML resulting from pdftotext

  Sources   Download

MIT

The Requires

 

The Development Requires

text pdf pdftotext

07/06 2018

dev-php7.0

dev-php7.0

Library to parse XML resulting from pdftotext

  Sources   Download

MIT

The Requires

 

The Development Requires

text pdf pdftotext

07/06 2018

v0.2.0

0.2.0.0

Library to parse XML resulting from pdftotext

  Sources   Download

MIT

The Requires

 

The Development Requires

text pdf pdftotext

07/06 2018

v0.3.0

0.3.0.0

Library to parse XML resulting from pdftotext

  Sources   Download

MIT

The Requires

 

The Development Requires

text pdf pdftotext

31/05 2018

v0.1.0

0.1.0.0

Library to parse XML resulting from pdftotext

  Sources   Download

MIT

The Requires

 

The Development Requires

text pdf pdftotext