2017 © Pedro Peláez
 

library php-apache-tika

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

image

nekulin/php-apache-tika

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

  • Tuesday, January 26, 2016
  • by nekulin
  • Repository
  • 1 Watchers
  • 0 Stars
  • 14 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 9 Forks
  • 0 Open issues
  • 4 Versions
  • 0 % Grown

The README.md

PHP Apache Tika

This tool provides Apache Tika bindings for PHP, allowing to extract text and metadata from documents, images and other formats., (*1)

Two modes are supported: * App mode: run app JAR via command line interface * Server mode: make HTTP requests to JSR 311 network server, (*2)

Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background., (*3)

Features

  • Simple class interface to Apache Tika features:
    • Text and HTML extraction
    • Metadata extraction
    • OCR recognition
  • Standarized metadata for documents
  • Support for local and remote resources
  • No heavyweight library dependencies

Requirements

  • PHP 5.4 or greater
  • Apache Tika 1.7 or greater
  • Oracle Java or OpenJDK
    • Java 6 for Tika up to 1.9
    • Java 7 for Tika 1.10 or greater
  • Tesseract (optional for OCR recognition)

Installation

Install using composer:, (*4)

composer require vaites/php-apache-tika

If you want to use OCR you must install Tesseract:, (*5)

  • Fedora/CentOS: sudo yum install tesseract (use dnf instead of yum on Fedora 22 or greater)
  • Debian/Ubuntu: sudo apt-get install tesseract-ocr
  • Mac OS X: brew install tesseract (using Homebrew)

Usage

Start Apache Tika server with caution:, (*6)

java -jar tika-server-1.10.jar

Instantiate the class:, (*7)

$client = \Vaites\ApacheTika\Client::make('localhost', 9998);           // server mode (default)
$client = \Vaites\ApacheTika\Client::make('/path/to/tika-app.jar');     // app mode 

Use the class to extract text from documents:, (*8)

$language = $client->getLanguage('/path/to/your/document');
$metadata = $client->getMetadata('/path/to/your/document');

$html = $client->getHTML('/path/to/your/document');
$text = $client->getText('/path/to/your/document');

Or use to extract text from images:, (*9)

$client = \Vaites\ApacheTika\Client::make($host, $port);
$metadata = $client->getMetadata('/path/to/your/image');

$text = $client->getText('/path/to/your/image');

Integrations

The Versions

26/01 2016

dev-master

9999999-dev

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

  Sources   Download

MIT

The Requires

  • php >=5.4.0
  • ext-curl *

 

The Development Requires

by David Martinez

pdf apache documents tika doc office ocr odt docx pptx ppt

13/12 2015

0.3.0

0.3.0.0

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

  Sources   Download

MIT

The Requires

  • php >=5.4.0
  • ext-curl *

 

The Development Requires

by David Martinez

pdf apache documents tika doc office ocr odt docx pptx ppt

13/09 2015

0.2.0

0.2.0.0

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

  Sources   Download

MIT

The Requires

  • php >=5.4.0
  • ext-curl *

 

The Development Requires

by David Martinez

pdf apache documents tika doc office ocr odt docx pptx ppt

30/08 2015

0.1.0

0.1.0.0

Apache Tika bindings for PHP: extracts metadata, text, HTML and more

  Sources   Download

MIT

The Requires

  • php >=5.4.0
  • ext-curl *

 

The Development Requires

by David Martinez

pdf apache documents tika doc office odt docx pptx ppt