2017 © Pedro PelĂĄez
 

library language-detector

PHP library to detect the language of any free text.

image

andywer/language-detector

PHP library to detect the language of any free text.

  • Wednesday, September 30, 2015
  • by andywer
  • Repository
  • 1 Watchers
  • 0 Stars
  • 15 Installations
  • PHP
  • 0 Dependents
  • 0 Suggesters
  • 61 Forks
  • 0 Open issues
  • 9 Versions
  • 0 % Grown

The README.md

LanguageDetector Build Status

PHP library to detect languages from any free text., (*1)

It follows the approach described in the paper, a given text is tokenized into N-Grams (we cleanup whitespaces before doing this step). Then we sort the tokens and we compare against a language model., (*2)

Fork of crodas/languagedetector, since the original package seems abandoned., (*3)

How it works

The first thing we need is a language model (which looks like this file) that is used to compare the texts against at classification time. This process must done before anything, and it can be generated with an script similar to this file., (*4)

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// it could use a little bit of memory, but it's fine
// because this process runs once.
ini_set('memory_limit', '1G');

// we load the configuration (which will be serialized
// later into our language model file
$config = new LanguageDetector\Config;

$c = new LanguageDetector\Learn($config);
foreach (glob(__DIR__ . '/samples/*') as $file) { 
    // feed with examples ('language', 'text');
    $c->addSample(basename($file), file_get_contents($file));
}

// some callback so we know where the process is 
$c->addStepCallback(function($lang, $status) {
    echo "Learning {$lang}: $status\n";
});

// save it in `datafile`. 
// we currently support the `php` serialization but it's trivial
// to add other formats, just extend `\LanguageDetector\Format\AbstractFormat`. 
//You can check example at https://github.com/crodas/LanguageDetector/blob/master/lib/LanguageDetector/Format/PHP.php
$c->save(AbstractFormat::initFormatByPath('language.php'));

Once we have our language model file (in this case language.php) we're ready to classify texts by their language., (*5)

// register the autoloader
require 'lib/LanguageDetector/autoload.php';

// we load the language model, it would create
// the $config object for us.
$detect = LanguageDetector\Detect::initByPath('language.php');

$lang = $detect->detect("Agricultura (-ae, f.), sensu latissimo, 
est summa omnium artium et scientiarum et technologiarum quae de 
terris colendis et animalibus creandis curant, ut poma, frumenta, 
charas, carnes, textilia, et aliae res e terra bene producantur. 
Specialius, agronomia est ars et scientia quae terris colendis student, 
agricultio autem animalibus creandis.")

var_dump($lang);

And that's it., (*6)

Algorithms

The project is designed to work with modules, which means you can provide your own algorithm for sorting and comparing the N-Grams. By default the library implements the PageRank as sorting algorithm, and out of place (described in the paper) as comparing., (*7)

In order to supply your own algorithms, you must change the $config at learning stage to load your own classes (which by the way should implement some interaces)., (*8)

Language Detection Training Files

Have a look at example/samples directory. For more advanced traning data, visit the Leipzig Corpora Download Page., (*9)

Languages with non-latin characters

Remember to set the Config's mb property (already before creating the language model) if you train for languages based on non-latin characters. Use UTF-8 encoded texts., (*10)

The Versions

30/09 2015

dev-master

9999999-dev

PHP library to detect the language of any free text.

  Sources   Download

BSD-4-Clause

The Requires

 

The Development Requires

by César D. Rodas

30/09 2015

0.2.0

0.2.0.0

PHP library to detect the language of any free text.

  Sources   Download

BSD-4-Clause

The Requires

 

The Development Requires

by César D. Rodas

21/06 2015

dev-cli

dev-cli

simple library to classify texts

  Sources   Download

BSD-4-Clause

The Requires

 

The Development Requires

by César D. Rodas

11/10 2014

dev-develop

dev-develop

simple library to classify texts

  Sources   Download

BSD-4-Clause

The Requires

 

by César D. Rodas

06/11 2013

v0.1.1

0.1.1.0

simple library to classify texts

  Sources   Download

BSD-4-Clause

The Development Requires

by César D. Rodas

05/11 2013

v0.1.0

0.1.0.0

simple library to classify texts

  Sources   Download

BSD-4-Clause

The Development Requires

by César D. Rodas

20/07 2013

dev-redaktor-patch-2

dev-redaktor-patch-2

simple library to classify texts

  Sources   Download

by César D. Rodas

23/05 2013

dev-adam-lynch-patch-1

dev-adam-lynch-patch-1

simple library to classify texts

  Sources   Download

by César D. Rodas

11/04 2013

dev-better-learning

dev-better-learning

simple library to classify texts

  Sources   Download

by César D. Rodas