Monachus
Monachus is a library that helps you working with text, in any language. Monachus means Monk in Latin language, I think it's a good name to define this library. Monks were used to work a lot with books (strings) in a wide range of languages., (*1)
This library has been created keeping in mind these PHP versions: 5.5, 5.4, 5.3, (*2)
Install
The simplest way is with Composer, just add these lines to your composer.json:, (*3)
"repositories": [
{
"type": "git",
"url": "https://github.com/ssola/monachus.git"
}
]
How it works
String, (*4)
The first thing we need to know is how to use the String class. This class generates an object with a specific text. It will preserve that text in UTF-8 charset along the way., (*5)
include_once("./vendor/autoload.php");
use Monachus\String as String;
$text = new String("Hello World!");
echo $text;
Obviously this code is generating a new String object with a value and then it's printed., (*6)
Then you can do things like:, (*7)
include_once("./vendor/autoload.php");
use Monachus\String as String;
$text = new String("Hello World!");
echo $text->length();
echo $text->find("World");
echo $text->toUppercase();
if($text->equals("Hello World!"))
echo $text->toLowercase();
This kind of objects is used extensively in this library in order to perform all the actions with the proper charset., (*8)
Tokenizer, (*9)
Do you need to tokenize a string? Monachus can do it for you! We support a lot of languages, Japanese included! But if your language is not supported... relax! You can create your own adapters in order to tokenize different languages., (*10)
Let's do a simple example:, (*11)
include_once("./vendor/autoload.php");
use Monachus\String as String;
use Monachus\Tokenizer as Tokenizer;
$text = new String("This is a text");
$tokenizer = new Tokenizer();
var_dump($tokenizer->tokenize($text));
// Now imagine you need to tokenize a Japanase text
$textJp = new String("は太平洋側を中心に晴れた所が多いが");
$tokenizerJp = new Tokenizer(new Monachus\Tokenizers\Japanase());
var_dump($tokenizerJp);
As you have seen, we can use our own adapters to tokenize complex languages like Japanase or Chinese. Now it's time to explain you how to create these adapters., (*12)
class MyTokenizer implements Monachus\Interfaces\TokenizerInterface
{
public function tokenize(Monachus\String $string)
{
// your awesome code!
}
}
$tokenizer = new Monachus\Tokenizer(new MyTokenizer());
var_dump($tokenizer->tokenize(new Monachus\String("Поиск информации в интернете"));
N-Gram, (*13)
Yeah! Monachus is able to generate different levels of N-gram sequences, for example a bigram or trigram. But let's see how it works., (*14)
include_once("./vendor/autoload.php");
use Monachus\String as String;
use Monachus\Ngram as Ngram;
use Monachus\Config as Config;
$text = new String("This is an awesome text");
$config = new Config();
$config->max = 3; // we're creating trigrams.
$ngram = new Ngram($config);
var_dump($ngram->parse($text));
Do you need your own N-gram parser? No problem! You can create your own parsers as well., (*15)
class MyParser implements Monachus\Interfaces\NgramParserInterface
{
public function parse(String $string, $level)
{
// your awesome code!
}
}
And then..., (*16)
include_once("./vendor/autoload.php");
use Monachus\String as String;
use Monachus\Ngram as Ngram;
use Monachus\Config as Config;
$text = new String("This is an awesome text");
$config = new Config();
$config->max = 3; // we're creating trigrams.
$ngram = new Ngram($config);
$ngram->setParser(new MyParser());
var_dump($ngram->parse($text));