, (*1)
PHPNgrams
PHP N-Grams library, (*2)
Introduction
In the fields of computational linguistics, machine-learning and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles., (*3)
An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n in modern language, e.g., "four-gram", "five-gram", and so on. (More on Wikipedia), (*4)
Requirements
Installation
Include this library in your project by doing:, (*5)
composer require drupol/phpngrams, (*6)
The library provides two classes:, (*7)
and one trait:, (*8)
Usage
<?php
declare(strict_types = 1);
namespace drupol\phpngrams\tests;
use drupol\phpngrams\NGrams;
use drupol\phpngrams\NGramsCyclic;
include 'vendor/autoload.php';
$string = 'hello world';
// Better use preg_split() than str_split() in case of UTF8 strings.
$chars = preg_split('/(?!^)(?=.)/u', $string);
$ngrams = (new NGrams())->ngrams($chars, 3);
print_r(iterator_to_array($ngrams));
/*
[
0 =>
[
0 => 'h',
1 => 'e',
2 => 'l',
],
1 =>
[
0 => 'e',
1 => 'l',
2 => 'l',
],
2 =>
[
0 => 'l',
1 => 'l',
2 => 'o',
],
3 =>
[
0 => 'l',
1 => 'o',
2 => ' ',
],
4 =>
[
0 => 'o',
1 => ' ',
2 => 'w',
],
5 =>
[
0 => ' ',
1 => 'w',
2 => 'o',
],
6 =>
[
0 => 'w',
1 => 'o',
2 => 'r',
],
7 =>
[
0 => 'o',
1 => 'r',
2 => 'l',
],
8 =>
[
0 => 'r',
1 => 'l',
2 => 'd',
],
];
*/
$string = 'hello world';
// Better use preg_split() than str_split() in case of UTF8 strings.
$chars = preg_split('/(?!^)(?=.)/u', $string);
$ngrams = (new NGramsCyclic())->ngrams($chars, 3);
print_r(iterator_to_array($ngrams));
/*
[
0 => [
0 => 'h',
1 => 'e',
2 => 'l',
],
1 => [
0 => 'e',
1 => 'l',
2 => 'l',
],
2 => [
0 => 'l',
1 => 'l',
2 => 'o',
],
3 => [
0 => 'l',
1 => 'o',
2 => ' ',
],
4 => [
0 => 'o',
1 => ' ',
2 => 'w',
],
5 => [
0 => ' ',
1 => 'w',
2 => 'o',
],
6 => [
0 => 'w',
1 => 'o',
2 => 'r',
],
7 => [
0 => 'o',
1 => 'r',
2 => 'l',
],
8 => [
0 => 'r',
1 => 'l',
2 => 'd',
],
9 => [
0 => 'l',
1 => 'd',
2 => 'h',
],
10 => [
0 => 'd',
1 => 'h',
2 => 'e',
],
];
*/
To reduce to the maximum the memory footprint, the library returns Generators, if you want to get the complete resulting array, use iterator_to_array()., (*9)
API
Find the complete API documentation at https://not-a-number.io/phpngrams., (*10)
Code quality and tests
Every time changes are introduced into the library, Travis CI run the tests., (*11)
The library has tests written with PHPSpec., (*12)
Feel free to check them out in the spec directory. Run composer phpspec to trigger the tests., (*13)
PHPInfection is used to ensure that your code is properly tested, run composer infection to test your code., (*14)
Contributing
Feel free to contribute to this library by sending Github pull requests. I'm quite reactive :-), (*15)