sentence-breaker
, (*1)
Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP., (*2)
Installation
composer require bigwhoop/sentence-breaker
Usage
<?php
use Bigwhoop\SentenceBreaker\SentenceBreaker;
$breaker = new SentenceBreaker();
$breaker->addAbbreviations(['Dr', 'Prof']);
// returns a generator, the text is parsed lazily
$sentences = $breaker->split("Hello Dr. Jones! How are you? I'm fine, thanks!");
// get first
$sentences->current() // 'Hello Dr. Jones!'
// get all as array
iterator_to_array($sentences) // ['Hello Dr. Jones!', 'How are you?', "I'm fine, thanks!"]
Rules
By default the rules/rules.ini
file is loaded. Its format is a list of patterns ..., (*3)
TOKEN [... TOKEN] = PROBABILITY
T_CAPITALIZED_WORD <T_PERIOD> T_WHITESPACE T_CAPITALIZED_WORD = 75
The token enclosed in <
/ >
is the one that defines for which token the pattern is applied. The example pattern
above would be applied to each T_PERIOD
token found in the input data. The probability defines how likely a sentence
boundary is after this token., (*4)
So for this pattern to match, the input text would need to contain something along the lines of This is Waldo.
He likes dogs.., (*5)
The available tokens are:, (*6)
Token |
Description |
Example |
T_WORD |
A non-capitalized word. |
hello , world
|
T_CAPITALIZED_WORD |
A capitalized word. |
Hello , World
|
T_EOF |
The end of the input. |
- |
T_PERIOD |
A period. |
. |
T_EXCLAMATION_POINT |
An exclamation point. |
! |
T_QUESTION_MARK |
A question mark. |
? |
T_QUOTED_STR |
A string enclosed in single or double quotes |
"Hello world!" , 'Hello world...'
|
T_WHITESPACE |
Whitespace characters like spaces, LF, CR. |
- |
T_ABBREVIATION |
An abbreviation without the trailing period. |
Dr , Prof
|
TIP: You can add your own rules via $breaker->addRules()
., (*7)
Abbreviation Providers
Inside the data
directory are flat files containing abbreviations (in English), collected from various
sources. They can be loaded like this:, (*8)
use Bigwhoop\SentenceBreaker\Abbreviations\FlatFileProvider;
// Load legal.txt and biz.txt
$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['legal', 'biz']));
// Load all files
$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['*']));
To make it fast and easy, all abbreviations are available in the all.txt
file. You can load it like this:, (*9)
$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['all']));
How does it work?
The input text is run through a lexer., (*10)
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence
of tokens, i.e. meaningful character strings., (*11)
So for example He asked: "What's on TV?" On T.V.? I have no clue. Really!
would result in the following sequence
of tokens:, (*12)
"He" "asked:" T_QUOTED_STR "On" "T.V" T_PERIOD T_QUESTION_MARK
"I" "have" "no" "clue" T_PERIOD "Really" T_EXCLAMATION_POINT
This sequence of tokens is then run through a probability calculator that calculates for each token the probability
of it being the boundary of a sentence. The calculator uses rules that are matched against each token. For example
if a T_EXCLAMATION_POINT is followed by a capitalized string the chance of it being a sentence boundary is 100%., (*13)
In the end the tokens are re-assembled into the sentences. The user can choose which threshold he wants to apply
when starting new sentences. For example the probability must be greater or equal to 50% that a boundary was
detected., (*14)
TODO
- [X]
calculateCurrentTokenProbability
is a big mess. Let's split it up into multiple Rule classes. Maybe use a rules engine.
- [ ] Add abbreviations support for different languages.
Contributing
# Check code style
vendor/bin/php-cs-fixer fix --diff --dry-run
# Fix code style
vendor/bin/php-cs-fixer fix --diff
# Run tests
vendor/bin/phpunit
# Run static analysis
vendor/bin/phpstan
Contributors
, (*15)
Made with contrib.rocks., (*16)
License
MIT. See LICENSE file., (*17)