bigwhoop/sentence-breaker

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sunday, May 31, 2015
by bigwhoop
Repository
4 Watchers
29 Stars
1,726 Installations

PHP
1 Dependents
0 Suggesters
1 Forks
2 Open issues
8 Versions
1 % Grown

The README.md

sentence-breaker

, _(*1)

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP., _(*2)

Installation

composer require bigwhoop/sentence-breaker

Usage

<?php
use Bigwhoop\SentenceBreaker\SentenceBreaker;

$breaker = new SentenceBreaker();
$breaker->addAbbreviations(['Dr', 'Prof']);

// returns a generator, the text is parsed lazily
$sentences = $breaker->split("Hello Dr. Jones! How are you? I'm fine, thanks!");

// get first
$sentences->current() // 'Hello Dr. Jones!'

// get all as array
iterator_to_array($sentences) // ['Hello Dr. Jones!', 'How are you?', "I'm fine, thanks!"]

Rules

By default the rules/rules.ini file is loaded. Its format is a list of patterns ..., _(*3)

TOKEN [... TOKEN] = PROBABILITY
T_CAPITALIZED_WORD <T_PERIOD> T_WHITESPACE T_CAPITALIZED_WORD = 75

The token enclosed in < / > is the one that defines for which token the pattern is applied. The example pattern above would be applied to each T_PERIOD token found in the input data. The probability defines how likely a sentence boundary is after this token., _(*4)

So for this pattern to match, the input text would need to contain something along the lines of This is Waldo. He likes dogs.., _(*5)

The available tokens are:, _(*6)

Token	Description	Example
`T_WORD`	A non-capitalized word.	`hello`, `world`
`T_CAPITALIZED_WORD`	A capitalized word.	`Hello`, `World`
`T_EOF`	The end of the input.	-
`T_PERIOD`	A period.	`.`
`T_EXCLAMATION_POINT`	An exclamation point.	`!`
`T_QUESTION_MARK`	A question mark.	`?`
`T_QUOTED_STR`	A string enclosed in single or double quotes	`"Hello world!"`, `'Hello world...'`
`T_WHITESPACE`	Whitespace characters like spaces, LF, CR.	-
`T_ABBREVIATION`	An abbreviation without the trailing period.	`Dr`, `Prof`

TIP: You can add your own rules via $breaker->addRules()., _(*7)

Abbreviation Providers

Inside the data directory are flat files containing abbreviations (in English), collected from various sources. They can be loaded like this:, _(*8)

use Bigwhoop\SentenceBreaker\Abbreviations\FlatFileProvider;

// Load legal.txt and biz.txt
$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['legal', 'biz']));

// Load all files
$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['*']));

To make it fast and easy, all abbreviations are available in the all.txt file. You can load it like this:, _(*9)

$breaker->addAbbreviations(new FlatFileProvider('/path/to/data/directory', ['all']));

How does it work?

The input text is run through a lexer., _(*10)

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens, i.e. meaningful character strings., _(*11)

So for example He asked: "What's on TV?" On T.V.? I have no clue. Really! would result in the following sequence of tokens:, _(*12)

"He" "asked:" T_QUOTED_STR "On" "T.V" T_PERIOD T_QUESTION_MARK
"I" "have" "no" "clue" T_PERIOD "Really" T_EXCLAMATION_POINT

This sequence of tokens is then run through a probability calculator that calculates for each token the probability of it being the boundary of a sentence. The calculator uses rules that are matched against each token. For example if a T_EXCLAMATION_POINT is followed by a capitalized string the chance of it being a sentence boundary is 100%., _(*13)

In the end the tokens are re-assembled into the sentences. The user can choose which threshold he wants to apply when starting new sentences. For example the probability must be greater or equal to 50% that a boundary was detected., _(*14)

TODO

[X] calculateCurrentTokenProbability is a big mess. Let's split it up into multiple Rule classes. Maybe use a rules engine.
[ ] Add abbreviations support for different languages.

Contributing

# Check code style
vendor/bin/php-cs-fixer fix --diff --dry-run

# Fix code style
vendor/bin/php-cs-fixer fix --diff

# Run tests
vendor/bin/phpunit

# Run static analysis
vendor/bin/phpstan

Contributors

, _(*15)

Made with contrib.rocks., _(*16)

License

MIT. See LICENSE file., _(*17)

The Versions

31/05 2015

dev-master

9999999-dev

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

The Development Requires

by Philippe Gerber

31/05 2015

2.0.1

2.0.1.0

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

The Development Requires

by Philippe Gerber

26/05 2015

2.0.0

2.0.0.0

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

The Development Requires

by Philippe Gerber

19/05 2015

1.0.1

1.0.1.0

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

The Development Requires

by Philippe Gerber

19/05 2015

0.1.2

0.1.2.0

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

The Development Requires

by Philippe Gerber

19/05 2015

1.0.0

1.0.0.0

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

The Development Requires

by Philippe Gerber

19/05 2015

0.1.1

0.1.1.0

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

The Development Requires

by Philippe Gerber

19/05 2015

0.1.0

0.1.0.0

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

Sources Download

MIT

The Requires

php >=5.6

library sentence-breaker

Sentence boundary disambiguation (SBD) - or sentence breaking - library written in PHP.

bigwhoop/sentence-breaker

The README.md

sentence-breaker

Installation

Usage

Rules

Abbreviation Providers

How does it work?

TODO

Contributing

Contributors

License

The Versions

dev-master

The Requires

The Development Requires

by Philippe Gerber

2.0.1

The Requires

The Development Requires

by Philippe Gerber

2.0.0

The Requires

The Development Requires

by Philippe Gerber

1.0.1

The Requires

The Development Requires

by Philippe Gerber

0.1.2

The Requires

The Development Requires

by Philippe Gerber

1.0.0

The Requires

The Development Requires

by Philippe Gerber

0.1.1

The Requires

The Development Requires

by Philippe Gerber

0.1.0

The Requires

The Development Requires

by Philippe Gerber