2017 © Pedro Peláez
 

library tokenizer

Provides a way of tokenizing strings

image

org_heigl/tokenizer

Provides a way of tokenizing strings

  • Wednesday, January 8, 2014
  • by heiglandreas
  • Repository
  • 1 Watchers
  • 1 Stars
  • 1,706 Installations
  • 0 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 2 Versions
  • 0 % Grown

The README.md

Tokenizer

Provides ways to split strings into smaller entities depending on the used tokenizers., (*1)

You can chain different tokenizers to a tokenizer-chain to get the results you want., (*2)

Currently this library provides these Tokenizers:, (*3)

  • WhitespaceTokenizer to split strings on every whitespace. Can be used to split a sentence into single words.
  • CamelCaseTokenizer to split CamelCased-Strings into separate tokens.

Build Status, (*4)

Installation

Install using composer by adding the following line to your composer.conf-files require-Section:, (*5)

"org_heigl/tokenizer" : "dev-master"

Usage

Usage is rather simple:, (*6)

use Org_Heigl\Tokenizer\TokenizerQueue;
use Org_Heigl\Tokenizer\Tokenizers;
// Create a new Tokenizer-Queue
$tokenizer = new TokenizerQueue();

// Add single tokenizers to the queue
// First a Whitespace tokenizer
$tokenizer->addTokenizer(new Tokenizers\WhitespaceTokenizer());
// Then a CamelCase-Tokenizer
$tokenizer->addTokenizer(new Tokenizers\CamelCaseTokenizer());

// Finally tokenize a given string
$tokenList = $tokenizer->tokenize('A String with WhiteSpace');

var_dump((array) $tokenList);

// This will print the following:
/*
array(8) {
  [0] =>
  class Org_Heigl\Tokenizer\Token#216 (3) {
    protected $token =>
    string(1) "A"
    protected $offset =>
    int(0)
    protected $type =>
    string(6) "string"
  }
  [1] =>
  class Org_Heigl\Tokenizer\Token#215 (3) {
    protected $token =>
    string(1) " "
    protected $offset =>
    int(1)
    protected $type =>
    string(10) "whitespace"
  }
  [2] =>
  class Org_Heigl\Tokenizer\Token#214 (3) {
    protected $token =>
    string(6) "String"
    protected $offset =>
    int(2)
    protected $type =>
    string(6) "string"
  }
  [3] =>
  class Org_Heigl\Tokenizer\Token#213 (3) {
    protected $token =>
    string(1) " "
    protected $offset =>
    int(8)
    protected $type =>
    string(10) "whitespace"
  }
  [4] =>
  class Org_Heigl\Tokenizer\Token#212 (3) {
    protected $token =>
    string(4) "with"
    protected $offset =>
    int(9)
    protected $type =>
    string(6) "string"
  }
  [5] =>
  class Org_Heigl\Tokenizer\Token#211 (3) {
    protected $token =>
    string(1) " "
    protected $offset =>
    int(13)
    protected $type =>
    string(10) "whitespace"
  }
  [6] =>
  class Org_Heigl\Tokenizer\Token#209 (3) {
    protected $token =>
    string(5) "White"
    protected $offset =>
    int(14)
    protected $type =>
    string(6) "string"
  }
  [7] =>
  class Org_Heigl\Tokenizer\Token#208 (3) {
    protected $token =>
    string(5) "Space"
    protected $offset =>
    int(19)
    protected $type =>
    string(6) "string"
  }
}
*/

The Versions

08/01 2014

dev-master

9999999-dev

Provides a way of tokenizing strings

  Sources   Download

MIT

The Development Requires

token string tokenize

07/01 2014

dev-test

dev-test

Provides a way of tokenizing strings

  Sources   Download

MIT

The Development Requires

token string tokenize