2017 © Pedro Peláez
 

library php-unilex

Unilex: lexical analyzer generator with Unicode support written in PHP

image

remorhaz/php-unilex

Unilex: lexical analyzer generator with Unicode support written in PHP

  • Wednesday, June 6, 2018
  • by remorhaz
  • Repository
  • 1 Watchers
  • 1 Stars
  • 60 Installations
  • PHP
  • 1 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 13 Versions
  • 5 % Grown

The README.md

UniLex

Latest Stable Version Build Scrutinizer Code Quality codecov Mutation testing badge Total Downloads License, (*1)

UniLex is lexical analyzer generator (similar to lex and flex) with Unicode support. It's written in PHP and generates code in PHP., (*2)

[WIP] Work in progress

Requirements

  • PHP 8

License

UniLex library is licensed under MIT license., (*3)

Installation

Installation is as simple as any other composer library's one:, (*4)

composer require remorhaz/php-unilex

Usage

Quick start in example

Let's imagine we want to write a simple calculator and we need a lexer (lexical analyzer) that provides a stream of IDs, numbers and operators. Create a new Composer project and execute following command from project directory:, (*5)

composer require --dev remorhaz/php-unilex

Next step is creating a lexer specification in LexerSpec.php file. We use @lexToken tag in comments to specify regular expression for a token:, (*6)

<?php
/**
 * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context
 * @lexTargetClass TokenMatcher
 * @lexHeader
 */

const TOKEN_ID = 1;
const TOKEN_OPERATOR = 2;
const TOKEN_NUMBER = 3;

/** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */
$context->setNewToken(TOKEN_ID);

/** @lexToken /[+\-*\/]/ */
$context->setNewToken(TOKEN_OPERATOR);

/** @lexToken /[0-9]+/ */
$context->setNewToken(TOKEN_NUMBER);

Next step is building a token matcher from specification:, (*7)

vendor/bin/unilex LexerSpec.php > TokenMatcher.php

Now we have a compiled token matcher in TokenMatcher.php file. Let's use it and read all tokens from the buffer:, (*8)

<?php

use Remorhaz\UniLex\Lexer\TokenFactory;
use Remorhaz\UniLex\Lexer\TokenReader;
use Remorhaz\UniLex\Unicode\CharBufferFactory;

require_once "vendor/autoload.php";
require_once "TokenMatcher.php";

$buffer = CharBufferFactory::createFromString("x+2*3");
$tokenReader = new TokenReader($buffer, new TokenMatcher, new TokenFactory(0xFF));

do {
    $token = $tokenReader->read();
    echo "Token ID: {$token->getType()}\n";
} while (!$token->isEoi());

On execution this script outputs:, (*9)

Token ID: 1
Token ID: 2
Token ID: 3
Token ID: 2
Token ID: 3
Token ID: 255

Let's go a bit further and make it possible to retrieve text presentation of every token from input buffer. We need to modify a lexer specification to attach the result to each non-EOI token as an attribute:, (*10)

<?php
/**
 * @var \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface $context
 * @lexTargetClass TokenMatcher
 * @lexHeader
 */

const TOKEN_ID = 1;
const TOKEN_OPERATOR = 2;
const TOKEN_NUMBER = 3;

/** @lexToken /[a-zA-Z][0-9a-zA-Z]*()/ */
$context
    ->setNewToken(TOKEN_ID)
    ->setTokenAttribute('text', $context->getSymbolString());

/** @lexToken /[+\-*\/]/ */
$context
    ->setNewToken(TOKEN_OPERATOR)
    ->setTokenAttribute('text', $context->getSymbolString());

/** @lexToken /[0-9]+/ */
$context
    ->setNewToken(TOKEN_NUMBER)
    ->setTokenAttribute('text', $context->getSymbolString());

After rebuilding token matcher with CLI utility we need to modify output cycle of our example program to make it print text with token IDs:, (*11)

do {
    $token = $tokenReader->read();
    echo
        "Token ID: {$token->getType()}",
        $token->isEoi() ? "\n" : " / '{$token->getAttribute('text')}'\n";
} while (!$token->isEoi());

And now program prints:, (*12)

Token ID: 1 / 'x'
Token ID: 2 / '+'
Token ID: 3 / '2'
Token ID: 2 / '*'
Token ID: 3 / '3'
Token ID: 255

CLI

You can use command-line utility to build token matcher from specification:, (*13)

vendor/bin/unilex path/to/spec/LexerSpec.php path/to/target/TokenMatcher.php --desc="My example matcher."

Specification

Specification is a PHP file that is split in parts by DocBlock comments with special tags. There is a special variable $context that contains context object with \Remorhaz\UniLex\Lexer\TokenMatcherContextInterface interface. Current implementation also uses int variable $char that contains current symbol (TODO: should be moved into context object)., (*14)

@lexHeader

This block can contain namespace and use statements that will be used during matcher generation., (*15)

@lexBeforeMatch

This block is executed before the beginning of matching procedure and can be used to initialize some additional variables., (*16)

@lexOnTransition

This block is executed on each symbol matched by token's regular expression., (*17)

@lexToken /regexp/

This block is executed on matching given regular expression from the input buffer. Most commonly it just setups new token in context object., (*18)

@lexMode 'mode_name'

This tag tells parser that matching @lexToken expression matches only if current lexical mode is mode_name. Lexical mode can be switched with $context->setMode('mode_name') method. Using lexical modes allows to have several "sub-grammars" in one specification (i. e. some tokens can be recognized only in comments or strings)., (*19)

@lexOnError

This block is executed if matcher fails to match any of token's regular expressions. By default it just returns false., (*20)

The Versions

06/06 2018

dev-master

9999999-dev https://github.com/remorhaz/php-unilex

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

28/05 2018

v0.0.12

0.0.12.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

09/05 2018

v0.0.11

0.0.11.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

06/05 2018

v0.0.10

0.0.10.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

03/05 2018

v0.0.9

0.0.9.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

28/04 2018

v0.0.8

0.0.8.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

28/04 2018

v0.0.7

0.0.7.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

27/04 2018

v0.0.6

0.0.6.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

27/04 2018

v0.0.5

0.0.5.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

27/04 2018

v0.0.4

0.0.4.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

27/04 2018

v0.0.3

0.0.3.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

19/04 2018

v0.0.2

0.0.2.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator

18/04 2018

v0.0.1

0.0.1.0 https://github.com/remorhaz/php-json-pointer

Unilex: lexical analyzer generator with Unicode support written in PHP

  Sources   Download

MIT

The Requires

 

The Development Requires

by Edward Surov

tokenizer lex lexical analyzer lexical analyzer generator tokenizer generator