dev-master
9999999-dev https://github.com/hafriedlander/php-pegPEG (parsing expression grammar) compiler for PHP
MIT GPLv2 BSD-3
The Requires
- php >=5.3.0
by Hamish Friedlander
parser dsl compiler peg
PEG (parsing expression grammar) compiler for PHP
This is a Parsing Expression Grammar compiler for PHP. PEG parsers are an alternative to other CFG grammars that includes both tokenization and lexing in a single top down grammar. For a basic overview of the subject, see http://en.wikipedia.org/wiki/Parsing_expression_grammar, (*1)
php ./cli.php ExampleParser.peg.inc > ExampleParser.php
$x = new ExampleParser( 'string to parse' ) ; $res = $x->match_Expr() ;
Parsers are contained within a PHP file, in one or more special comment blocks that start with /*!* [name | !pragma]
(like a docblock, but with an
exclamation mark in the middle of the stars)., (*2)
You can have multiple comment blocks, all of which are treated as contiguous for the purpose of compiling. During compilation these blocks will be replaced with a set of "matching" functions (functions which match a string against their rules) for each rule in the block., (*3)
The optional name marks the start of a new set of parser rules. This is currently unused, but might be used in future for optimization & debugging purposes. If unspecified, it defaults to the same name as the previous parser comment block, or 'Anonymous Parser' if no name has ever been set., (*4)
If the name starts with an '!' symbol, that comment block is a pragma, and is treated not as some part of the parser, but as a special block of meta-data., (*5)
Lexically, these blocks are a set of rules & comments. A rule can be a base rule or an extension rule., (*6)
Base rules consist of a name for the rule, some optional arguments, the matching rule itself, and an optional set of attached functions., (*7)
NAME ( "(" ARGUMENT, ... ")" )? ":" MATCHING_RULE ATTACHED_FUNCTIONS?
Names must be the characters a-z, A-Z, 0-9 and _ and - only, and must not start with a number., (*8)
Base rules can be split over multiple lines as long as subsequent lines are indented., (*9)
Extension rules are either the same as a base rule but with an addition name of the rule to extend, or as a replacing extension consist of a name for the rule, the name of the rule to extend, and optionally: some arguments, some replacements, and a set of attached functions., (*10)
NAME extend BASENAME ( "(" ARGUMENT, ... ")" )? ":" MATCHING_RULE ATTACHED_FUNCTIONS? NAME extends BASENAME ( "(" ARGUMENT, ... ")" )? ( ";" REPLACE "=>" REPLACE_WITH, ... )? ATTACHED_FUNCTIONS?
We allow indenting a parser block, but only in a consistent manner - whatever the indent of the /*** marker becomes the "base" indent, and needs to be used for all lines. You can mix tabs and spaces, but the indent must always be an exact match - if the "base" indent is a tab then two spaces, every line within the block also needs indenting with a tab then two spaces, not two tabs (even if in your editor, that gives the same indent)., (*11)
Any line with more than the "base" indent is considered a continuation of the previous rule., (*12)
Any line with less than the "base" indent is an error., (*13)
This might get looser if I get around to re-writing the internal "parser parser" in php-peg, bootstrapping the whole thing., (*14)
PEG matching rules try to follow standard PEG format, summarised thusly:, (*15)
token* - Token is optionally repeated token+ - Token is repeated at least one token? - Token is optionally present tokena tokenb - Token tokenb follows tokena, both of which are present tokena | tokenb - One of tokena or tokenb are present, preferring tokena &token - Token is present next (but not consumed by parse) !token - Token is not present next (but not consumed by parse) ( expression ) - Grouping for priority
But with these extensions:, (*16)
< or > - Optionally match whitespace [ or ] - Require some whitespace
Tokens may be:, (*17)
"
or '
quote pairs. No escaping support is provided in literals./
pairs.$
or more complex surrounded by ${ }
which call a user defined function to perform the matchAutomatically anchored to the current string start - do not include a string start anchor (^
) anywhere. Always acts as when the 'x' flag is enabled in PHP -
whitespace is ignored unless escaped, and '#' starts a comment., (*18)
Be careful when ending a regular expression token - the '/' pattern (as in /foo\s/) will end a PHP comment. Since the 'x' flag is always active, just split with a space (as in / foo \s* /)., (*19)
Expressions allow run-time calculated matching. You can embed an expression within a literal or regex token to match against a calculated value, or simply specify the expression as a token to match against a dynamic rule., (*20)
When getting a value to use for an expression, the parser will travel up the stack looking for a set value. The expression stack is a list of all the rules passed through to get to this point. For example, given the parser:, (*21)
A: $a B: A C: B
The expression stack for finding $a will be C, B, A - in other words, the A rule will be checked first, followed by B, followed by C., (*22)
The token will be replaced by the looked up value. To find the value for the token, the expression stack will be travelled up checking for one of the following:, (*23)
$
( i.e. function $foo()
)If no value is found it will then check if a method or a property excluding the $ exists on the parser. If neither of those is found the expression will be replaced with an empty string., (*24)
The token will be looked up to find a value, which must be the name of a matching rule. That rule will then be matched against as if the token was a recurse token for that rule., (*25)
To find the name of the rule to match against, the expression stack will be travelled up checking for one of the following:, (*26)
$
( i.e. function $foo()
)If no value is found it will then check if a method or a property excluding the $ exists on the parser. If neither of those is found the rule will fail to match., (*27)
Be careful against using a token expression when you meant to use a terminal expression, for example:, (*28)
quoted_good: q:/['"]/ string "$q" quoted_bad: q:/['"]/ string $q
"$q"
matches against the value of q again. $q
tries to match against a rule named "
or '
(both of which are illegal rule
names, and will therefore fail), (*29)
Tokens and groups can be given names by prepending name and :
, e.g.,, (*30)
rulea: "'" name:( tokena tokenb )* "'"
There must be no space between the name and the :
, (*31)
badrule: "'" name : ( tokena tokenb )* "'"
Recursive matchers can be given a name the same as their rule name by prepending with just a :
. These next two rules are equivalent:, (*32)
rulea: tokena tokenb:tokenb rulea: tokena :tokenb
Each rule can have a set of functions attached to it. These functions can be defined:, (*33)
{$rulename}_{$functionname}
, or {$rulename}{$functionname}
if function name starts with _
All functions that are not in-grammar must have PHP compatible names (see PHP name mapping). In-grammar functions will have their names converted if needed., (*34)
All these definitions define the same rule-attached function:, (*35)
Rules in the grammar map to php functions named match_{$rulename}
. However rule names can contain characters that php functions can't.
These characters are remapped:, (*36)
'-' => '_' '$' => 'DLR' '*' => 'STR'
Other dis-allowed characters are removed., (*37)
Results are a tree of nested arrays., (*38)
Without any specific control, each rules result will just be the text it matched against in a ['text']
member. This member must always exist., (*39)
Marking a subexpression, literal, regex or recursive match with a name (see Named matching rules) will insert a member into the result array named that name. If there is only one match it will be a single result array. If there is more than one match it will be an array of arrays., (*40)
You can override result storing by specifying a rule-attached function with the given name. It will be called with a reference to the current result array and the sub-match - in this case the default storage action will not occur., (*41)
If you specify a rule-attached function for a recursive match, you do not need to name that token at all - it will be called automatically, e.g., (*42)
rulea: tokena tokenb function tokenb ( &$res, $sub ) { print 'Will be called, even though tokenb is not named or marked with a :' ; }
You can also specify a rule-attached function called *
, which will be called with every recursive match made:, (*43)
rulea: tokena tokenb function * ( &$res, $sub ) { print 'Will be called for both tokena and tokenb' ; }
By default all matches are added to the 'text' property of a result. By prepending a member with .
that match will not be added to the ['text'] member. This
doesn't affect the other result properties that named rules' add., (*44)
Rules can inherit off other rules using the keyword extends. There are several ways to change the matching of the rule, but they all share a common feature - when building a result set the rule will also check the inherited-from rule's rule-attached functions for storage handlers. This lets you do something like:, (*45)
A: Foo Bar Baz function *(){ /* Generic store handler */ } B extends A function Bar(){ /* Custom handling for Bar - Foo and Baz will still fall through to the A#* function defined above */ }
The actual matching rule can be specified in three ways:, (*46)
If you don't specify a new rule or a replacement set the matching rule is copied as is. This is useful when you want to override some storage logic but not the rule itself., (*47)
You can replace some parts of the inherited rule using test replacement by using a ';' instead of an ':' after the name of the extended rule. You can then put replacements in a comma separated list. An example might help:, (*48)
A: Foo | Bar | Baz # Makes B the equivalent of Foo | Bar | (Baz | Qux) B extends A: Baz => (Baz | Qux)
Note that the replacements are not quoted. The exception is when you want to replace with the empty string, e.g., (*49)
A: Foo | Bar | Baz # Makes B the equivalent of Foo | Bar B extends A: | Baz => ""
Currently there is no escaping supported - if you want to replace "," or "=>" characters you'll have to use full replacement., (*50)
You can specify an entirely new rule in the same format as a non-inheriting rule, e.g., (*51)
A: Foo | Bar | Baz B extends A: Foo | Bar | (Baz Qux)
This is useful is the rule changes too much for text replacement to be readable, but want to keep the storage logic., (*52)
When opening a parser comment block, if instead of a name (or no name) you put a word starting with '!', that comment block is treated as a pragma - not part of the parser language itself, but some other instruction to the compiler. These pragmas are currently understood:, (*53)
!silent This is a comment that should only appear in the source code. Don't output it in the generated code. !insert_autogen_warning Insert a warning comment into the generated code at this point, warning that the file is autogenerated and not to edit it.
PEG (parsing expression grammar) compiler for PHP
MIT GPLv2 BSD-3
parser dsl compiler peg