Unstructured Text Parser [PHP]
, (*1)
About Unstructured Text Parser
This is a small PHP library to help extract text out of documents that are not structured in a processing friendly format.
When you want to parse text out of form generated emails for example you can create a template matching the expected incoming mail format
while specifying the variable text elements and leave the rest for the class to extract your pre-formatted variables out of the incoming mails' body text., (*2)
Useful when you want to parse data out of:
* Emails generated from web forms
* Documents with definable templates / expressions, (*3)
Installation
PHP Unstructured Text Parser is available on Packagist (using semantic versioning), and installation via Composer is recommended.
Add the following line to your composer.json
file:, (*4)
"aymanrb/php-unstructured-text-parser": "~2.0"
or run, (*5)
composer require aymanrb/php-unstructured-text-parser
<?php
include_once __DIR__ . '/../vendor/autoload.php';
$parser = new aymanrb\UnstructuredTextParser\TextParser('/path/to/templatesDirectory');
$textToParse = 'Text to be parsed fetched from a file, mail, web service, or even added directly to the a string variable like this';
//performs brute force parsing against all available templates, returns first match successful parsing
$parseResults = $parser->parseText($textToParse);
print_r($parseResults->getParsedRawData());
//slower, performs a similarity check on available templates to select the most matching template before parsing
print_r(
$parser
->parseText($textToParse, true)
->getParsedRawData()
);
Parsing Procedure
1- Grab a single copy of the text you want to parse., (*6)
2- Replace every single varying text within it to a named variable in the form of {%VariableName%}
if you want to match
everything in this part of text or {%VariableName:Pattern%}
if you want to match a specific set of characters or use a more
precise pattern., (*7)
3- Add the templates file into the templates directory (defined in parsing code) with a txt extension fileName.txt
, (*8)
4- Pass the text you wish to parse to the parse method of the class and let it do the magic for you., (*9)
Template Example
If the text documents you want to parse looks like this:, (*10)
Hello,
If you wish to parse message coming from a website that states info like:
ID & Source: 12234432 Website Form
Name: Pet Cat
E-Mail: email@example.com
Comment: Some text goes here
Thank You,
Best Regards
Admin
Your Template file (example_template.txt
) could be something like:, (*11)
Hello,
If you wish to parse message coming from a website that states info like:
ID & Source: {%id:[0-9]+%} {%source%}
Name: {%senderName%}
E-Mail: {%senderEmail%}
Comment: {%comment%}
Thank You,
Best Regards
Admin
The output of a successful parsing job would be:, (*12)
Array(
'id' => '12234432',
'source' => 'Website Form',
'senderName' => 'Pet Cat',
'senderEmail' => 'email@example.com',
'comment' => 'Some text goes here'
)
Upgrading from v1.x to v2.x
Version 2.0 is more or less a refactored copy of version 1.x of the library and provides the exact same functionality.
There is just one slight difference in the results returned. It's now a parsed data object instead of an array.
To get the results as an array like it used to be in v1.x simply call "getParsedRawData()" on the returned object., (*13)
<?php
//ParseText used to return array in 1.x
$extractedArray = $parser->parseText($textToParse);
//In 2.x you need to do the following if you want an array
$extractedArray = $parser->parseText($textToParse)->getParsedRawData();