This is a collection of PHP 7 classes useful for
extracting, transforming, and loading data between sources., (*1)
Hierarchical XML and JSON can be automatically converted to relational SQL.
Support includes extracting data documents from a file system or REST API,
and then loading the data into a DBMS like Microsoft SQL Server., (*2)
Values are surveyed for datatypes, numeric cardinality, and unique natural key
candidates. Then this information is used to create a normalized multi-table
database structure suited to insert the data., (*3)
Requirements
PHP >= 7.0, (*4)
Installation
This library is registered on packagist and can be easily installed into
your project using composer., (*5)
composer require jpuck/etl
Getting Started
There are 3 basic groups of interrelated classes:
Sources provide Data which have Schemata., (*6)
-
Sources, (*7)
Sources extend the abstract Source
class and transport Datum
objects.
In particular, the abstract DB
class has concrete class implementations
such as MicrosoftSQLServer
., (*8)
-
Data, (*9)
Data classes extend Datum
and must implement a valid parser, satisfied by
ParseValidator
. It uses the Schematizer
to construct the object from raw
data, which can be overridden by passing an existing Schema
., (*10)
-
Schemata, (*11)
A Schema
is a concrete class with a Validator
to enforce structure.
The Merger
class is for combining Schemas to create super-set Schemas.
The DDL
trait is used by the DB
class to generate
SQL Data Definition Language which contains abstract methods to be
implemented by a specific database management system., (*12)
Schematizer
The Schematizer
class is for surveying the structure of the data.
It includes node names, the count of distinct element groupings,
numeric cardinality for relationships between subnodes, and descriptive
statistics about the values including uniqueness. Categorically, it recognizes
datetime, integer, and decimal datatypes. Decimals will include scale and
precision measurements suitable for SQL., (*13)
Schematizer::getPrecision
returns the scale and precision of
numeric values suitable for the SQL DECIMAL(scale,precision)
datatype.
This function has notable behavior in that trailing zeros are discarded
when passed as raw PHP float types. However, when passed as a string, then the
trailing zeros are preserved in the precision.
See SchematizerUtilitiesTest::precisionDataProvider
for examples.
Note that in the XML
class, parsed values are represented as strings
in PHP, so trailing zeros should be represented in the precision values., (*14)
node name
โโโ count
โ โโโ max
โ โ โโโ measure
โ โ โโโ value
โ โโโ min
โ โโโ measure
โ โโโ value
โโโ unique (all values)
โโโ primaryKey
โโโ varchar โโโโโ
โ โโโ max โ
โ โ โโโ measure โ
โ โ โโโ value โ
โ โโโ min โ
โ โโโ measure โ
โ โโโ value โ
โโโ datetime โ datatypes
โ โโโ max โ
โ โ โโโ value โ
โ โโโ min โ
โ โโโ value โ
โโโ int/decimal โ
โ โโโ max โ
โ โ โโโ value โ
โ โโโ min โ
โ โโโ value โโโโโ
โโโ scale โโโโโ
โ โโโ max โ
โ โ โโโ measure โ
โ โ โโโ value โ
โ โโโ min โ
โ โโโ measure โ
โ โโโ value โ
โโโ precision โ if decimal
โ โโโ max โ
โ โ โโโ measure โ
โ โ โโโ value โ
โ โโโ min โ
โ โโโ measure โ
โ โโโ value โโโโโ
โโโ children
โ โโโ distinct (count of children)
โ โโโ count
โ โโโ max
โ โ โโโ measure
โ โโโ min
โ โโโ measure
โโโ attributes
โ โโโ ... (excluding count, which must be 1)
โโโ elements
โโโ ...
Database Connections
The DB
class requires an instance of PDO
in the constructor to connect,
but it is possible to pass a null
value if only utilizing the class for DDL., (*15)
SQL Data Definition Language
When one-to-many XML nodes are used to represent one-to-one relationships, then
the Schematizer
recognizes this and a DDL
class flattens them as columns
on a table. If a node has more than one of its name or grandchildren, then the
one-to-many relationship is preserved in a separate normalized table. Surrogate
keys are created to maintain the Primary/Foreign Key referential integrity., (*16)
If the Schema
has a primaryKey
set, then that field will be used for DDL
generation instead of the surrogate. However, this Schema
must also be passed
to the Datum
constructor prior to being used with DB::insert
, otherwise the
surrogate keys will be used by default and will result in a failed insertion if
the surrogate columns don't exist., (*17)
Saving Schemas
Generating Schemas can take a long time and may require customization, such
as adding primaryKey
flags or removing unwanted fields to be ignored. Here are
some examples for exporting and importing:, (*18)
$xml = file_get_contents("sample.xml");
$xml = new XML($xml);
$schema = $xml->schema();
// normal JSON_PRETTY_PRINT
echo $schema;
By simply echoing the object, output can be redirected to a file from console:, (*19)
php script.php > myschema.json
Use [file_put_contents
][19] to write to disk.
Schema::toJSON
accepts all the [json_encode
][18] options., (*20)
$string = $schema->toJSON(JSON_UNESCAPED_UNICODE);
file_put_contents('myschema.json', $string);
// native php array
$array = var_export($schema->toArray(), true);
$array = "<?php return $array;";
file_put_contents('myschema.php', $array);
Import any of those formats the same way by passing the filename, an
array, or a JSON string to the constructor., (*21)
$schemas []= new Schema('myschema.php');
$schemas []= new Schema('myschema.json');
$schemas []= new Schema($schema->toArray());
$schemas []= new Schema($schema->toJSON());
$schemas []= new Schema($schema->toJSON(JSON_PRETTY_PRINT));
foreach($schemas as $s){
var_dump($schema == $s);
}
Override the internal Schematizer
by passing a Schema
to the Datum
constructor., (*22)
$schema = new Schema('myschema.json');
$xml = file_get_contents("sample.xml");
$xml = new XML($xml, $schema);
You can also pass the Schema
override through Source::fetch
, (*23)
$credentials = [
'url' => 'https://api.example.com',
'username' => 'user',
'password' => 'P@55w0rd',
];
$source = new REST($credentials);
$xml = $source->fetch('endpoint', XML::class, new Schema('myschema.json'));
Development
The development dependencies can be installed by running composer with or
without the --dev
option (enabled by default)., (*24)
composer install --dev
Testing
Tests are written for PHPUnit which is included as a composer
dev-dependency. To run the whole test suite, then execute this command from the
shell console:, (*25)
php vendor/bin/phpunit
You might also be interested in an easy to read checklist output:, (*26)
php vendor/bin/phpunit --testdox
When stepping through breakpoints in an IDE, like Netbeans, it's helpful to see
the current test name output by setting the run configuration to debug:, (*27)
php vendor/bin/phpunit --debug
A code coverage report is available if you have the xdebug
extension
installed. In addition to the console text summary report, a full HTML GUI is
generated to explore in the coverage
folder. The easiest way to view this is
to boot up a dev server:, (*28)
php -S localhost:8080 -t coverage/
Database Testing
You must create the file (or symbolic link) tests/data/pdos/pdo.php
in order
to run the database tests. This should simply return a PDO
instance,
for example:, (*29)
<?php
return (function(){
$hostname = 'sql.example.com';
$database = 'mydb';
$username = 'user';
$password = 'P@55w0rd';
// https://www.microsoft.com/en-us/download/details.aspx?id=50419
$driver = 'ODBC Driver 13 for SQL Server';
$pdo = new PDO("odbc:Driver=$driver;
Server=$hostname;
Database=$database",
$username,
$password
);
$pdo->setAttribute( PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION );
return $pdo;
})();