Diavazo PHP7 HTML Parser
Diavazo is a wrapper arround \DOMDocument and \DOMElement. It adds some useful functionality
to search within descendants or query by classes. The HTMLDocument class allows to either load a string or a
file or url. Some basic search methods are available as well., (*1)
For example the method getElement("p .spanClass b.bClass") allows to search for elements, classes
and a combination of both. The example will find all <p> elements, all elements
with a the class spanClass as well as all <b class="bClass">., (*2)
The result of these searches are an array of HTMLElement objects. These again allow to query, with the difference
that searches are only applied to the their direct descendants., (*3)
Installation
composer require gm314/diavazo
Usage
use Diavazo\HTMLDocument;
$document = new HTMLDocument();
// load file
$document->loadFile("local.html");
$document->loadFile("http://mypage.com/test.html");
// load from string
$document->loadString("<html></html>");
HTMLDocument methods
$document = new HTMLDocument();
$document->loadFile(__DIR__ . "/assets/TableToArrayTest.html");
// get element by id
$table = $document->getElementById("associateArrayTest");
// get element by tag name
$elementList = $document->getElementByTagName("div");
// find all
<
p> elements, all elements with the class 'spanClass' and all <b class="bClass">
$elementList = $document->getElement("p .spanClass b.bClass");
// xpath query
$title = $document->query("/html/head/title");
// get root (<html>)
$root = $document->getRootElement();
HTMLElement descendants methods
The HTML Element is result of queries like getElementById. Further search methods can
be applied on the element. They will search within all descendants., (*4)
The method getDescendantByName("td th") allows to search for several tags., (*5)
$document = new HTMLDocument();
$document->loadFile(__DIR__ . "/assets/TableToArrayTest.html");
$table = $document->getElementById("table");
// will return the first tr (Breadth-first search)
$table->getFirstDescendantByName("tr");
// will return all td and th elements
$tdList = $table->getDescendantByName("td th");
// will find all elements that have the class 'active'
$root = $document->getRootElement();
$elementsWithClass = $root->getDescendantWithClassName("active");
// will find all elements that have the class 'myClass' and are td or th elements
$elementsWithClass = $root->getDescendantWithClassName("myClass", "td th");
// will find all elements having only the class 'testClass'
$elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass");
// will find all elements having only the class 'testClass' and are td or th elements
$elementsWithExactClass = $root->getDescendantWithClassNameStrict("testClass", "td th");
// find all
<
p> elements, all elements with the class 'spanClass' and all <b class="bClass"> that are descendants of #myId
$anyElement = $document-getElementById("myId");
$elementList = $document->getElement("p .spanClass b.bClass");
HTMLElement attribute methods
$document = new HTMLDocument();
$document->loadFile("myFile.html");
$table = $document->getElementBy("myTable");
// will return null if the attribute does not exist otherwise string
$table->getAttributeValue("align");
Table to Array Converter
Diavazo allows converting a table to an associative or index based array. Associative Array will
use the first row for the key attribute., (*6)
$document = new HTMLDocument();
$document->loadFile("tabletest.html");
$table = $document->getElementById("myTableID");
$arrayConverter = new TableToArrayConverter($table);
$array = $arrayConverter->getAsAssociativeArray();
| Key1 |
Key2 |
| Value 1 |
Value 2 |
...
will result in:
$array = [
[
"Key1" => "Value 1",
"Key2" => "Value 2"
],
...
]
The following examples show how to register an extractor. The closure will be invoked
with the table data cell (<td>) and is expected to return the value that will be added to the array.
The following example gets the first <a> element and extracts the href attribute, (*7)
$document = $this->getDocument();
$table = $document->getElementById("extractorTest");
$arrayConverter = new TableToArrayConverter($table);
$arrayConverter->registerExtractor("columnName", function (HTMLElement $td) {
$a = $td->getFirstDescendantByName("a");
return $a->getAttributeValue("href");
});
$array = $arrayConverter->getAsAssociativeArray();