Unicode
, (*1)
Miscellaneous Unicode utility functions., (*2)
Functions
Namespace pcrov\Unicode., (*3)
surrogate_pair_to_code_point(int $high, int $low): int
Translates a UTF-16 surrogate pair into a single code point. [Wikipedia's UTF-16 article]0
explains what this is fairly well., (*4)
utf8_find_invalid_byte_sequence(string $string): ?int
Returns the position of the first invalid byte sequence or null if the input is valid., (*5)
utf8_get_invalid_byte_sequence(string $string): ?string
Returns the first invalid byte sequence or null if the input is valid., (*6)
utf8_get_state_machine(): array
Provides a state machine letting you walk a (potentially endless) UTF-8
sequence byte by byte., (*7)
It is in the form of [byte => [valid next byte => ...,], ...], (*8)
Example use:, (*9)
function utf8_generate_all_code_points(): string
{
$generator = function (array $machine, string $buffer = "") use (&$generator) {
// Completed a UTF-8 encoded code point.
if ($buffer !== "" && isset($machine["\x0"])) {
return $buffer;
}
$out = "";
foreach ($machine as $byte => $next) {
$out .= $generator($next, $buffer . $byte);
}
return $out;
};
return $generator(utf8_get_state_machine());
}
utf8_validate(string $string): bool
Does what it says on the box., (*10)
Data
The test/data directory holds two files containing all possible UTF-8 encoded characters.
All 1,112,064 of them. One as plain text, the other as json. These are not included in
packaged stable releases but can be generated with the example utf8_generate_all_code_points()
function above (returns the plain text string.), (*11)
Excerpts from the [Unicode 10.0.0 standard][1]:
Recreated here for ease of reference. Nobody likes PDFs., (*12)
Table 3-6. UTF-8 Bit Distribution
| Scalar Value |
First Byte |
Second Byte |
Third Byte |
Fourth Byte |
| 00000000 0xxxxxxx |
0xxxxxxx |
| 00000yyy yyxxxxxx |
110yyyyy |
10xxxxxx |
| zzzzyyyy yyxxxxxx |
1110zzzz |
10yyyyyy |
10xxxxxx |
| 000uuuuu zzzzyyyy yyxxxxxx |
11110uuu |
10uuzzzz |
10yyyyyy |
10xxxxxx |
| Code Points |
First Byte |
Second Byte |
Third Byte |
Fourth Byte |
| U+0000..U+007F |
00..7F |
| U+0080..U+07FF |
C2..DF |
80..BF |
| U+0800..U+0FFF |
E0 |
A0..BF |
80..BF |
| U+1000..U+CFFF |
E1..EC |
80..BF |
80..BF |
| U+D000..U+D7FF |
ED |
80..9F
|
80..BF |
| U+E000..U+FFFF |
EE..EF |
80..BF |
80..BF |
| U+10000..U+3FFFF |
F0 |
90..BF |
80..BF |
80..BF |
| U+40000..U+FFFFF |
F1..F3 |
80..BF |
80..BF |
80..BF |
| U+100000..U+10FFFF |
F4 |
80..8F
|
80..BF |
80..BF |