pcrov/unicode

Miscellaneous Unicode utility functions

Friday, March 23, 2018
by pcrov
Repository
4 Watchers
1 Stars
2,256 Installations

PHP
1 Dependents
0 Suggesters
0 Forks
0 Open issues
2 Versions
196 % Grown

Unicode

, _(*1)

Miscellaneous Unicode utility functions., _(*2)

Functions

Namespace pcrov\Unicode., _(*3)

`surrogate_pair_to_code_point(int $high, int $low): int`

Translates a UTF-16 surrogate pair into a single code point. [Wikipedia's UTF-16 article]0 explains what this is fairly well., _(*4)

`utf8_find_invalid_byte_sequence(string $string): ?int`

Returns the position of the first invalid byte sequence or null if the input is valid., _(*5)

`utf8_get_invalid_byte_sequence(string $string): ?string`

Returns the first invalid byte sequence or null if the input is valid., _(*6)

`utf8_get_state_machine(): array`

Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte., _(*7)

It is in the form of [byte => [valid next byte => ...,], ...], _(*8)

Example use:, _(*9)

function utf8_generate_all_code_points(): string
{
    $generator = function (array $machine, string $buffer = "") use (&$generator) {
        // Completed a UTF-8 encoded code point.
        if ($buffer !== "" && isset($machine["\x0"])) {
            return $buffer;
        }

        $out = "";
        foreach ($machine as $byte => $next) {
            $out .= $generator($next, $buffer . $byte);
        }

        return $out;
    };

    return $generator(utf8_get_state_machine());
}

`utf8_validate(string $string): bool`

Does what it says on the box., _(*10)

Data

The test/data directory holds two files containing all possible UTF-8 encoded characters. All 1,112,064 of them. One as plain text, the other as json. These are not included in packaged stable releases but can be generated with the example utf8_generate_all_code_points() function above (returns the plain text string.), _(*11)

Excerpts from the [Unicode 10.0.0 standard][1]:

Recreated here for ease of reference. Nobody likes PDFs., _(*12)

Table 3-6. UTF-8 Bit Distribution

Scalar Value	First Byte	Second Byte	Third Byte	Fourth Byte
00000000 0xxxxxxx	0xxxxxxx
00000yyy yyxxxxxx	110yyyyy	10xxxxxx
zzzzyyyy yyxxxxxx	1110zzzz	10yyyyyy	10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx	11110uuu	10uuzzzz	10yyyyyy	10xxxxxx

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points	First Byte	Second Byte	Third Byte	Fourth Byte
U+0000..U+007F	00..7F
U+0080..U+07FF	C2..DF	80..BF
U+0800..U+0FFF	E0	A0..BF	80..BF
U+1000..U+CFFF	E1..EC	80..BF	80..BF
U+D000..U+D7FF	ED	80..9F	80..BF
U+E000..U+FFFF	EE..EF	80..BF	80..BF
U+10000..U+3FFFF	F0	90..BF	80..BF	80..BF
U+40000..U+FFFFF	F1..F3	80..BF	80..BF	80..BF
U+100000..U+10FFFF	F4	80..8F	80..BF	80..BF

23/03 2018

dev-master

9999999-dev https://github.com/pcrov/unicode

Miscellaneous Unicode utility functions

Sources Download

MIT

The Requires

php ^7.0

The Development Requires

phpunit/phpunit ^6.0.8

by Paul Crovella

utf-8 unicode

01/03 2018

0.1.0

0.1.0.0 https://github.com/pcrov/unicode

Miscellaneous Unicode utility functions

Sources Download

MIT

The Requires

php ^7.0

The Development Requires

phpunit/phpunit ^6.0.8

by Paul Crovella

utf-8 unicode

library unicode

Miscellaneous Unicode utility functions

pcrov/unicode

The README.md

Unicode

Functions

`surrogate_pair_to_code_point(int $high, int $low): int`

`utf8_find_invalid_byte_sequence(string $string): ?int`

`utf8_get_invalid_byte_sequence(string $string): ?string`

`utf8_get_state_machine(): array`

`utf8_validate(string $string): bool`

Data

Excerpts from the [Unicode 10.0.0 standard][1]:

Table 3-6. UTF-8 Bit Distribution

Table 3-7. Well-Formed UTF-8 Byte Sequences

The Versions

dev-master

The Requires

The Development Requires

by Paul Crovella

0.1.0

The Requires

The Development Requires

by Paul Crovella