2017 © Pedro Peláez
 

library unicode

Miscellaneous Unicode utility functions

image

pcrov/unicode

Miscellaneous Unicode utility functions

  • Friday, March 23, 2018
  • by pcrov
  • Repository
  • 4 Watchers
  • 1 Stars
  • 2,256 Installations
  • PHP
  • 1 Dependents
  • 0 Suggesters
  • 0 Forks
  • 0 Open issues
  • 2 Versions
  • 196 % Grown

The README.md

Unicode

CI Status License Latest Stable Version, (*1)

Miscellaneous Unicode utility functions., (*2)

Functions

Namespace pcrov\Unicode., (*3)

surrogate_pair_to_code_point(int $high, int $low): int

Translates a UTF-16 surrogate pair into a single code point. [Wikipedia's UTF-16 article]0 explains what this is fairly well., (*4)

utf8_find_invalid_byte_sequence(string $string): ?int

Returns the position of the first invalid byte sequence or null if the input is valid., (*5)

utf8_get_invalid_byte_sequence(string $string): ?string

Returns the first invalid byte sequence or null if the input is valid., (*6)

utf8_get_state_machine(): array

Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte., (*7)

It is in the form of [byte => [valid next byte => ...,], ...], (*8)

Example use:, (*9)

function utf8_generate_all_code_points(): string
{
    $generator = function (array $machine, string $buffer = "") use (&$generator) {
        // Completed a UTF-8 encoded code point.
        if ($buffer !== "" && isset($machine["\x0"])) {
            return $buffer;
        }

        $out = "";
        foreach ($machine as $byte => $next) {
            $out .= $generator($next, $buffer . $byte);
        }

        return $out;
    };

    return $generator(utf8_get_state_machine());
}

utf8_validate(string $string): bool

Does what it says on the box., (*10)

Data

The test/data directory holds two files containing all possible UTF-8 encoded characters. All 1,112,064 of them. One as plain text, the other as json. These are not included in packaged stable releases but can be generated with the example utf8_generate_all_code_points() function above (returns the plain text string.), (*11)

Excerpts from the [Unicode 10.0.0 standard][1]:

Recreated here for ease of reference. Nobody likes PDFs., (*12)

Table 3-6. UTF-8 Bit Distribution

Scalar Value First Byte Second Byte Third Byte Fourth Byte
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

The Versions

23/03 2018

dev-master

9999999-dev https://github.com/pcrov/unicode

Miscellaneous Unicode utility functions

  Sources   Download

MIT

The Requires

  • php ^7.0

 

The Development Requires

by Paul Crovella

utf-8 unicode

01/03 2018

0.1.0

0.1.0.0 https://github.com/pcrov/unicode

Miscellaneous Unicode utility functions

  Sources   Download

MIT

The Requires

  • php ^7.0

 

The Development Requires

by Paul Crovella

utf-8 unicode