A C Unicode and UTF-8 library

This is a Unicode library in the programming language C which deals with conversions to and from the UTF-8 format.

Author	Ben Bullock
Commit	588df88b1bb174d7beed173d11e4968ddbfee6e9
Date	Wed May 19 08:53:30 2021 +0900
Email	benkasminbullock@gmail.com, bkb_at_cpan.org
Licence	BSD 3 Clause, GNU GPL, Perl Artistic
Repository	https://github.com/benkasminbullock/unicode-c

Functions

surrogate_to_utf8

int32_t surrogate_to_utf8 (int32_t hi, int32_t lo, uint8_t * utf8);

Convert the surrogate pair in hi and lo to UTF-8 in utf8. This calls "surrogates_to_unicode" and "ucs2_to_utf8", thus it can return the same errors as them, and has the same restriction on utf8 as "ucs2_to_utf8".

surrogates_to_unicode

int32_t surrogates_to_unicode (int32_t hi, int32_t lo);

Convert a surrogate pair in hi and lo to a single Unicode value. The return value is the Unicode value. If the return value is negative, an error has occurred. If hi and lo do not form a surrogate pair, the error value UNICODE_NOT_SURROGATE_PAIR (-3) is returned.

https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/ConvertUTF.c

trim_to_utf8_start

int32_t trim_to_utf8_start (const uint8_t ** ptr);

Make *ptr point to the start of the first UTF-8 character after its initial value. This assumes that there are at least four bytes which can be read, and that *ptr points to valid UTF-8.

If **ptr does not have its top bit set, 00xx_xxxx, this does not change the value of *ptr, and it returns UNICODE_OK (0). If **ptr has its top two bits set, 11xx_xxxx, this does not change the value of *ptr and it returns UNICODE_OK (0). If **ptr has its top bit set but its second-to-top bit unset, 10xx_xxxx, so it is the second, third, or fourth byte of a multibyte sequence, *ptr is incremented until either **ptr is a valid first byte of a UTF-8 sequence, or too many bytes have passed for it to be valid UTF-8. If too many bytes have passed, UTF8_BAD_CONTINUATION_BYTE (-4) is returned and *ptr is left unchanged.

If a valid UTF-8 first byte was found, either 11xx_xxxx or 00xx_xxxx, UNICODE_OK (0) is returned, and *ptr is set to the address of the valid byte. Nul bytes (bytes containing zero) are considered valid.

If any of the bytes read contains invalid UTF-8 bytes 0xFE and 0xFF, the error code UNICODE_NOT_CHARACTER (-8) is returned and *ptr is left unchanged.

ucs2_to_utf8

int32_t ucs2_to_utf8 (int32_t ucs2, uint8_t * utf8);

Input: a Unicode code point, ucs2.

Output: UTF-8 characters in buffer utf8.

Return value: the number of bytes written into utf8, or a negative number if there was an error.

If the value of ucs2 is invalid because of being in the surrogate pair range from 0xD800 to 0xDFFF, the return value is UNICODE_SURROGATE_PAIR (-2).

If the value of ucs2 is in the range 0xFDD0 to 0xFDEF inclusive, the return value is UNICODE_NOT_CHARACTER (-8).

If the lower two bytes of ucs2 are either 0xFFFE or 0xFFFF, the return value is UNICODE_NOT_CHARACTER (-8).

If the value is too big to fit into four bytes of UTF-8, UNICODE_UTF8_4 (0x1fffff), the return value is UNICODE_TOO_BIG (-7).

However, it does not insist on ucs2 being less than UNICODE_MAXIMUM (0x10ffff), so the user needs to check that ucs2 is a valid code point.

This adds a zero byte to the end of the string. It assumes that the buffer utf8 has at least UNICODE_MAX_LENGTH (5) bytes of space to write to, without checking.

unicode_chars_to_bytes

int32_t unicode_chars_to_bytes (const uint8_t * utf8, int32_t n_chars);

Given a nul-terminated string utf8 and a number of Unicode characters n_chars, return the number of bytes into utf8 at which the end of the characters occurs. A negative value indicates some kind of error. If utf8 contains a zero byte, the return value is UNICODE_EMPTY_INPUT (-5). This may also return any of the error values of "utf8_to_ucs2".

unicode_code_to_error

const char * unicode_code_to_error (int32_t code);

Given a return value code which is negative or zero, return a string which describes what the return value means. Positive non-zero return values never indicate errors or statuses in this library. Unknown error codes result in a default string being returned.

unicode_count_chars

int32_t unicode_count_chars (const uint8_t * utf8);

Given a nul-terminated string utf8, return the total number of Unicode characters it contains.

Return value

If an error occurs, this may return UTF8_BAD_LEADING_BYTE (-1) or any of the errors of "utf8_to_ucs2".

unicode_count_chars_fast

int32_t unicode_count_chars_fast (const uint8_t * utf8);

Like unicode_count_chars, but without error checks or validation of the input. This only checks the first byte of each UTF-8 sequence, then jumps over the succeeding bytes. It may return UTF8_BAD_LEADING_BYTE (-1) if the first byte is invalid.

unicode_to_surrogates

int32_t unicode_to_surrogates (int32_t unicode, int32_t * hi_ptr, int32_t * lo_ptr);

This converts the Unicode code point in unicode into a surrogate pair, and returns the two parts in *hi_ptr and *lo_ptr.

Return value:

If unicode does not need to be a surrogate pair, the error UNICODE_NOT_SURROGATE_PAIR (-3) is returned, and the values of *hi_ptr and *lo_ptr are undefined. If the conversion is successful, UNICODE_OK (0) is returned.

utf8_bytes

int32_t utf8_bytes (uint8_t c);

This function returns the number of bytes of UTF-8 a sequence starting with byte c will become, either 1 (c = 0000xxxx), 2 (c = 110xxxxx), 3 (c = 1110xxxx), or 4 (c = 111100xx or c = 11110100). If c is not a valid UTF-8 first byte, the value UTF8_BAD_LEADING_BYTE (-1) is returned.

utf8_no_checks

int32_t utf8_no_checks (const uint8_t * input, const uint8_t ** end_ptr);

Try to convert input from UTF-8 to UCS-2, and return a value even if the input is partly broken. This checks the first byte of the input, but it doesn't check the subsequent bytes.

utf8_to_ucs2

int32_t utf8_to_ucs2 (const uint8_t * input, const uint8_t ** end_ptr);

This function converts UTF-8 encoded bytes in input into the equivalent Unicode code point. The return value is the Unicode code point corresponding to the UTF-8 character in input if successful, and a negative number if not successful. Nul bytes are rejected.

*end_ptr is set to the next character after the read character on success. *end_ptr is set to the start of input on all failures. end_ptr may not be NULL.

If the first byte of input is zero, in other words a NUL or '\0', UNICODE_EMPTY_INPUT (-5) is returned.

If the first byte of input is not valid UTF-8, UTF8_BAD_LEADING_BYTE (-1) is returned.

If the second or later bytes of input are not valid UTF-8, including NUL, UTF8_BAD_CONTINUATION_BYTE (-4) is returned.

If the value extrapolated from input is greater than UNICODE_MAXIMUM (0x10ffff), UNICODE_TOO_BIG (-7) is returned.

If the value extrapolated from input ends in 0xFFFF or 0xFFFE, UNICODE_NOT_CHARACTER (-8) is returned.

If the value extrapolated from input is between 0xFDD0 and 0xFDEF, UNICODE_NOT_CHARACTER (-8) is returned.

If the value is within the range of surrogate pairs, the error UNICODE_SURROGATE_PAIR (-2) is returned.

valid_utf8

int32_t valid_utf8 (const uint8_t * input, int32_t input_length);

Given input and input_length, validate input byte by byte up to input_length. The return value may be UTF8_VALID (1) or UTF8_INVALID (0).

validate_utf8

int32_t validate_utf8 (const uint8_t * input, int32_t len, utf8_info_t * info);

Given input and len, validate input byte by byte up to len. The return value is "UNICODE_OK (0)" (zero) on success or the error found (a negative number) on failure.

utf8_info_t is defined in "unicode.h".

The value of "info.len_read" is the number of bytes processed. the value of "info.runes_read" is the number of Unicode code points in the input.

Return values

UNICODE_EMPTY_INPUT (-5)

This return value indicates a zero byte was found in a string which was supposed to contain UTF-8 bytes. It is returned only by the functions which are documented as not allowing zero bytes.

UNICODE_NOT_CHARACTER (-8)

This return value indicates that the Unicode code-point ended with either 0xFFFF or 0xFFFE, meaning it cannot be used as a character code point, or it was in the disallowed range FDD0 to FDEF.

UNICODE_NOT_SURROGATE_PAIR (-3)

This return value means that code points which did not form a surrogate pair were tried to be converted into a code point as if they were a surrogate pair.

UNICODE_OK (0)

This return value indicates the successful completion of a routine which doesn't use the return value to communicate data back to the caller.

UNICODE_SURROGATE_PAIR (-2)

This return value means the caller attempted to turn a code point for a surrogate pair to or from UTF-8.

UNICODE_TOO_BIG (-7)

This return value indicates that there was an attempt to convert a code point which was greater than UNICODE_MAXIMUM or UNICODE_UTF8_4 into UTF-8 bytes.

UTF8_BAD_CONTINUATION_BYTE (-4)

This return value means that input which was supposed to be UTF-8 encoded contained an invalid continuation byte. If the leading byte of a UTF-8 sequence is not valid, UTF8_BAD_LEADING_BYTE is returned instead of this.

UTF8_BAD_LEADING_BYTE (-1)

This return value means that the leading byte of a UTF-8 sequence was not valid.

UTF8_INVALID (0)

This return value indicates that the UTF-8 is not valid. It is only used by "valid_utf8".

UTF8_NON_SHORTEST (-6)

This return value indicates that UTF-8 bytes were not in the shortest possible form. See http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. This return value is currently unused. If a character is not in the shortest form, the error UTF8_BAD_CONTINUATION_BYTE is returned.

UTF8_VALID (1)

This return value indicates that the UTF-8 is valid. It is only used by "valid_utf8".

Constants

UNICODE_MAXIMUM (0x10ffff)

The maximum possible value of a Unicode code point. See http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs.

UNICODE_UTF8_4 (0x1fffff)

The maximum possible value which will fit into four bytes of UTF-8. This is larger than UNICODE_MAXIMUM.

UTF8_MAX_LENGTH (5)

The maximum number of bytes we need to contain any Unicode code point as UTF-8 as a C string. This length includes one trailing nul byte.