A C Unicode and UTF-8 library
This is a Unicode library in the programming language C which deals with conversions to and from the UTF-8 format.
Author | Ben Bullock |
---|---|
Commit | 588df88b1bb174d7beed173d11e4968ddbfee6e9 |
Date | Wed May 19 08:53:30 2021 +0900 |
benkasminbullock@gmail.com, bkb_at_cpan.org | |
Licence | BSD 3 Clause, GNU GPL, Perl Artistic |
Repository | https://github.com/benkasminbullock/unicode-c |
Functions
surrogate_to_utf8
int32_t surrogate_to_utf8 (int32_t hi, int32_t lo, uint8_t * utf8);
Convert the surrogate pair in hi
and lo
to UTF-8 in utf8
. This calls "surrogates_to_unicode" and "ucs2_to_utf8", thus it can return the same errors as them, and has the same restriction on utf8
as "ucs2_to_utf8".
surrogates_to_unicode
int32_t surrogates_to_unicode (int32_t hi, int32_t lo);
Convert a surrogate pair in hi
and lo
to a single Unicode value. The return value is the Unicode value. If the return value is negative, an error has occurred. If hi
and lo
do not form a surrogate pair, the error value UNICODE_NOT_SURROGATE_PAIR (-3) is returned.
https://android.googlesource.com/platform/external/id3lib/+/master/unicode.org/ConvertUTF.c
trim_to_utf8_start
int32_t trim_to_utf8_start (const uint8_t ** ptr);
Make *ptr
point to the start of the first UTF-8 character after its initial value. This assumes that there are at least four bytes which can be read, and that *ptr
points to valid UTF-8.
If **ptr
does not have its top bit set, 00xx_xxxx, this does not change the value of *ptr
, and it returns UNICODE_OK (0). If **ptr
has its top two bits set, 11xx_xxxx, this does not change the value of *ptr
and it returns UNICODE_OK (0). If **ptr
has its top bit set but its second-to-top bit unset, 10xx_xxxx, so it is the second, third, or fourth byte of a multibyte sequence, *ptr
is incremented until either **ptr
is a valid first byte of a UTF-8 sequence, or too many bytes have passed for it to be valid UTF-8. If too many bytes have passed, UTF8_BAD_CONTINUATION_BYTE (-4) is returned and *ptr
is left unchanged.
If a valid UTF-8 first byte was found, either 11xx_xxxx or 00xx_xxxx, UNICODE_OK (0) is returned, and *ptr
is set to the address of the valid byte. Nul bytes (bytes containing zero) are considered valid.
If any of the bytes read contains invalid UTF-8 bytes 0xFE and 0xFF, the error code UNICODE_NOT_CHARACTER (-8) is returned and *ptr
is left unchanged.
ucs2_to_utf8
int32_t ucs2_to_utf8 (int32_t ucs2, uint8_t * utf8);
Input: a Unicode code point, ucs2
.
Output: UTF-8 characters in buffer utf8
.
Return value: the number of bytes written into utf8
, or a negative number if there was an error.
If the value of ucs2
is invalid because of being in the surrogate pair range from 0xD800 to 0xDFFF, the return value is UNICODE_SURROGATE_PAIR (-2).
If the value of ucs2
is in the range 0xFDD0 to 0xFDEF inclusive, the return value is UNICODE_NOT_CHARACTER (-8).
If the lower two bytes of ucs2
are either 0xFFFE or 0xFFFF, the return value is UNICODE_NOT_CHARACTER (-8).
If the value is too big to fit into four bytes of UTF-8, UNICODE_UTF8_4 (0x1fffff), the return value is UNICODE_TOO_BIG (-7).
However, it does not insist on ucs2 being less than UNICODE_MAXIMUM (0x10ffff), so the user needs to check that ucs2
is a valid code point.
This adds a zero byte to the end of the string. It assumes that the buffer utf8
has at least UNICODE_MAX_LENGTH (5) bytes of space to write to, without checking.
unicode_chars_to_bytes
int32_t unicode_chars_to_bytes (const uint8_t * utf8, int32_t n_chars);
Given a nul-terminated string utf8
and a number of Unicode characters n_chars
, return the number of bytes into utf8
at which the end of the characters occurs. A negative value indicates some kind of error. If utf8
contains a zero byte, the return value is UNICODE_EMPTY_INPUT (-5). This may also return any of the error values of "utf8_to_ucs2".
unicode_code_to_error
const char * unicode_code_to_error (int32_t code);
Given a return value code
which is negative or zero, return a string which describes what the return value means. Positive non-zero return values never indicate errors or statuses in this library. Unknown error codes result in a default string being returned.
unicode_count_chars
int32_t unicode_count_chars (const uint8_t * utf8);
Given a nul-terminated string utf8
, return the total number of Unicode characters it contains.
Return value
If an error occurs, this may return UTF8_BAD_LEADING_BYTE (-1) or any of the errors of "utf8_to_ucs2".
unicode_count_chars_fast
int32_t unicode_count_chars_fast (const uint8_t * utf8);
Like unicode_count_chars, but without error checks or validation of the input. This only checks the first byte of each UTF-8 sequence, then jumps over the succeeding bytes. It may return UTF8_BAD_LEADING_BYTE (-1) if the first byte is invalid.
unicode_to_surrogates
int32_t unicode_to_surrogates (int32_t unicode, int32_t * hi_ptr, int32_t * lo_ptr);
This converts the Unicode code point in unicode
into a surrogate pair, and returns the two parts in *hi_ptr
and *lo_ptr
.
Return value:
If unicode
does not need to be a surrogate pair, the error UNICODE_NOT_SURROGATE_PAIR (-3) is returned, and the values of *hi_ptr
and *lo_ptr
are undefined. If the conversion is successful, UNICODE_OK (0) is returned.
utf8_bytes
int32_t utf8_bytes (uint8_t c);
This function returns the number of bytes of UTF-8 a sequence starting with byte c
will become, either 1 (c = 0000xxxx), 2 (c = 110xxxxx), 3 (c = 1110xxxx), or 4 (c = 111100xx or c = 11110100). If c
is not a valid UTF-8 first byte, the value UTF8_BAD_LEADING_BYTE (-1) is returned.
utf8_no_checks
int32_t utf8_no_checks (const uint8_t * input, const uint8_t ** end_ptr);
Try to convert input
from UTF-8 to UCS-2, and return a value even if the input is partly broken. This checks the first byte of the input, but it doesn't check the subsequent bytes.
utf8_to_ucs2
int32_t utf8_to_ucs2 (const uint8_t * input, const uint8_t ** end_ptr);
This function converts UTF-8 encoded bytes in input
into the equivalent Unicode code point. The return value is the Unicode code point corresponding to the UTF-8 character in input
if successful, and a negative number if not successful. Nul bytes are rejected.
*end_ptr
is set to the next character after the read character on success. *end_ptr
is set to the start of input on all failures. end_ptr
may not be NULL.
If the first byte of input
is zero, in other words a NUL or '\0', UNICODE_EMPTY_INPUT (-5) is returned.
If the first byte of input
is not valid UTF-8, UTF8_BAD_LEADING_BYTE (-1) is returned.
If the second or later bytes of input
are not valid UTF-8, including NUL, UTF8_BAD_CONTINUATION_BYTE (-4) is returned.
If the value extrapolated from input
is greater than UNICODE_MAXIMUM (0x10ffff), UNICODE_TOO_BIG (-7) is returned.
If the value extrapolated from input
ends in 0xFFFF or 0xFFFE, UNICODE_NOT_CHARACTER (-8) is returned.
If the value extrapolated from input
is between 0xFDD0 and 0xFDEF, UNICODE_NOT_CHARACTER (-8) is returned.
If the value is within the range of surrogate pairs, the error UNICODE_SURROGATE_PAIR (-2) is returned.
valid_utf8
int32_t valid_utf8 (const uint8_t * input, int32_t input_length);
Given input
and input_length
, validate input
byte by byte up to input_length
. The return value may be UTF8_VALID (1) or UTF8_INVALID (0).
validate_utf8
int32_t validate_utf8 (const uint8_t * input, int32_t len, utf8_info_t * info);
Given input
and len
, validate input
byte by byte up to len
. The return value is "UNICODE_OK (0)" (zero) on success or the error found (a negative number) on failure.
utf8_info_t is defined in "unicode.h".
The value of "info.len_read" is the number of bytes processed. the value of "info.runes_read" is the number of Unicode code points in the input.
Return values
UNICODE_EMPTY_INPUT (-5)
This return value indicates a zero byte was found in a string which was supposed to contain UTF-8 bytes. It is returned only by the functions which are documented as not allowing zero bytes.
UNICODE_NOT_CHARACTER (-8)
This return value indicates that the Unicode code-point ended with either 0xFFFF or 0xFFFE, meaning it cannot be used as a character code point, or it was in the disallowed range FDD0 to FDEF.
UNICODE_NOT_SURROGATE_PAIR (-3)
This return value means that code points which did not form a surrogate pair were tried to be converted into a code point as if they were a surrogate pair.
UNICODE_OK (0)
This return value indicates the successful completion of a routine which doesn't use the return value to communicate data back to the caller.
UNICODE_SURROGATE_PAIR (-2)
This return value means the caller attempted to turn a code point for a surrogate pair to or from UTF-8.
UNICODE_TOO_BIG (-7)
This return value indicates that there was an attempt to convert a code point which was greater than UNICODE_MAXIMUM or UNICODE_UTF8_4 into UTF-8 bytes.
UTF8_BAD_CONTINUATION_BYTE (-4)
This return value means that input which was supposed to be UTF-8 encoded contained an invalid continuation byte. If the leading byte of a UTF-8 sequence is not valid, UTF8_BAD_LEADING_BYTE is returned instead of this.
UTF8_BAD_LEADING_BYTE (-1)
This return value means that the leading byte of a UTF-8 sequence was not valid.
UTF8_INVALID (0)
This return value indicates that the UTF-8 is not valid. It is only used by "valid_utf8".
UTF8_NON_SHORTEST (-6)
This return value indicates that UTF-8 bytes were not in the shortest possible form. See http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8. This return value is currently unused. If a character is not in the shortest form, the error UTF8_BAD_CONTINUATION_BYTE is returned.
UTF8_VALID (1)
This return value indicates that the UTF-8 is valid. It is only used by "valid_utf8".
Constants
UNICODE_MAXIMUM (0x10ffff)
The maximum possible value of a Unicode code point. See http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucs.
UNICODE_UTF8_4 (0x1fffff)
The maximum possible value which will fit into four bytes of UTF-8. This is larger than UNICODE_MAXIMUM.
UTF8_MAX_LENGTH (5)
The maximum number of bytes we need to contain any Unicode code point as UTF-8 as a C string. This length includes one trailing nul byte.
See also
- Online version There is an online version at https://www.lemoda.net/tools/uniconvert/index.html.
- UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn at the University of Cambridge (archive)