Convert hexadecimal Unicode representations, with repairs

The converters on this page convert between UTF-8 and Unicode code points. They also attempt to repair faulty inputs, and point out where the errors are.

Get the Unicode and UTF-8 for a character

This gets the Unicode of the first character you enter. Be sure to remove spaces from before the character.

Convert UTF-8 to Unicode code point

Enter bytes of UTF-8, represented in hexadecimal, to get the corresponding Unicode code point.

Convert Unicode code point to UTF-8

Enter a hexadecimal Unicode code point, in free format, and it will be converted into the corresponding UTF-8 bytes.


Notes

Surrogates pass through

For users who need to decipher errors, this converter passes through surrogates (Unicode code points from U+D800 to U+DFFF). However, these are not valid UTF-8 values. See the UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn.

Non-hexadecimal bytes are ignored

By design, the above converter ignores every character except 0 to 9, a to f and A to F. It will assume that any character except these marks the end of a hexadecimal number. It also ignores leading zeros such as 0xd0. The UTF-8 to code point converter ignores isolated characters such as the first e in hexadecimal, parsing that as "ad".

Because of this design, it can cope with almost any form of inputs, such as %E5%8B%89, \xF0\x9F\x8C\x8B, or any other type, in upper or lower case, space separated or with no spaces.

Only one character is processed

By design, this converter processes only one character at a time.

Errors

Errors in processing are reported in pink underneath the output.

Overlong UTF-8 sequences are processed and corrected

For your information, this converter accepts excessively long UTF-8 sequences such as F0838080 and processes them into the equivalent Unicode code points. It also prints an error pointing out the correct UTF-8 sequence for the particular code point. However, these overlong UTF-8 sequences should be regarded as errors. See the UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn for an explanation.

Invalid UTF-8 bytes

Invalid bytes in the input are flagged. If the initial byte is invalid, the length of the sequence cannot be calculated and the remaining bytes are discarded. If the initial byte is valid, but the following bytes are invalid, the allowed range of each byte is shown. See C routine to convert UTF-8 to UCS2 for the ranges which are allowed.

Maximum allowed Unicode code point

The maximum allowed Unicode code point is U+10FFFF. This converter prints an error if the Unicode code point is beyond that maximum.

Values resulting in up to four bytes of UTF-8 are processed

Values up to 0x1FFFFF may be contained in four UTF-8 bytes. The code point to UTF-8 converter displays the UTF-8 bytes corresponding to values up to this maximum value for four bytes, but it does not allow values resulting in more than four bytes of UTF-8 output. It will only process up to six hexadecimal digits.

Regardless of this, values larger than U+10FFFF do not result in valid Unicode code points.

Code points ending in FFFE and FFFF are not characters

Code points ending in hexadecimal FFFE and FFFF are not valid Unicode characters. See the UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn.


Copyright © Ben Bullock 2009-2023. All rights reserved. For comments, questions, and corrections, please email Ben Bullock (benkasminbullock@gmail.com) or use the discussion group at Google Groups. / Privacy / Disclaimer