Convert hexadecimal Unicode representations, with repairs
The converters on this page convert between UTF-8 and Unicode code points. They also attempt to repair faulty inputs, and point out where the errors are.
Get the Unicode and UTF-8 for a character
This gets the Unicode of the first character you enter. Be sure to remove spaces from before the character.
Convert UTF-8 to Unicode code point
Enter bytes of UTF-8, represented in hexadecimal, to get the corresponding Unicode code point.
Convert Unicode code point to UTF-8
Enter a hexadecimal Unicode code point, in free format, and it will be converted into the corresponding UTF-8 bytes.
Notes
Surrogates pass through
For users who need to decipher errors, this converter passes through surrogates (Unicode code points from U+D800 to U+DFFF). However, these are not valid UTF-8 values. See the UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn.
Non-hexadecimal bytes are ignored
By design, the above converter ignores every character except 0 to 9, a to f and A to F. It will assume that any character except these marks the end of a hexadecimal number. It also ignores leading zeros such as 0xd0. The UTF-8 to code point converter ignores isolated characters such as the first e in hexadecimal, parsing that as "ad".
Because of this design, it can cope with almost any form of inputs,
such as %E5%8B%89
, \xF0\x9F\x8C\x8B
, or any other type,
in upper or lower case, space separated or with no spaces.
Only one character is processed
By design, this converter processes only one character at a time.
Errors
Errors in processing are reported in pink underneath the output.
Overlong UTF-8 sequences are processed and corrected
For your information, this converter accepts excessively long UTF-8
sequences such as F0838080
and processes them into the
equivalent Unicode code points. It also prints an error pointing out
the correct UTF-8 sequence for the particular code point. However,
these overlong UTF-8 sequences should be regarded as errors. See
the UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn for an explanation.
Invalid UTF-8 bytes
Invalid bytes in the input are flagged. If the initial byte is invalid, the length of the sequence cannot be calculated and the remaining bytes are discarded. If the initial byte is valid, but the following bytes are invalid, the allowed range of each byte is shown. See C routine to convert UTF-8 to UCS2 for the ranges which are allowed.
Maximum allowed Unicode code point
The maximum allowed Unicode code point is U+10FFFF
. This
converter prints an error if the Unicode code point is beyond that
maximum.
Values resulting in up to four bytes of UTF-8 are processed
Values up to 0x1FFFFF
may be contained in four UTF-8 bytes. The
code point to UTF-8 converter displays the UTF-8 bytes corresponding
to values up to this maximum value for four bytes, but it does not
allow values resulting in more than four bytes of UTF-8 output. It
will only process up to six hexadecimal digits.
Regardless of this, values larger than U+10FFFF
do not result
in valid Unicode code points.
Code points ending in FFFE and FFFF are not characters
Code points ending in hexadecimal FFFE
and FFFF
are not
valid Unicode characters. See the UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn.