UTF-8 (UCS transformation format)

The Open Group has developed a transformation format for UCS designed for use in existing file systems. The intent is that UCS will be the process code for the transformation format, which is usable as a file code.

UTF-8 has the following properties:
  • It is a superset of ASCII, in which the ASCII characters are encoded as single-byte characters with the same numeric value.
  • No ASCII code values occur in multibyte characters, other than those that represent the ASCII characters.
  • The first byte of a character indicates the number of bytes to follow in the multibyte character sequence and cannot occur anywhere else in the sequence.

The UTF-8 encodes UCS values in the 0 through 0x7FFFFFFF range using multibyte characters with lengths of 1, 2, 3, 4, 5, and 6 bytes. Single-byte characters are reserved for the ASCII characters in the 0 through 0x7f range. These characters all have the high order bit set to 0. For all character encodings of more than one byte, the initial byte determines the number of bytes used, and the high-order bit in each byte is set. Every byte that does not start with the bit combination of 10xxxxxx, where x represents a bit that may be 0 or 1, is the start of a UCS character sequence. The following table provides UTF-8 multibyte codes:

Bytes Bits Hex Minimum Hex Maximum Byte Sequence in Binary
1 7 00000000 0000007F 0xxxxxxx
2 11 00000080 000007FF 110xxxxx 10xxxxxx
3 16 00000800 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 21 00010000 001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 26 00200000 03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxx 10xxxxxx
6 31 04000000 7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The UCS value is just the concatenation of the x bits in the multibyte encoding. When there are multiple ways to encode a value (for example, UCS 0), only the shortest encoding is permitted.

The following subset of UTF-8 is used to encode UCS-2:

Bytes Bits Hex Minimum Hex Maximum Byte Sequence in Binary
1 7 00000000 0000007F 0xxxxxxx
2 11 00000080 000007FF 110xxxxx 10xxxxxx
3 16 00000800 0000FFFF 1110xxxx 10xxxxxx 10xxxxxx

This subset of UTF-8 requires a maximum of three (3) bytes.