UTF-8 (UCS transformation format)
The Open Group has developed a transformation format for UCS designed for use in existing file systems. The intent is that UCS will be the process code for the transformation format, which is usable as a file code.
- It is a superset of ASCII, in which the ASCII characters are encoded as single-byte characters with the same numeric value.
- No ASCII code values occur in multibyte characters, other than those that represent the ASCII characters.
- The first byte of a character indicates the number of bytes to follow in the multibyte character sequence and cannot occur anywhere else in the sequence.
The UTF-8 encodes UCS values in the 0 through 0x7FFFFFFF range using multibyte characters with lengths of 1, 2, 3, 4, 5, and 6 bytes. Single-byte characters are reserved for the ASCII characters in the 0 through 0x7f range. These characters all have the high order bit set to 0. For all character encodings of more than one byte, the initial byte determines the number of bytes used, and the high-order bit in each byte is set. Every byte that does not start with the bit combination of 10xxxxxx, where x represents a bit that may be 0 or 1, is the start of a UCS character sequence. The following table provides UTF-8 multibyte codes:
Bytes | Bits | Hex Minimum | Hex Maximum | Byte Sequence in Binary |
---|---|---|---|---|
1 | 7 | 00000000 | 0000007F | 0xxxxxxx |
2 | 11 | 00000080 | 000007FF | 110xxxxx 10xxxxxx |
3 | 16 | 00000800 | 0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
4 | 21 | 00010000 | 001FFFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
5 | 26 | 00200000 | 03FFFFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxx 10xxxxxx |
6 | 31 | 04000000 | 7FFFFFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
The UCS value is just the concatenation of the x bits in the multibyte encoding. When there are multiple ways to encode a value (for example, UCS 0), only the shortest encoding is permitted.
The following subset of UTF-8 is used to encode UCS-2:
Bytes | Bits | Hex Minimum | Hex Maximum | Byte Sequence in Binary |
---|---|---|---|---|
1 | 7 | 00000000 | 0000007F | 0xxxxxxx |
2 | 11 | 00000080 | 000007FF | 110xxxxx 10xxxxxx |
3 | 16 | 00000800 | 0000FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
This subset of UTF-8 requires a maximum of three (3) bytes.