Unicode character encoding

The Unicode character encoding standard is a fixed-length, character encoding scheme that includes characters from almost all of the living languages of the world.

Information about Unicode can be found in The Unicode Standard, and from the Unicode Consortium website at www.unicode.org.

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character. This encoding form produces more than 65 000 code elements, which is sufficient for encoding most of the characters of the major languages of the world. The Unicode standard also provides an extension mechanism that allows the encoding of as many as 1,000,000 extra characters. The extension mechanism uses a pair of high and low surrogate characters to encode one extended or supplementary character. The first (or high) surrogate character has a code value between U+D800 and U+DBFF. The second (or low) surrogate character has a code value between U+DC00 and U+DFFF.

UCS-2

The International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) standard 10646 (ISO/IEC 10646) specifies the Universal Multiple-Octet Coded Character Set (UCS). This character set has a 16-bit (two-byte) version (UCS-2) and a 32-bit (four-byte) version (UCS-4). UCS-2 is identical to the Unicode 16-bit form without surrogates. UCS-2 can encode all the (16-bit) characters that are defined in the Unicode version 3.0 repertoire. Two UCS-2 characters - a high followed by a low surrogate - are needed to encode each of the new supplementary characters, starting in Unicode version 3.1. These supplementary characters are defined outside the original 16-bit Basic Multilingual Plane (BMP or Plane 0).

UTF-16

ISO/IEC 10646 also defines an extension technique for encoding some UCS-4 characters by using two UCS-2 characters. This extension, called UTF-16, is identical to the Unicode 16-bit encoding form with surrogates. In summary, the UTF-16 character repertoire consists of all the UCS-2 characters plus the additional 1,000,000 characters that are accessible through the surrogate pairs.

When serializing 16-bit Unicode characters into bytes, the order in which the bytes appear depends on the processor that is being used. Some processors place the most significant byte in the initial position (known as big-endian order), while others place the least significant byte first (known as little-endian order). The default byte ordering for Unicode is big-endian.

UTF-8

Sixteen-bit Unicode characters pose a major problem for byte-oriented ASCII-based applications and file systems. For example, non-Unicode aware applications might misinterpret the leading 8 zero bits of the uppercase character 'A' (U+0041) as the single-byte ASCII NULL character.

UTF-8 (UCS Transformation Format 8) is an algorithmic transformation that transforms fixed-length Unicode characters into variable-length ASCII-safe byte strings. In UTF-8, ASCII and control characters are represented by their usual single-byte codes, and other characters become two or more bytes long. UTF-8 can encode both nonsupplementary and supplementary characters.

UTF-8 characters can be up 4 bytes long. Nonsupplementary characters are up to 3 bytes long and supplementary characters are 4 bytes long.

The number of bytes for each UTF-16 character in UTF-8 format can be determined from Table 1.

Table 1. UTF-8 Bit Distribution
Code Value

(binary)

UTF-16

(binary)

First byte

(binary)

Second byte

(binary)

Third byte

(binary)

Forth byte

(binary)

00000000

0xxxxxxx

00000000

0xxxxxxx

0xxxxxxx      
00000yyy

yyxxxxxx

00000yyy

yyxxxxxx

110yyyyy 10xxxxxx    
zzzzyyyy

yyxxxxxx

zzzzyyyy

yyxxxxxx

1110zzzz 10yyyyyy 10xxxxxx  
uuuuu

zzzzyyyy

yyxxxxxx

110110ww

wwzzzzyy

110111yy

yyxxxxxx

11110uuu

(where uuuuu = wwww+1)

10uuzzzz 10yyyyyy 10xxxxxx

In each of the code values listed in the previous table, the series of u's, w's, x's, y's, and z's is the bit representation of the character. For example, U+0080 transforms into 11000010 10000000 in binary format, and the surrogate character pair U+D800 U+DC00 becomes 11110000 10010000 10000000 10000000 in binary format.