UTF-16

UTF-16 is an encoding of Unicode in which each character is composed of either one or two 16-bit elements.

Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. Over time, and especially after the addition of over 14 500 composite characters for compatibility with established sets, it became clear that 16 bits were not sufficient for most users. Out of this arose UTF-16.

UTF-16 allows access to about 60 000 characters as single Unicode 16-bit units. It can access an additional 1 000 000 characters by a mechanism known as surrogate pairs.

Two ranges of Unicode code values are reserved for the high (first) and low (second) values of these pairs. Highs are from 0xD800 to 0xDBFF, and lows from 0xDC00 to 0xDFFF. Because the most common characters have already been encoded in the first 64 000 values, the characters requiring surrogate pairs are relatively rare.

UTF-16 is extremely well designed as the best compromise between handling and space, and all commonly used characters can be stored with one code unit per code point. This is the default encoding for Unicode.

The IBM® i operating system supports UTF-16 encoding with CCSID 1200 (and CCSID 13488). Beginning with IBM i V5R3, CCSID 1200 is supported in database. CCSID 13488 has been supported in database for several releases.