Unicode implementation in Db2

Db2 supports UTF-8 and UCS-2 encoding. When a Unicode database is created, CHAR, VARCHAR, LONG VARCHAR, and CLOB data are stored in UTF-8 form, and GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, and DBCLOB data are stored in UCS-2 big-endian form.

A surrogate pair is a coded representation for a single character that consists of a sequence of two Unicode values, where the first value of the pair is a high-surrogate in the range U+D800 through U+DBFF, and the second value is a low-surrogate in the range U+DC00 through U+DFFF. You can use surrogate pairs to encode an additional 1,048,576 code points without using 32-bit code units.

In versions of Db2 products before Version 7.2 FixPak 4, Db2 treats the two characters in a surrogate pair as two independent Unicode characters. Therefore transforming the pair from UTF-16/UCS-2 to UTF-8 results in two three-byte sequences. Starting in Db2 Universal Database Version 7.2 FixPak 4, Db2 recognizes surrogate pairs when transforming between UTF-16/UCS-2 and UTF-8, thus a pair of UTF-16 surrogates will become one UTF-8 four-byte sequence. In other usages, Db2 continues to treat a surrogate pair as two independent UCS-2 characters. You can safely store supplementary characters in Db2 Unicode databases, provided you know how to distinguish them from the non-supplementary characters.

Db2 treats each Unicode character, including (non-spacing) characters such as the COMBINING ACUTE ACCENT character (U+0301), as an individual character. It does not perform normalization, which refers to transforming equivalent characters or sequences of characters into a consistent underlying representation. Therefore Db2 would not recognize that the character LATIN SMALL LETTER A WITH ACUTE (U+00E1) is canonically equivalent to the character LATIN SMALL LETTER A (U+0061) followed by the character COMBINING ACUTE ACCENT (U+0301).

All culturally sensitive parameters, such as date or time format, decimal separator, and others, are based on the current territory of the client.

A Unicode database allows connection from every code page supported by the Db2 database system. The database manager automatically performs code page conversion for character and graphic strings between the client's code page and Unicode.

Every client is limited by the character repertoire, the input method, and the fonts supported by its environment, but the Unicode database itself accepts and stores all Unicode characters. Therefore, every client usually works with a subset of Unicode characters, but the database manager allows the entire repertoire of Unicode characters.

When characters are converted from a local code page to Unicode, there may be expansion in the number of bytes. Prior to Version 8, based on the semantics of SQL statements, character data may have been marked as being encoded in the client's code page, and the database server would have manipulated the entire statement in the client's code page. This manipulation could have resulted in potential expansion of the data. Starting in Version 8, once an SQL statement enters the database server, it operates only on the database server's code page. In this case there is no size change. However, specifying string units for some string functions might result in internal code page conversions. If this occurs, the size of the data string might change.