Data types in Unicode databases

All data types supported by Db2® are also supported in a Unicode database. In particular, graphic string data is supported for a Unicode database, and is stored in UCS-2 encoding. Every client, including SBCS clients, can work with graphic string data types in UCS-2 encoding when connected to a Unicode database.

A Unicode database is like any MBCS database where character string data is measured in number of bytes. When working with character string data in UTF-8 encoding, one should not assume that each character is one byte. In multibyte UTF-8 encoding, each ASCII character is one byte, but non-ASCII characters take two to four bytes each. This should be taken into account when defining CHAR fields. Depending on the ratio of ASCII to non-ASCII characters, a CHAR field of size n bytes can contain anywhere from n/4 to n characters.

Using character string UTF-8 encoding versus the graphic string UCS-2 data type also has an impact on the total storage requirements. In a situation where the majority of characters are ASCII, with some non-ASCII characters in between, storing UTF-8 data may be a better alternative, because the storage requirements are closer to one byte per character. However, in situations where the majority of characters are non-ASCII characters that expand to three- or four-byte UTF-8 sequences (for example ideographic characters), the UCS-2 graphic-string format may be a better alternative, because every three-byte UTF-8 sequence becomes a 16-bit UCS-2 character, while each four-byte UTF-8 sequence becomes two 16-bit UCS-2 characters.

SQL CHAR data types are supported (in the C language) by the char data type in user programs. SQL GRAPHIC data types are supported by sqldbchar in user programs. Note that, for a Unicode database, sqldbchar data is always in big-endian (high byte first) format. When an application program is connected to a Unicode database, character string data is converted between the application code page and UTF-8, and graphic string data is converted between the application graphic code page and UCS-2 by Db2.

When retrieving data from a Unicode database to an application that does not use an SBCS, EUC, or Unicode code page, the defined substitution character is returned for each blank padded to a graphic column. Db2 pads fixed-length Unicode graphic columns with ASCII blanks (U+0020), a character that has no equivalent in pure DBCS code pages. As a result, each ASCII blank used in the padding of the graphic column is converted to the substitution character on retrieval. Similarly, in a DATE, TIME or TIMESTAMP string, any SBCS character that does not have a pure DBCS equivalent is also converted to the substitution character when retrieved from a Unicode database to an application that does not use an SBCS, EUC, or Unicode code page.

Note: Before Version 8, graphic string data and character string data were not compatible. Since Version 8, graphic and character data can be used interchangeably. To provide compatibility with applications that depend on the previous behavior of Db2, the registry variable DB2GRAPHICUNICODESERVER has been introduced. Its default value is OFF. Changing the value of this variable to ON will cause Db2 to use its earlier behavior. Additionally, the Db2 server checks the version of Db2 running on the client, and will simulate Db2 Universal Database Version 7 behavior if the client is running Db2 UDB Version 7.