Data types in Unicode databases
All data types supported by Db2® are also supported in a Unicode database. In particular, graphic string data is supported for a Unicode database, and is stored in UCS-2 encoding. Every client, including SBCS clients, can work with graphic string data types in UCS-2 encoding when connected to a Unicode database.
A Unicode database is like any MBCS database where character string data is measured in number of bytes. When working with character string data in UTF-8 encoding, one should not assume that each character is one byte. In multibyte UTF-8 encoding, each ASCII character is one byte, but non-ASCII characters take two to four bytes each. This should be taken into account when defining CHAR fields. Depending on the ratio of ASCII to non-ASCII characters, a CHAR field of size n bytes can contain anywhere from n/4 to n characters.
Using character string UTF-8 encoding versus the graphic string UCS-2 data type also has an impact on the total storage requirements. In a situation where the majority of characters are ASCII, with some non-ASCII characters in between, storing UTF-8 data may be a better alternative, because the storage requirements are closer to one byte per character. However, in situations where the majority of characters are non-ASCII characters that expand to three- or four-byte UTF-8 sequences (for example ideographic characters), the UCS-2 graphic-string format may be a better alternative, because every three-byte UTF-8 sequence becomes a 16-bit UCS-2 character, while each four-byte UTF-8 sequence becomes two 16-bit UCS-2 characters.
SQL CHAR data types are supported (in the C language) by the char
data type in
user programs. SQL GRAPHIC data types are supported by sqldbchar
in user programs.
Note that, for a Unicode database, sqldbchar
data is always in big-endian (high
byte first) format. When an application program is connected to a Unicode database, character string
data is converted between the application code page and UTF-8, and graphic string data is converted
between the application graphic code page and UCS-2 by Db2.
When retrieving data from a Unicode database to an application that does not use an SBCS, EUC, or Unicode code page, the defined substitution character is returned for each blank padded to a graphic column. Db2 pads fixed-length Unicode graphic columns with ASCII blanks (U+0020), a character that has no equivalent in pure DBCS code pages. As a result, each ASCII blank used in the padding of the graphic column is converted to the substitution character on retrieval. Similarly, in a DATE, TIME or TIMESTAMP string, any SBCS character that does not have a pure DBCS equivalent is also converted to the substitution character when retrieved from a Unicode database to an application that does not use an SBCS, EUC, or Unicode code page.