Estimating the column size for Unicode data

When you create a table to store Unicode data, allocate columns for storage length, not for display length.

Procedure

To estimate the column size for Unicode data, perform one of the following actions:

For UTF-8 data, allocate three times the column size that you would allocate for a non-Unicode table.
For example, if you use CHAR(10) for a name column in an EBCDIC table, use VARCHAR(30) for the same column in a Unicode table. This column can contain 30 bytes or ten 3-byte characters. In this case, use VARCHAR instead of CHAR, because the length (30) is greater than 18. (18 is traditionally the length when VARCHAR should be used instead of CHAR.)

This estimate allows for the worst-case expansion of UTF-8 data. The worst case for SBCS data is that 1 byte in ASCII or EBCDIC expands to 3 bytes in UTF-8. For mixed data, such as Chinese, Japanese, or Korean characters, the same worst-case scenario applies. You might have 2-, 3- and 4-byte characters, depending on the encoding, that expand to a four-byte UTF-8 character in the worst case. However, because these characters used more than one byte in ASCII or EBCDIC, the worst-case expansion in UTF-8 is still three times the original size.
For UTF-16 data, allocate two times the column size that you would allocate for a non-Unicode table, and use the GRAPHIC or VARGRAPHIC data types.
For example, if you use CHAR(10) for a name column in an EBCDIC table, use VARGRAPHIC(10) for the same column in a Unicode table. CHAR(10) is 10 bytes long. VARGRAPHIC(10) is 20 bytes long or the equivalent of 10 two-byte characters.

Recommendation: If your application is written in COBOL or PL/I, store your data in UTF-16, and use the GRAPHIC and VARGRAPHIC data types. Thus, the Unicode format in your application matches the format in your database. This setup avoids conversion costs.