Creating a Unicode table

If you plan to store Unicode data, create Unicode tables. If you try to insert Unicode data into an ASCII or EBCDIC table, data might be lost, unless you use escaped data.

About this task

Recommendation: When you create objects, use standard characters for the object names and column names. Unique characters, such as ü or é, can complicate your applications if conversions are needed.

Procedure

To create a Unicode table:

In the CREATE DATABASE, CREATE TABLESPACE, or CREATE TABLE statement, specify the CCSID UNICODE clause.
By default, the encoding scheme of a table is the same as the encoding scheme of its table space. Also by default, the encoding scheme of the table space is the same as the encoding scheme of its database. You can override the encoding scheme with the CCSID clause in the CREATE TABLESPACE or CREATE TABLE statement. However, all tables within a table space must have the same CCSID.

In the CREATE TABLE statement, for each column definition, specify the appropriate data type, subtype, and length value.

data type

Use one of the following data types:

For UTF-8 data, create columns of type CHAR, VARCHAR, or CLOB.
For UTF-16 data, create columns of type GRAPHIC, VARGRAPHIC, or DBCLOB.
For binary data, create columns of type BINARY, VARBINARY, and BLOB.

Recommendation: In general, use varying-length columns for Unicode tables because the number of bytes in a Unicode column usually is two to three times that of an EBCDIC column.

The general guideline is to use variable length for columns that are greater than 18 bytes unless you know that the entire column is to always be filled. For example, if you store a timestamp in character form (not as the Db2 TIMESTAMP datatype), you need a column with some number of characters. In DB2® 9, that number would be 26 characters. (In ASCII, EBCDIC, or UTF-8, that column is 26 bytes. In UTF-16, that column is 52 bytes.) Because the timestamp is always the same size, using a varying-length column does not save storage. However, suppose that you have a name field that is in ASCII or EBCDIC and allows for names of 26 characters. (In ASCII SBCS or EBCDIC SBCS, you use 26 bytes. In UTF-8, you need 78 bytes. In UTF-16, you need 52 bytes.) In this case, you want to use a varying-length column, because the name field is likely to have many blanks and you do not want to store them.

subtype

For character columns, optionally specify one the following subtypes for the column by adding the FOR subtype DATA clause to the column definition:

SBCS: Specify this subtype if the column is to contain only those UTF-8 characters that are stored as 1 byte. Those characters are the first 128 characters in the Unicode code page. Data that is stored in a SBCS character column in a Unicode table has a CCSID of 367.
MIXED: Specify this subtype if the column is to contain any UTF-8 data that is more than 1 byte. MIXED is the default value. Character data in a Unicode table is stored as mixed data by default, even if your subsystem is defined with a MIXED DECP value of NO. Data that is stored in a MIXED character column in a Unicode table has a CCSID of 1208.
BIT: This subtype specifies that the column contains BIT data. CCSID 66534 is associated with FOR BIT DATA columns.
Recommendation: Although you can also specify the subtype BIT for CHAR and VARCHAR columns that contain BIT data, use the BINARY or VARBINARY data types instead.
Do not use FOR BIT DATA columns for the sole purpose of handling international data. Only use FOR BIT DATA columns if you have a specific reason, such as encryption. Otherwise, this data type can cause problems. For example, if you have a string of length 10 and put it in a FOR BIT DATA column of length 12, Db2 pads the string with two blanks. The hexadecimal value that is used for those blanks is system specific. For example, X'40' is used for EBCDIC and X'20' is used for Unicode. These different hexadecimal values can potentially cause problems when you convert this data.

length

To determine the appropriate the length value, follow the instructions in Estimating the column size for Unicode data.

Db2 associates a certain CCSID with the column depending on the data type that you specify. The following table summarizes the possible column data types in a Unicode table and the CCSIDs that are associated with the data in those columns.

Table 1. CCSIDs that are associated with columns in a Unicode table
Column data type	Associated CCSID	Format in which the data is stored
CHAR^3.a	1208	UTF-8
CHAR FOR SBCS DATA	367	7-bit ASCII
CHAR FOR MIXED DATA	1208	UTF-8
CHAR FOR BIT DATA	66534	NA
VARCHAR^3.a	1208	UTF-8
VARCHAR FOR SBCS DATA	367	7-bit ASCII
VARCHAR FOR MIXED DATA	1208	UTF-8
VARCHAR FOR BIT DATA	66534	NA
CLOB^3.a	1208	UTF-8
CLOB FOR SBCS DATA	367	7-bit ASCII
CLOB FOR MIXED DATA	1208	UTF-8
GRAPHIC	1200	UTF-16
VARGRAPHIC	1200	UTF-16
DBCLOB	1200	UTF-16
Note: If you do not specify a subtype, Db2 assumes FOR MIXED DATA.

Example

Begin general-use programming interface information.

The following CREATE TABLE statement creates a Unicode table.

CREATE TABLE UNITAB
 (C1 CHAR(4)FOR SBCS DATA,
  C2 CHAR(4),
  C3 GRAPHIC(4),
  C4 VARCHAR(4) FOR SBCS DATA,
  C5 VARCHAR(4),
  C6 VARGRAPHIC(4))
CCSID Unicode

End general-use programming interface information.

Columns C1 and C4 can contain only 1-byte UTF-8 data. (This data has CCSID 367 and is stored in 7-bit ASCII format.) Columns C2 and C5 can contain any UTF-8 data. Columns C3® and C6 can contain UTF-16 data.

The CHAR and VARCHAR columns each have a length of 4 bytes. That length means that each of these columns can contain one of the following characters or sets of characters:

one UTF-8 character that is 4 bytes
two UTF-8 characters that are each 2 bytes
one 3-byte UTF-8 characters and one one-byte UTF-8 character
four one-byte UTF-8 characters

The GRAPHIC and VARGRAPHIC columns each have a length of 4 UTF-16 code units. (A UTF-16 code unit is 16 bits or 2 bytes.) For UTF-16 characters that are 2 bytes, this length means 4 characters. However, this length does not always correlate to 4 characters. Consider supplementary UTF-16 characters, which are each 2 UTF-16 code units or 4 bytes. If you include any supplementary characters in the column, the column cannot include 4 characters. Thus, the length of this column can contain 2, 3, or 4 characters, depending on the size of the character. For example, each of these GRAPHIC and VARGRAPHIC columns can contain one of the following characters or sets of characters:

four 2-byte UTF-16 characters
two 4-byte UTF-16 characters
one 4-byte UTF-16 character and two 2-byte UTF-16 characters