If
you plan to store Unicode data, create Unicode tables. If you try
to insert Unicode data into an ASCII or EBCDIC table, data might be
lost, unless you use escaped data.
About this task
Recommendation: When
you create objects, use standard characters for the object names and
column names. Unique characters, such as ü or é,
can complicate your applications if conversions are needed.
Procedure
To create a Unicode table:
- In the CREATE DATABASE, CREATE TABLESPACE, or CREATE TABLE
statement, specify the CCSID UNICODE clause.
By default,
the encoding scheme of a table is the same as the encoding scheme
of its table space. Also by default, the encoding scheme of the table
space is the same as the encoding scheme of its database. You can
override the encoding scheme with the CCSID clause in the CREATE TABLESPACE
or CREATE TABLE statement. However, all tables within a table space
must have the same CCSID.
- In the CREATE TABLE statement, for each column definition,
specify the appropriate data type, subtype, and length value.
- data type
Use one of the following data types:
- For UTF-8 data, create columns of type CHAR, VARCHAR, or CLOB.
- For UTF-16 data, create columns of type GRAPHIC, VARGRAPHIC, or
DBCLOB.
- For
binary data, create columns of type BINARY, VARBINARY, and BLOB.
Recommendation: In general, use varying-length
columns for Unicode tables because the number of bytes in a Unicode
column usually is two to three times that of an EBCDIC column.
The
general guideline is to use variable length for columns that are greater
than 18 bytes unless you know that the entire column is to always
be filled. For example, if you store a timestamp in character form
(not as the Db2 TIMESTAMP datatype),
you need a column with some number of characters. In DB2® 9, that
number would be 26 characters. (In ASCII, EBCDIC, or UTF-8, that column
is 26 bytes. In UTF-16, that column is 52 bytes.) Because the timestamp
is always the same size, using a varying-length column does not save
storage. However, suppose that you have a name field that is in ASCII
or EBCDIC and allows for names of 26 characters. (In ASCII SBCS or
EBCDIC SBCS, you use 26 bytes. In UTF-8, you need 78 bytes. In UTF-16,
you need 52 bytes.) In this case, you want to use a varying-length
column, because the name field is likely to have many blanks and you
do not want to store them.
- subtype
- For character columns, optionally specify one the following subtypes
for the column by adding the FOR subtype DATA clause
to the column definition:
- SBCS
- Specify this subtype if the column is to contain only those UTF-8
characters that are stored as 1 byte. Those characters are the first
128 characters in the Unicode code page. Data that is stored in a
SBCS character column in a Unicode table has a CCSID of 367.
- MIXED
- Specify this subtype if the column is to contain any UTF-8 data
that is more than 1 byte. MIXED is the default value. Character data
in a Unicode table is stored as mixed data by default, even if your
subsystem is defined with a MIXED DECP value of NO. Data that is stored
in a MIXED character column in a Unicode table has a CCSID of 1208.
- BIT
- This subtype specifies that the column contains BIT data. CCSID
66534 is associated with FOR BIT DATA columns.
Recommendation: Although
you can also specify the subtype BIT for CHAR and VARCHAR columns
that contain BIT data, use the BINARY or VARBINARY data types instead.
Do
not use FOR BIT DATA columns for the sole purpose of handling international
data. Only use FOR BIT DATA columns if you have a specific reason,
such as encryption. Otherwise, this data type can cause problems.
For example, if you have a string of length 10 and put it in a FOR
BIT DATA column of length 12, Db2 pads
the string with two blanks. The hexadecimal value that is used for
those blanks is system specific. For example, X'40' is used
for EBCDIC and X'20' is used for Unicode. These different
hexadecimal values can potentially cause problems when you convert
this data.
- length
- To determine the appropriate the length value, follow the instructions
in Estimating the column size for Unicode data.
Db2 associates
a certain CCSID with the column depending on the data type that you
specify. The following table summarizes the possible column data types
in a Unicode table and the CCSIDs that are associated with the data
in those columns.
Table 1. CCSIDs that are associated with columns in a Unicode
table
Column data type |
Associated CCSID |
Format in which the data is stored |
CHAR3.a |
1208 |
UTF-8 |
CHAR FOR SBCS DATA |
367 |
7-bit ASCII |
CHAR FOR MIXED DATA |
1208 |
UTF-8 |
CHAR FOR BIT DATA |
66534 |
NA |
VARCHAR3.a |
1208 |
UTF-8 |
VARCHAR FOR SBCS DATA |
367 |
7-bit ASCII |
VARCHAR FOR MIXED DATA |
1208 |
UTF-8 |
VARCHAR FOR BIT DATA |
66534 |
NA |
CLOB3.a |
1208 |
UTF-8 |
CLOB FOR SBCS DATA |
367 |
7-bit ASCII |
CLOB FOR MIXED DATA |
1208 |
UTF-8 |
GRAPHIC |
1200 |
UTF-16 |
VARGRAPHIC |
1200 |
UTF-16 |
DBCLOB |
1200 |
UTF-16 |
Note:
- If you do not specify a subtype, Db2 assumes FOR MIXED DATA.
|
Example
The following CREATE TABLE
statement creates a Unicode table.CREATE TABLE UNITAB
(C1 CHAR(4)FOR SBCS DATA,
C2 CHAR(4),
C3 GRAPHIC(4),
C4 VARCHAR(4) FOR SBCS DATA,
C5 VARCHAR(4),
C6 VARGRAPHIC(4))
CCSID Unicode

Columns C1 and C4 can contain only 1-byte UTF-8 data. (This data has CCSID 367 and is stored in 7-bit ASCII format.) Columns C2 and C5 can contain any UTF-8 data. Columns C3® and C6 can contain UTF-16 data.
The
CHAR and VARCHAR columns each have a length of 4 bytes. That length
means that each of these columns can contain one of the following
characters or sets of characters:
- one UTF-8 character that is 4 bytes
- two UTF-8 characters that are each 2 bytes
- one 3-byte UTF-8 characters and one one-byte UTF-8 character
- four one-byte UTF-8 characters
The
GRAPHIC and VARGRAPHIC columns each have a length of 4 UTF-16 code
units. (A UTF-16 code unit is 16 bits or 2 bytes.) For UTF-16 characters
that are 2 bytes, this length means 4 characters. However, this length
does not always correlate to 4 characters. Consider supplementary
UTF-16 characters, which are each 2 UTF-16 code units or 4 bytes.
If you include any supplementary characters in the column, the column
cannot include 4 characters. Thus, the length of this column can contain
2, 3, or 4 characters, depending on the size of the character. For
example, each of these GRAPHIC and VARGRAPHIC columns can contain
one of the following characters or sets of characters:
- four 2-byte UTF-16 characters
- two 4-byte UTF-16 characters
- one 4-byte UTF-16 character and two 2-byte UTF-16 characters