Japanese and Traditional Chinese EUC and UCS-2 code set considerations
Extended UNIX Code (EUC) denotes a set of general encoding rules that can support from one to four character sets in Linux® and UNIX operating environments. The encoding rules are based on the ISO 2022 definition for encoding 7-bit and 8-bit data in which control characters are used to separate some of the character sets. A code set based on EUC conforms to the EUC encoding rules, but also identifies the specific character sets associated with the specific instances. For example, the IBM-eucJP code set for Japanese refers to the encoding of the Japanese Industrial Standard characters according to the EUC encoding rules.
Database and client application support for graphic (pure double-byte character) data, while running under EUC code pages with character encoding that is greater than two bytes in length is limited. The Db2® products implement strict rules for graphic data that require all characters to be exactly two bytes wide. These rules do not allow many characters from both the Japanese and Traditional Chinese EUC code pages. To overcome this situation, support is provided at both the application level and the database level to represent Japanese and Traditional Chinese EUC graphic data using another encoding scheme.
A database created under either Japanese or Traditional Chinese EUC code pages will actually store and manipulate graphic data using the Unicode UCS-2 code set, a double-byte encoding scheme that is a proper subset of the full Unicode character repertoire. Similarly, an application running under those code pages will send graphic data to the database server as UCS-2 encoded data. With this support, applications running under EUC code pages can access the same types of data as those running under DBCS code pages. The IBM-defined code page identifier associated with UCS-2 is 1200, and the CCSID number for the same code page is 13488. Graphic data in an eucJP or eucTW database uses the CCSID number 13488. In a Unicode database, use CCSID 1200 for GRAPHIC data.
Db2 database
system supports the all the Unicode characters that can be encoded
using UCS-2, but does not perform any composition, decomposition,
or normalization of characters. More information about the Unicode
standard can be found at the Unicode Consortium website, www.unicode.org
,
and from the latest edition of the Unicode Standard book published
by Addison Wesley Longman, Inc.
If you are working with applications or databases using these character sets you might need to consider dealing with UCS-2 encoded data. When converting UCS-2 graphic data to the application's EUC code page, there is the possibility of an increase in the length of data. When large amounts of data are being displayed, it might be necessary to allocate buffers, convert, and display the data in a series of fragments.
The following sections discuss how to handle data in this environment. For these sections, the term EUC is used to refer only to Japanese and Traditional Chinese EUC character sets. Note that the discussions do not apply to Db2 Korean or Simplified-Chinese EUC support, because graphic data in these character sets is represented using the EUC encoding.