Access your database from everywhere

A practical approach to DB2 UDB for Linux, UNIX, Windows Unicode support

Design your IBM DB2® Universal Database™ for Linux®, UNIX®, and Windows® (DB2 UDB) database for international use with Unicode. This article explains what encodings are supported, what you need to consider in planning your database, how to store Unicode data, how to migrate an existing non-Unicode database to a Unicode deatabase, and how to enter data. In addition, you'll learn how a Java™ application accesses DB2 Unicode data, and examine Java examples.

Share:

Dirk Fechner (fechner@de.ibm.com), IT Services Specialist, IBM Software Group, Information Management

Author photoDirk Fechner works as an IT Specialist for IBM Software Group Services. His area of expertise is design, implementation, and administration of DB2 databases on distributed platforms as well as database application development. He has ten years of experience with DB2 for Linux, UNIX, and Windows and is an IBM Certified Advanced Database Administrator and an IBM Certified Application Developer.


developerWorks Contributing author
        level

Samuel Poon (poons@ca.ibm.com), DB2 Consultant, IBM

Sam Poon photoSamuel Poon is a DB2 Consultant from the IBM Toronto Laboratory and works primarily with IBM Business Partners. Sam has worked on several articles published on developerWorks. You can reach him at: poons@ca.ibm.com.



Shan Chen (shanchen@ca.ibm.com), DB2 Techical Support Specialist, IBM

author photoShan Chen is a DB2 Techical Support Specialist at IBM Toronto Laboratory. She is a certified DB2 database administrator and a certified DB2 developer.



26 January 2006

Introduction

Today applications are often designed for international use. These applications may be required to cope with character strings in different languages. Unicode has been established as a language-independent standard for the representation of characters.

With Java as a programming language that already represents characters in Unicode internally, the development of international applications has been significantly simplified. However, we cannot consider only the application side. The back-end database has to be able to handle Unicode characters too. This article discusses several topics that will help developers who are implementing their first DB2 UDB application for international use.

This article covers the following topics:

What Unicode standard is supported in DB2?

There is only one Unicode standard, but nevertheless different encodings for Unicode characters exist. The most common encodings for Unicode characters are UTF-8 and UCS-2 :

  • UTF-8: 1 to 4 bytes are used for the encoding of each character. This encoding scheme makes it possible to encode an ASCII character with one byte, and a non-ASCII character with multiple (up to 4) bytes.
  • UCS-2: Each Unicode character is encoded with exactly 2 bytes. This encoding scheme can represent 65000+ characters which cover most of the characters of the most important languages of the world. UCS-2 is also used internally by Java.

As previous mentioned, the complete Unicode standard consists of more than the 65000+ characters. The additional characters primarily belong to exotic languages that might no longer be used, or might be limited in use. For example, some Asian characters are used in names only. These additional characters, also called supplementary characters, can be represented in UTF-8 using 4 bytes. In addition, there is another encoding scheme called UTF-16 that can also be used to represent supplementary characters. To do so, UTF-16 uses 2 x 2 bytes.

DB2 UDB only supports the encoding UTF-8 and UCS-2. Even though DB2 does not have support for the supplementary characters, the supplementary characters can be stored in DB2 UDB. It should be noted that the 65000+ characters are more than sufficient for the most applications. Special mechanisms are required for handling supplementary characters in Java applications. Therefore, the requirements for the supplementary characters are not only the database support, but also the application.

How to store data using a Unicode encoding scheme in DB2

You can store Unicode data in a DB2 Unicode database. In a Unicode database, all the tables store character data using the Unicode encoding scheme. DB2 also allows you to store character data using the Unicode format in a Unicode table within a non-Unicode database.

Define a Unicode database

A Unicode database only stores character data in Unicode format, and it does not display the data in Unicode format. DB2 determines the database code page when a database is created. The database code page is either implicitly or explicitly determined.

DB2 UDB for Linux, UNIX, and Windows can implicitly determine the code page based on an active environment variable. When you create a database using the CREATE DATABASE statement, the code page for the database will be derived from the operating system locale setting. For Windows operating system, the database code page will be derived from the ANSI code page setting from the registry. For the UNIX operating system, the active environment will be derived from locale setting including language, territory, and code set. It should be noted that the registry variable DB2CODEPAGE can be used to override the code page. However, if the DB2CODEPAGE registry variable is set to an incorrect value, it may cause unpredictable results and potential data corruption.

You can also explicitly specify the code page in the CREATE DATABASE statement with the USING CODESET clause. The CREATE DATABASE statement with USING CODESET UTF-8 clause implies that the database can contain character data in either UTF-8 or UCS-2 encoding. It is recommended that you define a Unicode database explicitly for clarification reasons:

CREATE DATABASE db_name USING CODESET codeset
TERRITORY territory_name COLLATE USING collating sequence

where codeset should be specified in uppercase characters like UTF-8, territory_name is the code to provide region-specific support, and collating sequence indicates the method for comparing character data.

For example,

CREATE DATABASE UCSAMPLE USING CODESET UTF-8 TERRITORY US

For Unicode collating sequence in DB2 for Linux, UNIX, and Windows V8.2, here are the possible collators:

  • IDENTITY collator implements the character comparison based on the code point.
  • IDENTITY_16BIT collator implements CESU-8 (An 8-bit compatibility encoding scheme for UTF-16). This collator ensures all characters, supplementary and non-supplementary, have the same binary collating sequence as UTF-8.
  • UCS400_NO collator implements the Unicode Collation Algorithm (UCA) based on the Unicode Standard version 4.00 with normalization implicitly set to on. For details, refer the Unicode Technical Report #10 at the Unicode Technical Consortium.
  • UCS400_Thai implements the UCA version 4.00, but will sort all Thai characters as per the Royal Thai Dictionary order.

Prior to DB2 UDB V8.2, a Unicode database could only be defined with the collating sequence IDENTITY, which means that the characters are compared by their byte encoding. In general, the collator does not have a sort order as you would naturally expect in a language. Taken the letters A to C as capital and small letters, the collating sequence IDENTITY has the following sort order:
IDENTITY: A B C a b c

You get a correct sort order with the collating sequence UCA400_NO in V8.2. The collator UCA400_NO has the following sort order:
UCA400_NO: a A b B c C

While choosing collating sequence, you should also consider the performance implication. For details, refer the performance section of the Unicode Technical Report #10 at the Unicode Technical Consortium.

Once the database is created, the only way to change a database code page is to drop and re-create the database with a new code page.

Create a Unicode table in a Unicode database

When you create a table in a Unicode database, the table is created with the Character Code Set ID (CCSID) UNICODE. DB2 can store Unicode data using either UTF-8 or UCS-2 format. It should be noted that UCS-2 format is identical to UTF-16 format without the supplementary characters support. The characteristics of Unicode formats for DB2 are as follows:

UTF-8 format:

  • In a Unicode database, data stored in CHAR, VARCHAR, LONG VARCHAR, or CLOB will be in UTF-8 format.
  • A character in UTF-8 can take up from 1 to 4 bytes. An ASCII character will take 1 byte, and a non-ASCII character can take from 2 to 4 bytes. Therefore, the maximum number of characters that can be stored in the column might be reduced if non-ASCII characters are involved. It should be noted that an insufficient character size of a column will cause character truncation.

UCS-2 format:

  • In a Unicode database, data stored in GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, or DBLOB will be stored in UCS-2 format.
  • A graphic character is fixed length 2 bytes per character.
  • DB2 always stores graphic Unicode in the big-endian order (i.e., UCS-2BE), even though UCS-2 can be in Big Endian (BE) or Little Endian (LE) order.
  • DB2 UDB for Linux, UNIX, and Windows is supplementary character safe, but unaware.

The character length for UTF-8 and UCS-2 are summarized as follows:

Table 1: Character lengths for UTF-8 and UCS2

Data TypeUnicode FormatMax. length of the Data TypeNo. of bytes per CharacterNo. of Characters with Max. Length
CHARUTF-82541 to 463 to 254
VARCHARUTF-8326721 to 48,168 to 32,672
LONGVARCHARUTF-8327001 to 48,175 to 32,700
CLOBUTF-82,147,483,6471 to 4536,870,911 to 2,147,483,647
GRAPHICUCS-21272127
VARGRAPHICUCS-216,336216,336
LONG VARGRAPHICUCS-216,350216,350
DBCLOBUCS-21,073,741,82321,073,741,823

For example, if you store the following five Chinese characters in UTF-8, you need to define a table column at minimum with CHAR(15):

Figure 1. UTF-8 Chinese characters
UTF-8 Chinese String

When the above same text stores in UCS-2 format, it takes only 10 bytes:

Figure 2. UCS-2 Chinese characters
UCS-2 Chinese String

You can define a table column with GRAPHIC(5) to store these five Chinese characters. It should be noted that all characters including ASCII will also take 2 bytes per character in UCS-2.

Create a Unicode table in a non-Unicode database

When you create tables in a non-Unicode database, you create a table with a default CCSID, which is ASCII. To create a Unicode table in a non-Unicode database, you can use the following CREATE TABLE statement with CCSID UNICODE.

You cannot specify UNICODE on specific column, and you cannot create a non-Unicode (ASCII) table in a Unicode database in DB2 for Linux, UNIX, and Windows. Tables that are created with CCSID UNICODE must not contain graphic data types. That means Unicode tables in non-Unicode databases can store strings only in UTF-8 encoding.

There are a few restrictions that are documented in the DB2 UDB SQL Reference (see the notes for the command CREATE TABLE). For details, refer to the DB2 for Linux, UNIX, and Windows Information Center.

Before you can create a Unicode table in a non-Unicode database, you must first activate a database configuration parameter, ALT_COLLATE (Alternate Collating Sequence). The ALT_COLLATE parameter specifies the collating sequence for Unicode tables, which presently can be only IDENTITY_16BIT. Without setting up the ALT_COLLATE parameter, you are not authorized to create a Unicode table in a non-Unicode database, and you will receive the following error message:
SQL0622N The clause clause is invalid for this database.

To set up the ALT_COLLATE, you can use the following command:
DB2 UPDATE DB CFG FOR TEST2 USING ALT_COLLATE IDENTITY_16BIT

With Unicode table in a non-Unicode database, DB2 supports a section code page under which an SQL statement executes. Even though Unicode tables and non-Unicode tables can co-exist in a database, they cannot interact with each other within a single SQL statement because only one encoding scheme, either ASCII or Unicode, can be referenced in the same SQL statement. If a SQL statement reference both ASCII and Unicode, you will receive the following DB2 SQL error:

SQLCODE: -873, 
SQLSTATE: 53090, 
SQLERRMC: Different encoding schemes in one statement

Create indexes on a Unicode tables

When you create indexes on VARCHAR or VARGRAPHIC columns in your Unicode tables, you need to pay attention of the index key limit. The index key limit might be reduced when you are using VARCHAR or VARGRAPHIC data type. It should be noted that the DB2 index key size limit is 1024 bytes, and the index key limit includes all overhead bytes. For example, when you create a nullable VARCHAR(1024) column in a Unicode database and then create an index on VARCHAR(1024) column. You will encounter the following error:

SQL0614N The index or index extension VARCHAR1024 cannot be 
created or altered because the combined length of the specified columns is too long. 
			SQLSTATE=54008 
			Explanation . . . . 
			sqlcode : -614 
			sqlstate : 54008

The error is caused by the fact that there is an overhead of two bytes for varying length and a one-byte overhead for null. You can only define an index on a column defined with VARCHAR(1021).

If you use a nullable VARCHAR column and only ASCII characters (UTF-8 format 1 byte per character) involved, the maximum character length that can be indexed is 1021 characters. The maximum character length will be reduced when non-ASCII characters are indexed.

If you use a nullable VARGRAPHIC column instead, the 1021-byte index limitation is still true. Since a graphic character is always 2 bytes long, the maximum graphic length that can be indexed is 510 graphic characters, and the column can be defined as VARGRAPHIC(510). The index key size limit also applies to those constraints that are enforced through indexes.

Considerations for using different Unicode formats

There are advantages and disadvantages of using UTF-8:

Advantages:

  • For migrating a non-Unicode to a Unicode database, the data types CHAR, VARCHAR, LONG VARCHAR, and CLOB can still be used in case of UTF-8 encoding.
  • UTF-8 encodes ASCII characters with only 1 byte. Therefore, if you use UTF-8 and you mainly store ASCII characters in the database, storage can be saved.

Disadvantages:

  • For database design, you have to consider that the column sizes have to be defined in bytes, not in characters. Hence, if you want to define a VARCHAR column that can store up to 10 characters, you have to define the column as VARCHAR(40) because a single character requires from 1 to 4 bytes. Therefore, when you migrate a non-Unicode database to a Unicode database, you should keep in mind that the lengths of the CHAR, VARCHAR, LONG VARCHAR, and CLOB data types will change.
  • In combination with the data types that use UTF-8 encoding, functions like SUBSTR and LENGTH still work with bytes and not with characters. It is difficult to work with those functions because characters might require various numbers of bytes in UTF-8 encoding (1 to 4 bytes). A User Defined Function (UDF) may be used instead. For example, one can define a SUBSTRING UDF, which can accept a CHAR value as parameter, cast it into a GRAPHIC value, call the SUBSTR function for the GRAPHIC value, and cast the GRAPHIC result back to a CHAR value. An implementation like that makes use of the fact that functions like SUBSTR and LENGTH still work correctly on the GRAPHIC data type. But please note: Such an implementation might not perform well when a large amount of data is involved.

There are advantages and disadvantages of using UCS-2:

Advantages:

  • The sizes of the data types using UCS-2, GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, and DBCLOB are defined in characters rather than in bytes, which is in contrast to the UTF-8 data types. This simplifies the storage estimate that eases the physical database design.
  • Functions like SUBSTR and LENGTH work with characters when called in conjunction with UCS-2 data types. They can, therefore, be used without problems.
  • There will be no code page conversion between a Java application and a Unicode database using UCS-2 encoding. However, there may be Edian conversions in Java applications, depending upon the running platform.

Disadvantages:

  • If mainly ASCII characters are stored in the Unicode database, there will be an increase of storage requirements when using UCS-2. This is because each character is encoded with 2 bytes.
  • When migrating a non-Unicode to a Unicode database, string data types have to be changed in case of UCS-2 encoding. This might have an impact on the definition of UDFs and stored procedures. If the application is written in a programming language other than Java, an adjustment of the string data types in the source code is generally required.

How to migrate a non-Unicode database to a Unicode database

Since the code set and territory can only be defined at the database creation time and those settings cannot be changed afterwards, there is no easy way to migrate a non-Unicode database to a Unicode database. You need to create the Unicode database as a new database using code set UTF-8 first. Then, you export the data from the non-Unicode database and import the data to the new Unicode database. To simplify things, the DB2 data movement utilities, namely EXPORT, IMPORT, and LOAD, can perform an automatic code page conversion. Nevertheless, there are a few potential pitfalls you should know about. If you are not aware of them, you may encounter problems or even lose data.

For the data transfer, you can choose the following file formats:

  • Integrated Exchange Format (IXF)
  • Delimited ASCII Format (DEL)

where IXF is a binary format that contains not only the data, but also information about the corresponding table structure. DEL is a delimited text format where columns are separated by a delimiter and rows are separated by line breaks.

The advantages and disadvantages for IXF format are:

Advantages:

  • Since IXF format does not require delimiters to separate columns and rows, any characters can be used in data. Therefore, there is no need to verify if the chosen delimiter appears in the data itself. This is even more important when the data contain line breaks.
  • In general, working with IXF files is easier for EXPORT, IMPORT, or LOAD utility because IXF is in a binary format and the metadata also simplifies the IMPORT/LOAD syntax.

Disadvantages:

  • IXF files contain the information whether a column is a single-byte character column (CHAR/VARCHAR) or a multi-byte character column (GRAPHIC/VARGRAPHIC). Importing CHAR/VARCHAR columns into GRAPHIC/VARGRAPHIC columns is not supported when using IXF files. This is problematic, particularly when the migration is to a Unicode database that is designed to store character data in UCS-2 format that uses the GRAPHIC/VARGRAPHIC data types.

The advantages and disadvantages for DEL are:

Advantages:

  • DEL files do not contain information whether data is originally stored as CHAR/VARCHAR or GRAPHIC/VARGRAPHIC. So, there is no restriction for loading or importing data in CHAR/VARCHAR or GRAPHIC/VARGRAPHIC columns.

Disadvantages:

  • A column delimiter has to be chosen that must not appear in the data itself. The row delimiter is always the line break and cannot be chosen freely. Because of this restriction, the data string must not contain any line breaks.
  • Working with DEL files with EXPORT, IMPORT, or LOAD is usually a bit more complex comparing with IXF files because DEL is a text format without metadata (table structure). This leads to a more complex IMPORT/LOAD syntax.

If data in a DEL file is imported or loaded into a DB2 database, there is an important difference between the IMPORT and the LOAD utility. The IMPORT utility always performs a conversion between the code page of the data file and the code page of the database, whereas the LOAD utility by default expects that the data file is in the same code page as the database. If this is not the case, the wrong data is loaded into the database. Therefore, in case of different code pages between data file and database, it is necessary to explicitly tell the LOAD utility the code page of the data file by using the option MODIFIED BY CODEPAGE=codepage. This guarantees that the code page conversion can be performed correctly.

If you already have a Unicode database and you want to transfer data in another Unicode database for reporting or test purposes, you have to ensure that no information is lost by mistake during the data transfer. As mentioned before, the data movement utilities EXPORT, IMPORT, and LOAD perform an automatic code page conversion between data file and database. If you export data from a Unicode database to a client that does not use Unicode as its standard code page, you may lose data when the data file is written.

For example, if the EXPORT utility is called from a Windows command line, the command line uses the Windows code page 1252 by default. Therefore, the data file is also created with the code page 1252. The EXPORT utility performs an automatic code page conversion from Unicode (database code page) to 1252 (client code page). Since the code page 1252 contains only a subset of all Unicode characters, those Unicode characters that have no equivalent in code page 1252 are replaced by a so-called replacement character.

For example, a VARCHAR column contains the small Greek letter alpha, a, (UTF-8=0xCEB1). The content of this column is exported to a DEL file in code page 1252. Looking at the DEL file with a hex-editor shows the following values:

 	22 1A 22 0D 0A
	0x22 = double quotes
	0x1A = replacement character
	0x0D = carriage return
	0x0A = line feed

The character alpha is replaced by the replacement character (0x1A); that is information (data) has been lost during the export. To avoid this, you can override the default client code page by using the DB2 registry variable DB2CODEPAGE. To prevent data loss, you should set DB2CODEPAGE to a value of 1208 (Unicode). It is important to set DB2CODEPAGE before connecting to the database because the data movement utilities determine the code page when the connection to the database is established. The same export is done again, but now the client code page is explicitly set to Unicode before the export:
db2set DB2CODEPAGE=1208 and the result is:

	22 CE B1 22 0D 0A
	0x22 = double quotes
	0xCEB1 = small greek alpha (in UTF-8 encoding)
	0x0D = carriage return
	0x0A = line feed

In this case, the character alpha is properly exported. In the example, a DEL file is used as target for the export, but all things said are true for IXF files too.


How to manually enter Unicode characters

Unlike ASCII characters, entering Unicode characters could be a problem because many of those characters cannot be entered directly through the keyboard. In Windows, you can use the DB2 Command Center in conjunction with the Windows Character Map to enter Unicode characters manually in a Unicode database.

Given that the necessary fonts are installed, the DB2 Command Center is able to represent Unicode characters without any problems. The DB2 Technote #1200211 - How to display Unicode characters (including DBCS) in DB2 Control Center on an English Windows environment explains the method for installing other fonts that can be used by the DB2 Command Center.

The result set tab of the DB2 Command Center allows to make changes to the result set of a query and to add additional rows. To enter Unicode characters via the result set tab of the DB2 Command Center, you can use the Windows Character Map. The Character Map can be started by executing the command charmap.

Figure 3. Windows Character Map
Windows Character Map

In the Character Map, if you choose the option Advanced View, you can select Unicode as Character Set and then you can Search for certain characters. Alternatively, you can simply look for characters in the list. However, not every font includes a representation for each Unicode character. So, you may have to select another font to find a certain Unicode character. When you have found the desired character, you can put it into the DB2 Command Center by using Copy and Paste.

Figure 4. DB2 Command Center query results tab
Windows Character Map

If it is not possible to access the Unicode database by DB2 Command Center, at least you can use the Windows Character Map to find out the hex-codes of Unicode characters, and you can enter them via the DB2 CLP. This does only work in case of UCS-2 encoding (i.e., GRAPHIC/VARGRAPHIC columns) as the Windows Character Map shows the hex-codes for UCS-2 only. You can use the following syntax to enter characters by their hex-codes with a SQL statement:

GX'hex-code' 
where G indicates a graphic string and  X'. . .' allows the insertion of 
characters in form of the corresponding hex-value.

For example, the following INSERT statement stores the string that consists of alpha, beta, and gamma into the GRAPHIC column UCS2COL of the table UCTABLE_1:
INSERT INTO UCTABLE_1 (UCS2COL) VALUES (GX'03B103B203B3')

If you know the UTF-8 encoding of a character, you can also use this method to enter Unicode characters in a CHAR or VARCHAR column. In this case the syntax X'hex-code' is sufficient, as we do not deal with graphic columns. So, the INSERT statement for UTF-8 now looks like this:
INSERT INTO UCTABLE_2 (UTF8COL) VALUES (X'C3B1C3B2C3B3')

Given that you know only the UCS-2 hex-code, there is another method for entering Unicode characters in a UTF-8 column. You can first store the character with its UCS-2 hex-code in a UCS-2 column, then you populate the UTF-8 column with a INSERT with SELECT statement. Thereby, DB2 automatically performs a conversion from UCS-2 to UTF-8.
INSERT INTO UCTABLE_2 (UTF8COL) SELECT UCS2COL FROM UCTABLE_1

The Java programming language also allows to enter characters in form of their corresponding UCS-2 hex-code. The equivalent to the DB2 CLP INSERT statement looks like this in Java:

java.sql.PreparedStatement pstmt = con.prepareStatement("INSERT INTO UCTABLE
 (UCS2COL) VALUES (?)");
	pstmt.setString(1, Character.toString('\u03B1') +
 Character.toString('\u03B2') + Character.toString('\u03B3'));
pstmt.executeUpdate();

In contrast to the DB2 CLP version, the Java version works independently of the encoding of the target column, which means it does not matter whether the INSERT command is executed for a GRAPHIC or VARGRAPHIC column (UCS-2 encoding) or a CHAR or VARCHAR column (UTF-8 encoding). The reason is that the UCS-2 hex-codes are interpreted by Java before they are sent to the Unicode database where the code page conversion occurs.


Understanding how a Java application accesses DB2 Unicode data

Before demonstrating how a Java application accesses a Unicode database, we need to ensure the data are displayed correctly at the application. Hence, setting up the client environment is important.

Set up the client environment

By default, the application code page will be derived by the settings within the operating system where the application is compiled and bound. For example, you need to set Windows locale and language for displaying Chinese characters on Windows. The steps are listed as follows:

Figure 5. Setting System locale
System Locale

You can verify the active code page by typing chcp from the DOS prompt.

	C:\Program Files\IBM\SQLLIB\java\jdk\jre\lib>chcp
	Active code page: 936

The client application code page is determined by the active environment variable. The application code page can be overridden from DB2CODEPAGE value. If the client runs with proper setting locale and language setting, it should not be necessary to use DB2CODEPAGE variable. If the DB2CODEPAGE registry variable is set to an incorrect value, it may cause unpredictable results and potential data corruption. When an application code page is different from the Unicode database, the database manager converts the character string from application code page to database Unicode code page. Java stores string internally as UCS-2 format. For details of DB2 character conversion, refer to the developerWorks article Understanding DB2 UDB database character conversion.

For languages, such as Hebrew and Arabic, which are inherently mixed-directional, DB2BIDI environment variable should also be set.

Java database connectivity

In client-server architecture, when an application written in Java accesses database, Java Database Connectivity™ (JDBC) is required. JDBC is an application programming interface (API) that Java applications use to access relational databases locally or remotely.

The DB2 Universal JDBC Driver is a single instance that includes JDBC Type 2 connectivity, JDBC Type 4 connectivity, as well as SQLJ support. When an application loads the DB2 Universal JDBC Driver, a single driver instance is loaded for Type 2 and type 4 implementations. Within an application, one can make Type 2 and Type 4 connections using this single driver instance.

JDBC is installed at the time that DB2 UDB for Linux, UNIX, and Windows is installed. The db2jcc.jar and sqlj.zip files for the type 4 driver will be installed in the sqllib/java directory, and these files should be added to CLASSPATH. The db2jcct2.dll, which is required for Universal Type 2 Connectivity, will be installed in sqllib/bin directory.

Java Development Kit (JDK)

When you install DB2 UDB Version 8.2, the latest supported IBM Java Development Kit (JDK) will also be installed if it is not already installed. If the DB2 UDB installation is an update of a previous DB2 version, you need to install JDK manually. IBM supplies a single version of Java Runtime Environment (JRE), which contains all the character conversion codes. From IBM JDK 1.4, the libraries for code page conversion are stored into several JAR files, such as core.jar, graphics.jar, xml.jar, and server.jar, instead of a single rt.jar in other JDK implementation. IBM JDK is installed under sqllib/java directory by default.

If you consider using other JDK, you can install the JDK in a different directory, and specify the JDK directory in the PATH and CLASSPATH variables. It should be noted that Sun JDK V1.3.1 for all platforms and JRE V1.3.1 standard edition for Solaris and Linux support most IBM encodings. The Sun JDK V1.3.1 for Windows has in two different versions: US-only and International. The basic encoding set is stored in rt.jar, which covers the encoding scheme in Table 2.

Table 2: Basic encoding set character in rt.jar

Canonical NameDescription
ASCII American Standard Code for Information Interchange
Cp1252 Windows Latin-1
ISO8859_1ISO 8859-1, Latin alphabet No. 1
UnicodeBigSixteen-bit Unicode Transformation Format, big-endian byte order, with byte-order mark
UnicodeBigUnmarkedSixteen-bit Unicode Transformation Format, big-endian byte order
UnicodeLittleSixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark
UnicodeLittleUnmarkedSixteen-bit Unicode Transformation Format, little-endian byte order
UTF8Eight-bit Unicode Transformation Format
UTF-16Sixteen-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark

The extended encoding set is stored in charsets.jar. If you consider using the US-only JDK and the extended encoding set, you will need to either install the international version or include charsets.jar file under $JAVAHOME/JRE/LIB/ directory. For encoding sets supported in charset.jar, please check Supported Encodings on the Java site.

For JDK 1.3 and prior to JDK 1.3, the character converters are built-in internally in sun.io, such as ByteToCharCp1250. These converters are called internally by java.util.String, java.io.InputStreamReader, and java.io.OutputStreamWriter. In JDK 1.4, new public classes are available for converting 16-bit Unicode (Java char) to other code pages. These classes are:

  • java.nio.charset.Charset: This class defines methods for creating decoders and encoders and for retrieving the various names associated with a charset.
  • java.nio.charset.CharsetDecoder: This class defines an engine that can transform a sequence of bytes in a specific charset into a sequence of sixteen-bit Unicode characters.
  • java.nio.charset.CharsetEncoder: This class defines an engine that can transform a sequence of sixteen-bit Unicode characters into a sequence of bytes in a specific charset.

Type 2 and Type 4 connectivity

A Java application that uses the DB2 Universal JDBC Type 4 driver does not need the DB2 client in the client machine. Using DB2 Universal JDBC Type 2 driver, the application requires the DB2 client. The db2jcct2.dll for Windows or libdb2jcct2.so for UNIX or libdb2jcct2.sl for HP-UNIX is required to be installed in sqllib\bin or sqllib/lib, respectively, for using Universal JDBC Type 2 driver. When Java application on client machine accesses the Unicode database on DB2 server using universal JDBC driver, the DB2 Universal JDBC Driver (Type 2 or Type 4) sends data to the database server as UCS-2 format; thus, no code page conversion is needed for a Unicode database. Character data that is sent from the database server to the client is converted using Java's built-in character converters. The conversions that the DB2 Universal JDBC Driver supports are limited by the support of the underlying JRE implementation.


Examples of Java applications accessing DB2 Unicode data

Java application accessing Unicode Table in Unicode Database

In this example, we define a Unicode database using the following CREATE DATABASE and CREATE TABLE STATEMENT:

Create database test2 using codeset UTF-8 territory US
Create table test_table(record_no integer, description varchar(50), comment CLOB)

With the above definition, you can also find the code page information from the database configuration for database TEST2 by using the following command:

	DB2 GET DB CFG FOR TEST2

The database code set, database code page, and territory values are shown as follows:

	Database Configuration for Database
	Database configuration release level            = 0x0a00
	Database release level                          = 0x0a00
	Database territory                              = US
	Database code page                              = 1208
	Database code set                               = UTF-8
	Database country/region code                    = 1

In our example, we demostrate how you can have a Java application to input Chinese (non-ASCII) characters to Unicode database. In order to display and input Chinese characters, the system language locale is set to Chinese Simplified (GB2312, code page 936). For details, refer to Figure 5. If you want to input Chinese characters, you will also need to set the Input Locales, as well. You might need to insert the Windows installation CD and restart the computer.

Figure 6. Set up Windows locale and language
Set up Windows locale and language

The following a Java program inserts a mixed character string that contains both English and Chinese characters into the DESCRIPTION column of TEST_TABLE and to display the result on screen:

Set up Windows locale and language

You can download the above source code: UnicodeExample.java. It should be noted that you will read a series of question marks instead of Chinese characters in the sample program, if the Windows environment is not set up for Chinese codepage.

When you execute the program, the result is:

Figure 7. Program result 1
Program source 1

Let us change the active code page to 1252 by typing chcp 1252 and run the program again. The result is:

Figure 8. Program result 2
Program result 2

You cannot read the Chinese characters anymore. This is because the active code page 1252 does not have the code points for non-ASCII Chinese characters. Therefore, if a client code is not set correctly, you will not be able to display correctly.

Now, you can reset the active codepage 936 and add 4 more Chinese characters at the end of the string. This will increase the string to have additional 12 bytes longer, then re-compile and run your Java program. You encounter the SQLCODE -443 as follows:

Figure 9. Program result 3
Program result 3

This is because each Chinese character takes 3 bytes. The original string takes 48 bytes. Now, the modified string takes a total of 60 bytes, which exceeds the definition of the DESCRIPTION. Hence, it is important to ensure that the length of the column definition is sufficient for handling your Unicode data.

Java application accessing Unicode table in non-Unicode database

The DB2 Sample database is a database defined using default OS code page 1252. In order to create a Unicode table in the Sample database, you need to set the database configuration ALT_COLLATE to IDENTITY_16BIT by running the following command.

	Update database configuration using alt_collate identity_16bit

The ALT_COLLATE in database configuration has to be updated.

	Update database configuration using alt_collate identity_16bit
	Database territory                             = US
	Database code page                             = 1252
	Database code set                              = IBM-1252
	Database country/region code                   = 1
	Database collating sequence                    = UNIQUE
	Alternate collating sequence     (ALT_COLLATE) = IDENTITY_16BIT

You can create a Unicode table as follows:

Create test Unicode_table (record_no integer, description varchar(50)) CCSID UNICODE

The following program simply reads the Unicode table in a non-Unicode database.

public class UnicodeExample_T4
{
    public static void main(String[] args) {
        try {
            // create data source
            com.ibm.db2.jcc.DB2SimpleDataSource ds =
                new com.ibm.db2.jcc.DB2SimpleDataSource();

            // set connection properties
            ds.setServerName("localhost");
            ds.setPortNumber(50000);
            ds.setDatabaseName("sample");
            ds.setDriverType(4);

            // get connection
            java.sql.Connection con = ds.getConnection("userid", "passw");

            java.sql.Statement stmt = con.createStatement();            
    	   
            String query = "select record_no, description, comment from unicode_table";
            java.sql.ResultSet rs = stmt.executeQuery(query);
            while (rs.next()) {
                System.out.println("\n" + rs.getString(1)+"     "+rs.getString(2));
            }

            rs.close();
            stmt.close();
            con.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The result of the above program is:

	C:\Program Files\IBM\SQLLIB\java\jdk>chcp
	Active code page: 1252
	
	C:\Program Files\IBM\SQLLIB\java\jdk>bin\java UnicodeExample_T4
	
	1      Test data for unicode table
	C:\Program Files\IBM\SQLLIB\java\jdk>

Modify the above code to access both Unicode and non-Unicode table as follows:

 // execute a query
java.sql.Statement stmt = con.createStatement();            
    	   
String query = "select employee.lastname, Unicode_table.description from employee,
 Unicode_table";
            java.sql.ResultSet rs = stmt.executeQuery(query);
	. . .

As discussed previously, you will receive the following error because you cannot acceess both Unicode and non-Unicode tables in the same SQL statement:

	C:\Program Files\IBM\SQLLIB\java\jdk>chcp
	Active code page: 1252
	
	C:\Program Files\IBM\SQLLIB\java\jdk>bin\java UnicodeExample_T4
	com.ibm.db2.jcc.c.SqlException: DB2 SQL error: SQLCODE: -873, SQLSTATE: 53090, S
	QLERRMC: Different encoding schemes in one statement

Summary

This article explains how DB2 UDB for Linux, UNIX, and Windows supports Unicode. Based on practical questions, we've discussed diverse aspects of DB2 UDB Unicode support. Use the information we've provided to get a jump start in developing international applications that access DB2 UDB Unicode databases.


Downloads

DescriptionNameSize
Unicode Java exampleUnicodeExample.java3 KB
Unicode Java exampleUnicodeExample_T4.java3 KB

Resources

Learn

Get products and technologies

  • Download a free trial version of DB2 Universal Database Enterprise Server Edition.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edtion and provides a solid base to build and deploy applications.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Java technology
ArticleID=102531
ArticleTitle=Access your database from everywhere
publish-date=01262006