Application programming with Unicode data and multiple CCSIDs

If your application handles Unicode data or data that is in different encoding schemes, you should be aware of several programming techniques and recommendations in DB2®.

DB2 always returns data to your application in the CCSID that your application uses for data. This CCSID is called the application encoding scheme.

Recommendations: Use the following general recommendations to guide you in writing and preparing your application programs:

If possible, use either Unicode or EBCDIC data, but not both. If you do choose to use multiple encoding schemes, consider the following possible implications for data loss and performance:
- Managing multiple CCSIDs in your application can be difficult. To ensure that data is not lost, you have to control where the data goes, a path that potentially includes many modules.
- Many environments, such as CICS® Transaction Gateway and WebSphere® MQ are message-based. In these cases, the entire message must be in a single encoding scheme. Because the entire message is in one encoding, flowing some data through the application in EBCDIC and some in Unicode makes little sense. You still have to convert all of it to a single encoding, such as Unicode, right before the putting the message on the wire.
- DB2 tables must be in the same encoding scheme. You cannot make some columns Unicode and some EBCDIC. If your application processes some columns in Unicode and others in EBCDIC, character conversion occurs, which likely increases the performance overhead.
If you are using Unicode data in COBOL or PL/I applications, use the coprocessor.
If your COBOL, PL/I, C/C++ , or Assembler application handles Unicode data, do not place literals in the source code of the application. Because these language compilers do not support Unicode source code, they could misinterpret these literal values. Instead, place these literal values in a file or DB2 table that can be accessed at the start of the program to load the values. (Files and host variables are not precompiled and compiled as application source code.)
If an expanding or contracting conversion occurs on your data, the length of the data might change. Be aware of these length changes when you use the LENGTH function, CHARACTER_LENGTH function, SUBSTRING function, and SUBSTR function on the converted string. For CHARACTER_LENGTH and SUBSTRING, use the CODEUNITS16 and CODEUNITS32 options to specify how you want DB2 to calculate the length.
If you need to represent characters from multiple Latin-based character sets, such as Latin-1 and Latin-4, consider using Unicode for your application encoding scheme. An SBCS CCSID does not have enough code points to represent all of the characters that the combination of the two character sets require. For example, assume that your application uses an EBCDIC CCSID, such as 277 or 1069. You might have some data that is represented in the database in Unicode but that cannot be retrieved by the application without substitution. If your application needs to handle only one language at a time, you can set up your infrastructure in one of the following ways:
- Have one version of your application that uses CCSID 277 and another version that uses CCSID 1069. Also have two corresponding subsystems, one that uses CCSID 277 and another that uses CCSID 1069. (You cannot have multiple EBCDIC CCSIDs in one DB2 subsystem.)
- Store the data in Unicode and have one version of your application that uses CCSID 277 and another version that uses CCSID 1069. Then bind these applications with different values for the ENCODING bind option.
- Store the data in Unicode and have one version of your application that uses an EBCDIC CCSID and another version that uses Unicode.
However, if you require that a single version of the application handle both Latin-1 and Latin-4 character sets, your application needs to process data in Unicode.

Application encoding scheme
The application encoding scheme is the CCSID that your application uses to interpret data in host variables. For DB2 for z/OS® applications, typically the application encoding scheme is the value of the ENCODING bind option. (By default this value is the subsystem default application encoding scheme.)
Specifying a CCSID for your application
In DB2 for z/OS applications, one CCSID is associated with the source code and one or more CCSIDs can be associated with the data that your application manipulates. The CCSID that DB2 associates with the data is called the application encoding scheme.
Determining the CCSID of DB2 data
DB2 can store EBCDIC, ASCII, and Unicode data.
Determining the CCSID of a string value in an SQL statement
Knowing the CCSID of a particular string value in an SQL statement helps you determine how DB2 evaluates the statement. This knowledge also helps you plan for character conversions. You can determine whether character conversion is necessary and what character conversions you need to define.
Objects with different CCSIDs in the same SQL statement
You can reference data with different CCSIDs from the same SQL statement. This ability is useful if you use table objects such as tables, views, temporary tables, query tables, and user-defined functions with different CCSIDs. However, you should understand how DB2 for z/OS processes these queries so that you can code them correctly.
Differences between Unicode and EBCDIC sorting sequences
In Unicode, numeric characters are sorted before alphabetic characters. In EBCDIC, alphabetic characters are sorted before numeric characters.
Specifying how DB2 calculates the length of a string
If you use certain length functions, you can specify whether you want DB2 to calculate the length by bytes or characters. This distinction is important for multibyte characters. If you convert DB2 data to Unicode and the data expands, consider updating some of these function calls to specify the appropriate unit of measurement.
Specifying the sorting sequence for a language
If your application sorts non-English data, you should specify the sorting sequence to ensure that DB2 sorts the data in a culturally correct manner. For example, suppose your data contains the following strings: cote, coté, côte, côté. You need to specify how you want these strings sorted.
Performing culturally correct case conversions
Rules for uppercase and lowercase conversion vary according to language and country. If you plan to use the UPPER or LOWER function, you need to ensure that DB2 uses the culturally correct casing rules. For example, you need to tell DB2 how to convert characters such as ß and ó to uppercase.
Generating escaped Unicode data
If you pass Unicode characters to an application or object that is not intended to handle Unicode data, data might be lost unless you escape certain characters. For example, you might need to pass Unicode data through an application that has EBCDIC host variables. Or you might want to store Unicode data in a non-Unicode table.
Normalization of Unicode strings
Your application should treat as equal those characters that are functionally and visually equivalent but have different code point representations. This behavior is important when you search, sort, or compare Unicode strings. To accomplish this goal, you might need to normalize the strings. However, normalization can degrade performance.
How DB2 handles Unicode supplementary characters
Unicode supplementary characters are those characters that have a code point between U+10000 and U+10FFFF. These characters include certain math symbols and certain characters from Chinese, Japanese, and some historic scripts.
Processing Unicode data in COBOL applications
COBOL supports UTF-16 data. COBOL has no support for UTF-8 data.
Processing Unicode data in PL/I applications
PL/I supports UTF-16 data. PL/I has no support for UTF-8 data.
Processing Unicode data in C/C++ applications
C/C++ supports UTF-16 data. C/C++ also supports UTF-32 data, but DB2 for z/OS does not.
Java applications and Unicode data
Java is Unicode-based, and all character processing inside a Java application occurs in Unicode. Character data that is not already in Unicode must be converted before being passed to a Java application. These conversions are handled by DB2 or by the JDBC driver and are transparent to the application.
Green screen applications and Unicode data
Green screen applications are applications that run on 3270 terminal emulators. These applications do not support Unicode data.
Variant characters
Variant characters are characters that correspond to different code points across a given set of code pages. For example, the character # is variant. It corresponds to code point X'7B' in CCSIDs 37, 273, 500, and 1047. However, this character corresponds to code point X'4A' in CCSID 277.
DRDA character type parameters in Unicode
Remote DB2 applications can send and receive DRDA command and reply message parameters that contain character type data encoded in Unicode CCSID 1208 (UTF-8). Using Unicode instead of EBCDIC for these DRDA parameters can improve performance and avoid potential character conversion errors.

Parent topic: Introduction to character conversion

Related concepts:

Contracting conversion

Expanding conversion

Application encoding schemes and DB2 ODBC

Generating table and view declarations by using DCLGEN

Related reference:

LENGTH

SUBSTR

SUBSTRING

Enterprise PL/I for z/OS

Related information

Guidelines to design global solutions