Data encoding basics

Most Db2 for z/OS® subsystems process data from multiple encoding schemes or operate in a distributed environment where different applications must communicate with each other. In this environment, a basic understanding of code pages and CCSIDs can help you avoid certain types of performance problems or, worse yet, data corruption.

What is a code page and a CCSID?

A code page is a map that tells a computer which hexadecimal numbers represent which characters. Another similar term is CCSID. A CCSID is a number that identifies an implementation of a code page at a particular point in time. Although these terms are sometimes used interchangeably, they are different. If you are not familiar with these terms and the difference between them, read Code pages and CCSIDs.

Db2 for z/OS uses CCSID values, not code page numbers. To indicate the encoding value of some data, specify the CCSID. Db2 uses this CCSID value to interpret the data. If you do not specify the correct CCSID, the data can be misinterpreted or corrupted by the receiving program.

Each CCSID belongs to an encoding scheme. The major encoding schemes are EBCDIC, ASCII, and Unicode. To learn more about encoding schemes and the relationship between CCSIDs and encoding schemes, read Encoding schemes.

When data is passed between different systems or applications, it might need to be converted from one CCSID to another CCSID. That process is called character conversion. In some cases, character conversion can happen within the same system. Ideally, you want to avoid character conversions whenever possible because they can degrade performance.

Which CCSIDs are my Db2 subsystems using?

Each Db2 for z/OS subsystem has three default CCSID sets, one for each encoding scheme: ASCII, EBCDIC, and Unicode. Your system administrator can set the value for the ASCII CCSID set and the EBCDIC CCSID set. The Unicode CCSID set is predefined and cannot be changed.

Each subsystem also has two default encoding schemes, one for the subsystem and one for applications. The subsystem value determines the default encoding scheme of Db2 objects. The application value determines the default encoding scheme of the data that is passed to Db2 by applications.

For more information about the default CCSID and encoding scheme values, including where they are specified, see Subsystem CCSIDs and encoding schemes.

To determine the values that are defined for your subsystem, either ask your system programmer or follow the procedure in Determining current subsystem CCSID and encoding scheme values.

The default CCSID and encoding scheme values affect you in a couple of ways:

They potentially affect how you create database objects. Consider the following examples:
- You create a table and specify CCSID EBCDIC in the CREATE TABLE statement. Db2 defines that table with the default EBCDIC CCSID value. Suppose that value on your subsystem is 500. Db2 stores all data in that table as encoded in CCSID 500.
- You create a table and do not define a CCSID. In this case, Db2 uses the default subsystem encoding scheme. If that value is EBCDIC and the default EBCDIC CCSID is 500, the table is again created with CCSID 500.
If your data is encoded in CCSID 500, both of these examples work fine. However, if your data is encoded in another CCSID, such as CCSID 857, both of these examples result in data corruption; Db2 is using the wrong code page to interpret the data.
They affect how you write applications that pass variables to Db2. Suppose that your application passes data to Db2 but does not tell Db2 the CCSID of the data that it is passing. Db2 then assumes that the data is in the subsystem default application encoding scheme. Again, suppose that value is EBCDIC and suppose that this time, your default EBCDIC CCSID is 37. Db2 interprets the data that is being passed as being encoded in CCSID 37. If the data is encoded in CCSID 37, this scenario works fine. However, if the data is encoded in another CCSID, such as CCSID 860, Db2 misinterprets this data from the application, and it is now corrupted.

Which CCSIDs are my other data sources using?

Knowing which CCSIDs all of your data sources are using is critical to making sure that the data is interpreted correctly and with minimal conversions.

Consider the example in which you created a table with CCSID 500. Now suppose that you are loading data into that table, but the data that you are loading is encoded in CCSID 37. Assume that you use the LOAD utility and specify CCSID 37 for the input data. The data must be converted from CCSID 37 to CCSID 500. Although these CCSIDs are similar, some characters will need to be converted. The greater the number of characters that must be converted, the bigger the impact is on performance.

Now consider the example in which your application does not specify the CCSID of the data that it is passing to Db2. Again, suppose that the default application encoding scheme is EBCDIC, and the default EBCDIC CCSID is 37. Db2 assumes that the data it receives is in CCSID 37. However, suppose that the data that you are passing to Db2 is actually in another CCSID. For example, suppose that you are passing Japanese characters, encoded in CCSID 281. Db2 is reading the data as if it is encoded in CCSID 37; therefore, the data is corrupted.

Consider a more subtle example. Suppose that the data that you are passing is in CCSID 1140. Only one code point is different between CCSID 37 and CCSID 1140: the character at code point X'9F'. If the data that is being passed to Db2 does not contain this character, you might not even notice a problem. Suppose, however, that one day this character is passed to Db2. The application passes € but Db2 interprets it using CCSID 37, which maps to ¤. Your data now has one corrupted character, and it might take some time to discover the error.

These examples illustrate the importance of knowing the CCSID of all of the data that you are working with. The following resources can help you find the CCSID values of some of your data sources:

To find the CCSID for some of the common data sources, use the table in Finding the CCSID values of your data sources.
To find the CCSID of Db2 objects, use the instructions in Determining the CCSID of Db2 data.

Where do I specify CCSIDs?

You can specify CCSIDs for objects and applications:

For objects, you specify CCSIDs when you create them. See Specifying object CCSIDs for more information.
For Db2 for z/OS applications, you can specify a CCSID for the program source code and a CCSID for the variables. The way that you specify the CCSID for the source code depends on whether you use the Db2 precompiler or the Db2 coprocessor and the language that you are using. For the application data (the host variables and parameter markers), generally you use the ENCODING bind option. For more information about the process and all of the options available, see Specifying a CCSID for your application.

What is Unicode?

You can think of Unicode as one giant code page that includes all characters. For a detailed explanation, see Unicode and Why the Unicode standard?.

Unicode has a couple of formats that Db2 for z/OS supports: UTF-8 and UTF-16. These formats are explained in UTFs and Deciding whether to store data as UTF-8 or UTF-16.

Other learning resources

See International data and character conversion in Db2 for z/OS for a comprehensive explanation of CCSIDs, Unicode, and international data, including how to specify the sort sequence of a language and how to perform culturally correct case conversions.

For more detailed training about Unicode, use the following resources:

Course: "How to Deal with Unicode and Multiple CCSIDs." You can find links to this course information from the Db2 for z/OS Database Administrator training path.
Unicode Tutorials and Overviews (on Unicode Consortium website)