Character conversion terminology

To understand the concept of character conversion, you should know the meaning of some basic related terms.

The following terms are related to character conversion:

application encoding scheme

The CCSID that your application uses to interpret data in host variables. For DB2® for z/OS® applications, typically the application encoding scheme is the value of the ENCODING bind option. (By default the value of the ENCODING bind option is the subsystem default application encoding scheme, which is the APPENSCH DECP value.) However, you can also set the CCSID of application data by using the DECLARE VARIABLE statement with the CCSID option or the CURRENT APPLICATION ENCODING SCHEME special register.

If you are using the DB2 coprocessor, you can use various language compiler options to override the DB2 application encoding scheme for an application.

For more information about application encoding schemes, see Specifying a CCSID for your application

ASCII

Acronym for American Standard Code for Information Interchange, an encoding scheme that is used to represent characters. In this information, the term ASCII is used to refer to IBM®-PC data or ISO 8-bit data.

For more information about ASCII, see ASCII. For more information about encoding schemes in general, see Encoding schemes.

big endian

A data format in which the most significant byte is stored first, at the memory location with the lowest address.

For more information about big endian, see Endianness.

character conversion

The process of converting characters from one CCSID to another.

For more information about how DB2 performs character conversions, see How DB2 performs character conversions.

character data representation architecture (CDRA)

An IBM architecture that aims to achieve consistent representation, processing, and interchange of graphic character data in data processing environments. CDRA defines a set of identifiers, services, supporting resources, and conventions. The identifiers that CDRA defines are CCSIDs.

For more information about CDRA, see Code pages and CCSIDs.

character repertoire

A set of characters.

character set

A defined set of characters, in which a character is the smallest component of written language that has semantic value.

code page

A specification of code points from a defined encoding scheme for each character in a set or in a collection of character sets. Within a code page, a code point can have only one specific meaning. Code pages are defined by the IBM Globalization Center of Competency.

For more information about code pages, see Code pages and CCSIDs.

code point

A unique bit pattern that represents a character.

For more information about code points, see Code pages and CCSIDs.

coded character set

A set of unambiguous rules that establishes a character set and the one-to-one relationships between the characters of the set and their coded representations. A coded character set is the assignment of each character in a character set to a unique numeric code value.

coded character set identifier (CCSID)

A 16-bit number that identifies a specific set of encoding scheme identifiers, character set identifiers, code page identifiers, and additional coding-related information. A CCSID is a number that identifies an implementation of a code page at a particular point in time. A CCSID is an attribute of strings, just as length is an attribute of strings. All values of the same string column have the same CCSID.

For more information about CCSIDs, see Code pages and CCSIDs.

coded character set identifier (CCSID) set

The single byte CCSID value (SBCS), mixed CCSID value, and double byte CCSID value (DBCS) that are associated with a particular encoding scheme.

For more information about CCSID sets, see Specifying subsystem CCSIDs.

collation name

A string value that specifies how DB2 is to sort data. The collation name specifies attributes such as the language of the data, whether case should be considered, and how punctuation characters should be treated.

For more information about collation names, see Specifying the sorting sequence for a language.

contracting conversion

A character conversion in which the length of the converted string is smaller than that of the source string.

For more information about contracting conversions, see Contracting conversion.

conversion image

A data set that contains the information that z/OS Unicode Services needs to perform character and case conversions.

For more information about conversion images, see Conversion image.

EBCDIC

Acronym for Extended Binary-Coded Decimal Interchange Code, a group of coded character sets that consists of 8-bit coded characters. EBCDIC coded character sets assign characters to code points. Each code point consists of 8 bits.

For more information about EBCDIC, see EBCDIC. For more information about encoding schemes in general, see Encoding schemes.

encoding scheme

A set of rules that is used to represent character data. All string data that is stored in a table must use the same encoding scheme. All tables within a table space must use the same encoding scheme, except for global temporary tables, declared temporary tables, and work file table spaces. An encoding scheme only describes the type of encoding; it does not specify code points or a code page. Examples of encoding schemes include ASCII, EBCDIC, and Unicode.

For more information about encoding schemes, see Encoding schemes.

endianness

A data attribute that describes byte order.

For more information about endianness, see Endianness.

enforced subset conversion

A character conversion in which characters that do not have a code point in the target CCSID are converted to a single substitution character.

For more information about enforced subset conversions, see Enforced subset conversion.

escaped data

One or more characters that cannot be represented in the target CCSID and that have been identified as such by some extra syntax.

For more information about escaped data, see Generating escaped Unicode data.

expanding conversion

A character conversion in which the length of the converted string is greater than that of the source string.

For more information about expanding conversions, see Expanding conversion.

International Components for Unicode (ICU)

A set of C/C++ and Java libraries for Unicode support and software internationalization.

For more information about ICU, see The International Components for Unicode.

little endian

A data format in which the least significant byte is stored first, at the memory location with the lowest address.

For more information about little endian, see Endianness.

locale

An attribute that defines the user's cultural environment.

For more information about locales, see Locale.

lossless conversion

A character conversion in which all characters in the source CCSID exist in the target CCSID, and thus, no character is lost.

For more information about lossless conversions, see Possible consequences of character conversion.

normalization

A process that produces a unique code point sequence for all sequences that are equivalent, either canonically or compatibly.

For more information about normalization, see Normalization of Unicode strings.

round-trip conversion

A character conversion that ensures the integrity of all character data from the source CCSID to the target CCSID and back to the source. Even if the target CCSID does not support a given character, the character regains its original hexadecimal value after the conversion back to the original CCSID.

For more information about round-trip conversions, see Round-trip conversion.

substitution character

A unique character that is substituted during character conversion for any characters in the source CCSID that do not have a match in the target CCSID.

For more information about substitution characters, see Enforced subset conversion.

supplementary characters

Characters that have a code point between U+10000 and U+10FFFF.

For more information about supplementary characters, see How DB2 handles Unicode supplementary characters.

Unicode

An international character code for information processing that is designed to encode all characters that are used for written communication in a simple and consistent manner. The Unicode character encoding was established to provide enough code points for all the scripts and technical symbols in common usage around the world, plus some ancient scripts.

For more information about Unicode, see Unicode. For more information about encoding schemes in general, see Encoding schemes.

Unicode Consortium

A non-profit organization that develops and maintains international standards, including the Unicode Standard.

For more information about the Unicode Consortium, see Unicode Consortium.

Unicode transformation formats (UTFs)

Forms of Unicode encoding that were devised by the Unicode Consortium to ensure that systems can communicate efficiently. UTF-8, UTF-16, and UTF-32 were each designed for different processing objectives.

For more information about the UTFs, see UTFs.

z/OS Unicode Services

A set of functions that are provided by z/OS. Among the services are case conversion service and character conversion service.

For more information about the z/OS Unicode Services, see Setting up z/OS Unicode Services for DB2 for z/OS.