Character sets and code pages

A character set is a set of letters, numbers, special characters, and other elements used to represent information. The term code page refers to a coded character set.

A character set is independent of a coded representation. A coded character set is the coded representation of a set of characters, where each character is assigned a numerical position, called a code point, in the encoding scheme. ASCII and EBCDIC are examples of types of coded character sets. Each variation of ASCII or EBCDIC is a specific coded character set.

Each code page that IBM® defines is identified by a code page name, for example IBM-1252, and a coded character set identifier (CCSID), for example 1252.

Compile-time code page

The compile-time code page can be an ASCII single-byte or ASCII double-byte code page, an EUC code page, or UTF-8. The specific code page is indicated by the compile-time locale, or environment variable in effect.

The source program (including user-defined words and the content of alphanumeric, DBCS, and national literals) is encoded in the code page indicated by the locale or environment variable in effect at compile time.

Runtime code page

The code page used at run time is determined by a combination of a data item's USAGE clause, the compiler options in effect, and the locale (or environment variable value) in effect.

When the CHAR(NATIVE) compiler option is in effect, data items described with USAGE DISPLAY or USAGE DISPLAY-1 are encoded in an ASCII, EUC, or UTF-8 code page as indicated by the runtime locale.

When the CHAR(EBCDIC) compiler option is in effect, data items described with USAGE DISPLAY or USAGE DISPLAY-1 are encoded in an EBCDIC code page, except when the NATIVE phrase is specified in the item's USAGE clause. If the NATIVE phrase is specified, the code page used is the ASCII, EUC, or UTF-8 code page indicated by the runtime locale.

For EBCDIC, the code page is determined from the EBCDIC_CODEPAGE environment variable, if set. If the EBCDIC_CODEPAGE environment variable is not set, the default EBCDIC code page associated with the current runtime locale is used. The default EBCDIC code page associated with each supported locale is identified in Locales and code pages that are supported in the COBOL for Linux® on x86 Programming Guide.

For DBCS, the code page is determined by the DBCS_CODEPAGE environment variable, if set. If the DBCS_CODEPAGE environment variable is not set, the default DBCS codepage associated with the current runtime locale is used.

The default code page for data items described with USAGE NATIONAL and national literals is UTF-16LE (little endian), CCSID 1200. The source text representation of national literals is converted at run time from the compile-time code page to UTF-16LE.

To change the endianness representation for data items described with USAGE NATIONAL and national literals refer to the UTF16 compiler option.

A reference to UTF-16 in this document is a reference to UTF-16LE.