Character sets and code pages

A character set is a set of letters, numbers, special characters, and other elements used to represent information. The term code page refers to a coded character set.

A character set is independent of a coded representation. A coded character set is the coded representation of a set of characters, where each character is assigned a numerical position, called a code point, in the encoding scheme. ASCII and EBCDIC are examples of types of coded character sets. Each variation of ASCII or EBCDIC is a specific coded character set.

Each code page that IBM® defines is identified by a code page name, for example IBM-1252, and a coded character set identifier (CCSID), for example 1252.

Enterprise COBOL provides the CODEPAGE compiler option for specifying a coded character set for use at compile time and run time for code-page-sensitive elements, such as:

  • The encoding of literals in the source program
  • The default encoding for data items described with USAGE DISPLAY or DISPLAY-1
  • The default encoding for XML parsing and XML generation

Some COBOL operations can override the encoding established by the CODEPAGE compiler option, for example:

  • The DISPLAY-OF and NATIONAL-OF intrinsic functions can specify a CCSID as argument-2.
  • The XML PARSE and XML GENERATE statements can specify a code page in the ENCODING phrase.

For further details about the CODEPAGE compiler option, see CODEPAGE in the Enterprise COBOL Programming Guide.

If you do not specify a code page, the default is code page IBM-1140, CCSID 1140.

The encoding of national and UTF-8 data is not affected by the CODEPAGE compiler option. The encoding for national literals and data items described with usage NATIONAL is UTF-16BE (big endian), CCSID 1200. A reference to UTF-16 in this document is a reference to UTF-16BE. The encoding for UTF-8 literals and data items described with usage UTF-8 is UTF-8, CCSID 1208.