Unicode and the encoding of language characters

COBOL for Linux® provides basic runtime support for Unicode, which can handle tens of thousands of characters that cover all commonly used characters and symbols in the world.

A character set is a defined set of characters, but is not associated with a coded representation. A coded character set (also referred to in this documentation as a code page) is a set of unambiguous rules that relate the characters of the set to their coded representation. Each code page has a name and is like a table that sets up the symbols for representing a character set; each symbol is associated with a unique bit pattern, or code point. Each code page also has a coded character set identifier (CCSID), which is a value from 1 to 65,536.

Unicode has several encoding schemes, called Unicode Transformation Format (UTF), such as UTF-8, UTF-16, and UTF-32. COBOL for Linux uses UTF-16 (CCSID 1200) in little-endian format as the representation for national literals and data items that have USAGE NATIONAL.

UTF-8 represents ASCII invariant characters a-z, A-Z, 0-9, and certain special characters such as ' @ , . + - = / * ( ) the same way that they are represented in ASCII. UTF-16 represents these characters as NX'nn00', where X'nn' is the representation of the character in ASCII.

For example, the string 'ABC' is represented in UTF-16 as NX'410042004300'. In UTF-8, 'ABC' is represented as X'414243'.

One or more encoding units are used to represent a character from a coded character set. For UTF-16, an encoding unit takes 2 bytes of storage. Any character defined in any EBCDIC, ASCII, or EUC code page is represented in one UTF-16 encoding unit when the character is converted to the national data representation.

Cross-platform considerations: Enterprise COBOL for z/OS® and COBOL for AIX® support UTF-16 in big-endian format in national data. By default, COBOL for Linux supports UTF-16 in little-endian format in national data. If you are porting Unicode data that is encoded in UTF-16BE representation to COBOL for Linux from another platform, you must either convert that data to UTF-16 in little-endian format to process the data as national data, or use the UTF16 compiler option to change the way the compiler treats UTF-16 endianness. With COBOL for Linux, you can perform such conversions by using the NATIONAL-OF intrinsic function.

Related references  
Storage of character data  
Locales and code pages that are supported
  
Character sets and code pages (COBOL for Linux on x86 Language Reference)