Character sets

A character set is an element of internationalization that maps and translates an alphabet; that is, the characters that are used in a particular language. A character set is made up of a series of code points, or the numeric representation of a character. For example, the code point for the letter A in international EBCDIC is 0xC1. A character set can also be called a coded character set, a code set, a code page, or an encoding. Examples of character sets include International EBCDIC, Latin 1, and Unicode. Character sets are chosen on the basis of the letters and symbols required.

Character sets are referred to by a name or by an integer identifier called the coded character set identifier (CCSID). For example, Latin 1 might be called ISO-8859-1 or CCSID 819. The CCSID determines the character set name that is used with the iconv functions. A CCSID table associates the CCSID with the character set name. The entries in the CCSID table must conform to the standards outlined in the Character Data Representation Architecture Reference and Registry. See Add coded character set identifiers for more information about the CCSIDs, and The iconv functions for more information about the iconv functions.

The character set definition defaults that are on the z/TPF system support all aliases for the character sets that are supported on the GNU C library (glibc). Not all glibc translations are included on the z/TPF system.

The z/TPF system supports multibyte character sets that use shift-out and shift-in (shift out of a regular character set mode; shift back to regular character set mode). Wide characters are 4 bytes and encoded using the UCS-4 character set, which is a Unicode-based character set and ASCII compatible. For more information about multibyte character sets and wide characters, see the GNU website.

The z/TPF system supports translations using the iconv functions among the following character sets listed in Table 1. These character sets are supported by glibc and are single-byte character sets (SBCSs) unless otherwise noted.

Table 1. z/TPF-supported character sets
Character set Description Encoding
ANSI_X3.4-1968 Standard 7-bit ASCII ASCII (X'00'-X'7F')
CP1250 MS Windows Latin 2 ASCII
CP1252 MS Windows Latin 1 ASCII
EUC-JP Japanese characters ASCII
GB18030 Chinese multibyte ASCII
IBM037 US/Canada Latin 1 EBCDIC
IBM290 Japanese Katakana EBCDIC
IBM500 Multinational EBCDIC
IBM819 Alias for ISO8859-1 ASCII
IBM850 Latin 1 PC Data ASCII
IBM924 IBM500/IBM1047 with euro EBCDIC
IBM930 Japanese Katakana/Kanji multibyte character set EBCDIC
IBM932 Japanese PC Data ASCII
IBM939 Japanese Latin/Kanji multibyte character set EBCDIC
IBM1026 Turkey Latin 5 EBCDIC
IBM1047 Open Systems Latin 1 EBCDIC
IBM1140 Latin 1; IBM037 with euro for US EBCDIC
IBM1141 Latin 1; IBM273 with euro for Austria/Germany EBCDIC
IBM1142 Latin 1; IBM277 with euro for Denmark/Norway EBCDIC
IBM1143 Latin 1; IBM278 with euro for Finland/Sweden EBCDIC
IBM1144 Latin 1; IBM280 with euro for Italy EBCDIC
IBM1145 Latin 1; IBM284 with euro for Spain EBCDIC
IBM1146 Latin 1; IBM285 with euro for UK EBCDIC
IBM1147 Latin 1; IBM297 with euro for France EBCDIC
IBM1148 Latin 1; IBM500 with euro for Belgium/Canada/Switzerland (Multinational) EBCDIC
IBM1149 Latin 1; IBM871 with euro for Iceland EBCDIC
ISO8859-1 Latin 1, Standard 8-bit ASCII
ISO8859-2 Latin 2 ASCII
ISO8859-3 Latin 3 ASCII
ISO8859-4 Latin 4 ASCII
ISO8859-9 Latin 5, Turkey/Western Europe ASCII
ISO8859-10 Latin 6, Baltic/Scandanavian ASCII
ISO8859-15 Latin 9, ISO8859-1 with euro ASCII
UCS-2 2-byte normalized Unicode Unicode
UCS-4 4-byte normalized Unicode Unicode
UTF-8 Multibyte Unicode (a range of 1-6 bytes per character) Unicode
UTF-16 Multibyte Unicode (a range of 1-6 bytes per character) Unicode
UTF-32 Multibyte Unicode (a range of 1-6 bytes per character) Unicode