Code sets for multicultural support

The globalization of AIX® is based on the assumption that all code sets can be divided into any number of character sets.

To understand code sets, it is necessary to first understand character sets. A character set is a collection of predefined characters based on the specific needs of one or more languages without regard to the encoding values used to represent the characters. The choice of which code set to use depends on the user's data processing requirements. A particular character set can be encoded using different encoding schemes. For example, the ASCII character set defines the set of characters found in the English language. The Japanese Industrial Standard (JIS) character set defines the set of characters used in the Japanese language. Both the English and Japanese character sets can be encoded using different code sets.

A code page is similar to a code set with the limitation that a code page specification is based on a 16-column by 16-row matrix. The intersection of each column and row defines a coded character.

Consider the following when working with code sets:
  • Do not assume the size of all characters to be 8 bits, or 1 byte. Characters may be 1, 2, 3, 4 or more bytes.
  • Do not assume the encoding of any code set.
  • Do not hard code names of code sets, locales, or fonts because it can impact portability.
The following code sets are supported:
  • Support for industry-standard code sets is provided. The ISO8859 family of code sets provides a range of single-byte code set support that includes:
    • Latin-1
    • Latin-2
    • Latin-4
    • Cyrillic
    • Arabic
    • Greek
    • Hebrew
    • Turkish
    The following industry-standard code sets are available:
    • The IBM-eucJP code set is the industry-standard code set used to support the Japanese locale.
    • The IBM-eucKR code set is the industry-standard code set used to support Korean countries.
    • The IBM-eucTW code set is the industry-standard code set used to support Traditional Chinese countries.
    • The IBM-eucCN code set is the industry-standard code set used to support countries using Simplified Chinese.
    • The UTF-8 code set is a Universal Transformation Format of Unicode/ISO10646 used to support multiple languages at once (including Simplified Chinese, Traditional Chinese, and Chinese characters used in Japanese and Korean).
  • ISO8859-15 standard code set is a replacement standard for the existing ISO8859-1 code set that is currently in use by the western European locales, the United States, and Canada. The need for another code set resulted from the introduction of the euro currency unit and the need for European countries to be able to do business transactions using the euro. In addition, ISO8859-15 contains 7 additional characters for the French and Finnish languages.
  • Support is also provided for the personal computer (PC) based code sets IBM-856, IBM-943, and IBM-1046. IBM-856 is a single-byte code set used to support Hebrew countries. IBM-943 is a multibyte code set used to support the Japanese locale. IBM-1046 is a single-byte code set used to support Arabic countries.
  • IBM-1129 is a single-byte code set used to support Vietnamese.
  • TIS-620 is a single-byte code set used to support Thai.
  • IBM-1124 is a single-byte code set used to support Ukrainian.
  • Full Unicode support is provided by the UTF-8 code set for all languages and territories supported by AIX. The UTF-8 code set is a Universal Transformation Format of Unicode/ISO10646 used to support multiple languages at once. The UTF-8 code set provides the most complete solution for use in environments where multiple languages and alphabets must be processed. The Unicode/UTF-8 code set also provides full support for the common European currency (euro).
  • IBM-1252 code set support is provided as a compatibility option for users who require a single byte code set environment containing the euro currency symbol. The structure of the IBM-1252 code set is identical to the industry-standard code set ISO8859-1, except that additional graphic characters are added in the ISO control character range from 0x80 through 0x9F. The euro currency symbol is located at hexadecimal valTXx80 in the IBM-1252 code set.