Unicode considerations for database files

Unicode is a universal encoding scheme for written characters and text that enables the exchange of data internationally.

This topic discusses how to specify DDS positions 30 through 37 and positions 45 through 80 for describing database files. Positions not mentioned have no special considerations for Unicode.

A Unicode field can contain all types of characters used on the IBM® i platform, including double-byte character set (DBCS) characters. Unicode data is composed of code units, which represent the minimal byte combination that can represent a unit of text.

These transformation formats (encoding forms) of Unicode are supported with physical and logical file DDS:

UTF-8 is an 8-bit encoding form designed for ease of use with existing ASCII-based systems. UTF-8 data is stored in character data types. The CCSID value for data in UTF-8 format is 1208.
A UTF-8 code unit is 1 byte in length. A UTF-8 character can be 1, 2, 3, or 4 code units in length. A UTF-8 data string can contain any character, including surrogates and combining characters.
UTF-16 is a 16-bit encoding form designed to provide code values for over a million characters, and a superset of UCS-2. UTF-16 data is stored in graphic data types. The CCSID value for data in UTF-16 format is 1200.
A UTF-16 code unit is 2 bytes in length. A UTF-16 character can be 1 or 2 code units (2 or 4 bytes) in length. A UTF-16 data string can contain any character, including UTF-16 surrogates and combining characters.
UCS-2 is the Universal Character Set coded in 2 octets, which means that characters are represented in 16-bits per character. UCS-2 data is stored in graphic data types. The CCSID value for data in UCS-2 format is 13488.
UCS-2 is a subset of UTF-16, and can no longer support all of the characters defined by Unicode. UCS-2 is identical to UTF-16, except that UTF-16 also supports combining characters and surrogates. If you do not need support for combining characters and surrogates, then you can choose to use the UCS-2 type, because there is more database functionality available for it.

Note: In this topic, references to UTF-16 imply UCS-2 as well.