Unicode considerations for database files: Length (positions 30 through 34)

The length of a field containing UTF-16 data can range from 1 through 16 383 code units. The length of a field containing UTF-8 data can range from 1 through 32 766 code units.

When determining the program length of a field containing Unicode data, consider the following rules:
  • Each UTF-16 code unit is 2 bytes long.
  • The length of the field is specified in the number of UTF-16 code units. For example, a field containing 3 UTF-16 code units has 6 bytes of data.
  • Each UTF-8 code unit is 1 byte long. A UTF-8 character can be 1, 2, 3, or 4 code units in length.
  • After the conversion between Unicode data and EBCDIC, the resulting data should be equal to, longer, or shorter than the original length of the data. For example, one UTF-16 code unit is composed of 2 bytes of data. That character might convert to one single-byte character set (SBCS) character composed of 1 byte of data, 11 graphic double-byte character set (DBCS) characters composed of 2 bytes of data, or one bracketed DBCS character composed of 4 bytes of data. Therefore, it is suggested that when a Unicode field (in the physical file) is converted to a field with a different type in the logical file, the field in the logical file be defined with the VARLEN keyword. The length of the logical file field should be defined large enough to hold the maximum size that the Unicode field can be converted to. This accounts for the expansion that can occur.
On a logical file, if the length is not specified, and a UTF-16 to EBCDIC conversion will be taking place, the length of the corresponding physical file field will be taken, except in the following case:
  • If the physical file field is UTF-16 capable and the logical file field has a data type of O, then the length of the logical file field will be 2 times the field size of the physical file field.