UTF-8 support

UTF stands for "UCS (Unicode) Transformation Format". The UTF-8 encoding can be used to represent any Unicode character. Depending on a Unicode character’s numeric value, the corresponding UTF-8 character is a 1, 2, or 3 byte sequence. Table 1 shows the mapping between Unicode and UTF-8. See RFC 2279: UTF-8, a transformation format of ISO 10646 and RFC 2253: Lightweight Directory Access Protocol (v3): UTF-8 String Representation of Distinguished Names for more information about UTF-8.

Table 1. Mapping between Unicode and UTF-8
Unicode range (hexadecimal)	UTF-8 octet sequence (binary)
0000-007F	0xxxxxxx
0080-07FF	110xxxxx 10xxxxxx
0800-FFFF	1110xxxx 10xxxxxx 10xxxxxx

The LDAP Version 3 protocol specifies that all data exchanged between LDAP clients and servers be UTF-8. The LDAP server supports UTF-8 data exchange as part of its Version 3 protocol support.

Note: For UTF-8 data stored in a LDAP server’s TDBM and GDBM (when DB2-based) backends, collation for single-byte UTF-8 characters is relative to the server’s locale. For multi-byte UTF-8 characters, collation is relative to the numeric value of the equivalent Unicode character.