DB2 Version 9.7 for Linux, UNIX, and Windows

Character comparisons based on collating sequences

Once a collating sequence is established for a database with SYSTEM, NLSCHAR, COMPATIBILITY, or user defined collation option, character comparison is performed by comparing the weights of two characters, instead of directly comparing their code point values.

If weights that are not unique are used, characters that are not identical might compare equally. Because of this, string comparison can become a two-phase process:
  1. Compare the characters in each string based on their weights.
  2. If step 1 yields equality, compare the characters of each string based on their code point values.

If the collating sequence contains 256 unique weights, only the first step is performed. If the collating sequence is the identity sequence, only the second step is performed. In either case, there is a performance benefit. For Unicode databases, if the collation option is SYSTEM or IDENTITY, the collating sequence will be IDENTITY and only the second step is performed.

A Unicode database with the IDENTITY_16BIT collation option will collate the CHAR or VARCHAR data in the database according to their CESU-8 binary order instead of the UTF-8 binary order. The collation order is identical for non-supplementary characters. However, a supplementary character in UTF-8 encoding, is represented by one 4-byte sequence, but the same character in CESU-8 encoding requires two 3-byte sequences, which results in different collation orders.

For Unicode databases with locale-sensitive UCA-based collations, semantically equal characters are considered equal, even if these characters are not binarily identical. Because of this, string comparison can become a two-phase process:
  1. Compare the characters in each string as per the algorithm specified in the Unicode Technical Standard #10, available at the Unicode Technical Consortium web site (http://www.unicode.org/).
  2. If step 1 yields equality, compare the characters of each string based on their code point values.