Collating sequences

The database manager compares character data using a collating sequence. This is an ordering for a set of characters that determines whether a particular character sorts higher, lower, or the same as another.

The Unicode Collation Algorithm (UCA) uses weight tables to determine the collating sequence.

Note: Character string data defined with the FOR BIT DATA attribute, and BLOB data, is sorted using the binary sort sequence.

For example, a collating sequence can be used to indicate that lowercase and uppercase versions of a particular character are to be sorted equally.

The database manager allows databases to be created with custom collating sequences. The following sections help you determine and implement a particular collating sequence for a database.

Each single-byte character in a database is represented internally as a unique number between 0 and 255 (in hexadecimal notation, between X'00' and X'FF'). This number is referred to as the code point of the character; the assignment of numbers to characters in a set is collectively called a code page. A collating sequence is a mapping between the code point and the requried position of each character in a sorted sequence. The numeric value of the position is called the weight of the character in the collating sequence. In the simplest collating sequence, the weights are identical to the code points. This is called the identity sequence.

For example, suppose the characters B and b have the code points X'42' and X'62'. If (according to the collating sequence table) they both have a sort weight of X'42' (B), they collate the same. If the sort weight for B is X'9E', and the sort weight for b is X'9D', b will be sorted before B. The collating sequence table specifies the weight of each character. The table is different from a code page, which specifies the code point of each character.

Consider the following example. The ASCII characters A through Z are represented by X'41' through X'5A'. To describe a collating sequence in which these characters are sorted consecutively (no intervening characters), you can write: X'41', X'42', ... X'59', X'5A'.

The hexadecimal value of a multibyte character is also used as the weight. For example, suppose the code points for the double-byte characters A and B are X'8260' and X'8261' respectively, then the collation weights for X'82', X'60', and X'61' are used to sort these two characters according to their code points.

The weights in a collating sequence need not be unique. For example, you could give uppercase letters and their lowercase equivalents the same weight.

Specifying a collating sequence can be simplified if the collating sequence provides weights for all 256 code points. The weight of each character can be determined using the code point of the character.

In all cases, the Db2® database uses the collation table that was specified at database creation time. If you want the multibyte characters to be sorted the way that they appear in their code point table, you must specify IDENTITY as the collating sequence when you create the database.

Once a collating sequence is defined, all future character comparisons for that database will be performed with that collating sequence. Except for character data defined as FOR BIT DATA or BLOB data, the collating sequence will be used for all SQL comparisons and ORDER BY clauses, and also in setting up indexes and statistics.

Potential problems can occur in the following cases:
  • An application merges sorted data from a database with application data that was sorted using a different collating sequence.
  • An application merges sorted data from one database with sorted data from another, but the databases have different collating sequences.
  • An application makes assumptions about sorted data that are not true for the relevant collating sequence. For example, numbers collating lower than alphabetics might or might not be true for a particular collating sequence.

A final point to remember is that the results of any sort based on a direct comparison of character code points will only match query results that are ordered using an identity collating sequence.

In a Unicode database, graphic data is sorted using the database collation mechanism. In a non-Unicode database with SYSTEM collation, the graphic data is collated based on the weight of each byte.