Language-aware collation

Language-aware collations are based on the weight tables that the Db2® database uses for non-Unicode databases. The weight table is translated to Unicode and applied to Unicode data.

Strings are ordered by providing a single weight for each character from a subset of the Unicode characters. The weight tables are derived from the Db2 databases's non-Unicode collation tables. Characters which are not in the weight table are sorted in binary order. As a result, words are sorted in a manner that comes close to linguistic correctness.

Substring matching is done using the internal representation of the string. This means that two substrings will only be considered a match if they are byte-for-byte identical and linguistic rules will not be considered.

For more information about character weights used in language-aware collations, see System and language-aware collation tables.

Advantages

Provides a language based ordering for the code points in the weight table.
Very fast collation (comparable to non-Unicode and IDENTITY collations).
Same order for character and graphic types. ¹
Collation doesn't change for a migration from a non-Unicode database to a Unicode database.

¹ In character strings, the range of characters from x'EE8080' to x'EFBFBF' inclusive (U+E000 - U+FFFF) sort after the supplementary characters from x'F0908080' to x'F48FBFBF' (U+10000 - U+10FFFF). However, on the certain platforms, the supplementary characters sort before the range of characters from x'EE8080' to x'EFBFBF'. The affected platforms are: Linux® AMD 64, HP-UX IA64, SUN SPARC64, SUN AMD x64.

Application can cast character strings as graphic strings before order by clause to get the sort order as graphic strings on all affected platforms.

Example:

db2 "create db utf using codeset UTF-8 collate using  SYSTEM_819_US"
db2 "connect to utf"
db2 "create table t2 ( c1 int, c2 varchar(200),c3 vargraphic(200))"
db2 "insert into t2 values (1, U&'\E000', U&'\E000' )"
db2 "insert into t2 values (2, U&'\FFFF', U&'\FFFF' )"
db2 "insert into t2 values (3, U&'\+10FFFE', U&'\+10FFFE' )"
db2 "insert into t2 values (4, U&'\+10FFFF', U&'\+10FFFF' )"

db2
"select hex(c2) from t2 order by c2"

Result for the previous query with correct sort order for character column c2:

F48FBFBE
F48FBFBF
EE8080
EFBFBF

Reversed sort order on affected platforms:

EE8080
EFBFBF
F48FBFBE
F48FBFBF

Workaround to get sort as graphic:

db2 'select
hex(c2) from t2 order by graphic (c2)'

F48FBFBE
F48FBFBF
EE8080
EFBFBF

Disadvantages

Only covers 256 code points in the weight table. Other code points are sorted in binary order.
The original non-Unicode weight tables did not handle MBCS characters properly. When the tables were moved to Unicode, only the SBCS portion of each table was used.
These collations do not handle combining accents.
Substring matching is not linguistic.

Language-aware collations are suitable when a reasonable ordering is needed, but the additional processing required of a locale-sensitive UCA-based collation is not acceptable.

Examples

To demonstrate the behavior of this collation, the following list of Czech words is used.

chleb¹
Čech
C◌̌̌ech ²
Jana
hlava
Jaroslav
holub
cena
jaro
čas
c◌̌as ³

The database with the language-aware collation was created using the following command: CREATE DATABASE TESTDB COLLATE USING SYSTEM_912_CZ.

Sorting:

SELECT WORD FROM TESTDATA ORDER BY WORD

WORD
----------

cena
chleb
c◌̌as
C◌̌ech
čas
Čech
hlava
holub
jaro
Jana
Jaroslav

In the results of the ORDER BY command, notice:

Upper case, lower case, and accented letters are grouped together.
Characters with combining accents are grouped with the unaccented characters.
Case and accent differences are treated as significant. For example, Jana is ordered between jaro and Jaroslav.
The word chleb is incorrectly grouped with words starting with the letter c.

Substring matching:

SELECT WORD FROM TESTDATA WHERE WORD LIKE 'c%'

WORD 
----------

cena 
chleb 
c◌̌as

In the results of the LIKE command, notice:

The word c◌̌as is selected, even though it does not start with c. It starts with the character č .
The word chleb is selected, even though the digraph ch does not linguistically match the letter c.

¹ In Czech, the digraph ch is sorted separately from the letter c and is ordered between the letters h and i.

² In Unicode, the accented character Č can be entered as a single Unicode code point, U+010C (Latin capital letter C with caron) or as two code points, U+0043 U+030C (Latin capital letter C, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010C will be drawn as Č and U+0043 U+030C will be drawn as C◌̌ . To demonstrate combining accents, both forms are included in the word list.

³ In Unicode, the accented character č can be entered as a single Unicode code point, U+010D (Latin small letter c with caron) or as two code points, U+0063 U+030C (Latin small letter c, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010D will be drawn as č and U+0063 U+030C will be drawn as c◌̌ . To demonstrate combining accents, both forms are included in the word list.