DB2 Version 9.7 for Linux, UNIX, and Windows

Language-aware collation

Language-aware collations are based on the weight tables that the DB2® database uses for non-Unicode databases. The weight table is translated to Unicode and applied to Unicode data.

Strings are ordered by providing a single weight for each character from a subset of the Unicode characters. The weight tables are derived from the DB2 databases's non-Unicode collation tables. Characters which are not in the weight table are sorted in binary order. As a result, words are sorted in a manner that comes close to linguistic correctness.

Substring matching is done using the internal representation of the string. This means that two substrings will only be considered a match if they are byte-for-byte identical and linguistic rules will not be considered.

For more information about character weights used in language-aware collations, see System and language-aware collation tables.

Advantages
  • Provides a language based ordering for the code points in the weight table.
  • Very fast collation (comparable to non-Unicode and IDENTITY collations).
  • Same order for character and graphic types. 1
  • Collation doesn't change for a migration from a non-Unicode database to a Unicode database.

1 In character strings, the range of characters from x'EE8080' to x'EFBFBF' inclusive (U+E000 - U+FFFF) sort after the supplementary characters from x'F0908080' to x'F48FBFBF' (U+10000 - U+10FFFF). However, on the certain platforms, the supplementary characters sort before the range of characters from x'EE8080' to x'EFBFBF'. The affected platforms are: Linux AMD 64, HP-UX IA64, SUN SPARC64, SUN AMD x64.

Application can cast character strings as graphic strings before order by clause to get the sort order as graphic strings on all affected platforms.

Example:

db2 "create db utf using codeset UTF-8 collate using  SYSTEM_819_US"
db2 "connect to utf"
db2 "create table t2 ( c1 int, c2 varchar(200),c3 vargraphic(200))"
db2 "insert into t2 values (1, U&'\E000', U&'\E000' )"
db2 "insert into t2 values (2, U&'\FFFF', U&'\FFFF' )"
db2 "insert into t2 values (3, U&'\+10FFFE', U&'\+10FFFE' )"
db2 "insert into t2 values (4, U&'\+10FFFF', U&'\+10FFFF' )"
db2 "select hex(c2) from t2 order by c2"

Result for the above query with correct sort order for character column c2:

  • F48FBFBE
  • F48FBFBF
  • EE8080
  • EFBFBF

Reversed sort order on affected platforms:

  • EE8080
  • EFBFBF
  • F48FBFBE
  • F48FBFBF

Workaround to get sort as graphic:

db2 'select hex(c2) from t2 order by graphic (c2)'
  • F48FBFBE
  • F48FBFBF
  • EE8080
  • EFBFBF
Disadvantages
  • Only covers 256 code points in the weight table. Other code points are sorted in binary order.
  • The original non-Unicode weight tables did not handle MBCS characters properly. When the tables were moved to Unicode, only the SBCS portion of each table was used.
  • These collations do not handle combining accents.
  • Substring matching is not linguistic.

Language-aware collations are suitable when a reasonable ordering is needed, but the overhead of a locale-sensitive UCA-based collation is not acceptable.

Example

To demonstrate the behavior of this collation, the following list of Czech words is used.
  • chleb1
  • Čech
  • C◌̌̌ech2
  • Jana
  • hlava
  • Jaroslav
  • holub
  • cena
  • jaro
  • čas
  • c◌̌as3

The database with the language-aware collation was created using the following command: CREATE DATABASE TESTDB COLLATE USING SYSTEM_912_CZ.

Sorting:
SELECT WORD FROM TESTDATA ORDER BY WORD

WORD
----------
cena
chleb
c◌̌as
C◌̌ech
čas
Čech
hlava
holub
jaro
Jana
Jaroslav
In the results of the ORDER BY command, notice:
  • Upper case, lower case, and accented letters are grouped together.
  • Characters with combining accents are grouped with the unaccented characters.
  • Case and accent differences are treated as significant. For example, Jana is ordered between jaro and Jaroslav.
  • The word chleb is incorrectly grouped with words starting with the letter c.
Substring matching:
SELECT WORD FROM TESTDATA WHERE WORD LIKE 'c%'

WORD 
---------- 
cena 
chleb 
c◌̌as
In the results of the LIKE command, notice:
  • The word c◌̌as is selected, even though it does not start with c. It starts with the character č.
  • The word chleb is selected, even though the digraph ch does not linguistically match the letter c.
1 In Czech, the digraph ch is sorted separately from the letter c and is ordered between the letters h and i.
2 In Unicode, the accented character Č can be entered as a single Unicode code point, U+010C (Latin capital letter C with caron) or as two code points, U+0043 U+030C (Latin capital letter C, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010C will be drawn as Č and U+0043 U+030C will be drawn as C◌̌. To demonstrate combining accents, both forms are included in the word list.
3 In Unicode, the accented character č can be entered as a single Unicode code point, U+010D (Latin small letter c with caron) or as two code points, U+0063 U+030C (Latin small letter c, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010D will be drawn as č and U+0063 U+030C will be drawn as c◌̌. To demonstrate combining accents, both forms are included in the word list.