Language-aware collations are based on the weight tables
that the DB2® database uses for
non-Unicode databases. The weight table is translated to Unicode and
applied to Unicode data.
Strings are ordered by providing a single weight for each
character from a subset of the Unicode characters. The weight tables
are derived from the DB2 databases's
non-Unicode collation tables. Characters which are not in the weight
table are sorted in binary order. As a result, words are sorted in
a manner that comes close to linguistic correctness.
Substring
matching is done using the internal representation of the string.
This means that two substrings will only be considered a match if
they are byte-for-byte identical and linguistic rules will not be
considered.
For more information about character weights used
in language-aware collations, see System and language-aware collation tables.
- Advantages
- Provides a language based ordering for the code points in the
weight table.
- Very fast collation (comparable to non-Unicode and IDENTITY collations).
- Same order for character and graphic types. 1
- Collation doesn't change for a migration from a non-Unicode database
to a Unicode database.
1 In character strings, the range of characters
from x'EE8080' to x'EFBFBF' inclusive (U+E000 - U+FFFF) sort after
the supplementary characters from x'F0908080' to x'F48FBFBF' (U+10000
- U+10FFFF). However, on the certain platforms, the supplementary
characters sort before the range of characters from x'EE8080' to x'EFBFBF'.
The affected platforms are: Linux AMD 64, HP-UX IA64, SUN SPARC64, SUN AMD x64.
Application
can cast character strings as graphic strings before order by clause
to get the sort order as graphic strings on all affected platforms.
Example:
db2 "create db utf using codeset UTF-8 collate using SYSTEM_819_US"
db2 "connect to utf"
db2 "create table t2 ( c1 int, c2 varchar(200),c3 vargraphic(200))"
db2 "insert into t2 values (1, U&'\E000', U&'\E000' )"
db2 "insert into t2 values (2, U&'\FFFF', U&'\FFFF' )"
db2 "insert into t2 values (3, U&'\+10FFFE', U&'\+10FFFE' )"
db2 "insert into t2 values (4, U&'\+10FFFF', U&'\+10FFFF' )"
db2 "select hex(c2) from t2 order by c2"Result
for the above query with correct sort order for character column c2:
- F48FBFBE
- F48FBFBF
- EE8080
- EFBFBF
Reversed sort order on affected platforms:
- EE8080
- EFBFBF
- F48FBFBE
- F48FBFBF
Workaround to get sort as graphic:
db2 'select
hex(c2) from t2 order by graphic (c2)'- F48FBFBE
- F48FBFBF
- EE8080
- EFBFBF
- Disadvantages
- Only covers 256 code points in the weight table. Other code points
are sorted in binary order.
- The original non-Unicode weight tables did not handle MBCS characters
properly. When the tables were moved to Unicode, only the SBCS portion
of each table was used.
- These collations do not handle combining accents.
- Substring matching is not linguistic.
Language-aware collations are suitable when a reasonable
ordering is needed, but the overhead of a locale-sensitive UCA-based
collation is not acceptable.
Example
To demonstrate the behavior of this collation,
the following list of Czech words is used.
- chleb1
- Čech
- C◌̌̌ech2
- Jana
- hlava
- Jaroslav
- holub
- cena
- jaro
- čas
- c◌̌as3
The database with the language-aware collation was created
using the following command: CREATE DATABASE TESTDB COLLATE USING SYSTEM_912_CZ.
Sorting:
SELECT WORD FROM TESTDATA ORDER BY WORD
WORD
----------
cena
chleb
c◌̌as
C◌̌ech
čas
Čech
hlava
holub
jaro
Jana
Jaroslav
In the results of the ORDER BY command, notice:
- Upper case, lower case, and accented letters are grouped together.
- Characters with combining accents are grouped with the unaccented
characters.
- Case and accent differences are treated as significant. For example, Jana is ordered between jaro and Jaroslav.
- The word chleb is incorrectly grouped with words
starting with the letter c.
Substring matching:
SELECT WORD FROM TESTDATA WHERE WORD LIKE 'c%'
WORD
----------
cena
chleb
c◌̌as
In the results of the
LIKE command, notice:
- The word c◌̌as is selected, even though it does not start with c. It starts with the character č.
- The word chleb is selected, even though the digraph ch does not linguistically match the letter c.
2 In Unicode, the accented character
Č can be entered as a single Unicode code point, U+010C (Latin
capital letter C with caron) or as two code points, U+0043 U+030C
(Latin capital letter C, combining caron). The two representations
appear the same on a computer screen or a printout, but they have
different internal representations. For the purposes of the examples,
however, the characters will be drawn differently; U+010C will be
drawn as
Č and U+0043 U+030C will be drawn as
C◌̌. To demonstrate combining accents, both forms are included
in the word list.
3 In Unicode, the accented character
č can be entered as a single Unicode code point, U+010D (Latin
small letter c with caron) or as two code points, U+0063 U+030C (Latin
small letter c, combining caron). The two representations appear the
same on a computer screen or a printout, but they have different internal
representations. For the purposes of the examples, however, the characters
will be drawn differently; U+010D will be drawn as
č and U+0063 U+030C will be drawn as
c◌̌. To demonstrate combining accents, both forms are included
in the word list.