Language-aware collation
Language-aware collations are based on the weight tables that the Db2® database uses for non-Unicode databases. The weight table is translated to Unicode and applied to Unicode data.
Strings are ordered by providing a single weight for each character from a subset of the Unicode characters. The weight tables are derived from the Db2 databases's non-Unicode collation tables. Characters which are not in the weight table are sorted in binary order. As a result, words are sorted in a manner that comes close to linguistic correctness.
Substring matching is done using the internal representation of the string. This means that two substrings will only be considered a match if they are byte-for-byte identical and linguistic rules will not be considered.
For more information about character weights used in language-aware collations, see System and language-aware collation tables.
- Advantages
- Provides a language based ordering for the code points in the weight table.
- Very fast collation (comparable to non-Unicode and IDENTITY collations).
- Same order for character and graphic types. 1
- Collation doesn't change for a migration from a non-Unicode database to a Unicode database.
1 In character strings, the range of characters from x'EE8080' to x'EFBFBF' inclusive (U+E000 - U+FFFF) sort after the supplementary characters from x'F0908080' to x'F48FBFBF' (U+10000 - U+10FFFF). However, on the certain platforms, the supplementary characters sort before the range of characters from x'EE8080' to x'EFBFBF'. The affected platforms are: Linux® AMD 64, HP-UX IA64, SUN SPARC64, SUN AMD x64.
Application can cast character strings as graphic strings before order by clause to get the sort order as graphic strings on all affected platforms.
Example:
db2 "create db utf using codeset UTF-8 collate using SYSTEM_819_US" db2 "connect to utf" db2 "create table t2 ( c1 int, c2 varchar(200),c3 vargraphic(200))" db2 "insert into t2 values (1, U&'\E000', U&'\E000' )" db2 "insert into t2 values (2, U&'\FFFF', U&'\FFFF' )" db2 "insert into t2 values (3, U&'\+10FFFE', U&'\+10FFFE' )" db2 "insert into t2 values (4, U&'\+10FFFF', U&'\+10FFFF' )"
db2 "select hex(c2) from t2 order by c2"
Result for the previous query with correct sort order for character column c2:
- F48FBFBE
- F48FBFBF
- EE8080
- EFBFBF
Reversed sort order on affected platforms:
- EE8080
- EFBFBF
- F48FBFBE
- F48FBFBF
Workaround to get sort as graphic:
db2 'select hex(c2) from t2 order by graphic (c2)'
- F48FBFBE
- F48FBFBF
- EE8080
- EFBFBF
- Disadvantages
- Only covers 256 code points in the weight table. Other code points are sorted in binary order.
- The original non-Unicode weight tables did not handle MBCS characters properly. When the tables were moved to Unicode, only the SBCS portion of each table was used.
- These collations do not handle combining accents.
- Substring matching is not linguistic.
Language-aware collations are suitable when a reasonable ordering is needed, but the additional processing required of a locale-sensitive UCA-based collation is not acceptable.
Examples
The
database with the language-aware collation was created using the following
command: CREATE DATABASE TESTDB COLLATE USING SYSTEM_912_CZ
.
SELECT WORD FROM TESTDATA ORDER BY WORD
WORD
----------
cena
chleb
c◌̌as
C◌̌ech
čas
Čech
hlava
holub
jaro
Jana
Jaroslav
- Upper case, lower case, and accented letters are grouped together.
- Characters with combining accents are grouped with the unaccented characters.
- Case and accent differences are treated as significant. For example, Jana is ordered between jaro and Jaroslav.
- The word chleb is incorrectly grouped with words starting with the letter c.
SELECT WORD FROM TESTDATA WHERE WORD LIKE 'c%'
WORD
----------
cena
chleb
c◌̌as
- The word c◌̌as is selected, even though it does not start with c. It starts with the character č .
- The word chleb is selected, even though the digraph ch does not linguistically match the letter c.