Locale-sensitive UCA-based collation

Locale-sensitive collations are based on the full Unicode Collation Algorithm (UCA) specification and provide full cultural correctness.

Strings are ordered according to the Unicode Collation Algorithm. The collation can be tailored to account for features such as language or case and accent insensitivity. For more information about UCA, see Unicode Collation Algorithm-based collations.

This algorithm uses multiple weights per character as well as extra processing to handle special cases such as contractions and combining accents. The complexity of the algorithm adds significantly more processing time.

Substring matching is done using the collation. Substrings are matched in a linguistically meaningful manner.

Advantages
  • Full support of the UCA, including contractions and combining accents.
  • Provides support for case and accent insensitive collations.
  • Handles all Unicode code points.
  • Allows collations to be tailored to suit different languages.
  • Same order for character and graphic types.
  • Substring matching is done using the collation.
Disadvantages
  • Substantial performance penalty.

Locale-sensitive UCA-based collations are suitable when fully linguistic ordering is needed and the extra performance time required can be tolerated.

Examples

To demonstrate the behavior of this collation, the following list of Czech words is used.
  • chleb1
  • Čech
  • C◌̌̌ech 2
  • Jana
  • hlava
  • Jaroslav
  • holub
  • cena
  • jaro
  • čas
  • c◌̌as 3

The database with the locale-sensitive collation was created using the following command: CREATE DATABASE TESTDB COLLATE USING CLDR181_LCS.

Sorting:

SELECT WORD FROM TESTDATA ORDER BY WORD  

WORD 
---------- 
cena 
čas 
c◌̌as 
Čech 
C◌̌ech 
hlava 
holub 
chleb 
Jana 
jaro 
Jaroslav
In the results of the ORDER BY command, notice:
  • The result is linguistically correct.
  • Case and accent differences are treated as less significant than the base character.
  • Combining accents are equal to the equivalent accented character.
  • The word chleb is correctly ordered after the word holub.

Substring matching:

SELECT WORD FROM TESTDATA WHERE WORD LIKE 'c%'

WORD
----------
cena
In the results of the LIKE command, notice:
  • Neither c◌̌as nor chleb are selected, since linguistically they do not start with the letter c.
1 In Czech, the digraph ch is sorted separately from the letter c and is ordered between the letters h and i.
2 In Unicode, the accented character Č can be entered as a single Unicode code point, U+010C (Latin capital letter C with caron) or as two code points, U+0043 U+030C (Latin capital letter C, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010C will be drawn as Č and U+0043 U+030C will be drawn as C◌̌ . To demonstrate combining accents, both forms are included in the word list.
3 In Unicode, the accented character č can be entered as a single Unicode code point, U+010D (Latin small letter c with caron) or as two code points, U+0063 U+030C (Latin small letter c, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010D will be drawn as č and U+0063 U+030C will be drawn as c◌̌ . To demonstrate combining accents, both forms are included in the word list.