Unicode Collation Algorithm based collations

The CREATE DATABASE command and the COLLATION_KEY_BIT scalar function support the CLDR181 collation keyword. This keyword implements the UCA (Unicode Collation Algorithm) specified in the Unicode Standard and based on the Common Locale Data Repository (CLDR).

The default Unicode Collation Algorithm is implemented by the CLDR181 keyword without any attributes. Since the default UCA cannot simultaneously encompass the collating sequence of every language supported by Unicode, optional attributes can be specified to customize the UCA ordering. The attributes are separated by the underscore (_) character. The CLDR181 keyword and any attributes form a UCA collation name.

The following table describes the collation attributes, their values, and typical usage examples.

Table 1. UCA collation attributes
Attribute name Attribute short form Valid values Description
Locale:
  1. Language
  2. Region
  3. Script
  4. Keyword
Locale:
  1. L[ISO 639-1 language code]
  2. R[ISO 3166 country/region code]
  3. Z[ISO 15924 script code]
  4. K[name]
See Table 2 for a list of all the valid Locale names.

The Locale attribute is probably the most important attribute to obtain ordering that conforms to the user expectations in different countries and regions. You need to explicitly specify the Locale attribute to properly collate text for a specific language.

The Locale attribute consists of the following parts: language, region/country, script, and keyword. Not all the parts are mandatory. See Table 2 for a complete list of the valid combinations. The specification of a locale automatically presets all the other collation attributes to values that are suitable for that locale. Typically there is no need to specify additional collation attribute.

Examples:
  • CLDR181 or CLDR181_LROOT for the default UCA ordering
  • CLDR181_LDE for German, where "Köpfe" < "Kypper"
  • CLDR181_LSV for Swedish, where "Köpfe" > "Kypper"
  • CLDR181_LDE_KPHONEBOOK, which specifies the German telephone ordering
Strength S 1, 2, 3, 4, or I

The Strength attribute determines whether accent or case is taken into account when collating or comparing text strings. In writing systems without case or accent, the Strength attribute controls similarly important features.

The possible values are: primary (1), secondary (2), tertiary (3), quaternary (4), and identity (I). To ignore:
  • accent and case, use the primary strength level
  • case only, use the secondary strength level
  • neither accent nor case, use the tertiary strength level

Almost all characters can be distinguished by the first three strength levels, therefore in most locales the default Strength attribute is set at the tertiary level. However if the Alternate attribute (described in a following row) is set to shifted, then the quaternary strength level can be used to break ties among white space characters, punctuation marks, and symbols that would otherwise be ignored. The identity strength level is used to distinguish among similar characters, such as the MATHEMATICAL BOLD SMALL A character (U+1D41A) and the MATHEMATICAL ITALIC SMALL A character (U+1D44E).

Setting the Strength attribute to higher level will slow down text string comparisons and increase the length of the sort keys.

Examples:
  • CLDR181_S1 will collate "role" = "Role" = "rôle"
  • CLDR181_S2 will collate "role" = "Role" < "rôle"
  • CLDR181_S3 will collate "role" < "Role" < "rôle"
Case Level E
  • X (Off)
  • O (On)

Setting the Case Level attribute to on and the Strength attribute to primary level will ignore accent but not case. The Case Level attribute is set to X by default in most locales. When this attributes is set to O, it will slightly affect text string comparisons performance and lengthen the sort keys.

Examples:
  • CLDR181_EX_S1 will collate "role" = "Role" = "rôle"
  • CLDR181_EO_S1 will collate "role" = "rôle" < "Role"
Case First C X, L, or U

The Case First attribute controls whether upper case characters collate before or after lower case characters, in the absence of other differences in the two text strings.

The possible values are upper case first (U), lower case first (L), and off (X). There is almost no difference between the lower case first setting and the off setting, therefore typically there is no need to use the lower case first setting.

Specifying a Case First attribute of U or L can increase the length of the sort keys.

Examples:
  • CLDR181_CX or CLDR181_CL will collate "china" < "China" < "denmark" < "Denmark"
  • CLDR181_CU will collate "China" < "china" < "Denmark" < "denmark"
Alternate A N or S

The Alternate attribute controls the handling of variable characters in the UCA: white space, punctuation marks, and symbols.

If the Alternate attribute is set to non-ignorable (N), then differences among these variable characters are of the same importance as differences among non-variable characters such as the English alphabet. If the Alternate attribute is set to shifted (S), then these variable characters are of only minor importance. If the Alternate attribute is set to shifted and the Strength attribute is set to the quaternary level, then variable characters are considered in a comparison when all other aspects of the strings - base letters, accents, and case - are identical.

The default for most locales is non-ignorable.

If shifted is selected, performance will be slower if there are many strings that are identical except for punctuation marks. Sort key length will not be affected unless the strength level is also increased.

Examples:
  • CLDR181_AN_S3 will collate "di Silva" < "Di Silva" < "diSilva" < "U.S.A." < "USA"
  • CLDR181_AS_S3 will collate "di Silva" = "diSilva" < "Di Silva" < "U.S.A." = "USA"
  • CLDR181_AS_S4 will collate "di Silva" < "diSilva" < "Di Silva" < "U.S.A." < "USA"
Variable Top T [4 or 8 UTF-16BE hexadecimal digits]

The Variable Top attribute controls which characters to ignore, and is only meaningful if the Alternate attribute is set to Shifted. All characters whose primary weight is equal or lower than the specified character are considered ignorable.

The character is specified as one or two UTF-16BE code units in hexadecimal notation. A Unicode supplementary character is specified using a surrogate pair. For example, if you want to ignore white space characters and not visible characters, then set the Alternate attribute to Shifted and this attribute to U+0020 (space) or U+3000 (ideographic space). Since all characters having the same primary weight are equivalent, so setting this attribute to U+0020 is equivalent to setting it to U+3000.

This attribute alone has little impact on text string comparison performance, but setting it higher makes sort keys longer.

Example:
  • CLDR181_AS_S3 will collate "di Silva" = "diSilva" < "U.S.A." = "USA"
  • CLDR181_AS_S3_T0020 will collate "di Silva" = "diSilva" < "U.S.A." = "USA"
Normalization Checking N
  • X (Off)
  • O (On)
The Normalization Checking attribute, if set to O, will normalize the input text if necessary. Even if this attribute is set to X, as is the default for many locales, text as represented in common usage will collate correctly. You should, however, set this attribute to O in two cases:
  • if the text contains accent marks in non-canonical order
  • if the text is in a script that uses multiple combining characters, such as Arabic, ancient Greek, Hebrew, Hindi, Thai, or Vietnamese
There is a medium string comparison performance cost if this attribute is set to O, depending on the frequency of sequences that require normalization. There is no significant effect on length of the sort keys. If the text is already in normalized form NFD or NFKD, then you can set this attribute to X to improve performance. Examples:
  • CLDR181_NX will collate ä = a + ◌̈ < ä + ◌ ̣ < ạ + ◌̈
  • CLDR181_NO will collate ä = a + ◌̈ < ä + ◌ ̣ = ạ + ◌̈
French F
  • X (Off)
  • O (On)

The French sorts strings by examining the accents starting from the end of the string. This attribute is automatically set to O for the French locales, and has a minor performance cost for text string comparisons, but no change in the length of the sort keys.

Examples:
  • CLDR181_LFR_FX will collate "cote" < "coté" < "côte" < "côté"
  • CLDR181_LFR will collate "cote" < "côte" < "coté" < "côté"
Hiragana H
  • X (Off)
  • O (On)

The Hiragana attribute determines whether to distinguish between Japanese Hiragana and Katakana characters. To conform with the Japanese JIS X 4061 standard, you need to set this attribute to O and the Strength attribute to the quaternary level. This will, however, slow down text string comparisons and increase the length of the sort keys.

Examples:
  • CLDR181_LJA_HX_S4 will collate "きゅう" = "キュウ" < "きゆう" = " "キユウ"
  • CLDR181_LJA_HO_S4 will collate "きゅう" < "キュウ" < "きゆう" < "キユウ"

Valid locale names for the collations are shown in Table 2. The Default collation attributes column shows the full name of the collation for the specific locale. For example, CLDR181_LAR is equivalent to CLDR181_LAR_AN_CX_EX_FX_HX_NX_S3.

Tip: If a locale name is not listed in Table 2, try the LROOT locale instead. While the LROOT locale does not always yield the correct collation for all unlisted locales, it might result in the expected order for some locales.
Table 2. Valid collation locale names
CLDR181_Locale name Language (Region) Default collation attributes Remarks
CLDR181_LAF Afrikaans CLDR181_LAF_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LAR Arabic CLDR181_LAR_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LAS Assamese CLDR181_LAS_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LAZ Azerbaijani CLDR181_LAZ_A N_CX_EX_FX_HX _NX_S3  
CLDR181_LBE Belarusian CLDR181_LBE_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LBG Bulgarian CLDR181_LBG_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LBN Bengali CLDR181_LBN_A N_CX_EX_FX_HX _NO_S3  
CLDR181_LBN_KTRADITIONAL Bengali CLDR181_LBN_KTRADITIONAL_A N_CX_EX_FX_HX _NO_S3  
CLDR181_LCA Catalan CLDR181_LCA_AN_CX_EX_FO_HX_NX_S3  
CLDR181_LCS Czech CLDR181_LCS_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LCY Welsh CLDR181_LCY_A N_CX_EX_FX_HX _NX_S3  
CLDR181_LDA Danish CLDR181_LDA_AN_CU_EX_FX_HX_NX_S3  
CLDR181_LDE German CLDR181_LDE_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LDE_KPHONEBOOK German CLDR181_LDE_KPHONEBOOK_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LEL Greek CLDR181_LEL_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LEN English CLDR181_LEN_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LEN_RBE English (Belgium) CLDR181_LEN_RBE_AN_CX_EX_FO_HX_NX_S3  
CLDR181_LEO Esperanto CLDR181_LEO_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LES Spanish CLDR181_LES_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LES_KTRADITIONAL Spanish CLDR181_LES_KTRADITIONAL_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LET Estonian CLDR181_LET_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LFA Persian CLDR181_LFA_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LFA_RAF Persian (Afghanistan) CLDR181_LFA_RAF_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LFI Finnish CLDR181_LFI_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LFO Faroese CLDR181_LFO_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LFR French CLDR181_LFR_AN_CX_EX_FO_HX_NX_S3  
CLDR181_LGU Gujarati CLDR181_LGU_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LHA Hausa CLDR181_LHA_A N_CX_EX_FX_HX _NX_S3  
CLDR181_LHAW Hawaiian CLDR181_LHAW_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LHE Hebrew CLDR181_LHE_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LHI Hindi CLDR181_LHI_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LHI_KDIRECT Hindi CLDR181_LHI_KDIRECT_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LHR Croatian CLDR181_LHR_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LHU Hungarian CLDR181_LHU_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LHY Armenian CLDR181_LHY_A N_CX_EX_FX_HX _NX_S3  
CLDR181_LIG Igbo CLDR181_LIG_A N_CX_EX_FX_HX _NO_S3  
CLDR181_LIS Icelandic CLDR181_LIS_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LIT Italian CLDR181_LIT_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LJA Japanese CLDR181_LJA_AN_CX_EX_FX_HO_NX_S3 Treat Hiragana as equal to their Katakana equivalents. To sort Hiragana before Katakana, set the strength level to 4.
CLDR181_LJA_KUNIHAN Japanese CLDR181_LJA_KUNIHAN_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LKK Kazakh CLDR181_LKK_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LKL Kalaallisut CLDR181_LKL_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LKM Khmer CLDR181_LKM_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LKN Kannada CLDR181_LKN_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LKN_KTRADITIONAL Kannada CLDR181_LKN_KTRADITIONAL_A N_CX_EX_FX_HX _NO_S3  
CLDR181_LKO Korean CLDR181_LKO_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LKO_KUNIHAN Korean CLDR181_LKO_KUNIHAN_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LKOK Konkani CLDR181_LKOK_ AN_CX_EX_FX_H X_NO_S3  
CLDR181_LLT Lithuanian CLDR181_LLT_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LLV Latvian CLDR181_LLV_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LMK Macedonian CLDR181_LMK_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LML Malayalam CLDR181_LML_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LMR Marathi CLDR181_LMR_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LMT Maltese CLDR181_LMT_AN_CU_EX_FX_HX_NX_S3  
CLDR181_LNB Norwegian Bokmål CLDR181_LNB_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LNN Norwegian Nynorsk CLDR181_LNN_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LOM Oromo CLDR181_LOM_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LOR Oriya CLDR181_LOR_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LPA Punjabi CLDR181_LPA_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LPL Polish CLDR181_LPL_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LPS Pashto CLDR181_LPS_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LRO Romanian CLDR181_LRO_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LROOT Root CLDR181_LROOT_AN_CX_EX_FX_HX_NX_S3 Default UCA
CLDR181_LRU Russian CLDR181_LRU_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LSI Sinhala CLDR181_LSI_A N_CX_EX_FX_HX _NO_S3  
CLDR181_LSI_KDICTIONARY Sinhala CLDR181_LSI_KDICTIONARY_AN_CX_EX_FX_HX_ NO_S3  
CLDR181_LSK Slovak CLDR181_LSK_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LSL Slovenian CLDR181_LSL_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LSQ Albanian CLDR181_LSQ_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LSR Serbian CLDR181_LSR_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LSR_ZLATN Serbian CLDR181_LSR_ZLATN_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LSV Swedish CLDR181_LSV_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LTA Tamil CLDR181_LTA_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LTE Telugu CLDR181_LTE_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LTH Thai CLDR181_LTH_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LTR Turkish CLDR181_LTR_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LUK Ukrainian CLDR181_LUK_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LUR Urdu CLDR181_LUR_AN_CX_EX_FX_HX_NX_S3  
CLDR181_LVI Vietnamese CLDR181_LVI_AN_CX_EX_FX_HX_NO_S3  
CLDR181_LYO Yoruba CLDR181_LYO_A N_CX_EX_FX_HX _NO_S3  
CLDR181_LZH Chinese CLDR181_LZH_AN_CX_EX_FX_HX_NX_S3 Pinyin ordering
CLDR181_LZH_KUNIHAN Chinese CLDR181_LZH_KUNIHAN_AN_CX_EX_FX_HX_NX_S3 Default UCA ordering
CLDR181_LZH_KBIG5HAN Chinese CLDR181_LZH_KBIG5HAN_AN_CX_EX_FX_HX_NX_S3 Big5 ordering
CLDR181_LZH_KGB2312HAN Chinese CLDR181_LZH_KGB2312HAN_AN_CX_EX_FX_HX_NX_S3 GB2312 ordering
CLDR181_LZH_KSTROKE Chinese CLDR181_LZH_KSTROKE_AN_CX_EX_FX_HX_NX_S3 Stroke ordering

Details of the Unicode Collation Algorithm can be found in the Unicode Technical Standard #10, available at the Unicode Consortium website at http://www.unicode.org.

Collations introduced before version 10.1

Important: Collations based on the Unicode Collation Algorithm of the Unicode Standard version 4.0.0 have been deprecated in version 10.1 and might be removed in a future release. For more information, see Collations based on the Unicode Collation Algorithm of the Unicode Standard version 4.0.0 have been deprecated.

The UCA400R1 collations introduced in Db2® Version 9.1 Fix Pack 3 are only supported in the COLLATION_KEY_BIT function. These collations are not supported when creating databases.

The UCA500R1 collations introduced in Db2 Version 9.5 Fix Pack 1 are supported when creating databases and in the COLLATION_KEY_BIT function.

For specific details about collations introduced before the current release, refer to the documentation for the previous release.

Table 3. Collation keywords with corresponding versions of the Unicode Standard and CLDR
Collation keyword Complies with Unicode Standard version Complies with CLDR version
UCA400R1 4.0 1.2
UCA500R1 5.0 1.5.1
CLDR181 5.2 1.8.1