IDENTITY collation

The IDENTITY collation is a simple binary comparison of the values.

Strings are ordered by the computer's internal representation of the data. This produces a result that is not meaningful in any language.

Substring matching is also done using the internal representation of the string. This means that two substrings will only be considered a match if they are byte-for-byte identical. Linguistic and cultural rules will not be considered.

Advantages

Fastest collation available.

Disadvantages

The order is not linguistic.
Substring matching is not linguistic.
Character and graphic types are ordered differently.

IDENTITY collation is suitable when linguistic correctness is not important for the database and applications, or when the absolute best performance is vital.

Example

To demonstrate the behavior of this collation, the following list of Czech words is used.

chleb¹
Čech
C◌̌̌ech²
Jana
hlava
Jaroslav
holub
cena
jaro
čas
c◌̌as³

The database with IDENTITY collation was created using the following command: CREATE DATABASE TESTDB COLLATE USING IDENTITY.

Sorting:

SELECT WORD FROM TESTDATA ORDER BY WORD

WORD
----------

C◌̌ech
Jana
Jaroslav
cena
chleb
c◌̌as
hlava
holub
jaro
Čech
čas

In the results of the ORDER BY command, notice:

Upper and lower case letters are not grouped together.
Accented characters are grouped separately from unaccented characters.
Characters with combining accents are grouped with the unaccented characters.
The word chleb is incorrectly grouped with words starting with c.

Substring matching:

SELECT WORD FROM TESTDATA WHERE WORD LIKE 'c%'  

WORD 
----------

cena 
chleb  
c◌̌as

In the results of the LIKE command, notice:

The word c◌̌as is selected, even though it starts with the character č and not the character c.
The word chleb is selected, even though the digraph ch does not linguistically match the letter c.

¹ In Czech, the digraph ch is sorted separately from the letter c and is ordered between the letters h and i.

² In Unicode, the accented character Č can be entered as a single Unicode code point, U+010C (Latin capital letter C with caron) or as two code points, U+0043 U+030C (Latin capital letter C, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010C will be drawn as Č and U+0043 U+030C will be drawn as C◌̌. To demonstrate combining accents, both forms are included in the word list.

³ In Unicode, the accented character č can be entered as a single Unicode code point, U+010D (Latin small letter c with caron) or as two code points, U+0063 U+030C (Latin small letter c, combining caron). The two representations appear the same on a computer screen or a printout, but they have different internal representations. For the purposes of the examples, however, the characters will be drawn differently; U+010D will be drawn as č and U+0063 U+030C will be drawn as c◌̌. To demonstrate combining accents, both forms are included in the word list.