Normalization of Unicode strings
Your application should treat as equal those characters that are functionally and visually equivalent but have different code point representations. This behavior is important when you search, sort, or compare Unicode strings. To accomplish this goal, you might need to normalize the strings. However, normalization can degrade performance.
Unicode strings can be canonically equivalent or compatibly equivalent. If they are canonically equivalent, they are also compatibly equivalent.
Canonically equivalent characters are those characters that are equivalent both functionally and visually, but might have different code point representations. To users, these characters are indistinguishable in that they are displayed identically. For example, the character ü is canonically equivalent to the sequence u and ¨.
Compatibly equivalent characters are characters with plain text that is equivalent, regardless of the semantic meaning. These characters might also have different code point representations. For example, superscript and subscript numerals are compatibly equivalent to their decimal-digit counterparts.
The process of normalization of Unicode strings produces a unique code point sequence for all sequences that are equivalent, either canonically or compatibly. Therefore, all canonically equivalent characters have the same binary representation. You can normalize a Unicode string into one of the following normalized forms that are defined by the Unicode Standard:
- Normalization Form Canonical Decomposition (NFD)
- Characters are decomposed by canonical equivalence.
- Normalization Form Canonical Composition (NFC)
- Characters are decomposed and then recomposed by canonical equivalence.
- Normalization Form Compatibly Decomposition (NFKD)
- Characters are decomposed by compatibly equivalence.
- Normalization Form Compatibly Composition (NFKC)
- Characters are decomposed by compatibly equivalence and then recomposed by canonical equivalence.
To normalize Unicode strings, use the NORMALIZE_STRING built-in function.