NORMALIZE_STRING scalar function

The NORMALIZE_STRING function takes a Unicode string argument and returns a normalized string that can be used for comparison.

The NORMALIZE_STRING function can convert two strings that look the same (such as Å, which can be encoded in UTF-16 as X'00C5' and as X'0041030a') but might not be encoded using the same Unicode code point, to a normalized form that can be compared.

The schema is SYSIBM.

unicode_string

An expression that returns a value of a built-in character string or graphic string data type that is either Unicode UTF-8 or Unicode UTF-16, and is not a LOB. The CAST specification can be used to convert ASCII or EBCDIC data to Unicode for use with this function.

NFC, NFD, NFKC, or NFKD

Specifies the normalized form:

NFC: Canonical Decomposition followed by Canonical Composition
NFD: Canonical Decomposition
NFKC: Compatibility Decomposition followed by Canonical Composition
NFKD: Compatibility Decomposition

integer

The length attribute, in bytes if the string is a character string, or in double byte code points if the string is a graphic string, for the resulting variable length string. The value must be an integer in the range 1–32704 if the source string is character, or 16352 if the source string is graphic.

The result of the function is a varying length string with a data type that depends on the data type of unicode-string:

VARCHAR if unicode-string is CHAR or VARCHAR
VARGRAPHIC if unicode-string is GRAPHIC or VARGRAPHIC

The CCSID of the result is the same as the CCSID of unicode-string.

The length attribute of the result depends on whether integer is specified. If integer is specified, the length attribute of the result is integer bytes or double byte code points. If integer is not specified, the length attribute of the result is MIN(3*n,32704) for character strings, or MIN(3*n,16352) for graphic strings, where n is the length attribute of the source.

The result can be null; if the first argument is null, the result is the null value.

Example 1: In the following example, "ábc" is normalized to normalization form NFC:

    SET :hv1 = NORMALIZE_STRING('ábc',NFC) -- x'0061030100620063'

hv1 is set to 'ábc' -- X'00E100620063'. Using normalization form NFC, the two code-point sequence X'00610301', which represents the character 'á', is normalized to X'00E1' which is also the pre-composed equivalent of X'00610301'.

Example 2: In the following example, "ábc" is normalized to normalization form NFD.

    SET :hv1 = NORMALIZE_STRING('ábc',NFD) -- x'00E100620063'

hv1 is set to 'ábc' -- X'0061030100620063'. Using normalization form NFD, the code point X'00E1' is decomposed into the two code-point sequence X'00610301', which consists of the Latin lower case letter A and the combining acute accent character.