NORMALIZE_STRING scalar function
The NORMALIZE_STRING function takes a Unicode string argument and returns a normalized string that can be used for comparison.
The NORMALIZE_STRING function can convert two strings that look the same (such as Å, which can be encoded in UTF-16 as X'00C5' and as X'0041030a') but might not be encoded using the same Unicode code point, to a normalized form that can be compared.
The schema is SYSIBM.
- unicode_string
- An expression that returns a value of a built-in character string or graphic string data type that is either Unicode UTF-8 or Unicode UTF-16, and is not a LOB. The CAST specification can be used to convert ASCII or EBCDIC data to Unicode for use with this function.
- NFC, NFD, NFKC, or NFKD
- Specifies the normalized form:
- NFC
- Canonical Decomposition followed by Canonical Composition
- NFD
- Canonical Decomposition
- NFKC
- Compatibility Decomposition followed by Canonical Composition
- NFKD
- Compatibility Decomposition
- integer
- The length attribute, in bytes if the string is a character string, or in double byte code points if the string is a graphic string, for the resulting variable length string. The value must be an integer in the range 1–32704 if the source string is character, or 16352 if the source string is graphic.
The result of the function is a varying length string with a data type that depends on the data type of unicode-string:
- VARCHAR if unicode-string is CHAR or VARCHAR
- VARGRAPHIC if unicode-string is GRAPHIC or VARGRAPHIC
The CCSID of the result is the same as the CCSID of unicode-string.
The
length attribute of the result depends on whether integer is
specified. If integer is specified, the
length attribute of the result is integer bytes
or double byte code points. If integer is
not specified, the length attribute of the result is MIN(3*n,32704)
for
character strings, or MIN(3*n,16352)
for
graphic strings, where n is the length attribute
of the source.
The result can be null; if the first argument is null, the result is the null value.
SET :hv1 = NORMALIZE_STRING('ábc',NFC) -- x'0061030100620063'
hv1 is
set to 'ábc' -- X'00E100620063'. Using normalization form
NFC, the two code-point sequence X'00610301', which represents
the character 'á', is normalized to X'00E1' which is also
the pre-composed equivalent of X'00610301'. SET :hv1 = NORMALIZE_STRING('ábc',NFD) -- x'00E100620063'
hv1 is
set to 'ábc' -- X'0061030100620063'. Using normalization
form NFD, the code point X'00E1' is decomposed into the two
code-point sequence X'00610301', which consists of the Latin
lower case letter A and the combining acute accent character.