String comparisons in a Unicode database
Pattern matching is one area where the behavior of existing MBCS databases is slightly different from the behavior of a Unicode database.
For MBCS databases in Db2®, the current behavior is as follows: If the match-expression contains MBCS data, the pattern can include both SBCS and non-SBCS characters. The special characters in the pattern are interpreted as follows:
- An SBCS halfwidth underscore refers to one SBCS character.
- A non-SBCS fullwidth underscore refers to one non-SBCS character.
- A percent (either SBCS halfwidth or non-SBCS fullwidth) refers to zero or more SBCS or non-SBCS characters.
In a Unicode database, there is really no distinction between "single-byte"
and "non-single-byte" characters. Although the UTF-8 format is a
"mixed-byte" encoding of Unicode characters, there is no real distinction
between SBCS and non-SBCS characters in UTF-8. Every character is
a Unicode character, regardless of the number of bytes in UTF-8 format.
In a Unicode graphic string, every non-supplementary character, including
the halfwidth underscore (U+005F) and halfwidth percent (U+0025),
is two bytes in width. For Unicode databases, the special characters
in the pattern are interpreted as follows:
- For character strings, a halfwidth underscore (X'5F') or a fullwidth underscore (X'EFBCBF') refers to one Unicode character. A halfwidth percent (X'25') or a fullwidth percent (X'EFBC85') refers to zero or more Unicode characters.
- For graphic strings, a halfwidth underscore (U+005F) or a fullwidth underscore (U+FF3F) refers to one Unicode character. A halfwidth percent (U+0025) or a fullwidth percent (U+FF05) refers to zero or more Unicode characters.
Note: Two underscores are needed to match a Unicode supplementary
graphic character because such a character is represented by two
UCS-2 characters in a graphic string. Only one underscore is needed
to match a Unicode supplementary character in a character string.
For the optional "escape expression", which specifies a character
to be used to modify the special meaning of the underscore and percent
sign characters, the expression can be specified by any one of:
- A constant
- A special register
- A host variable
- A scalar function whose operands are any of the previously mentioned operands
- An expression concatenating any of the previously mentioned operands or functions
- No element in the expression can be of type CLOB or DBCLOB. In addition, it cannot be a BLOB file reference variable.
- For character strings, the result of the expression must be one character or a FOR BIT DATA string containing exactly one (1) byte (SQLSTATE 22019). For graphic strings, the result of the expression must be one character (SQLSTATE 22019).