String unit specifications
The ability to specify string units for certain built-in functions and on the CAST specification allows you to process string data in a more "character-based manner" than a "byte-based manner". The string unit determines the length in which the operation is to occur. You can specify CODEUNITS32, CODEUNITS16, or OCTETS as the units for the operation.
- CODEUNITS32
- Specifies that Unicode UTF-32 is the units for the operation. CODEUNITS32 is useful when an application wants to process data in a simple fixed-length format and needs the same answer regardless of the storage format of the data (ASCII, EBCDIC, UTF-8, or UTF-16). Although the answers are in terms of CODEUNITS32, the data is not converted to UTF-32 to perform the function.
- CODEUNITS16
- Specifies that Unicode UTF-16 is the units for the operation. CODEUNITS16 is useful when an application wants to know how many double-byte characters are in a string.
- OCTETS
- Specifies that bytes are the units for the operation. OCTETS is often used when an application is interested in allocation buffer space or when operations need to use simple byte processing.
Determining the length of a string by counting in string units (CODEUNITS16 or CODEUNITS32) or bytes (OCTETS) can result in different answers. When OCTETS is specified, the length of a string is determined by simply counting the number of bytes in the string. Counting by CODEUNITS16 or CODEUNITS32 gives the same answer unless the data contains supplementary characters. For information about the difference between CODEUNITS16 and CODEUNITS32 when the data contains supplementary characters, see Difference between CODEUNITS16 and CODEUNITS32.
SELECT CHARACTER_LENGTH(NAME,CODEUNITS32)
FROM T1 WHERE NAME = 'Jürgen';
SELECT CHARACTER_LENGTH(NAME,CODEUNITS16)
FROM T1 WHERE NAME = 'Jürgen';
SELECT CHARACTER_LENGTH(NAME,OCTETS)
FROM T1 WHERE NAME = 'Jürgen';
Format | Representation of the name 'Jürgen' |
---|---|
UTF-8 | x'4AC3BC7267656E' |
UTF-16 | x'004A00FC007200670065006E' |
UTF-32 | x'0000004A000000FC0000007200000067000000650000006E' |
- The UTF-8 representation of the character ü is X'C3BC'. In UTF-8, characters that are not in the Latin-1 subset (essentially a through z, A through Z, and 0 through 9), such as accented characters or Japanese characters, are represented by multiple bytes.
- The UTF-16 representation of the character ü is X'00FC'. In UTF-16, each character is represented in 2 bytes. UTF-16 supplementary characters take two 2-byte code points.
- The UTF-32 representation of the character ü is X'000000FC'. In UTF-32, each character is represented in 4 bytes.
Specifying the string units on a built-in function does not affect the data type or the CCSID of the result of the function. If necessary, Db2 converts the data to Unicode for evaluation when CODEUNITS32 or CODEUNITS16 is specified. Db2 always evaluates the data in the encoding scheme of the output data when OCTETS is specified. For more information about the data types and CCSIDs of the results of functions, see the description of each function.
Differences between the way that characters are represented in ASCII, EBCDIC, and Unicode can affect the results of your queries.
SELECT SUBSTRING(C1,1,21,CODEUNITS16)
FROM T1 WHERE C1 = 'Mit freundlichen Grüßen, Jürgen';
Format | Representation of 'Mit freundlichen Grüß' |
---|---|
EBCDIC |
|
UTF-8 |
|
UTF-16 |
|