A character string is a sequence of bytes. The length of the string is the number of bytes in the sequence. If the length is zero, the value is called the empty string. This value should not be confused with the null value.
All values in a fixed-length string column have the same length, which is determined by the length attribute of the column. The length attribute must be between 1 and 254 inclusive.
The functions in the SYSFUN schema taking a VARCHAR as an argument will not accept VARCHARs greater than 4 000 bytes long as an argument. However, many of these functions also have an alternative signature accepting a CLOB(1M). For these functions, the user may explicitly cast the greater than 4 000 VARCHAR strings into CLOBs and then recast the result back into VARCHARs of desired length.
NUL-terminated character strings found in C are handled differently, depending on the standards level of the precompile option.
The ability to specify string units for certain built-in functions allows you to process string data in a more "character-based manner" than a "byte-based manner". The string unit determines the length in which an operation is to occur. You can specify CODEUNITS16, CODEUNITS32, or OCTETS as the string unit for an operation.
Since CHAR is a fixed-length type, DB2 pads the string with blanks to complete the defined length. The defined length here is 5 * CODEUNITS32, which is 20 bytes and is stored on the lower level / page - disk storage. VARCHAR(x CODEUNITS16) would mean x 2-byte characters, VARCHAR(x CODEUNITS32) would mean x 4-byte characters, and VARCHAR(x OCTETS) would mean x bytes.
For example, CHAR(5 CODEUNITS32), storing 'abc' is handled in memory as 'abc ' (two spaces) to meet the logical definition of the column. However, on disk it is stored as 'abc ' (17 spaces) since the DB2 Data Management Services layer requires the column to meet the physical definition.
The calculated length of a string computed using OCTETS (bytes) might differ from that computed using CODEUNITS16 or CODEUNITS32. When using OCTETS, the length of the string is determined by simply counting the number of bytes in the string. When using CODEUNITS16 or CODEUNITS32, the length of the string is determined by counting the number of 16-bit or 32-bit code units necessary to represent the string in UTF-16 or UTF-32, respectively. The length determined using CODEUNITS16 and CODEUNITS32 will be identical unless the data contains supplementary characters (see Difference between CODEUNITS16 and CODEUNITS32).
SELECT CHARACTER_LENGTH(NAME,CODEUNITS16) FROM T1
WHERE NAME = 'Jürgen'
SELECT CHARACTER_LENGTH(NAME,CODEUNITS32) FROM T1
WHERE NAME = 'Jürgen'
The next query,
which counts the length of the string in OCTETS, returns the value
7. SELECT CHARACTER_LENGTH(NAME,OCTETS) FROM T1
WHERE NAME = 'Jürgen'
These values represent
the length of the string expressed in the specified string unit.Format Representation of the name 'Jürgen'
-------- --------------------------------------
UTF-8 X'4AC3BC7267656E'
UTF-16BE X'004A00FC007200670065006E'
UTF-32BE X'0000004A000000FC0000007200000067000000650000006E'
The
representation of the character 'ü' differs among the three string
units: Specifying string units for a built-in function does not affect the data type or the code page of the result of the function. If necessary, DB2® converts the data to Unicode for evaluation when CODEUNITS16 or CODEUNITS32 is specified.
When OCTETS is specified for the LOCATE or POSITION function, and the code pages of the string arguments differ, DB2 converts the data to the code page of the source-string argument. In this case, the result of the function is in the code page of the source-string argument. When OCTETS is specified for functions that take a single string argument, the data is evaluated in the code page of the string argument, and the result of the function is in the code page of the string argument.
When CODEUNITS16 or CODEUNITS32 is specified, the result is the same except when the data contains Unicode supplementary characters. This is because a supplementary character is represented by two UTF-16 code units or one UTF-32 code unit. In UTF-8, a non-supplementary character is represented by 1 to 3 bytes, and a supplementary character is represented by 4 bytes. In UTF-16, a non-supplementary character is represented by one CODEUNITS16 code unit or 2 bytes, and a supplementary character is represented by two CODEUNITS16 code units or 4 bytes. In UTF-32, a character is represented by one CODEUNITS32 code unit or 4 bytes.
Character | UTF-8 representation | UTF-16BE representation | UTF-32BE representation |
---|---|---|---|
Unicode value X'1D400' - 'A'; mathematical bold capital A | X'F09D9080' | X'D835DC00' | X'0001D400' |
Unicode value X'0041' - 'A'; latin capital letter A | X'41' | X'0041' | X'00000041' |
Query Returns
----- -------
SELECT CHARACTER_LENGTH(C1,CODEUNITS16) FROM T1 2
SELECT CHARACTER_LENGTH(C1,CODEUNITS32) FROM T1 1
SELECT CHARACTER_LENGTH(C1,OCTETS) FROM T1 4