A character string is a sequence of code units. The length of the string is the number of code units in the sequence. If the length is zero, the value is called the empty string. This value should not be confused with the null value.
All values in a fixed-length string column have the same length, which is determined by the length attribute of the column. The length attribute must be 1 - 254, inclusive, unless the string unit is CODEUNITS32 which has a range of 1 - 63, inclusive.
The functions in the SYSFUN schema taking a VARCHAR as an argument will not accept VARCHARs greater than 4 000 bytes long as an argument. However, many of these functions also have an alternative signature accepting a CLOB(1M). For these functions, the user can explicitly cast the greater than 4 000 VARCHAR strings into CLOBs and then recast the result back into VARCHARs of the required length.
NUL-terminated character strings that are found in C are handled differently, depending on the standards level of the precompile option.
In a non-Unicode database, the string unit is always OCTETS and cannot be changed. In a Unicode database, the string units can be explicitly specified with the length attribute of a character string data type, or it can default based on an environment setting. If FOR BIT DATA is also specified for the character string data type, CODEUNITS32 cannot be specified and an environment setting of CODEUNITS32 does not apply.
The environment setting for string units is based on the value for the NLS_STRING_UNITS global variable, or the string_units database configuration parameter. The database configuration parameter can be set to either SYSTEM or CODEUNITS32. The global variable can also be set to either SYSTEM or CODEUNITS32, but also can be set to NULL. The NULL value indicates that the SQL session should use the string_units database configuration parameter setting. If the value for the environment setting is SYSTEM, then OCTETS is used as the default string units setting.
The ability to specify string units for certain built-in string functions allows you to process string data in a more "character-based manner" than a "byte-based manner". The string unit determines the unit that is used for length or position when you execute the function. You can specify CODEUNITS16, CODEUNITS32, or OCTETS as the string unit for some string functions. When no string unit is specified, the default string unit is usually determined by the string units of the source string argument (refer to the description of the function that you are using for details). The string units argument can be specified for string functions that support the parameter in Unicode or non-Unicode databases.
Since CHAR is a fixed-length type, DB2 pads the string with blanks to complete the defined length. The defined length here is 5 * CODEUNITS32, which is 20 bytes and is stored on the lower level / page - disk storage. VARCHAR(x CODEUNITS16) would mean x 2-byte characters, VARCHAR(x CODEUNITS32) would mean x 4-byte characters, and VARCHAR(x OCTETS) would mean x bytes.
For example, CHAR(5 CODEUNITS32), storing 'abc' is handled in memory as 'abc ' (two spaces) to meet the logical definition of the column. However, on disk it is stored as 'abc ' (17 spaces) since the DB2 Data Management Services layer requires the column to meet the physical definition.
The calculated length of a string computed using OCTETS (bytes) might differ from that computed using CODEUNITS16 or CODEUNITS32. When you are using OCTETS, the length of the string is determined by simply counting the number of bytes in the string. When you are using CODEUNITS16 or CODEUNITS32, the length of the string is determined by counting the number of 16-bit or 32-bit code units necessary to represent the string in UTF-16 or UTF-32. A length that is determined using CODEUNITS16 and CODEUNITS32 is identical unless the data contains supplementary characters (see Difference between CODEUNITS16 and CODEUNITS32).
SELECT CHARACTER_LENGTH(NAME,CODEUNITS16) FROM T1
WHERE NAME = 'Jürgen'
SELECT CHARACTER_LENGTH(NAME,CODEUNITS32) FROM T1
WHERE NAME = 'Jürgen'
The next query,
which counts the length of the string in OCTETS, returns the value
7. SELECT CHARACTER_LENGTH(NAME,OCTETS) FROM T1
WHERE NAME = 'Jürgen'
These values represent
the length of the string that is expressed in the specified string
unit.Format Representation of the name 'Jürgen'
-------- --------------------------------------
UTF-8 X'4AC3BC7267656E'
UTF-16BE X'004A00FC007200670065006E'
UTF-32BE X'0000004A000000FC0000007200000067000000650000006E'
The
representation of the character 'ü' differs among the three string
units: Specifying string units for a built-in function does not affect the data type, the string units, or the code page of the result of the function. If necessary, the data is converted to Unicode for evaluation when CODEUNITS16 or CODEUNITS32 is specified.
When OCTETS is specified for the LOCATE or POSITION function, and the code pages of the string arguments differ, the data is converted to the code page of the source-string argument. In this case, the result of the function is in the code page of the source-string argument. When OCTETS is specified for functions that take a single string argument, the data is evaluated in the code page of the string argument, and the result of the function is in the code page of the string argument.
When CODEUNITS16 or CODEUNITS32 is specified, the result is the same except when the data contains Unicode supplementary characters. This is because a supplementary character is represented by two UTF-16 code units or one UTF-32 code unit. In UTF-8, a non-supplementary character is represented by 1 to 3 bytes, and a supplementary character is represented by 4 bytes. In UTF-16, a non-supplementary character is represented by one CODEUNITS16 code unit or 2 bytes, and a supplementary character is represented by two CODEUNITS16 code units or 4 bytes. In UTF-32, a character is represented by one CODEUNITS32 code unit or 4 bytes.
Character | UTF-8 representation | UTF-16BE representation | UTF-32BE representation |
---|---|---|---|
Unicode value X'1D400' - 'A'; mathematical bold capital A | X'F09D9080' | X'D835DC00' | X'0001D400' |
Unicode value X'0041' - 'A'; latin capital letter A | X'41' | X'0041' | X'00000041' |
Query Returns
----- -------
SELECT CHARACTER_LENGTH(C1,CODEUNITS16) FROM T1 2
SELECT CHARACTER_LENGTH(C1,CODEUNITS32) FROM T1 1
SELECT CHARACTER_LENGTH(C1,OCTETS) FROM T1 4