In the current age of the Internet and global economy, more and more applications are required to handle data that presents itself in different national languages. To a developer, this means that national language requirements are to be taken into consideration during every phase of application development — database design, application design, and application programming. DB2 9 supports a variety of languages with a wide range of attributes like accent marks (French), bidirectional (Arabic), and large character set (Chinese). These languages pose different challenges in storing, processing, accessing, and presenting the data. The data that is affected by national languages is not limited to string data. It also includes numeric, date, and monetary data.
Character vs byte semantics of string data
Prior to DB2 9, DB2 had some string functions that worked on character and graphic data from a mixed perspective of bytes and double byte units. As explained earlier, increasingly users think of their data in terms of characters of various national languages. The subject of what constitutes a character, and how you can count them, is addressed by the new DB2 9 functionality discussed in this article.
In the case of a single-byte character encoding scheme, a single byte constitutes a character and the length of a single byte string is the same as the byte length of the string. In the case of a graphic strings, two bytes constitute a character and you use the number of double bytes to represent the length of the string. However, in the case of a multi-byte encoding, the length of the character in bytes varies according to the encoding used, and each character can be one or more bytes in length. The counting of string length using a byte is referred to as byte semantics in this article, and the counting of string length using the number of characters is referred to as character semantics.
Consider the following string in the Chinese language:
Figure 1. String in Chinese
The length of the string is two, if character semantics is used to calculate the length of string. But if byte semantics is used, and characters are encoded using UTF-8, then the length of the string is 6 bytes.
Need for character-based functions
The character-based data in SQL is associated with numeric values in many contexts, as mentioned below:
- Length of a string variable: Input argument for the SUBSTR function, which determines the desired length of the resulting string or output of the LENGTH function.
- Offset within a string: The second argument of the LOCATE function, which specifies the starting position within a string to begin the search.
These numeric values represent the number of bytes in single-byte data and the number of double bytes in graphic or double-byte data. However, these numeric values do not adhere to character semantics in the case of multi-byte character encodings, like UTF-8. The following condition helps us to understand the need for character-based functions.
What constitutes a character?
Recognizing the character as a single, unit as opposed to a sequence of bytes, is a requirement in the case of string manipulations involving multi-byte characters. It is necessary for an application programmer who is allocating a buffer to know how much memory to allocate for each character. Therefore, it is important to understand what constitutes a character for writing applications that involve multi-byte character data. A character can be defined as a unit of information that corresponds to an atomic unit in a written language. Each character is represented using a sequence of bits using a character encoding. An individual character is usually encoded using a byte or more, depending upon the encoding used. Consider the characters "A" and "Latin capital letter A with ring above." The hexadecimal representation of the character "A" is x'41' and the "Latin capital letter A with ring above" is x'C385'. You can get this representation using the SQL function hex().
Figure 2. Hexadecimal representation of characters
You can see from the above representation, that during display there is only one character. However, the length of "A" is one byte and that of "Latin capital letter A with ring above" is two bytes.
Length of a string in terms of code units
The length of the character string is dependent on the character encoding (ASCII, EBCDIC, and Unicode) that is used to encode the character. A character can be represented using one or more of the code units of a respective encoding. Therefore, if you have the same set of characters in a string, its length may defer according to the encoding used. Consider an example of a character named "musical symbol G clef." Now consider the different encodings for this character from Table 1, and you can see that the hexadecimal representation of different encodings and their length in different code units are diffrent.
Table 1. Hexadecimal representation of the same character in different encodings
|Length in respective code units||4||2||1|
You can see from Figure 3 how you get the length in bytes for the "musical symbol G clef" character in UTF-8.
Figure 3. Length of "musical symbol G clef" in bytes
Search for characters
When you search a string for the occurrence of a specified substring, the search is first performed and then the result (position within the string) is returned as the number of byte positions, not the correct character or code units position. Figure 4 shows a search for "a", the actual character position of "a" is 2 but the output is 3 because there is a multi-byte character in the string.
Figure 4. Result of search inside a string
Splitting of characters
Treating multi-byte character data as a sequence of bytes could lead to the accidental splitting of characters by string functions. In Figure 5, the substring of length 1 from the first byte of the string has been specified. Since the first character is multi-byte, it results in splitting the character and leads to dirty output.
Figure 5. Character being split by SUBSTR function
Specify the start
You may need to provide input to functions like LOCATE to specify the starting position of the search. In the case of multi-byte data you may have problems, and the results may not be what you expected. In Figure 6, shows a search for the character after the third byte, which would have been the second occurrence of the character "a" if all were single byte characters. But since you have a multi-byte character as the first character, you get the result as 3, which is the first occurrence of the search string.
Figure 6. Use of LOCATE to specify start position
Character based functions
In addition to the string functions that were available with earlier versions of DB2 that handle character data using byte semantics, DB2 9 introduces a set of character-based string functions that understand character semantics. If a character in a particular encoding spans multiple bytes, the character-based string functions can process each character as a single unit as opposed to a sequence of bytes.
Introducing string length units
The character-based string functions of DB2 introduces the concept of string length units to understand the character encoding, according to which the input string is to be considered for string operations. The string units available with DB2 9 for Linux, UNIX, and Windows are OCTETS, CODEUNITS16, and CODEUNITS32.
The string function has the specification of numeric value, or the result is a numeric value related to the input data. The string length unit pertains to the numeric values. The string operation to be performed may result in different outputs, depending on the string length unit that is used for counting the characters. For some functions, a numeric value is input, for example start, length, and offset parameters of string functions. With other functions, a numeric value is returned as the result, for example, to search a string for the occurrence of a specified substring, the search is first performed and then the result is returned as a number in the string length units implicitly or explicitly specified.
When OCTETS is used as the string length unit, the length of a string is determined by simply counting the bytes of the string. The CODEUNITS16 specifies that Unicode UTF-16 is used for character semantics. Also, CODEUNITS32 specifies that Unicode UTF-32 is used to understand the character boundaries of multi-byte characters.
Counting code units using either CODEUNITS16 or CODEUNITS32 gives the same answer, unless supplementary characters or a surrogate pair is involved. When supplementary characters are involved, a supplementary is counted as two UTF-16 code units using CODEUNITS16, or one UTF-32 code unit using CODEUNITS32.
If you take the length of a character in CODEUNITS, the output differs according to the CODEUNITS used as inputs to the string function.
Listing 1. Length of a string in different CODEUNITS
VALUES CHARACTER_LENGTH(X'F09D849E', OCTETS) 1 ----------- 4 1 record(s) selected. VALUES CHARACTER_LENGTH(X'F09D849E', CODEUNITS16) 1 ----------- 2 1 record(s) selected. VALUES CHARACTER_LENGTH(X'F09D849E', CODEUNITS32) 1 ----------- 1 1 record(s) selected.
Character-based string functions in DB2 9
This function, as mentioned in SQL standards, is used to find the length of a character string in character semantics. This function is similar to the LENGTH function in DB2 and has an optional string length unit, in which the result can be expressed. Unlike the LENGTH function, the CHARACTER_LENGTH does not accept input data that is not string based. The function includes two arguments, the first one being the string and second one the code units. In many situations, you need the string length in terms of code units, the character-based functions could be used to find the length of the string in terms of string units.
Consider the example of the character "musical symbol G clef" discussed before.
Listing 2. Use of CHARACTER_LENGTH to get the length of string in CODEUNITS
VALUES CHAR_LENGTH(X'F09D849E',CODEUNITS16) 1 ----------- 2 1 record(s) selected. VALUES CHAR_LENGTH(X'F09D849E',CODEUNITS32) 1 ----------- 1 1 record(s) selected.
The character-based string functions can be used to solve the problem of getting the length of string in terms of CODEUNITS.
This function, as mentioned in SQL standard, returns the length of the input string in octets or bytes. It is similar to the LENGTH function when used against a single-byte data type. It gives double the LENGTH function value if double-byte data type is used as input. The same functionality could be derived by using CHARACTER_LENGTH and using OCTETS as string length units.
Listing 3. Use of OCTECT_LENGTH to get the length of string in bytes
VALUES OCTET_LENGTH(X'F09D849E') 1 ----------- 4 1 record(s) selected.
The LOCATE function returns the starting position of the first occurrence of one string within another string. If the search-string is not found, and neither argument is null, the result is zero. If the search-string is found, the result is a number from 1 to the actual length of the source-string. If the optional start is specified, it indicates the character position in the source-string at which the search is to begin. An optional string length unit can be specified to indicate in what units the start and result of the function are expressed.
The problem with specifying the start in the LOCATE function can be solved using the character-based functions, as shown in Figure 7:
Figure 7. Use of LOCATE with CODEUNITS
The POSITION function returns the starting position of the first occurrence of one string within another string. If the string to be searched is not found and neither argument is null, the result is 0. If string to be searched is found, the result is a number from 1 to the actual length of input string, expressed in the code units that are explicitly specified. The POSITION function is defined in the SQL Standard. It is similar to, but not the same as, the POSSTR function that is implemented across the DB2 family.
The problem of byte positions being returned for the character positions can be solved using character-based functions. Figure 8 shows how you can do so using a LOCATE function.
Figure 8. Use of POSITION with CODEUNITS
The SUBSTRING function returns a substring of a string. A substring is zero or more contiguous string length units of input string. Along with the input string, the SUBSTRING function has three other arguments, which are start position, length, and code unit specification. The start position specifies the position within input string that is to be the first string length unit of the result. The length argument specifies the length of the desired substring. The splitting of CODEUNITS used for making the character does not happen when using character-based functions. Figure 9 shows how to prevent spitting of multi-byte characters.
Figure 9. Use of SUBSTRING with CODEUNITS
Handling of incorrect or incomplete data
The string operations involving multi-byte characters may involve conditions with characters being incorrect (combination of bytes not defined in the encoding) or incomplete (having a partial byte of a multi-byte character). Consider some common conditions that can cause such situations while you do string manipulation using the new character-based string functions. The examples are based on the character "musical symbol G clef" (UTF-8 hex format is X'F09D849E'), which has a length of two in CODEUNITS16.
Problems with input string
Incomplete string data
A string data that has a partial character can be called an incomplete string data. Consider that you have a character in UTF-8 encoding has length of 3 bytes, and the string has only the first two byte of the encoding. If you find the length of the first two bytes in CODEUNITS16, the function results in a warning.
Listing 4. Use of incomplete input string data
VALUES CHARACTER_LENGTH(X'849E',CODEUNITS16) 1 ----------- 2 SQL1289W During conversion of an argument to "SYSIBM.CHARACTER_LENGTH" from code page "1208" to code page "1200", one or more invalid characters were replaced with a substitute character, or a trailing partial multi-byte character was omitted from the result. SQLSTATE=01517 1 record(s) selected with 1 warning messages printed.
Incorrect string data
Each character encoding has its set of permitted byte or byte combinations for a particular character. The input string data to string functions may have wrong or invalid characters in the string that it is supplied with. If DB2 comes across an invalid character while doing CODEUNITS16 or CODEUNITS32 calculation, it replaces any such byte sequences by the substitution character when the byte sequence forms part of the result of applying the function. The hex format of X'80' in UTF-8 is invalid and a warning is thrown when it is encountered.
Listing 5. Use of incomplete character data
VALUES CHARACTER_LENGTH(X'80',CODEUNITS16) 1 ----------- 1 SQL1289W During conversion of an argument to "SYSIBM.CHARACTER_LENGTH" from code page "1208" to code page "1200", one or more invalid characters were replaced with a substitute character, or a trailing partial multi-byte character was omitted from the result. SQLSTATE=01517 1 record(s) selected with 1 warning messages printed.
OCTETS and graphic string input
In SUBSTRING FUNCTION, when OCTETS is specified and the input to the function is graphic data, the <start> parameter is not odd or the <length> parameter is not even, it results in an error as it is splitting a graphic character into two bytes.
Listing 6. Splitting of characters
VALUES SUBSTRING(GRAPHIC('K'),2,1,OCTETS) 1 -- SQL20289N Invalid string length unit "OCTETS" in effect for function "SYSIBM.SUBSTRING". SQLSTATE=428GC
Problems with output string
Stand-alone surrogates or incomplete string data
When a character is represented using a sequence of two 16-bit code units, it's called a surrogate pair. A pair can be distinguished into high and low surrogates. When using CODEUNITS16 in the string functions, DB2 distinguishes the standalone or isolated code units. That is, if you have a surrogate pair, the length of the character is two in CODEUNITS16 and is one in CODEUNITS32. So the functions, like SUBSTRING, could split the surrogate pair depending up on the arguments you give.
Buffer overflow on substitution character insertion
When a substitution character is inserted, the byte length of the string may increase. If the length increases beyond the amount of buffer space available for the output, the tail part of the string is truncated and you receive a warning that the value of a string was truncated when assigned to another string data type with shorter length.
Note that the new string function is under the SYSIBM function path, compared to the older functions under the SYSFUN version. You are expected to use the new SYSIBM function path, even when you are not using the string unit argument. By default, the SYSIBM function path precedes over SYSFUN in the default CURRENT PATH. All the older functions continue to be supported.
The character-based functions may need to convert the input data string to an intermediate UNICODE code page, like UTF-16 or UTF-32, before its processing can be done. In cases where the result data is a string, the intermediate result is also converted back to the input code page. OCTETS as a string unit specification does not require any conversion and is more efficient. The CODEUNITS16 and CODEUNIST32 as string units may cause code page conversions. Though DB2 does its own optimizations, code page conversions may or may not be necessary. The conversion costs are all the more important for LOB input because of the potentially large size of the input string.
This article provided you with an overview of the new character-based string functions in DB2 Data Server. It first explained key concepts, such as character and byte semantics with respect to string data. Next, it discussed why you need these functions, and provided examples of some generic scenarios. You also learned about the concept of code units and character-based functions. Then, it explained how these functions helped you to solve the problems discussed before and gave an example for each scenario. Finally, it discussed the common problems and the performance considerations while using these functions. Ideally, you should be using these functions to do string manipulations better and pushing more application logic into the SQL layer, rather than implementing the same logic in your application.
- "Globalize your On Demand Business": Get a basic understanding of coded character sets or code pages, essential to deal with multiple languages in information processing systems.
- "Access your database from everywhere" (developerWorks, Jan 2006): Read about a practical approach to DB2 UDB for Linux, UNIX, and Windows Unicode support.
- "DB2 UDB National Language Support for the Command Line Processor and Utilities" (developerWorks, Oct 2002): Use the CLP in the national language of your choice on both Windows and AIX environments.
- "Setting Up a Mixed-Byte Character Set (MBCS) Database on an English OS in DB2 UDB Version 8" (developerWorks, Sept 2002) Follow step-by-step instructions for those who need to set up an environment and create a mixed-byte character set (MBCS) database on DB2 Universal Database Version 8 in an English operating system environment.
- "A brief introduction to code pages and Unicode" (developerWorks, March 2000): Understand how the Unicode standard works and why you need it. Explore the code behind the letters you see on your screen and in your printouts.
- Visit the developerWorks resource page for DB2 for Linux, UNIX, and Windows to read articles and tutorials and connect to other resources to expand your DB2 skills.
- Learn about DB2 Express-C, the no-charge version of DB2 Express Edition for the community.
Get products and technologies
- Download a free trial version of DB2 Enterprise 9.
- Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edition and provides a solid base to build and deploy applications.
- Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- Participate in the discussion forum.
- Check out developerWorks blogs and get involved in the developerWorks community.