Unicode string functions in DB2

When you use string functions in ESQL expressions, certain parameters refer to character positions or counts.

For example, with the SUBSTRING function, coding:
  SUBSTRING('Hello World!' FROM 7 FOR 4)
results in the string Worl because 7 refers to the seventh character position, and 4 refers to four characters.

If the string Hello World! is represented in an ASCII code page, the seventh byte of that representation is the seventh character. However, in other code pages or encoding schemes (for example, some Unicode representations) the seventh character could start at byte 13, and 4 characters can occupy 8 bytes.

IBM® App Connect Enterprise correctly handles this situation; that is, the numeric parameters and results of these string functions always refer to characters and not the bytes used to represent them.

In some situations, IBM App Connect Enterprise delegates expressions to a database engine for processing. For example, if there is a WHERE clause, in a SELECT function, applied to a database data source, the database is passed the WHERE clause if it can interpret all the functions in the expression.

If there are functions that are not supported by the database, IBM App Connect Enterprise passes only those parts of the expression that can be interpreted, retrieves an unfiltered record set, and performs the remaining filtering itself.

DB2® string functions use byte indexing and not character indexing. Therefore, for Unicode data, the meaning of certain functions differs from the IBM App Connect Enterprise functions, even though they can be ‘interpreted'.

Characters in Unicode UTF8 representation, for example, can occupy from 1-4 bytes, so that the seventh character can start anywhere from byte 7 to byte 25.

The following string functions are affected:
  • INSERT function
  • LEFT function
  • LENGTH function
  • OVERLAY function
  • POSITION function
  • RIGHT function
  • SPACE function
  • SUBSTRING function

These functions either take numeric parameters, or return numeric results that refer to indexes, or counts, of characters in a string.

When expressions involving these functions are passed to the DB2 database, and Unicode string data is manipulated in the database, the results can be unexpected, or an error might occur.

This error might also occur if, for example, an expression of this type is passed directly to the database by using the PASSTHRU function. In this situation, you could modify each expression yourself, as necessary, for the target database.

It is not possible to systematically modify expressions to avoid this problem and IBM App Connect Enterprise does not attempt to do so.

If the Unicode strings do not use any characters that, in UTF8 representation, occupy more than 1 byte each, the functions perform correctly.