How Db2 handles Unicode supplementary characters
Unicode supplementary characters are those characters that have a code point between U+10000 and U+10FFFF. These characters include certain math symbols and certain characters from Chinese, Japanese, and some historic scripts.
Supplementary characters are also known as surrogate characters. Each one of these characters takes up 4 bytes in either UTF-8 and UTF-16. In UTF-8, each one of these characters takes up four 8-bit code units. In UTF-16, each one of these characters takes up two 16-bit code units.
Db2 detects any supplementary data that is not well formed only if Db2 has to manipulate the data in some way. For example, if Db2 converts the data or processes it as part of a built-in function, Db2 can detect if it is not well formed. Any built-in function that has the CODEUNITS32, CODEUNITS16, and OCTETS options, such as CHARACTER_LENGTH and LOCATE_IN_STRING, can detect whether supplementary characters are well formed. Other operations are also "character aware." For example, LIKE predicates, the truncation of host variables, and character conversion operations need to know the content of any character data.
However, suppose that you insert data into a column and Db2 does not need to manipulate it in any way. In this case, Db2 does not detect problems with data that is not well formed. For example, if your COBOL application, which uses UTF-16 data, inserts garbage data into a GRAPHIC column, Db2 does not report any problems. You can use the NORMALIZE_STRING function to process data and ensure that it is well-formed according to one of the Unicode standard forms. However, using this function might degrade performance.