Guidelines for using XML in a non-Unicode database

The pureXML® feature enables you to store and retrieve XML data from a database even when that database does not use a Unicode code page. This topic provides guidance on how to store XML data safely in this situation, in order to minimize any possibility of data corruption or loss.

When storing an XML document in a relational database table, in order to ensure data integrity you need to be mindful of both the encoding of the XML document to be stored, and the encoding of the target database. Where these two encodings match, data can always be safely inserted. Where these two encodings do not match, character conversion may be required. Specifically, any characters in an XML document that do not have a matching code point in a target database code page will be substituted on insertion. Further, a second character substitution occurs during any XML processing, such as during casting or comparison operations involving XML data. These operations are always managed in a Unicode format, and therefore characters in a non-Unicode database must be converted to UTF-8 format during the operation. In any event where character substitution is required, operations on that data may have an unexpected outcome, and XML data retrieved from the database may contain incorrect values.

The following discussion describes ways of inserting and querying XML data under various encoding scenarios, providing methods of safely avoiding code page conversion in order to ensure the integrity of stored XML data and the operations that involve it.

Contents

XML document insertion and code page conversion

Whenever XML data is inserted into a Db2® database server through a host variable or parameter marker that has a character data type (a data type of CHAR, VARCHAR or CLOB that is not a FOR BIT DATA type), code page conversion occurs if the database code page differs from that of the client or application issuing the request. A second conversion occurs as the inserted character data is converted from the database code page to Unicode, the format in which XML data is managed internally.

The following table shows the various possible encoding combinations between a database and an XML document string inserted from a client or application. The XML document encoding is the same as the client code page because the client inserts the XML data via a character data type. For each combination, the implications for code page conversion and the possibility of resultant character substitution during XML document insertion are described.
Table 1. Encoding scenarios between a database and an inserted XML document string
Scenario XML document encoding Database encoding Code pages match?
1. Unicode (UTF-8) Unicode (UTF-8) yes
2. Non-Unicode Unicode (UTF-8) no
3. Non-Unicode Non-Unicode yes
4. Unicode (UTF-8) Non-Unicode no
5. Non-Unicode Non-Unicode no
  1. In Scenario 1, an XML document and a database share a Unicode encoding. No character conversion occurs as the XML document is inserted. It is always safe to insert XML data in this way.
  2. In Scenario 2, a non-Unicode XML document is converted to UTF-8 for insertion into a Unicode database. No character substitution occurs through this process. It is always safe to insert XML data in this way
  3. In Scenario 3, an XML document and a database share the same non-Unicode encoding. In this case, the XML document can contain only code points that are part of the database code page, so no character substitution occurs during code page conversion. It is always safe to insert XML data in this way
  4. In Scenario 4, a Unicode XML document is inserted into a non-Unicode database. Code page conversion occurs if the XML document is inserted from a UTF-8 client or application through either a host variable or a parameter marker that has a character data type. Any characters in the XML document that do not have matching code points in the database code page will be substituted.
  5. In Scenario 5, an XML document is inserted into a database server where the two have different encodings, neither of which is UTF-8. In this case, like in Scenario 4, if the XML document is inserted using a character data type, character substitution occurs in any case where the XML document contains characters that are not valid in the database code page.

Safely inserting XML data into a non-Unicode database

The safest way to ensure the integrity of XML data is to use a Unicode database. However, if this is not possible, there are other ways to prevent character substitution from occurring. The following list describes various methods of inserting XML data safely, whether or not a Unicode database is used:

Use a Unicode database, or ensure that the database and client use the same encoding
As illustrated in Table 1, code page conversion issues for XML data are always avoided when:
  • The database is Unicode
  • The database and the client share the same encoding, whether or not this is Unicode
Avoid using a host variable or parameter marker with a character data type

When a Unicode database cannot be used, code page conversion of XML data can also be avoided by binding the XML data using a host variable or parameter marker of type XML or any binary data type. That is, specifying a data type other than CHAR, VARCHAR, or CLOB for the XML data allows it to be passed directly from the client or application code page to Unicode, bypassing conversion to the database code page.

The ENABLE_XMLCHAR configuration parameter allows you to control whether or not inserting is allowed via character data types. Setting ENABLE_XMLCHAR to "NO" will block the usage of character data types during XML document insertion, preventing possible character substitution and ensuring the integrity of stored XML data. The BLOB and FOR BIT DATA types are still allowed, since these data types are safe from code page conversion. By default, ENABLE_XMLCHAR is set to "YES" so that insertion of character data types is allowed.

When a Unicode database is used code page conversion is never a problem, so in this case the ENABLE_XMLCHAR configuration parameter has no effect; character data types can be used for XML document insertion regardless of the setting for ENABLE_XMLCHAR.

Use character entity references for characters that are not in the database code page
In cases where code page conversion cannot be avoided and a character data type must be used for the XML data stream, it is best to ensure that all characters in the XML document have matching code points in the database code page. For any characters in the XML data that do not have matching code points in the target database you can use a character entity reference to specify the Unicode code point of the character. Code page conversion is always bypassed for character entity references, so that the correct character is preserved in the XML data. For example, the character entity references > and > are, respectively, the hexadecimal and decimal equivalents of the greater than sign (">").

Querying XML data in a non-Unicode database

As with inserting XML data into a database, the safest way to ensure data integrity during a query involving XML data is to use a Unicode database. If this is not possible, character substitution is avoided by making sure that all XML data is representable in the database code page, or by using character entity references for characters that are not in the database code page.

In the event that a query contains XML content that includes characters not representable in the database code page, the following two types of character substitution may occur, potentially causing unexpected results for the query:
Replacement by default substitution character
The default substitution character for the code page is introduced in place of the unmatchable character in the XML data. For example, if a Chinese character is passed into an ASCII encoded database (ISO-8859-1), the original character is substituted with the ASCII code point 0x1A, a control character which typically displays as a question mark ('?') when displayed on a client. When the XML data is converted from the database code page to Unicode, the substitution character is preserved.
Replacement by nearest character equivalent ("folding")
The original input character is replaced by a character in the target code page that is similar, but not necessarily identical, to the original character. It is sometimes the case that two or more characters with distinct Unicode code points map onto a single code point in a database code page (the nearest character equivalent in the target code page), so that after insertion into a database the distinction between the values is lost. This scenario is illustrated in Example 2.

Examples

The following examples demonstrate the possible effects of code page conversion when a client or application with a UTF-8 encoding is used to query XML data in a non-Unicode database. In these examples, assume that the database is created using code page ISO8859-7 (Greek). XQuery expressions are used to match XML data stored in table T1, where the stored XML data consists of the Unicode Greek sigma character (ΣG) and the Unicode mathematical sigma character (ΣM). The code point 0xD3 identifies the sigma character in the ISO8859-7 database.

The table T1 is created and populated using the following commands:
CREATE TABLE T1 (DOCID BIGINT NOT NULL, XMLCOL XML);
INSERT INTO T1 VALUES (1, XMLPARSE(
       document '<?xml version="1.0" encoding="utf-8" ?> <Specialchars>
	<sigma>ΣG</sigma>
	<summation>ΣM</summation>
	</Specialchars>' 
       preserve whitespace));
Example 1: Successful code page conversion (character is representable in database code page)
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "ΣG"] return $test
This expression produces the desired result:
<sigma>ΣG</sigma>
In this case, the expression ΣG begins at the client as the Unicode code point for the Greek sigma character (U+03A3), is converted to the sigma character in the Greek database code page (0xD3), and is then converted back to the correct Unicode character for XML processing. Because the Greek sigma character is representable in the database code page, the expression matches correctly. This character conversion is shown in the following table:
Table 2. Character data conversion (Example 1)
  Client (UTF-8)   Database (ISO8859-7)   XML parser (UTF-8)
Character U+03A3 (Greek sigma) 0xD3 (Greek sigma) U+03A3 (Greek sigma)
Example 2: Unsuccessful code page conversion (character is not representable in database code page)
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "ΣM"] return $test
This expression does not produce the desired result:
<sigma>ΣG</sigma>
In this case, the expression ΣM begins at the client as the Unicode code point for the mathematical symbol sigma (U+2211), is converted to the sigma character in the Greek database code page (0xD3) and then matches the ΣG character when the XML comparison occurs. For the return expression, the process is identical to that in Example 1. The Unicode XML character ΣG converts first to the sigma character in the Greek database code page (ΣA), and then back to the Greek sigma character in the client UTF-8 code page (ΣG). This character conversion is shown in the following table:
Table 3. Character data conversion (Example 2)
  Client (UTF-8)   Database (ISO8859-7)   XML parser (UTF-8)
Character U+2211 (Mathematical sigma) 0xD3 (Greek sigma) U+03A3 (Greek sigma)
Example 3: Bypassing code page conversion using a character entity reference
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "&#2211;"] 
   return $test
This expression produces the desired result:
<summation>ΣM</summation>
In this case, the expression ΣM begins at the client as the Unicode code point for the mathematical symbol sigma (U+2211), and because it is escaped as character reference &#2211, the Unicode code point is preserved when it is passed to the XML parser, allowing for successful comparison against the stored XML value ΣM. The bypassing of character conversion is shown in the following table:
Table 4. Character data conversion (Example 3)
  Client (UTF-8)   Database (ISO8859-7)   XML parser (UTF-8)
Character U+2211 (character reference for mathematical sigma) "&#2211" (character reference for mathematical sigma) U+2211 (mathematical sigma)
Example 4: Unsuccessful code page conversion (character is not representable in database code page)
This example is like Example 1, except that here an ASCII encoded database is used and the default substitution character for the code page is introduced into the XML expression.
XQUERY for $test in db2-fn:xmlcolumn("T1.XMLCOL")//*[. = "ΣG"] return $test
This query fails to match the correct value in table T1. In this case, the Unicode character U+2211 (Greek sigma) does not have a matching code point in the ASCII code page so a default substitution character is introduced, in this case the question mark character ('?'). This character conversion is shown in the following table:
Table 5. Character data conversion (Example 4)
  Client (UTF-8)   Database (ISO8859-1)   XML parser (UTF-8)
Character U+2211 (mathematical sigma) 0x003F ('?') 0x003F ('?')