XML in a non-Unicode database

Handle code page conversion in XML data in DB2 for Linux, UNIX, and Windows

This article explains how to use the XML native data type in a non-Unicode database in DB2® Viper 2. Find tips on how to avoid substitution characters, pointers for query structuring and character reference usage, and explanations of the new utility user-defined functions (UDFs) shipping with the product.

Randall P. Spalten (rspalten@us.ibm.com), Advisory Software Engineer, IBM

Photo: Randy SpaltenRandall Spalten is an advisory software engineer in the Silicon Valley Laboratory in San Jose, Calif. He joined IBM in 1997 working on Information Integration. Since then, he has worked on many areas of DB2, including Federation, DDL semantics, two-phase commit processing, and runtime, before moving to pureXML support. He currently works on the pureXML Runtime team, focusing on non-Unicode support and runtime performance enhancement.



Preethi Vishwanath (pvishwa@us.ibm.com), Software Engineer, IBM

Photo: Preethi VishwanathPreethi Vishwanath is a software developer in the DB2 XML Runtime group. She has been working on enabling XML support for non-Unicode databases since she joined IBM as an intern in 2006. Her professional career interests include distributed database design, runtime and language support, and replication issues. She holds a Master's degree in Computer Science from San Jose State University.



19 July 2007

XML support in V9

In DB2 9, the only way to include XML data in a database was to have a UTF-8 code set for the database. This restriction was true even if the XML data included only ASCII code points. XML is stored in the database in a binary format, and the text nodes are stored as UTF-8 code points. In DB2 9, no conversion was done to the XML document, so it had to be inserted into the database as a series of UTF-8 bytes. The DB2 Viper 2 release lifts the restriction that all databases for use with the pureXML feature have to be created in Unicode. This document explains the support for XML in non-Unicode databases, and how to ensure that XML documents can be inserted and queried safely, without data integrity loss.

Example scenario: Migrating to pureXML without migrating to Unicode code set

Suppose you are a database administrator (DBA) for an existing non-Unicode DB2 database that contains a table that catalogs books in a library. Currently, the table for books has the following form:

Listing 1. Sample relational schema
	CREATE TABLE BOOKCAT
		(TITLE           VARCHAR(400) NOT NULL,
		AUTHOR_LAST      VARCHAR(50) NOT NULL,
		AUTHOR_FIRST     VARCHAR(50),
		PUBLISHER        VARCHAR(100),
		DATE_PUB         DATE NOT NULL,
		ISBN             BIGINT NOT NULL PRIMARY KEY,
		KEYWORD1         CHAR(20),
		KEYWORD2         CHAR(20),
		FIRST_PARA       CLOB(2K));

However, there are some restrictions on this data in a non-Unicode relational table. What about books whose titles or text contain characters outside the current database's character set? Why restrict the table to only two keywords per book? What if the size of the keyword is too large for a new entry? What if a book has an unknown publication date, or more than one author?

This schema is too restrictive, so you decide to explore how to store this data in XML, but do not want to migrate the existing database to UTF-8. Suppose that the database of books also gets an influx of new books from all over the world, with books written in Chinese, Russian, and other languages. You would like the database to be able to store the metadata about the books (title, author, first paragraph) in English as well as the native language. Each book can be modeled as an XML document that is encoded differently, depending on the language of the book. How can these documents be inserted, stored, and queried in a non-Unicode pureXML database?

New support for XML in non-Unicode databases

In DB2 Viper 2, customers can now create XML columns in any new or existing table, inside of any new or existing database. Use of a Unicode code set is no longer required. To create a column of type XML, the format is:
CREATE TABLE <table_name> (<column_name1> XML);

Please note the syntax for CREATE TABLE remains the same as before.

Example:

Listing 2. Create a column of type XML
CREATE TABLE  xmlbookcat (
	ISBN BIGINT NOT NULL PRIMARY KEY,
	BOOK XML
);

In order to alter a table to add an XML column, it must be legal to add a column to the table (meaning the table cannot have NOT NULL columns with no DEFAULT clause, and the table's row width must be able to accommodate the new XML column, which is 80 bytes wide). DB2 allows users to create a table that is stored in a different character code set identifier (CCSID) than the rest of the database, using the CCSID UNICODE or CCSID ASCII clause of the column declaration of CREATE TABLE or ALTER TABLE. In a database that is not Unicode, it is illegal to create or add an XML column to a table that has the CCSID UNICODE attribute. Such tables are not stored in the database code page, and XML is not supported in these tables. Attempting these actions results in a -873 (code page mismatch) error. Create a new table to hold the XML data in the non-Unicode database.

Example:

CREATE TABLE untable(a VARCHAR(20)) CCSID UNICODE;
ALTER TABLE untable ADD xmlcol XML;

This would result in the error SQL0873N Objects encoded with different encoding schemes cannot be referenced in the same SQL statement.

Storage of XML data

The structure of the XML documents might be done like this:

Listing 3. Structure of XML documents
	<book>
		<title> . . . </title>
		<native_title> . . . </native_title>
		<author>
			<firstname> . . . </firstname>
			<lastname> . . . </lastname>
			<native_name> . . . </native_name>
		</author>
		<publisher> . . . </publisher>
		<date> . . . </date>
		<keyword> . . . </keyword>
		<first_para> . . . </first_para>
		<native_first_para> . . . </native_first_para>
		<ISBN> . . . </ISBN>
	</book>

Because different books may be written in different languages, you may need to store XML documents containing different character sets in the same DB2 column. The example below shows that the names of the elements are in ASCII for readability, but the values inside the elements may consist of any valid Unicode code point. The sample data for these XML documents are:

Listing 4. Sample data for XML documents
	<book>
		<title> A Tale of Two Cities </title>
		<author>
			<firstname> Charles </firstname>
			<lastname> Dickens </lastname>
		</author>
		<publisher> Signet Classics </publisher>
		<date> 7/1/1997 </date>
		<keyword> London </keyword>
		<keyword> Sydney Carton </keyword>
		<keyword> Paris </keyword>
		<first_para> It was the best of times, 
                                it was the worst of times </first_para>
		<ISBN> 0451526562 </ISBN>
	</book>
Listing 5. Sample data for XML documents
	<book>
		<title> Romance of the Three Kingdoms </title>
		<native_title> 三國演義 </native_title>
		<author>
			<firstname> Lou </firstname>
			<lastname> Guanzhong </lastname>
			<native_name> 羅貫中 </native_name>
		</author>
		<publisher> Tuttle Publishing </publisher>
		<date> 8/1/2002 </date>
		<keyword> China </keyword>
		<keyword> Han Dynasty </keyword>
		<keyword> Empire </keyword>
		<keyword> History </keyword>
		<keyword> 周末 </keyword>
		<first_para> Here begins our tale: The empire, long divided, must unite;
                           long united, must divide </first_para>
		<native_first_para> 天下大势, 分久必合 ,合久必分 :周末七国分争
                    ,并入于秦 ;及秦灭之后, 汉分争, 又并入于汉 </native_first_para>
		<ISBN> 0804834679 </ISBN>
	</book>
Listing 6. Sample data for XML documents
	<book>
		<title> War and Peace </title>
		<native_title> Война и мир </native_title>
		<author>
			<firstname> Leo </firstname>
			<lastname> Tolstoy </lastname>
		</author>
		<publisher> Modern Library </publisher>
		<date> 7/9/2002 </date>
		<keyword> Russia </keyword>
		<keyword> History </keyword>
		<keyword> Paris </keyword>
		<keyword> Napoleon </keyword>
		<keyword> Война </keyword>
		<first_para> Well, Prince, Genoa and Lucca are now no more 
                than private estates of the Bonaparte family </first_para>
		<native_first_para> Eh bien, mon prince. Gênes et Lucques 
                ne sont plus que  des apanages, des поместья, de la 
                famille Buonaparte </native_first_para>
		<ISBN> 0375760644 </ISBN>
	</book>

Before you look at how this data can be safely stored in a non-Unicode database, you need to examine how XML data flows through DB2, from client to server, and from server to the pureXML feature for parsing and query execution.

Data flow for XML

The XML component of DB2 operates in a Unicode code page. Therefore, all XML parsing, serialization, query execution, and comparisons must take place in a Unicode code page. When the database is UTF-8, the data can be manipulated in the database code page. If the database is a non-Unicode code page (such as ISO-8859-1), all XML documents, XQUERY strings, and fragments must undergo a code page conversion from the database code page into Unicode (specifically, to UTF-8, code page 1208).

The SQL component of DB2 operates in the database code page. All SQL comparisons, SQL query execution, and SQL casting takes place in the database code page. If a query consists of some XML activities and some SQL activities, there are conversions between the two "worlds", the "SQL world" and the "XML world", as the data for each world must be acted upon in the correct code page.

An action in which data is converted from the database code page to UTF-8 will be referred to as converting to the "XML world", and an action in which data is converted from UTF-8 to the database code page as converting to the "SQL world." These are the only two code page conversion states that data goes through during SQL/XML or XQUERY execution on the server.

It is recommended for XML support to a non-Unicode database to use a client that is Unicode since the client must be able to render all characters in the input XML and result XML documents. If the client code page and database code page are the same, then there is no conversion between client and server, and the client would be able to render all characters in input XML and result XML documents.

The client can temporarily override its code page by use of the DB2CODEPAGE environment variable. Setting this variable to 1208 (UTF-8) allows DB2 to communicate with the client as if it was operating in code page 1208. Overriding the client code page in this way may result in incorrect output or unprintable characters (such as sending Kanji data from a 943 database to an ASCII client that has overridden its DB2CODEPAGE). In the case of Windows, the code page of the client environment can be overridden by performing desired changes to the Languages and Environment settings under the Windows Control Panel.

What undergoes conversion?

Figure 1. Example of conversion process
Example of conversion process

Whenever any request is sent from a DB2 client to the DB2 server, all character data (such as queries or CHAR host variables) is converted from the client code page to the database code page. This happens whether or not XML is involved in the request.

The XML portion of a query is converted to the XML world for parsing by the XML parser, and for output (serialization) of the XML value. The result is converted to the SQL world. Both of these conversions have the possibility of introducing substitution characters, which are replacement code points inserted into a character stream when the target code page cannot render the code point. Substitution characters introduce data integrity loss at the server. This article looks at ways of avoiding substitution characters.

The results of SQL queries and serialized XML are converted to the SQL world before being passed back to the client. Substitution characters can be introduced at this point as well, if the serialized XML contains characters not representable in the database code page, or not representable in the client code page. Ways to avoid both of these substitution character pitfalls are discussed below.

Insertion of XML documents

Now that you have gone over how data is converted to the correct code page, how does this affect how XML documents are inserted? To insert an XML document into the table, you can use a character string containing the XML data:

Listing 7. Character string containing XML data
INSERT INTO xmlbookcat VALUES 
(0451526562, XMLPARSE (DOCUMENT '<?xml version="1.0" encoding="ISO-8859-1" ?>
	<book>
		<title> A Tale of Two Cities </title>
		<author>
			<firstname> Charles </firstname>
			<lastname> Dickens </lastname>
		</author>
		<publisher> Signet Classics </publisher>
		<date> 7/1/1997 </date>
		<keyword> London </keyword>
		<keyword> Sydney Carton </keyword>
		<keyword> Paris </keyword>
		<first_para> It was the best of times, 
                                 it was the worst of times </first_para>
		<ISBN> 0451526562 </ISBN>
	</book>'));

The whole INSERT statement is converted from the client code page to the database code page, and then the XMLPARSE document string is converted to UTF-8 for parsing into XML. These conversions do not introduce substitution characters and are reversible, so this scenario is safe for inserting XML data. It is always safe to insert XML documents that are encoded in the database code page, and only include characters that can be represented in the database code page. This works even if the database is non-Unicode. However, what if you wanted to insert the Chinese XML document in the same environment?

Listing 8. Insert Chinese XML document
INSERT INTO xmlbookcat VALUES 
(0804834679, XMLPARSE (DOCUMENT '<?xml version="1.0" encoding="ISO-8859-1" ?>
	<book>
		<title> Romance of the Three Kingdoms </title>
		<native_title> 三國演義 </native_title>
		<author>
			<firstname> Lou </firstname>
			<lastname> Guanzhong </lastname>
			<native_name> 羅貫中 </native_name>
		</author>
		<publisher> Tuttle Publishing </publisher>
		<date> 8/1/2002 </date>
		<keyword> China </keyword>
		<keyword> Han Dynasty </keyword>
		<keyword> Empire </keyword>
		<keyword> History </keyword>
		<keyword> 周末 </keyword>
		<first_para> Here begins our tale: The empire, long divided, must unite;
                           long united, must divide </first_para>
		<native_first_para> 天下大势, 分久必合 ,合久必分 :周末七国分争
                    ,并入于秦 ;及秦灭之后, 汉分争, 又并入于汉 </native_first_para>
		<ISBN> 0804834679 </ISBN>
	</book>'));

Here there is a problem when the database server receives this SQL statement and converts it into the database code page, none of the Chinese characters can be represented in the database code page (which only consists of the ASCII characters in ISO-8859-1). At this point, DB2 replaces these missing characters with a substitution character. The code point for this substitution character differs depending on the target code page, but for code page 819 (ISO-8859-1) it is the hex character 0x1A. Displayed on a client, this is usually represented as a question mark ('?').

The presence of a substitution character means that data integrity has been lost. The original document's Chinese character is missing and cannot be retrieved by simply sending the data stream back to the client. A substitution character is always converted into another substitution character in any other code page.

In the above XML document, the XML parser receives the native title of the book as "????" (with each of the ? characters being 0x1A code points). This is a legal XML character stream and the parser does not throw an error, but the data inside the XML column is now corrupted. Selecting out this document results in substitution characters being passed back to the client.

Note that if the database is created in code page 943, which supports Shift-JIS encoding code points, the above INSERT statement is safe, and the document is inserted into the database. For the Russian example, the XML document is only safe to insert on a database that supports the ISO-8859-5 (Cyrillic) encoding.

If users frequently are using XML documents with extensive amounts of code points that cannot be represented in the database code page, the most complete solution is to use a Unicode database. Substitution characters are not introduced during parsing or serialization on the server. Note that returning data to a non-Unicode client may still introduce substitution characters. However, the XML documents in the database are not affected.

Use the ENABLE_XMLCHAR parameter

In the Viper 2 release, there is a new DBA configuration parameter that prevents any parsing of XML from a CHAR-type host variable or character string. This parameter is database-scoped and is called ENABLE_XMLCHAR. By default, all databases are created with ENABLE_XMLCHAR set to TRUE, but it can be set to FALSE to cause an error to be thrown (-20429) if a user executes a query that performs an implicit or explicit XMLPARSE operation on a host variable of type CHAR, VARCHAR, or CLOB that is not a BIT DATA type. CHAR FOR BIT DATA, VARCHAR FOR BIT DATA, and BLOB are accepted in this scenario. The command to change the value of this database parameter is:

db2 update db cfg for database name using ENABLE_XMLCHAR FALSE

The DBA then needs to recycle the instance for this to take effect. For example, if a user performs:

INSERT INTO XTAB VALUES (XMLPARSE(DOCUMENT 'I just 
    joined the ΔΨΠ fraternity!' PRESERVE WHITESPACE));

on an ISO-8859-1 code set database, the resulting XML document inserted into XTAB contains three substitution characters in place of the Greek letters. If ENABLE_XMLCHAR was set to FALSE, this INSERT statement would result in an error:

SQL20429N  The XML operation "XMLPARSE" is not allowed on strings 
      that are not FOR BIT DATA on this database.

This error tells the user that XMLPARSE is not allowed for CHAR types. In order to properly insert this document into the database, the user would need to use an external API to bind the string "I just joined the ΔΨΠ fraternity!" to a BLOB, CHAR FOR BIT DATA, or VARCHAR FOR BIT DATA host variable. This inserts the text into the XML column as is, with the code points identical to the client code page.

Insertion of unsafe XML documents

The only safe way to insert documents that contain code points that cannot be represented in the database code page is to use a BLOB or BIT DATA host variable. This cannot be done from the CLP.

Binding XML into a BLOB host variable can help to avoid substitution characters as DB2 does not perform code page conversion on BLOB data. In order to determine the code page of data inside the BLOB, there are certain rules.

If the BLOB XML stream contains a byte-order mark (BOM), it is used to determine the assumed code page of the data inside the BLOB. This is known as "internal encoding" because there is no "external encoding" on the host variable (meaning that there is no code page associated with the host variable).

If the BLOB XML stream does not contain a BOM, the stream is parsed to find the XML attribute encoding in the form "<?xml version="1.0" encoding = "...encoding string">" in the header of the XML data. This encoding, if it exists, is mapped to a CCSID and the CCSID is used as the internal encoding. For a list of valid encoding attributes, see the Resources section.

If there is no BOM and there is no given encoding attribute, the BLOB data is assumed to consist of UTF-8 code points. Note that no checking is done before XML parsing to ensure that the BLOB contains only valid UTF-8 code points. If a BLOB containing the lone code point 0xDB is sent in to be parsed as XML, this results in an error since this is not a legal UTF-8 code point. There are further restrictions on the valid range of UTF-8 code points that XML can encompass. For more information on this, see the Resources section.

UDFs to 'clean' XML documents

The last alternative is to convert the problem characters into hexadecimal character references (of the form "&#xhhhh;", where hhhh is the hexadecimal Unicode UTF16 code point of the character). Decimal character references can be used in any XML fragment and is replaced by the actual code point during XML parsing. The character string "I just joined the &#x394;&#x3A8;&#x380; fraternity!" is equivalent to "I just joined the ΔΨΠ fraternity!" in UTF-8.

To assist the user in converting XML documents, DB2 provides two UDFs, that test and clean up XML documents before they are inserted into the database.

The first UDF is called TEST_XML and takes in a BLOB (the preferred method is to use a BLOB_FILE referencing an XML text file encoded in UTF-8 code set) containing the XML document, and outputs a Boolean value. When TEST_XML is called, DB2 attempts to convert the BLOB from UTF-8 into the database code page, and return TRUE if no substitution characters were encountered, or FALSE if some substitution character was encountered during the conversion. This does not insert the document or modify the BLOB in any way. It is simply a test to see whether this XML document can be safely inserted as a CHAR, VARCHAR, or CLOB into the database without resulting in data integrity loss.

If the BLOB is encoded in another code page other than UTF-8, users can specify the code page of the BLOB input as an optional second parameter to TEST_XML.

The second UDF performs conversion with substitution escape character replacement. It is called CLEAN_XML and takes in a BLOB or BLOB file containing an XML document (assumed to be in UTF-8 code page), and outputs a CLOB containing the XML document in the database code page, with each code point that cannot be converted safely into the database code page replaced with the escape character form "&#xhhhh", where hhhh is the hexadecimal UTF16 code point of the character. This does not insert the document or modify the BLOB in any way, but it can be used in conjunction with INSERT/XMLPARSE to safely insert any given BLOB XML document into any database, for example:

INSERT INTO XTAB VALUES (XMLPARSE(CLEAN_XML(:BLOB_HV)));

If the BLOB is encoded in another code page other than UTF-8, users can specify the code page of the BLOB input as an optional second parameter to CLEAN_XML.

Example:

Assume the following XML document is the sample being provided as input BLOB to a database with the code page ISO-8859-7.(Note symbols "é" and "í" used in Sterling below do not belong to ISO-8859-7).

Sample input:

<?xml version="1.0" encoding="utf-8" ?>
      <a> Stérlíng
      </a>

Running TEST_XML on the above input would result in an output value of 1, since the symbols "é" and "í" used in Sterling above do not belong to ISO-8859-7 and therefore need to be substituted.

Similarly, running CLEAN_XML would result in Stérlíng being converted to St&#xE9;rl&#xED;ng , the word é having been replaced by its hexadecimal value &#xE9; and í replaced with its hexadecimal equivalent &#xED;

Sample output:

<?xml version="1.0" encoding = "utf-8" ?>
        <a>
 St&#xE9;rl&#xED;ng 
        </a>

This XML document can now be inserted into a table with an XML column irrespective of the database code page. One caveat of these UDFs is that they do not parse the BLOB as an XML document. There is no check for well-formedness, and there is the possibility that element tag names or other non-text portions of XML documents will be changed into hexadecimal character references. If your XML documents contain code points that cannot be rendered in the database code page within element tag names or the XML header information, the resulting document from CLEAN_XML may no longer be a well-formed XML document.

Query of XML data in non-Unicode databases

XML serialization is the process of binding out the Unicode XML values into host variables, for eventual transportation back to the client. There is a code page conversion to the SQL world if the value is serialized into a CHAR, VARCHAR, or CLOB type that is not FOR BIT DATA. BLOB serializations do not undergo any code page conversion, but they do get assigned an "encoding" attribute in the serialized header, which defaults to an encoding of UTF-8.

Substitution characters may be introduced during serialization if code points cannot be rendered in the database code page. If these characters are introduced, the user gets an error -20412, which corresponds to the XML white paper rule that serialization of an XML value must not inject a substitution character.

Now take a look at how SELECT works in this scenario.

Syntax :

	SELECT XMLSERIALIZE(<XML column name> AS <datatype>) FROM <table_name>;

Take a simple example, fetching the result of the INSERT: SELECT XMLSERIALIZE(book AS CLOB(1M)) FROM xmlbookcat; or XQUERY db2-fn:xmlcolumn('XMLBOOKCAT.BOOK')/book;.

Note that XQUERY, by default, serializes its result set into a CLOB. Selecting an XML column from a table also serializes its result into a CLOB.

CLOB data types have an inherent code page equal to the database code page, so there is an implicit code page conversion of the reply to the SQL world during the serialization. The result is then sent back to the Unicode client from the database, and the reply gets converted accordingly. You end up with the same data in the client code page for ASCII data, with no data loss.

For the Chinese and Russian characters, though, the XML Serializer replaces them with substitution characters for the output. The client throws a -20412 error if it finds that XML data being serialized contains substitution characters. Even if the Chinese or Russian documents were inserted as BLOBs, they cannot be serialized as CLOBs or CHAR-types without giving this error.

The only safe way to select this data is to bind it into a BLOB host variable, or a BLOB FILE host variable, to write it into an XML file. If it is possible to return the data to the client without serialization, by binding directly to an XML output host variable, this is the safest way to avoid any kind of substitution character.

Use character references

Clients can embed character references into the XML stream manually, to pass unique code points into the database without binding the XML data to a BLOB. For example, with the Greek Delta character ("Δ", Unicode code point U+0394, UTF-8 code point 0xCE94) :

	INSERT INTO XMLBOOKCAT VALUES (1, xmlparse(document 
    '<?xml version="1.0"?><book>H&#x0394;llo</book>'));

This statement inserts the delta character into the XML document on any database.

	SELECT XMLSERIALIZE(XMLCOL AS BLOB(250)) FROM xmlbookcat;

Gives the following output:

	x'3C626F6F6B3E48CE946C6C6F3C2F626F6F6B3E'

These are the UTF-8 code points for "<book>HΔllo</book>." Note that the Delta character was passed in through the hex character reference (hex 394), and is stored in the ISO8859-1 database correctly. Serializing the Delta character gives the UTF-8 representation of that character (hex CE94).

Substitution characters

Insert Kanji characters to codeset 8859-1

INSERT INTO xmlbookcat VALUES( 11, 
    XMLPARSE(document'<?xml version="1.0" encoding="utf-8" ?> 
       <book><pr番duct> </pr番duct></book>'));

The Kanji character "番" present in product is not a part of database code page 8859-1. The character is replaced by a substitution character before parsing takes place, and results in the element tag name throwing a SQL error code "16110 XML Syntax error. Expected to find Attribute name."

XQUERY on Kanji characters in codeset 8859-1

XQUERY (番);

The above statement results in generation of error -16002N "An XQuery expression has an unexpected token 0x1A following XQUERY."

Select a Kanji XML element

SELECT XMLELEMENT(NAME "pr番duct") FROM xmlbookcat;

The above statement results in error code "SQL 20275N The XML name "pr duct" is not valid" where the Kanji symbol 番 has been replaced by a blank space.

XQUERY examples of mixed code page situations

	XQUERY for $x in db2-fn:xmlcolumn('XMLBOOKCAT.BOOK')/book[keyword = 'Paris']	
		return $x/author;

This example scans the book documents and returns the authors of books that have a keyword of "Paris." It would return the elements for "Leo Tolstoy" and "Charles Dickens." Note that even though the Russian book contained Cyrillic Unicode characters, this XQUERY is valid even on an ISO-8859-1 database because it serializes only those elements that match the result predicate.

	XQUERY for $x in db2-fn:xmlcolumn('XMLBOOKCAT.BOOK')/book[native_title = '三國演義']
		return $x/date;

This example is legal if the database supports code set Shift_JIS code points that are used in the title predicate, but fails to execute correctly on other databases. Even though only the date is returned as a result set, the query itself is converted into the database code page, so the title predicate here must be safely transcoded. For databases that do not support these code points, this query can be run against the XML documents in this fashion, by replacing the code points with the equivalent character references for each symbol:

	XQUERY for $x in db2-fn:xmlcolumn('XMLBOOKCAT.BOOK')/book
          [native_title='&#x4E09;&#x570B;&#x6F14;&#x7FA9;']
		return $x/date;

Although unwieldy, this does correctly match the predicate on the server and return the expected value. Such character references can be used in any character string passed to DB2 to replace code points that cannot be represented in the database code page.

	XQUERY for $x in db2-fn:xmlcolumn('XMLBOOKCAT.BOOK')/book
		return $x/native_first_para;

This query gives a -20412 substitution character error on the client if run against any database except a UTF-8 code set one, as it returns Russian, Chinese, and English code points in the same result set. There is no non-Unicode code page that contains all of these values. The client can also bind out the result set to a BLOB, which allows the bytes to be passed safely back to the client without transcoding.

	XQUERY for $x in db2-fn:xmlcolumn('XMLBOOKCAT.BOOK')/book
		return $x/first_para;

This query returns the English version of the first paragraph of each book, even if that book has non-ASCII characters in the "native" first paragraph. It is safe to run on any database.

Now examine some more ramifications of the code page conversion process. The SQL parser and the XML parser interact within the DB2 compiler, but they expect their input in different code pages. But what happens in the case of XQUERY and SQL together in the same request?

	XQUERY db2-fn:sqlquery("SELECT BOOK FROM XMLBOOKCAT")//book[date < '01/01/1999'];

Here you have an XQUERY with an embedded SQL request. First, the entire XQUERY (with the SQL request) is converted to UTF-8. The SQL portion of the query ("SELECT BOOK FROM XMLBOOKCAT") is converted to the SQL world. The resulting final query is a hybrid, partially in one code page, and partially in another.

The Listing 9 is attempting to compare the Chinese Lou character ("羅") in the ISO-8859-1 database.

Listing 9. Comparing the Chinese Lou character
SELECT XMLCAST(
XMLQUERY('$doc/book/author[native_name = "羅"]' 
PASSING xmlbookcat.book AS "doc") AS CLOB)
 	FROM xmlproductnuc;

This results in error code 16002N, with substitution 0x1A and an error with SQL code 16002N, as seen in listing 10.

Listing 10. SQL code 16002N
SQL16002N  An XQuery expression has an unexpected token "0x1A" following " = ". 
Expected tokens may include: "#x9|#xA|#xD|[#x20-#xD7FF]|[#xE000-#xFFFD]|
[#x10000-#x10FFFF]". Error QName=err:XPST0003.  SQLSTATE=10505

In order to compare this value exactly to the XML data, you make use of character references (such as &#x7F85).

For more complex scenarios such as XMLTABLE:

Listing 11. XMLTABLE comparison
SELECT x.* from XMLBOOKCAT xmlb, 
      XMLTABLE('$p/book' PASSING xmlb.book AS "p" COLUMNS 
"Title" 		VARCHAR(50)    PATH '//title',
"Publisher"		VARCHAR(32)    PATH '//publisher',
"Author_First"	VARCHAR(32)    PATH '//author/firstname',
"Author_Last"	        VARCHAR(32)    PATH '//author/lastname') AS X;

The XML document is shredded, and each bound-out column (title, publisher, firstname, lastname) is converted to the database code page. However, since all of these elements contain ASCII characters, this is safe. The following XMLTABLE call is not safe:

Listing 12. Unsafe XMLTABLE call
SELECT x.* from XMLBOOKCAT xmlb, 
      XMLTABLE('$p/book' PASSING xmlb.book AS "p" COLUMNS 
"Title" 		VARCHAR(50)    PATH '//native_title',
"Publisher"		VARCHAR(32)    PATH '//publisher',
"Author"	      VARCHAR(32)    PATH '//author/native_name') AS X;

This results in substitution characters in a non-Unicode database, since it attempts to convert Unicode characters inside "native_name" and "native_title" elements into the database code page for storage within a VARCHAR column in a relational table. This is not recommended for mixed-code page XML documents in a non-Unicode database.

Summary

If you want to add XML to a non-Unicode database, and there are many characters in the XML document that cannot be rendered in the database code page, it is recommended to migrate the database to Unicode and insert the XML data in that form.

If the XML documents contain only characters that can be rendered in the database code page, users can insert and query the XML to the database without restriction or substitution characters, but be careful not to insert new XML data that contains characters outside the database code page.

If there are a few characters in the documents that lie outside the range of the database code page, this can be checked using the UDF TEST_XML, which returns TRUE if the XML document can be safely inserted into the database, and FALSE if it requires cleaning by passing it to the CLEAN_XML UDF prior to insertion into a non-Unicode database.

DBAs can restrict users from inserting XML documents through character-types by using the ENABLE_XMLCHAR database configuration parameter.

The safest way to fetch XML from a database is into an XML native type host variable. BLOBs are also a safe way to fetch XML data without code page conversion.

When issuing queries against XML data that contains characters outside the database code page, attempt to use decimal or hexadecimal character references.

Resources

Learn

Get products and technologies

  • Download the Viper 2 code and participate in the open beta.
  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=240965
ArticleTitle=XML in a non-Unicode database
publish-date=07192007