The pureXML™ technology in DB2® 9 is designed to provide the highest level of performance for XML data management. This article compares its performance with that of character large object (CLOB) and shredded XML storage. Many database systems allow you to store XML data as CLOBs or "shred" the data into relational tables. These two options are also supported in DB2® V8 through the XML Extender, which is still available unchanged in DB2 9 for backwards compatibility. However, they are superseded by the pureXML features.
The DB2 XML Extender consists of a set of stored procedures, user-defined functions (UDFs), and user-defined data types (UDTs) which add XML capabilities on top of the core DB2 engine. Equipped with an XML parser and XML-specific logic in its procedures and UDFs, the XML Extender performs XML storage and retrieval (supported by traditional DB2 engine features). You can use the XML Extender with either an XML Extender Column or an XML Extender Collection.
An XML Extender Column allows you to store XML documents intact as plain unparsed text. This process is quite simple, but ignores the internal structure of the XML documents. You can choose to make the underlying storage a CLOB column, a VARCHAR column, or files within the file system. In a VARCHAR column, the XML Extender only stores documents up to 3KB in size -- a limit which is hard to guarantee in many applications. Knowledgeable database administrators (DBAs) can change the limit to 32k, but often even this maximum cannot be guaranteed. External file storage is more flexible but does not benefit from database-managed persistency and integrity. This is why CLOBs, which can store documents up to 2GB in size, have emerged as the most common choice for XML Extender Columns. Later, this article examines the performance of XML Extender CLOB columns.
An XML Extender Collection allows you to convert XML data to a relational format. This requires a fixed mapping from the expected XML structure to a collection of relational tables in your database schema. Based on that mapping, a stored procedure extracts atomic data values from XML documents and inserts them into traditional relational rows and columns. This process is known as "shredding" or decomposition. It involves XML parsing and translates a single logical XML document insert into a series of SQL row inserts. In real-world applications, it can easily take several dozen relational tables to represent all one-to-many relationships in the original XML structure. Thus, the mapping becomes complex very quickly, and XML insert performance suffers correspondingly. Once the data is available in relational format, plain SQL can be used for data access and manipulation. However, reconstruction of the original XML documents is also expensive. It requires multi-way joins and the generation of proper XML tagging. The tagging can be defined by the standardized SQL/XML publishing functions, to reconstruct the original or new and different documents. However, the XML Extender Collection won't preserve any digital signatures associated with the original XML document.
Making XML data available in relational format continues to be an important requirement. The most common reason is the need to feed legacy SQL applications, packaged business applications and business intelligence (BI) tools that consume only relational data. Therefore DB2 9 offers a new "decomp" solution, also known as "annotated schema shred" or "new shred", which can be 7 to 8 times faster than XML Extender Collection shredding. Later in this article, the performance of this new high-speed shredder is compared to the IBM pureXML™ support in IBM DB2 9.
The new pureXML technology in DB2 9 is significantly different from CLOB or shredded XML storage. It does not store documents as plain text and does not map XML to relational or object-relational tables. Instead, it stores XML in its inherent hierarchical format, which matches the XML data model. Any XML document is a well-defined tree of elements and attributes, and XML queries are expressed in terms of tree traversal. Therefore, it seems intuitive that a corresponding hierarchical storage and processing format allows for efficient XML data management. To get to the bottom of this claim, the performance of pureXML in DB2 9 is compared with CLOB-based and shredded XML processing.
Table 1 summarizes the comparisons made in this article. Basically, key XML operations on CLOB and shredded storage are compared to corresponding pureXML operations.
Table 1: Comparing CLOB to shredded XML processing and pureXML
|XML in CLOBs||DB2 9 pureXML|
Insert XML into an XML Extender CLOB column
Insert XML into an XML column
Full document retrieval from CLOBs
Full document retrieval from an XML column
Query XML in CLOBs using XML Extender "extract" functions (XML parsing at query time)
XQuery over an XML column
|XML shredded to relational tables||DB2 9 pureXML|
Shred XML to relational tables, using the DB2 9 new shred
Insert XML into an XML column
SQL/XML publishing to construct XML documents from relational
XQuery over an XML column
All tests were performed with the following data and settings:
- A 4-CPU pSeries system with IBM® AIX® 5.2 (64bit) and a single DB2 9 instance
- Between 1,000 and 100,000 CustAcc documents (4kb to 20kb in size) were used, taken from the financial scenario in the "DB2 9 XML performance characteristics" article
- Database managed (DMS) table spaces with a page size of 32kb
- All table spaces were defined with the
no file system cachingoption, except where otherwise noted (for some of the CLOB storage tests)
- All tables spaces were distributed over 10 physical disks, with the database log on a separate striped volume
- Identical database configuration and tuning was used in all tests to ensure fair and valid comparisons
Comparing CLOB to pureXML columns
This comparison is interesting because for a certain class of today's XML applications, CLOB columns are a commonly choice for XML storage. Before DB2 9, there wasn't a better alternative. A fundamental difference between CLOB storage and pureXML processing lies in XML parsing and its significant impact on insert and query performance.
If XML documents are inserted into CLOB columns, they are inserted as unparsed text objects. Avoiding XML parsing at insert time can be a performance benefit, especially in CPU-bound systems. However, without XML parsing, the structure of the XML documents is entirely ignored. This precludes the database from performing intelligent and efficient search and extract operations on the stored text objects. The only remedy is to invoke the XML parser at query execution time to "look into" the XML documents such that search conditions can be evaluated. The significant CPU consumption of XML parsing often leads to low search and extract performance. Only a blind full document retrieval, which again ignores the internal XML structure, can quickly read XML documents from CLOB columns.
The pureXML technology in DB2 9 parses XML documents at insert time and never at query time. The XML documents are stored and queried in a parsed format, denoted in DB2 9 by the use of a new data type, "XML". This parsed format is a tree structure of nodes, which is different from the textual representation of XML documents. Search and extract operations can be performed without XML parsing, a significant performance benefit, because the XML parsing overhead is incurred at insert time instead. Similarly, document retrieval from an XML column requires serialization, that is, converting the parsed XML format back into its original textual representation. This overhead does not exist when reading full XML documents from a CLOB where the XML is already stored in text form.
In summary, CLOB storage offers good performance for insert and full document retrieval, typically at the cost of poor search and extract performance. The XML data type in DB2 9 sacrifices some of the insert and retrieval performance for much higher search and extract performance. This is a reasonable trade-off because business data is more often searched and analyzed than inserted. Typically, there is one insert and multiple searches. Additionally, the potential overhead of XML columns is often outweighed by the fact that XML columns are buffered in the buffer pool while CLOB columns are not.
The next section reviews performance measurements to quantify these trade-offs.
Comparing CLOB inserts to XML inserts
In the first test, we sequentially inserted 100,000 documents with and without indexes, committing after every document as many OLTP applications would. The result in Figure 1 shows the relative elapsed time between XML and CLOB column inserts (lower is better). Setting the elapsed time for XML inserts as the comparison baseline (100%), maintaining six XML indexes incurs only little overhead (5% in our scenario).
Figure 1: Comparing the single-user insert performance of XML to CLOB columns
Inserting the same XML data into a CLOB column takes approximately half the time (53%). This is because XML parsing and deeper processing of the XML data is avoided. Loosely speaking, XML Extender "side tables" are for CLOB columns what real XML indexes are for XML columns. Maintaining side tables at insert time requires XML parsing and additional relational inserts because selected XML element and attribute values are extracted and stored separately. Consequently, the insert elapsed time is at least as high as for pureXML inserts, in our scenario even 23% higher.
The experiments in Figure 1 used the
no file system caching option for the DB2 table
space. If you allow file system caching for the DMS table space containers
(Figure 2), the XML insert performance improves
slightly while the CLOB inserts suffer. Since CLOB inserts use direct
writes, the file system cache is not needed and is pure overhead. However,
file system caching can help read operations on CLOB columns if no XML
parsing is involved.
Figure 2: The impact of file system caching on XML and CLOB column inserts (no indexes or side tables)
Single-user database applications are very rare, so it is important to look at the performance behavior of a concurrent workload. The insert test was modified to use 10 concurrent threads with 10 separate connections to the database, each inserting 10,000 documents without think time. Figure 3 shows that the increased workload intensity hurts CLOB performance drastically. While CLOB inserts were about twice as fast as XML inserts in our single-user tests (Figure 1), they are more than twice as slow in our multi-user test (207%). This is because CLOB inserts do not benefit as much from parallelism as XML column inserts. With increasing degrees of concurrency, buffered inserts for XML columns are far more scalable than concurrent direct writes for CLOBs.
Figure 3: Multi-user insert performance of XML compared to CLOB columns
Figure 4 clarifies that the parallelism speed-up is much higher for XML columns than for CLOB columns. Taking single-user XML inserts as the 100% baseline, our system and configuration allowed 10 concurrent insert streams to insert the same 100,000 documents in 18% of the time, which is more than five times faster. CLOB inserts didn't benefit nearly as much from the parallelism and still took 37% of the baseline, which is only 1.4 times faster than non-parallel CLOB inserts.
Figure 4: Parallelism speed-up for XML and CLOB column inserts
All of these insert tests were also conducted with 1000, 5000, 10,000, and 50,000 documents. As expected, the elapsed insert time for both CLOB and XML columns is approximately linear in the number of documents inserted. Therefore, the corresponding charts are omitted for brevity.
XML queries on CLOB compared to XML columns
To assess the difference in query performance between XML and CLOB columns, five queries were designed to cover the following common search and retrieval cases:
- Full document retrieval of all documents, no predicate
- Full document retrieval of one document matching certain criteria (one predicate)
- Full document retrieval of documents matching certain criteria (multiple predicates)
- Partial retrieval of all documents
- Partial retrieval of all documents matching certain criteria
These operations were implemented in the five queries as follows:
Q1 (Select*): Selects all XML documents (select * from
Q2 (1Pred1Doc): Returns the Customer document for a given account number
Q3 (5PredSome): Returns Customer documents for all female customers with a primary address in California who have an account in USD and have not yet reached Premium customer status
Q4 (PartialAll): For each Customer, returns the name, and the sum of the balances of all his/her accounts
Q5 (PartialSome): Gets the primary email address for all customers who are holding IBM stock in any of their accounts.
For CLOBs, these queries were expressed in SQL with XML Extender extract functions. For XML columns, they were expressed in XQuery notation. Whether the XQueries were embedded in SQL or executed stand-alone did not make a performance difference in our tests. All queries and some sample data are available in pureXMLvsCLOBvsShredded.zip in the Downloads section.
Figure 5 shows the query performance (elapsed time) of all five test queries for both pureXML and CLOBs. You can see that pureXML queries can easily be 20, 30 or 40 times faster than queries over XML in CLOB columns. These results are for our default setup with no file system caching for table spaces. Figure 6 contains the same results with file system caching. File system caching has a significant impact only on query Q1 which retrieves all documents without any predicate evaluation. Without file system caching, Q1 performs similarly on XML and CLOB columns, with a slight disadvantage (10%) for the CLOB column which is not buffered in the buffer pool. File system caching significantly improves CLOB retrieval performance (see Figure 6) such that query Q1 can be more than twice as fast as on the XML column. This is because reading from the XML column requires serialization, or the conversion of parsed XML back to text format. Without file system caching, this overhead is mitigated by the fact that XML columns are buffered in the DB2 buffer pool and CLOB columns are not.
Figure 5: Query performance, no indexes, no file system caching
For queries 2 through 5, file system caching does not play a significant role. These queries always require sub-document level access to evaluate predicates and to extract document fragments. This is where the real benefit of pureXML kicks in: XML is stored in a parsed format, so no parsing is required at query execution time. In our tests, this yields a performance speed-up between 7 times and 44 times.
Figure 6: Query performance, no indexes, with file system caching
It is important to note that the difference in query response time between XML columns and CLOB columns increases drastically with the amount of data that needs to parsed (for CLOBs) or traversed (for pureXML). Figure 7 shows the query response times as a function of the number of documents in the table, ranging from 1,000 to 100,000 documents (when no indexes or side tables are used).
Figure 7: Query 2 performance as a function of the data volume
The XML Extender offers the concept of side tables to speed up the search for XML documents in order to avoid XML parsing for predicate evaluation. At insert time, specific elements and attributes are extracted into relational tables. You already know that this adds significant overhead to CLOB inserts, but the side tables can be efficiently searched and joined with the main table containing the CLOBs. Three of our five test queries (q2, q3, and q5) contain filtering predicates that can benefit from side table look-ups. Side tables can avoid much of the XML parsing for CLOBs and can often make CLOB queries 100 times faster or more.
In Figure 8, let's compare this to pureXML with real XML indexes that provide similar benefits. All of the six bars in Figure 8 represent elapsed times of about one second or less. And yet pureXML with indexes is 6 to 35 times faster than CLOB columns with side tables. There are multiple reasons for this. A pureXML index points directly to the row with the corresponding document. With side tables, DB2 first performs an index lookup on the side table and then joins the corresponding row with the main table that contains the CLOBs. Query Q3 (5PredSome) has multiple predicates and uses three side tables plus the main table, so it needs to compute a four-way join. Query Q5 uses side tables for the predicates, but requires an extract function (with XML parsing) to retrieve the customers email addresses.
Figure 8: Predicate evaluation with indexes (pureXML) and side tables (CLOBs)
Figure 8 clarifies that although side tables can reduce the number of CLOBs that must be parsed, XML indexes in DB2 9 similarly reduce the number of documents that must be traversed for predicate evaluation. Thus, the use of indexes and side tables typically reduces the absolute elapsed times for both XML and CLOBs, but does not erase the large relative performance advantage of XML over CLOB columns.
Multi-user query tests were executed with 10 users on both CLOB and XML columns. The results are omitted because the relative performance difference between CLOB and pureXML storage was similar to the single-user tests shown above.
Shredded storage compared to XML columns
Shredding XML data to relational tables is still a common requirement. A typical reason is that existing applications may not yet understand XML and require a relational format. This includes legacy SQL applications and packaged business applications as well as BI and reporting tools. The annotated XML Schema decomposition in DB2 9 ("new shred") resolves the functional and performance limitations of the older XML Extender shredding. The following section examines how the new DB2 9 shredding performance stacks up against the pureXML technology in DB2 9.
For these tests, the same Customer data was used as before. Due to various repeating elements, a total of 87 columns in 12 tables were required to represent the XML data in a traditional relational schema. The schema is shown in Figure 9, indicating the number of columns and the number of rows in each table after shredding 100,000 of our customer documents. In total, the tables contain more than 3.5 million relational rows to represent the 100,000 XML documents. The arrows in Figure 9 indicate one-to-many relationships. Indexes are defined on all primary and foreign key columns to support efficient joins between the 12 tables.
Figure 9: Relational schema to hold shredded XML data
Comparing shredding to relational tables with pureXML insert
Shredding the 100,000 documents in this relational schema with the new annotated schema shred in DB2 9 is about 1.75 times slower than inserting the same XML documents into an XML column (see Figure 10). This is with commits after every document, as is typical of OLTP applications.
Figure 10: pureXML inserts compared to DB2 9 shredding, commit after every document
If you commit less frequently, you can increase the performance advantage of an XML column over shredding. Committing after every other document makes XML inserts 2.56 times faster than shredding (in our scenario and configuration). For bulk insert or import operations, you would only commit every 50 or 100 documents and find XML column inserts 4 to 5 times faster than shredding (see Figure 11). XML column inserts benefit from larger commit intervals more than shredding does, with a larger number of log pages per log I/O request.
Figure 11: pureXML insert compared to DB2 9 shredding, varying commit interval
For a commit interval of 50, Figure 12 compares the new annotated schema shred in DB2 9 to the XML Extender shred in V8 and pureXML inserts. In our tests, the new shred is 7 times faster than the XML Extender shred, and pureXML inserts can be even 30 times faster than V8 shredding.
Figure 12: DB2 V8 XML Extender shredding compared to DB2 9 technology (commit interval = 50)
Comparing XML queries on shredded data to pureXML columns
Assume that you need to serve an XML-oriented application or web service that requires query results in XML format. If the XML data is shredded to relational tables, relational query results must be converted back to XML. For this purpose, SQL/XML publishing functions can be used in the SELECT clause of SQL queries to construct the required XML tagging for the result set.
Five SQL queries were defined over the relational schema in Figure 9 that are logically equivalent to our five XQueries over XML columns. These SQL queries use traditional relational predicates, join some or all of the tables, and use SQL/XML publishing functions to return the same XML results as the XQueries. The relative performance results are shown in Figure 13.
Query Q2 (1Pred1Doc) returns only one document, based on an account number lookup. In both cases (pureXML and publishing), there are indexes on the account number to improve the performance of this query. Although the SQL/XML query is quite fast, it's still 75 times slower than an XQuery on an XML column. This is because an XQuery only needs to locate the document through the index and serialize it, while SQL/XML publishing must join all 12 relational tables and construct the XML document.
Queries Q1 (Select*) and Q3 (5PredSome) return many documents, so the cost for SQL/XML document construction from relational data is multiplied. Query Q4 (PartialAll) reads a few values from each XML document and constructs a whole new document out of them. With XQuery over pureXML storage the values are read from the XML column and the construction is expressed in XQuery. With SQL/XML publishing over shredded data, the values are read from relational columns, and the construction is expressed in SQL/XML. In both cases, the construction is the bottleneck. This is why both versions of Q4 perform approximately identically.
Figure 13: pureXML XQuery compared to SQL/XML publishing over shredded data
Q5 (PartialSome) uses multiple predicates to return very few XML documents with only one element. In this case, SQL/XML is not very expensive because it constructs just one element per result row and joins only 3 of the 12 tables. Since searching over relational data can be somewhat faster than searching over XML data, you see better performance for this query over shredded data (20 times faster than XQuery in our setup).
Summary and conclusions
While a single-user workload may find CLOB inserts faster than pureXML inserts, the intensity of concurrent inserts in typical business scenarios can make CLOB inserts 2 or 2.5 times slower than inserts into an XML column. If inserts are committed after every document, pureXML inserts are about 60% to 70% faster than the new shredding solution in DB2 9. For bulk inserts or import with less frequent commits, pureXML can ingest XML data even 4 to 5 times faster than shredding. These tests used the new shredding solution in DB2 9, which is 7 to 8 times faster than XML Extender shredding in V8.
XQuery over XML data in XML type columns can be 40 times faster than corresponding queries over CLOB columns which require XML parsing at query time. The absolute performance difference between "pureXML queries" and "CLOB queries" increases (linearly) with the amount of data that is queried.
The relative performance difference between pureXML XQueries and SQL/XML Publishing over shredded data depends heavily on the amount of relational data that needs to be tagged as XML, but also on the number of joins required. XQuery clearly outperforms publishing queries when complex data must be retrieved. In some of our tests, retrieving XML data from DB2 pureXML storage can be 50 to 100 times faster than constructing XML data from relational tables. However, simple search queries without complex joins that return results with little or no XML tags can be much faster in SQL. Still, an application always needs to weigh query performance against insert performance and evaluate the mix of both. XML insert performance is summarized in Figure 14.
Figure 14: Summary of XML insert performance (commitcount =1)
Note that all performance results in this article were obtained in an isolated lab environment with specific hardware, operating, and database configurations. Different tests in different environments may show bigger or smaller performance differences than the results described in this article.
Although all of our measurements used DB2 only, many of the performance issues with CLOB and shredded XML storage are inherent in the general concepts of storing XML as text or converting it to the relational data model, respectively. Therefore, you may see similar performance characteristics in other DBMS environments that support these concepts. Readers interested in such matters are encouraged to explore such issues in their own environments, perhaps using the materials in this article as a basis for developing their own tests.
For their reviews of this paper, we'd like to thank Henrik Loeser, Cindy Saracco and Nikolaj Richers.
|Sample scripts for this article||pureXMLvsCLOBvsShredded.zip||14KB|
- "What's new in DB2 Viper - XML to the core" (developerWorks, February 2006), Cynthia M. Saracco.
- "DB2 9 XML Performance Characteristics" (developerWorks, June 2006), Irina Kogan, Matthias Nicola, and Berni Schiefer.
- "15 Best Practices for pureXML Performance in DB2 9" (developerWorks, October 2006), Matthias Nicola.
- "pureXML in DB2 9: Which way to query your XML data?" (developerWorks, June 2006), Matthias Nicola and Fatma Ozcan.
- "An introduction to the SQL/XML publishing functions" (developerWorks, November 2005), Roman Melnyk.
- "From DAD to annotated XML schema decomposition" (developerWorks, April 2006), Mayank Pradhan.
- DB2 XML Extender, WORF, and WebSphere Application Developer (developerWorks, September 2004), Owen Cline.
- "Exploit XML indexes for XML query performance in DB2 9" (developerWorks, November 2006), Matthias Nicola
- Visit the DB2 XML (pureXML) wiki to stay up-to-speed on DB2 XML technology.
- Browse the technology bookstore for books on these and other technical topics.
- developerWorks Information Management zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks technical events and webcasts.
Get products and technologies
- DB2 v9: Get DB2 9 software.
- Find useful information about the DB2 family of products and features as well as related WebSphere(R) Information Integration products and features at the DB2 Information Center.
- DB2 Express-C: Free to develop, deploy, and distribute.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
- Participate in the discussion forum.
- Check out developerWorks blogs and get involved in the developerWorks community.
Dig deeper into Information management on developerWorks
Get samples, articles, product docs, and community resources to help build, deploy, and manage your cloud apps.
Experiment with new directions in software development.
Software development in the cloud. Register today to create a project.
Evaluate IBM software and solutions, and transform challenges into opportunities.