A performance comparison of DB2 9 pureXML and CLOB or shredded XML storage

Like other databases, the DB2® V8 XML Extender offers two storage and access models for XML: XML documents can be stored intact as unparsed text in CLOB columns, or they can be mapped and shredded to a set of relational tables. Both options have known performance limitations. The new pureXML™ technology in DB2® 9 seeks to overcome these limitations by storing and querying XML in its inherent hierarchical format. This article describes a series of measurements to characterize the cases in which pureXML does or doesn't provide a performance benefit, and to quantify the performance difference to CLOB or shredded storage.

Share:

Matthias Nicola (mnicola@us.ibm.com), DB2/XML Performance, IBM Silicon Valley Laboratory

Author photo: Matthias NicolaDr. Nicola is the technical lead for XML database performance at IBM's Silicon Valley Lab. His work focuses on all aspects of XML performance in DB2, including XQuery, SQL/XML, and all native XML features in DB2. Dr. Nicola works closely with the DB2 XML development teams as well as with customers and business partners who are using XML, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Dr. Nicola worked on data warehousing performance for Informix Software. He also worked for four years in research and industry projects on distributed and replicated databases. He received his doctorate in computer science in 1999 from the Technical University of Aachen, Germany.


developerWorks Contributing author
        level

Vitor Rodrigues (vrodrig@us.ibm.com), DB2 pureXML Technical Enablement, IBM

photo2Vitor Rodrigues is a software developer at the IBM Silicon Valley Lab. He graduated from University of Minho, Portugal, in Computer Science and Systems Engineering. He joined IBM in 2005 as an intern working on DB2 Everyplace and DB2 9 pureXML. Vitor was a part of the DB2 9 QA team for pureXML, where he acquired deep knowledge of the XML features in DB2 9. After his internship, Vitor became a regular employee, now working for the DB2 9 XML Enablement team.



07 December 2006

Also available in Chinese

Introduction

The pureXML™ technology in DB2® 9 is designed to provide the highest level of performance for XML data management. This article compares its performance with that of character large object (CLOB) and shredded XML storage. Many database systems allow you to store XML data as CLOBs or "shred" the data into relational tables. These two options are also supported in DB2® V8 through the XML Extender, which is still available unchanged in DB2 9 for backwards compatibility. However, they are superseded by the pureXML features.

The DB2 XML Extender consists of a set of stored procedures, user-defined functions (UDFs), and user-defined data types (UDTs) which add XML capabilities on top of the core DB2 engine. Equipped with an XML parser and XML-specific logic in its procedures and UDFs, the XML Extender performs XML storage and retrieval (supported by traditional DB2 engine features). You can use the XML Extender with either an XML Extender Column or an XML Extender Collection.

An XML Extender Column allows you to store XML documents intact as plain unparsed text. This process is quite simple, but ignores the internal structure of the XML documents. You can choose to make the underlying storage a CLOB column, a VARCHAR column, or files within the file system. In a VARCHAR column, the XML Extender only stores documents up to 3KB in size -- a limit which is hard to guarantee in many applications. Knowledgeable database administrators (DBAs) can change the limit to 32k, but often even this maximum cannot be guaranteed. External file storage is more flexible but does not benefit from database-managed persistency and integrity. This is why CLOBs, which can store documents up to 2GB in size, have emerged as the most common choice for XML Extender Columns. Later, this article examines the performance of XML Extender CLOB columns.

An XML Extender Collection allows you to convert XML data to a relational format. This requires a fixed mapping from the expected XML structure to a collection of relational tables in your database schema. Based on that mapping, a stored procedure extracts atomic data values from XML documents and inserts them into traditional relational rows and columns. This process is known as "shredding" or decomposition. It involves XML parsing and translates a single logical XML document insert into a series of SQL row inserts. In real-world applications, it can easily take several dozen relational tables to represent all one-to-many relationships in the original XML structure. Thus, the mapping becomes complex very quickly, and XML insert performance suffers correspondingly. Once the data is available in relational format, plain SQL can be used for data access and manipulation. However, reconstruction of the original XML documents is also expensive. It requires multi-way joins and the generation of proper XML tagging. The tagging can be defined by the standardized SQL/XML publishing functions, to reconstruct the original or new and different documents. However, the XML Extender Collection won't preserve any digital signatures associated with the original XML document.

Making XML data available in relational format continues to be an important requirement. The most common reason is the need to feed legacy SQL applications, packaged business applications and business intelligence (BI) tools that consume only relational data. Therefore DB2 9 offers a new "decomp" solution, also known as "annotated schema shred" or "new shred", which can be 7 to 8 times faster than XML Extender Collection shredding. Later in this article, the performance of this new high-speed shredder is compared to the IBM pureXML™ support in IBM DB2 9.

The new pureXML technology in DB2 9 is significantly different from CLOB or shredded XML storage. It does not store documents as plain text and does not map XML to relational or object-relational tables. Instead, it stores XML in its inherent hierarchical format, which matches the XML data model. Any XML document is a well-defined tree of elements and attributes, and XML queries are expressed in terms of tree traversal. Therefore, it seems intuitive that a corresponding hierarchical storage and processing format allows for efficient XML data management. To get to the bottom of this claim, the performance of pureXML in DB2 9 is compared with CLOB-based and shredded XML processing.


Test setup

Table 1 summarizes the comparisons made in this article. Basically, key XML operations on CLOB and shredded storage are compared to corresponding pureXML operations.

Table 1: Comparing CLOB to shredded XML processing and pureXML

XML in CLOBsDB2 9 pureXML

Insert XML into an XML Extender CLOB column

Insert XML into an XML column

Full document retrieval from CLOBs

Full document retrieval from an XML column

Query XML in CLOBs using XML Extender "extract" functions (XML parsing at query time)

XQuery over an XML column

XML shredded to relational tablesDB2 9 pureXML

Shred XML to relational tables, using the DB2 9 new shred

Insert XML into an XML column

SQL/XML publishing to construct XML documents from relational data
(such as previously shredded XML)

XQuery over an XML column

All tests were performed with the following data and settings:

  • A 4-CPU pSeries system with IBM® AIX® 5.2 (64bit) and a single DB2 9 instance
  • Between 1,000 and 100,000 CustAcc documents (4kb to 20kb in size) were used, taken from the financial scenario in the "DB2 9 XML performance characteristics" article
  • Database managed (DMS) table spaces with a page size of 32kb
  • All table spaces were defined with the no file system caching option, except where otherwise noted (for some of the CLOB storage tests)
  • All tables spaces were distributed over 10 physical disks, with the database log on a separate striped volume
  • Identical database configuration and tuning was used in all tests to ensure fair and valid comparisons

Comparing CLOB to pureXML columns

This comparison is interesting because for a certain class of today's XML applications, CLOB columns are a commonly choice for XML storage. Before DB2 9, there wasn't a better alternative. A fundamental difference between CLOB storage and pureXML processing lies in XML parsing and its significant impact on insert and query performance.

If XML documents are inserted into CLOB columns, they are inserted as unparsed text objects. Avoiding XML parsing at insert time can be a performance benefit, especially in CPU-bound systems. However, without XML parsing, the structure of the XML documents is entirely ignored. This precludes the database from performing intelligent and efficient search and extract operations on the stored text objects. The only remedy is to invoke the XML parser at query execution time to "look into" the XML documents such that search conditions can be evaluated. The significant CPU consumption of XML parsing often leads to low search and extract performance. Only a blind full document retrieval, which again ignores the internal XML structure, can quickly read XML documents from CLOB columns.

The pureXML technology in DB2 9 parses XML documents at insert time and never at query time. The XML documents are stored and queried in a parsed format, denoted in DB2 9 by the use of a new data type, "XML". This parsed format is a tree structure of nodes, which is different from the textual representation of XML documents. Search and extract operations can be performed without XML parsing, a significant performance benefit, because the XML parsing overhead is incurred at insert time instead. Similarly, document retrieval from an XML column requires serialization, that is, converting the parsed XML format back into its original textual representation. This overhead does not exist when reading full XML documents from a CLOB where the XML is already stored in text form.

In summary, CLOB storage offers good performance for insert and full document retrieval, typically at the cost of poor search and extract performance. The XML data type in DB2 9 sacrifices some of the insert and retrieval performance for much higher search and extract performance. This is a reasonable trade-off because business data is more often searched and analyzed than inserted. Typically, there is one insert and multiple searches. Additionally, the potential overhead of XML columns is often outweighed by the fact that XML columns are buffered in the buffer pool while CLOB columns are not.

The next section reviews performance measurements to quantify these trade-offs.


Comparing CLOB inserts to XML inserts

In the first test, we sequentially inserted 100,000 documents with and without indexes, committing after every document as many OLTP applications would. The result in Figure 1 shows the relative elapsed time between XML and CLOB column inserts (lower is better). Setting the elapsed time for XML inserts as the comparison baseline (100%), maintaining six XML indexes incurs only little overhead (5% in our scenario).

Figure 1: Comparing the single-user insert performance of XML to CLOB columns
Compare insert performance of XML to CLOB columns

Inserting the same XML data into a CLOB column takes approximately half the time (53%). This is because XML parsing and deeper processing of the XML data is avoided. Loosely speaking, XML Extender "side tables" are for CLOB columns what real XML indexes are for XML columns. Maintaining side tables at insert time requires XML parsing and additional relational inserts because selected XML element and attribute values are extracted and stored separately. Consequently, the insert elapsed time is at least as high as for pureXML inserts, in our scenario even 23% higher.

The experiments in Figure 1 used the no file system caching option for the DB2 table space. If you allow file system caching for the DMS table space containers (Figure 2), the XML insert performance improves slightly while the CLOB inserts suffer. Since CLOB inserts use direct writes, the file system cache is not needed and is pure overhead. However, file system caching can help read operations on CLOB columns if no XML parsing is involved.

Figure 2: The impact of file system caching on XML and CLOB column inserts (no indexes or side tables)
Impact of file system caching on XML and CLOB column inserts

Single-user database applications are very rare, so it is important to look at the performance behavior of a concurrent workload. The insert test was modified to use 10 concurrent threads with 10 separate connections to the database, each inserting 10,000 documents without think time. Figure 3 shows that the increased workload intensity hurts CLOB performance drastically. While CLOB inserts were about twice as fast as XML inserts in our single-user tests (Figure 1), they are more than twice as slow in our multi-user test (207%). This is because CLOB inserts do not benefit as much from parallelism as XML column inserts. With increasing degrees of concurrency, buffered inserts for XML columns are far more scalable than concurrent direct writes for CLOBs.

Figure 3: Multi-user insert performance of XML compared to CLOB columns
Multi-user insert performance of XML compared to CLOB columns

Figure 4 clarifies that the parallelism speed-up is much higher for XML columns than for CLOB columns. Taking single-user XML inserts as the 100% baseline, our system and configuration allowed 10 concurrent insert streams to insert the same 100,000 documents in 18% of the time, which is more than five times faster. CLOB inserts didn't benefit nearly as much from the parallelism and still took 37% of the baseline, which is only 1.4 times faster than non-parallel CLOB inserts.

Figure 4: Parallelism speed-up for XML and CLOB column inserts
Parallelism speed-up for XML and CLOB column inserts

All of these insert tests were also conducted with 1000, 5000, 10,000, and 50,000 documents. As expected, the elapsed insert time for both CLOB and XML columns is approximately linear in the number of documents inserted. Therefore, the corresponding charts are omitted for brevity.


XML queries on CLOB compared to XML columns

To assess the difference in query performance between XML and CLOB columns, five queries were designed to cover the following common search and retrieval cases:

  1. Full document retrieval of all documents, no predicate
  2. Full document retrieval of one document matching certain criteria (one predicate)
  3. Full document retrieval of documents matching certain criteria (multiple predicates)
  4. Partial retrieval of all documents
  5. Partial retrieval of all documents matching certain criteria

These operations were implemented in the five queries as follows:

Q1 (Select*): Selects all XML documents (select * from <table>)

Q2 (1Pred1Doc): Returns the Customer document for a given account number

Q3 (5PredSome): Returns Customer documents for all female customers with a primary address in California who have an account in USD and have not yet reached Premium customer status

Q4 (PartialAll): For each Customer, returns the name, and the sum of the balances of all his/her accounts

Q5 (PartialSome): Gets the primary email address for all customers who are holding IBM stock in any of their accounts.

For CLOBs, these queries were expressed in SQL with XML Extender extract functions. For XML columns, they were expressed in XQuery notation. Whether the XQueries were embedded in SQL or executed stand-alone did not make a performance difference in our tests. All queries and some sample data are available in pureXMLvsCLOBvsShredded.zip in the Downloads section.

Figure 5 shows the query performance (elapsed time) of all five test queries for both pureXML and CLOBs. You can see that pureXML queries can easily be 20, 30 or 40 times faster than queries over XML in CLOB columns. These results are for our default setup with no file system caching for table spaces. Figure 6 contains the same results with file system caching. File system caching has a significant impact only on query Q1 which retrieves all documents without any predicate evaluation. Without file system caching, Q1 performs similarly on XML and CLOB columns, with a slight disadvantage (10%) for the CLOB column which is not buffered in the buffer pool. File system caching significantly improves CLOB retrieval performance (see Figure 6) such that query Q1 can be more than twice as fast as on the XML column. This is because reading from the XML column requires serialization, or the conversion of parsed XML back to text format. Without file system caching, this overhead is mitigated by the fact that XML columns are buffered in the DB2 buffer pool and CLOB columns are not.

Figure 5: Query performance, no indexes, no file system caching
Query performance, no indexes, no file system caching

For queries 2 through 5, file system caching does not play a significant role. These queries always require sub-document level access to evaluate predicates and to extract document fragments. This is where the real benefit of pureXML kicks in: XML is stored in a parsed format, so no parsing is required at query execution time. In our tests, this yields a performance speed-up between 7 times and 44 times.

Figure 6: Query performance, no indexes, with file system caching
Query performance, no indexes, with file system caching

It is important to note that the difference in query response time between XML columns and CLOB columns increases drastically with the amount of data that needs to parsed (for CLOBs) or traversed (for pureXML). Figure 7 shows the query response times as a function of the number of documents in the table, ranging from 1,000 to 100,000 documents (when no indexes or side tables are used).

Figure 7: Query 2 performance as a function of the data volume
Query 2 performance as a function of the data volume

The XML Extender offers the concept of side tables to speed up the search for XML documents in order to avoid XML parsing for predicate evaluation. At insert time, specific elements and attributes are extracted into relational tables. You already know that this adds significant overhead to CLOB inserts, but the side tables can be efficiently searched and joined with the main table containing the CLOBs. Three of our five test queries (q2, q3, and q5) contain filtering predicates that can benefit from side table look-ups. Side tables can avoid much of the XML parsing for CLOBs and can often make CLOB queries 100 times faster or more.

In Figure 8, let's compare this to pureXML with real XML indexes that provide similar benefits. All of the six bars in Figure 8 represent elapsed times of about one second or less. And yet pureXML with indexes is 6 to 35 times faster than CLOB columns with side tables. There are multiple reasons for this. A pureXML index points directly to the row with the corresponding document. With side tables, DB2 first performs an index lookup on the side table and then joins the corresponding row with the main table that contains the CLOBs. Query Q3 (5PredSome) has multiple predicates and uses three side tables plus the main table, so it needs to compute a four-way join. Query Q5 uses side tables for the predicates, but requires an extract function (with XML parsing) to retrieve the customers email addresses.

Figure 8: Predicate evaluation with indexes (pureXML) and side tables (CLOBs)
Predicate evaluation with indexes (pureXML) and side tables (CLOBs)

Figure 8 clarifies that although side tables can reduce the number of CLOBs that must be parsed, XML indexes in DB2 9 similarly reduce the number of documents that must be traversed for predicate evaluation. Thus, the use of indexes and side tables typically reduces the absolute elapsed times for both XML and CLOBs, but does not erase the large relative performance advantage of XML over CLOB columns.

Multi-user query tests were executed with 10 users on both CLOB and XML columns. The results are omitted because the relative performance difference between CLOB and pureXML storage was similar to the single-user tests shown above.


Shredded storage compared to XML columns

Shredding XML data to relational tables is still a common requirement. A typical reason is that existing applications may not yet understand XML and require a relational format. This includes legacy SQL applications and packaged business applications as well as BI and reporting tools. The annotated XML Schema decomposition in DB2 9 ("new shred") resolves the functional and performance limitations of the older XML Extender shredding. The following section examines how the new DB2 9 shredding performance stacks up against the pureXML technology in DB2 9.

For these tests, the same Customer data was used as before. Due to various repeating elements, a total of 87 columns in 12 tables were required to represent the XML data in a traditional relational schema. The schema is shown in Figure 9, indicating the number of columns and the number of rows in each table after shredding 100,000 of our customer documents. In total, the tables contain more than 3.5 million relational rows to represent the 100,000 XML documents. The arrows in Figure 9 indicate one-to-many relationships. Indexes are defined on all primary and foreign key columns to support efficient joins between the 12 tables.

Figure 9: Relational schema to hold shredded XML data
Relational schema to hold shredded XML data

Comparing shredding to relational tables with pureXML insert

Shredding the 100,000 documents in this relational schema with the new annotated schema shred in DB2 9 is about 1.75 times slower than inserting the same XML documents into an XML column (see Figure 10). This is with commits after every document, as is typical of OLTP applications.

Figure 10: pureXML inserts compared to DB2 9 shredding, commit after every document
pureXML inserts compared to DB2 9 shredding, commit after every document

If you commit less frequently, you can increase the performance advantage of an XML column over shredding. Committing after every other document makes XML inserts 2.56 times faster than shredding (in our scenario and configuration). For bulk insert or import operations, you would only commit every 50 or 100 documents and find XML column inserts 4 to 5 times faster than shredding (see Figure 11). XML column inserts benefit from larger commit intervals more than shredding does, with a larger number of log pages per log I/O request.

Figure 11: pureXML insert compared to DB2 9 shredding, varying commit interval
pureXML insert compared to DB2 9 shredding, varying commit interval

For a commit interval of 50, Figure 12 compares the new annotated schema shred in DB2 9 to the XML Extender shred in V8 and pureXML inserts. In our tests, the new shred is 7 times faster than the XML Extender shred, and pureXML inserts can be even 30 times faster than V8 shredding.

Figure 12: DB2 V8 XML Extender shredding compared to DB2 9 technology (commit interval = 50)
DB2 V8 XML Extender shredding compared to DB2 9 technology (commit interval = 50)

Comparing XML queries on shredded data to pureXML columns

Assume that you need to serve an XML-oriented application or web service that requires query results in XML format. If the XML data is shredded to relational tables, relational query results must be converted back to XML. For this purpose, SQL/XML publishing functions can be used in the SELECT clause of SQL queries to construct the required XML tagging for the result set.

Five SQL queries were defined over the relational schema in Figure 9 that are logically equivalent to our five XQueries over XML columns. These SQL queries use traditional relational predicates, join some or all of the tables, and use SQL/XML publishing functions to return the same XML results as the XQueries. The relative performance results are shown in Figure 13.

Query Q2 (1Pred1Doc) returns only one document, based on an account number lookup. In both cases (pureXML and publishing), there are indexes on the account number to improve the performance of this query. Although the SQL/XML query is quite fast, it's still 75 times slower than an XQuery on an XML column. This is because an XQuery only needs to locate the document through the index and serialize it, while SQL/XML publishing must join all 12 relational tables and construct the XML document.

Queries Q1 (Select*) and Q3 (5PredSome) return many documents, so the cost for SQL/XML document construction from relational data is multiplied. Query Q4 (PartialAll) reads a few values from each XML document and constructs a whole new document out of them. With XQuery over pureXML storage the values are read from the XML column and the construction is expressed in XQuery. With SQL/XML publishing over shredded data, the values are read from relational columns, and the construction is expressed in SQL/XML. In both cases, the construction is the bottleneck. This is why both versions of Q4 perform approximately identically.

Figure 13: pureXML XQuery compared to SQL/XML publishing over shredded data
pureXML XQuery compared to SQL/XML publishing over shredded data

Q5 (PartialSome) uses multiple predicates to return very few XML documents with only one element. In this case, SQL/XML is not very expensive because it constructs just one element per result row and joins only 3 of the 12 tables. Since searching over relational data can be somewhat faster than searching over XML data, you see better performance for this query over shredded data (20 times faster than XQuery in our setup).


Summary and conclusions

While a single-user workload may find CLOB inserts faster than pureXML inserts, the intensity of concurrent inserts in typical business scenarios can make CLOB inserts 2 or 2.5 times slower than inserts into an XML column. If inserts are committed after every document, pureXML inserts are about 60% to 70% faster than the new shredding solution in DB2 9. For bulk inserts or import with less frequent commits, pureXML can ingest XML data even 4 to 5 times faster than shredding. These tests used the new shredding solution in DB2 9, which is 7 to 8 times faster than XML Extender shredding in V8.

XQuery over XML data in XML type columns can be 40 times faster than corresponding queries over CLOB columns which require XML parsing at query time. The absolute performance difference between "pureXML queries" and "CLOB queries" increases (linearly) with the amount of data that is queried.

The relative performance difference between pureXML XQueries and SQL/XML Publishing over shredded data depends heavily on the amount of relational data that needs to be tagged as XML, but also on the number of joins required. XQuery clearly outperforms publishing queries when complex data must be retrieved. In some of our tests, retrieving XML data from DB2 pureXML storage can be 50 to 100 times faster than constructing XML data from relational tables. However, simple search queries without complex joins that return results with little or no XML tags can be much faster in SQL. Still, an application always needs to weigh query performance against insert performance and evaluate the mix of both. XML insert performance is summarized in Figure 14.

Figure 14: Summary of XML insert performance (commitcount =1)
Summary of XML insert performance (commitcount =1)

Note that all performance results in this article were obtained in an isolated lab environment with specific hardware, operating, and database configurations. Different tests in different environments may show bigger or smaller performance differences than the results described in this article.

Although all of our measurements used DB2 only, many of the performance issues with CLOB and shredded XML storage are inherent in the general concepts of storing XML as text or converting it to the relational data model, respectively. Therefore, you may see similar performance characteristics in other DBMS environments that support these concepts. Readers interested in such matters are encouraged to explore such issues in their own environments, perhaps using the materials in this article as a basis for developing their own tests.

Acknowledgement

For their reviews of this paper, we'd like to thank Henrik Loeser, Cindy Saracco and Nikolaj Richers.


Download

DescriptionNameSize
Sample scripts for this articlepureXMLvsCLOBvsShredded.zip14KB

Resources

Learn

Get products and technologies

  • DB2 v9: Get DB2 9 software.
  • Find useful information about the DB2 family of products and features as well as related WebSphere(R) Information Integration products and features at the DB2 Information Center.
  • DB2 Express-C: Free to develop, deploy, and distribute.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=182334
ArticleTitle=A performance comparison of DB2 9 pureXML and CLOB or shredded XML storage
publish-date=12072006