15 best practices for pureXML performance in DB2

DB2 9 introduces pureXML support, which means that XML data is stored and queried in its inherent hierarchical format. To query XML data, DB2 offers two languages, SQL/XML and XQuery. Additionally, DB2 9 has sophisticated XML indexing capabilities and support for XML Schema validation. While most existing performance guidelines for DB2 also apply to XML data, this article provides additional XML-specific performance tips. This article has been updated for DB2 9.5. [2009 May 26: Corrected code in Listings 12 and 13.--Ed.]

Matthias Nicola (mnicola@us.ibm.com), DB2/XML Performance, IBM Silicon Valley Laboratory

Author photo: Matthias NicolaDr. Nicola is the technical lead for XML database performance at IBM's Silicon Valley Lab. His work focuses on all aspects of XML performance in DB2, including XQuery, SQL/XML, and all native XML features in DB2. Dr. Nicola works closely with the DB2 XML development teams as well as with customers and business partners who are using XML, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Dr. Nicola worked on data warehousing performance for Informix Software. He also worked for four years in research and industry projects on distributed and replicated databases. He received his doctorate in computer science in 1999 from the Technical University of Aachen, Germany.


developerWorks Contributing author
        level

26 May 2009 (First published 05 October 2006)

Also available in Chinese Japanese Vietnamese Spanish

Introduction

The pureXML support in DB2 9 offers efficient and versatile capabilities for managing your XML data. Performance is a high priority for many XML applications; and DBAs, as well as application designers, can do their fair share to ensure good performance. To begin with, there are all the traditional DB2 performance guidelines for balanced CPU/memory/disk configurations, table space and buffer pool tuning, locking, logging, query execution plans, and so forth. All of these topics are covered in previous DB2 articles (see Resources) and are still relevant when you manage XML data in DB2.

Fortunately, a number of these issues are handled by DB2's autonomic capabilities such as automatic storage and self-tuning memory management. They provide high levels of performance for many applications and require very little manual intervention. But XML applications with aggressive performance requirements may benefit from additional performance considerations. This article focuses on such situations, offering tips and guidelines for achieving maximum performance of XML-related applications in DB2 9.

Here are 15 XML performance tips (in no particular order) which we discuss and illustrate in this article. These 15 tips cover a variety of areas, but experience shows that applications with performance issues often need to apply only one or two of these tips to achieve the desired performance.

In the discussion of these performance tips, we assume that you are familiar with basic DB2 administration and performance practices as well as with the basics of DB2s pureXML support. For example, you should know about XML columns, XML indexes, and how to query XML data with SQL/XML and XQuery. All these prerequisites are covered in articles previously published on developerWorks (see Resources).


DB2 XML Performance Tips

Tip 1: Choose your XML document granularity wisely

When you design your XML application and your XML document structure, in particular, you may have a choice to define which business data is kept together in a single XML document. For example, in our department table below, we use one XML document per department (medium granularity). This is a reasonable choice if a department is the predominant granularity at which our application accesses and processes the data. Alternatively, we could have decided to combine multiple or many departments into a single XML document, e.g. all those that belong to one unit (coarse granularity). This, however, is sub-optimal if we typically process only one department at a time.

Table 1. create table dept( unitID char(8), deptdoc xml)
unitIDdeptdoc
WWPR
<dept deptID='PR27'>
   <employee id='901'>
      <name>Jim Qu</name>
      <phone>408 555 1212</phone>
   </employee>
   <employee id='902'>
      <name>Peter Pan</name>
      <office>216</office>
   </employee>
</dept>
WWPR
<dept deptID='V15'>
   <employee id='673'>
      <name>Matt Foreman</name>
      <phone>416 891 7301</phone>
      <office>216</office>
   </employee>
   <description>This dept supports sales world wide</description>
</dept>
S-USE...
......

We could also have decided to have one XML document per employee (fine granularity), with an additional "dept" attribute for each employee to indicate which department he or she belongs to. This would be a very good choice if employees in themselves use business objects of interest which are often accessed and processed independently from the other employees in the same department. But, if the application typically processes all employees in one department together, one XML document per department can be better.

In particular, batching up many independent business objects in a single document is not recommended. DB2 uses indexes over XML data to filter on a per-document level. Therefore, the finer your XML document granularity, the higher your potential benefit from index-based access. Also, if your application uses a DOM parser to ingest the XML retrieved from DB2, small documents will allow better performance.

Related to XML document design, a common question is when to use attributes vs. elements and how that choice affects performance. This is much more a data modeling question than a performance question. As such, this question is as old as SGML, the precursor of XML, and has been hotly debated with no universally accepted consensus. However, an important fact to adhere to is that XML elements are more flexible than attributes because they can be repeated and nested. For example, in our department documents, we use an element "phone" which allows us to have multiple occurrences of "phone" if an employee has multiple numbers. It is also extensible in case we later need to break phone numbers into fragments, i.e. the "phone" element could have child elements for country code, area code, extension, etc.. If "phone" was an attribute of the employee element, then it can exist only once per employee, and we could not add child elements too it, which may hinder schema evolution over time. Although you could model all your data without attributes, they can be a very intuitive choice for data items which are known in advance to never repeat (per element) nor have any sub-fields. Attributes contribute to somewhat shorter XML because they have only a single tag as opposed to elements with a start tag and an end tag. In DB2, attributes can be used in queries, predicates, and index definitions just as easily as elements. Since attributes are less extensible than elements, DB2 can apply certain storage and access optimizations. This should be considered an extra performance bonus rather than an incentive to convert attributes to elements, especially when data modeling considerations actually call for elements.

In a nutshell, choose your XML document granularity with respect to the anticipated predominant granularity of access. When in doubt, it is usually better to lean towards finer granularity and smaller XML documents.

Tip 2: Use DMS and larger pages for better for XML performance

Database managed table spaces (DMS) provides higher performance than operating system managed table spaces (SMS). This is true for relational data, and even more so for XML read and write access. In DB2 9, newly created table spaces are DMS by default. It is also recommended to use DMS table spaces with automatic storage so that DMS containers grow as needed without manual intervention. If an XML document is too large to fit on a single page in a table space, DB2 splits the document into multiple regions which are then stored on multiple pages. This is transparent to your application and allows DB2 to handle XML documents up to the bind-in limit of 2GB per document.

Generally, the lower the number of regions (splits) per document the better the performance, especially for insert and full-document retrieval. If a document does not fit on a page, the number of splits per document depends on the page size (4KB, 8KB, 16KB, or 32KB). The larger the page size of your table space the lower the number of potential splits per document. For example, let's say a given document gets split across forty 4KB pages. Then the same document may get stored on only twenty 8KB pages, or ten 16KB or five 32KB pages, respectively. If the XML documents are significantly smaller than the selected page size, no space will be wasted since multiple small documents can be stored on a single page.

As a rule of thumb, choose a page size for XML data which is not smaller than two times your average expected document size, subject to the maximum of 32KB. If you use a single page size for relational and XML data, or for data and indexes, a 32KB page size may be beneficial for XML data but somewhat detrimental for relational data and index access. In such cases, 16KB or 8KB pages may be a better choice that works well for both.

Tip 3: Exploit storage options for XML: inlined, compressed, or a separate table space

Let's consider the following sample table to discuss the storage options for XML data. The table contains relational data and XML data:

Listing 1. Sample table with XML and relational data
create table product(pid bigint, name varchar(20), brand varchar(35), 
                     category integer, price decimal, description XML);

With this table definition, the XML data and the relational data of the table are by default stored in the same table space. This means they use the same page size and are buffered in the same buffer pool. Within the table space, the relational data is stored in the DAT object, while XML data resides in the XDA object. This is because XML documents, like LOBs, can be too large to fit within a single row on a data page of the table. This default layout provides good performance for most application scenarios.

If you have done a performance analysis and find that you need a large page size for XML data but a small page size for relational data or indexes, you can use separate table spaces to achieve this. When you define a table, you can direct "long" data into a separate table space with a different page size. Long data includes LOB and XML data.
The following example defines two buffer pools and two table spaces, one each with 4KB and 32KB pages. (Note that a table space always requires a buffer pool with a matching page size.) The table "product" is assigned to table space "relData" with 4KB pages. All of its columns are stored in that table space, except the XML column "description", which is stored on 32KB pages in table space "xmldata".

Listing 2. XML and relational data in separate table spaces and buffer pools
create bufferpool bp4k pagesize 4k;
create bufferpool bp32k pagesize 32k;

create tablespace relData
pagesize 4K
managed by automatic storage
bufferpool bp4k;

create tablespace xmlData
pagesize 32K
managed by automatic storage
bufferpool bp32k;

create table product(pid bigint, name varchar(20), brand varchar(35), 
                     category integer, price decimal, description XML) 
     in relData
     long in xmlData;

Table space defaults in DB2 9 are different than in DB2 V8. Unless explicitly specified, new table spaces are created as DMS with large row IDs. This means that a table space with 4KB pages can grow to 2TB instead of 64GB in V8, and with 32KB pages to 16TB instead of 512GB. Also, the limit of 255 rows per page is removed, allowing up to 2335 rows on 32KB pages. Thus, the row-per-page limit in itself is no longer a reason to use small pages for relational data.

DB2 9.5 also allows you to store XML data "inlined" and compressed. If some or all of your XML documents are small enough to fit into their corresponding row on the base table page in the DAT object, they can be inlined into the relational row. This provides more direct access to the XML data and avoids the redirected access to the XDA object. If some documents in the XML column are still too large to be inlined, they are stored "outlined" in the XDA object as usual. Inlining can reduce the size of the regions index dramatically since inlined documents do not require any regions index entries. They always consist of a single, inlined region. Documents that are inlined can also be compressed using the regular DB2 row compression, as shown in Listing 3:

Listing 3. Inlined and compressed XML storage
create table product(pid bigint, name varchar(20), brand varchar(35), 
                     category integer, price decimal, 
                     description XML inline length 3000) compress yes;

In this example, the XML column is defined with the option "inline length 3000". This means that any document that can be stored in 3000 bytes or less will be inlined. Relevant for inlining is the size of the document after XML parsing in DB2, not the size of the textual XML document in your file system. The inline length must be smaller than the page size minus the size of the other columns in the table. Inlined XML data always resides in the same table space as the relational columns of the table and cannot be stored on a different page size or in a separate table space.

Since the sample table is defined with the option "compress yes", the relational data and the inlined XML documents get compressed. It's not uncommon to compress inlined XML data by 70 to 85 percent. The following statement can be used to check the compression ratio of the "product" table:

Listing 4. Admin function to check the compression ratio
select tabname,pages_saved_percent,bytes_saved_percent 
from table(sysproc.admin_get_tab_compress_info('MYSCHEMA','PRODUCT','ESTIMATE')) as t

Compressing XML data can provide a tremendous performance boost if your system is rather I/O bound than CPU bound. Note, however, that inlining significantly increases the row size on your data pages. This in turn decreases the number of rows per page. Queries that only access the relational columns of the table now need to read a much larger number of pages than without inlining. This can lead to more I/O and lower performance for these queries. If your queries typically always touch the XML column, then this does not affect you.

In a nutshell, use common sense when considering separate table spaces for XML data. Fewer buffer pools and table spaces and fewer distinct page sizes lead to a simpler physical database design which is easier to manage, maintain, and tune. Avoid introducing multiple page sizes unless you know that it really provides a worthwhile performance benefit. Use inlining and compression to reduce storage consumption and increase I/O performance.

Tip 4: How to configure DB2 for fast bulk inserting of XML data

DB2 9 supports two options for moving XML data from a file system into a DB2 table: insert and import. Insert and import have similar characteristics from a performance and tuning point of view because the import utility actually executes a series of inserts.

Since DB2 9.5, you can also use the DB2 LOAD utility to move XML data into a table. The key advantages of the LOAD utility are the same for XML as for relational data, such as the data does not get logged and parallelism is automatically used to increase performance. DB2 determines a default degree of parallelism based on the number of CPUs and table space containers. You can also set the CPU and I/O parallelism with the parameters CPU_PARALLELISM and DISK_PARALLELISM in the syntax of the LOAD command.

Whether your application performs bulk inserts, possibly through concurrent insert threads, or if you use import or load, the following performance guidelines apply:

  • As a key prerequisite, be sure to use DMS table spaces with a large page size (see Tip 2).
  • Even if you have not defined any indexes on the target table , DB2's pureXML storage mechanism transparently maintains so-called regions and path indexes for efficient XML storage access. Thus, provide sufficient buffer pool space to support index reads.
  • If you need multiple user-defined XML indexes on your table, it is typically better to define them before the bulk insert rather than creating them afterwards. During insert, each XML document will be processed only once to generate index entries for all XML indexes. However, if you issue multiple "create index" statements, all documents in the XML column will be traversed multiple times.
  • If you need to move a very large number of small XML files from a file system into a DB2 table, it can help performance to have them in a dedicated file system for which file system caching is disabled. Since each file is read only once, caching is not needed. On AIX, mounting this file system with the -o cio option has been found beneficial.

Consider the following guidelines for insert and import operations:

  • "ALTER TABLE <tablename> APPEND ON" enables append mode for the table. New data is appended to the end of the table instead of searching for free space on existing pages. This provides for improved runtime performance of bulk inserts.
  • Increasing the log buffer size (LOGBUFSZ) and the log file size (LOGFILSIZ) helps insert performance. This is particularly important for XML inserts since the data volume per row tends to be a lot bigger than for relational data. A fast I/O device for the log is recommended.
  • You can avoid logging if you use "ALTER TABLE <tablename> ACTIVATE NOT LOGGED INITIALLY" (NLI). However, be warned that if there is a statement failure, the table will be marked as inaccessible and must be dropped. This often prohibits NLI for incremental bulk inserts in production systems, but can be useful for the initial population of an empty table. Beware that NLI prevents concurrent inserts/import into a target table and that parallelism may yield higher performance than NLI.
  • If you use import, a small value for the COMMITCOUNT parameter tends to hurt performance. Committing every 100 rows or more will perform better than committing every row. You can also omit the COMMITCOUNT parameter and let DB2 commit as often as appropriate.
  • To better utilize multiple CPUs and disks, you can run multiple import commands concurrently. Make sure that each import runs in its own connection to the database and uses the "ALLOW WRITE ACCESS" clause to avoid table locks. You do not need separate input files (DEL files) to run concurrent imports. Each import can read a different section of the same input file because the import command allows you to specify "SKIPCOUNT m ROWCOUNT n" to read rows m+1 to m+n from the input file.

For additional insert performance guidelines, see the article "Tips for improving INSERT performance in DB2 Universal Database" [see Resources].

In a nutshell, traditional insert and logging performance tuning is good for XML insert and import. You can run parallel import sessions, if you add the ALLOW WRITE ACCESS clause to each import command. In DB2 9.5, use load instead of import.

Tip 5: Use the new snapshot monitor elements to examine XML performance

Whether you are investigating the benefit of different page sizes or other aspects of XML performance, chances are you want to use the DB2 snapshot monitor as you would for relational data. You will find that DB2 9 offers new buffer pool snapshot monitor elements for XML data that match the existing counters for relational data and indexes. Since relational data and indexes are stored in separate storage objects within a table space, they have separate read and write counters. The pureXML storage in DB2 9 introduces a new storage object for XML data, called XDA, and it too has its own buffer pool counters.

The example below is a snippet from a snapshot monitor output. You see the various snapshot monitor elements for the three different storage objects: data, index, and XDA. This allows you to monitor and analyze buffering and I/O activity for XML separately from relational data. Any activity pertaining to XML indexes is included in the existing index counters. The interpretation of the new XDA counters is the same as their corresponding relational counters. For example, a low ratio of XDA physical reads to XDA logical reads indicates a high buffer pool hit ratio for XML data, which is desirable. For more details on the buffer pool snapshot monitor elements, see the DB2 documentation.

Listing 5. Monitor output for data, index, and XDA storage objects
Buffer pool data logical reads             = 221759
Buffer pool data physical reads            = 48580
Buffer pool temporary data logical reads   = 10730
Buffer pool temporary data physical reads  = 0
Buffer pool data writes                    = 6
Asynchronous pool data page reads          = 0
Asynchronous pool data page writes         = 6

Buffer pool index logical reads            = 8340915
Buffer pool index physical reads           = 54517
Buffer pool temporary index logical reads  = 0
Buffer pool temporary index physical reads = 0
Buffer pool index writes                   = 0
Asynchronous pool index page reads         = 0
Asynchronous pool index page writes        = 0

Buffer pool xda logical reads              = 2533633
Buffer pool xda physical reads             = 189056
Buffer pool temporary xda logical reads    = 374243
Buffer pool temporary xda physical reads   = 0
Buffer pool xda writes  
                   = 0
Asynchronous pool xda page reads           = 97728
Asynchronous pool xda page writes          = 0
Asynchronous data read requests            = 0
Asynchronous index read requests           = 0
Asynchronous xda read requests             = 83528

In a nutshell, new XDA counters in the snapshot monitor output reflect XML activity. They are useful to understand buffer pool, I/O and temp space usage of your XML data.

Tip 6: Be aware of XML schema validation overhead

An XML schema can define the structure, the element and attributes, their data types, value ranges, etc. that are allowed in a set of XML documents. DB2 allows you to (optionally) validate XML documents against XML schemas. If you choose to validate documents, you typically validate at insert time. This serves two purposes. First, validation ensures that data inserted into the database is compliant with the schema definition, i.e. you prevent "junk data" from entering your tables. Secondly, schema validation adds type annotations from the schema to each XML element and attribute, and these types get persisted in DB2s XML storage. For example, if an XML schema defines that the employee IDs in our dept table (shown in Tip 1) are integers, and the documents are validated against that schema, then DB2 remembers in each document that the employee IDs are of the type xs:integer. Any attempt to perform a string comparison on an employee ID will then fail with a type error at query run time.

XML schema validation is an optional activity during XML parsing. Performance studies have shown that XML parsing in general is significantly more CPU-intensive if schema validation is enabled [link]. This overhead can vary drastically depending on the structure and size of your XML documents and particularly on the size and complexity of the XML Schema used. For example, you may find 50% higher CPU consumption due to schema validation with moderately complex schemas. Unless your XML inserts are heavily I/O bound, the increased CPU consumption typically translates to reduced insert throughput.

Determine if your application needs the stricter type checking for XML queries and XML schema compliance. For example, if you are using an application server which receives, validates and processes XML documents before they are stored in the database, then the documents probably do not need to be validated again in DB2. At that point you already know they are valid. Or, maybe the database receives XML documents from a trusted application, maybe even one that you control, and you know that the XML data is always valid. In that case, avoid schema validation for the benefit of higher insert performance. If, however, your DB2 database receives XML data from untrusted sources and you need to ensure schema compliance at the DB2 level, then you need to spend some extra CPU cycles on that.

In a nutshell, for high performance inserts avoid performing schema validation in DB2 if it is not really needed.

Tip 7: In XPath expressions, use fully specified paths as much as possible

Assume we have a table with an XML column

create table customer(info XML);

to manage "customerinfo" documents which are of the following structure:

Listing 6. Sample XML document
<customerinfo Cid="1004">
    <name>Matt Foreman</name>
    <addr country="Canada">
          <street>1596 Baseline</street>
          <city>Toronto</city>
          <state>Ontario</state>
          <pcode>M3Z-5H9</pcode>
    </addr>
    <phone type="work">905-555-4789</phone>
    <phone type="home">416-555-3376</phone>
</customerinfo>

If you want to retrieve customers' phone numbers or the city they live in, there are multiple possible path expressions to get that data, no matter whether you use XQuery or SQL/XML. Both /customerinfo/phone as well as //phone would get you the phone numbers. Likewise, /customerinfo/addr/city as well as /customerinfo/*/city return the city. For best performance, the fully specified path is preferred over using * or // because it enables DB2 to navigate directly to the desired elements, skipping over non-relevant parts of the document.

In other words, if you know where in the document the desired element is located, it helps to provide that information in form of a fully specified path. If you ask for //phone instead of /customerinfo/phone, you ask for phone elements anywhere in the document. This requires DB2 to navigate down into the "addr" subtree of the document to look for phone elements at any level of the document. This is avoidable overhead.

Note that * and // can also lead to undesired or unexpected query results. For example, if some of the "customerinfo" documents also contain "assistant" information, like the one below. The path //phone would return the customer phones and the assistant phone numbers, without distinguishing them. From the query result you would not even know and mistakenly process the assistant's phone as a customer phone number.

Listing 7. Phone and name elements at multiple levels in the document
<customerinfo Cid="1004">
    <name>Matt Foreman</name>
    <addr country="Canada">
          <street>1596 Baseline</street>
          <city>Toronto</city>
          <state>Ontario</state>
          <pcode>M3Z-5H9</pcode>
    </addr>
    <phone type="work">905-555-4789</phone>
    <phone type="home">416-555-3376</phone>
    <assistant>
          <name>Peter Smith</name>
          <phone type="home">416-555-3426</phone>
     </assistant>
</customerinfo>

In a nutshell, avoid * and // in your path expressions and use fully specified paths instead, if possible.

Tip 8: Define lean XML indexes, and avoid indexing everything

Assume our queries often search for "customerinfo" documents by customer name. An index on the customer name element can greatly improve the performance of such queries. Let's look at the following example:

Listing 8. Indexes to support search by customer name
create table customer(info XML);

create index custname1 on customer(info) 
generate key using xmlpattern '/customerinfo/name' as sql varchar(20);

create index custname2 on customer(info) 
generate key using xmlpattern '//name' as sql varchar(20);

select * from customer
where xmlexists('$i/customerinfo[name = "Matt Foreman"]' passing info as $i);

Both indexes defined above are eligible to evaluate the XMLEXISTS predicate on the customer name. But, index custname2 can be substantially larger than index custname1 because it contains index entries not only for customer names but also for assistant names. This is because the XML pattern //name matches name elements anywhere in the document. But, if we never search by assistant name then we don't need them indexed.

For read operations, index custname1 is smaller and therefore potentially better performing. For insert, update and delete operations, index custname1 incurs index maintenance overhead only for customer names, while index custname2 requires index maintenance for customer and assistant names. You certainly don't want to pay that extra price if you require maximum insert/update/delete performance and don't need indexed access based on assistant names.

Also consider the following heavyIndex which "indexes everything". It contains index entries for every text node, i.e. every leaf element value in every XML document in the XML column. Such an index is very costly to maintain during insert/update/delete operations, consumes a lot of storage space, and is usually not recommended. The only exception could be applications with low write activity and unpredictable query workload such that more specific indexes are hard to define.

create index heavyIndex on customer(info) 
generate key using xmlpattern '//text()' as sql varchar(20);

In a nutshell, be as precise as possible when defining XML indexes and avoid * and // if you can.

Tip 9: Put document filtering predicates in XMLEXISTS instead of XMLQUERY

Let's consider the following table and data:

create table customer(info XML);
Table 2. Three rows of data in the customer table
<customerinfo>
    <name>Matt Foreman</name>
    <phone>905-555-4789</phone>
</customerinfo>
<customerinfo>
    <name>Peter Jones</name>
    <phone>905-123-9065</phone>
</customerinfo>
<customerinfo>
    <name>Mary Poppins</name>
    <phone>905-890-0763</phone>
</customerinfo>

Given this table, assume that you want to return the names of customers which have the phone number "905-555-4789". You could be tempted to write the following query

select xmlquery('$i/customerinfo[phone = "905-555-4789"]/name' passing info as "i") 
from customer;

But, this query is not what you want, for multiple reasons:

  1. It returns the following result set which has as many rows as there are rows in the table. This is because the SQL statement has no where clause and therefore cannot eliminate any rows.

    <name>Matt Foreman</name>

    3 record(s) selected

  2. For each row in the table which doesn't match the predicate, a row containing an empty XML sequence is returned. This is because the XQuery expression in the XMLQUERY function is applied to one row (document) at a time and never removes a row from the result set, only modifies its value. The value produced by that XQuery is either the customer's name element if the predicate is true, or the empty sequence otherwise. These empty rows are semantically correct (according to the SQL/XML standard) and must be returned if the query is written the way it is.
  3. Performance of this query will not be good. First, an index which may exist on /customerinfo/phone cannot be used because this query is not allowed to eliminate any rows. Secondly, returning many empty rows makes this query needlessly slow.

To resolve the performance issues and get the desired output, you should use the XMLQUERY function in the select clause only to extract the customer names, and move the search condition, which should eliminate rows, into an XMLEXISTS predicate in the where clause. This will allow index usage, row filtering, and avoids the overhead of empty results rows. Write the query in the following way:

select xmlquery('$i/customerinfo/name' passing info as "i") 
from customer
where xmlexists('$i/customerinfo[phone = "905-555-4789"]' passing info as "i")
<name>Matt Foreman</name>

1 record(s) selected

In a nutshell, predicates in the XMLQUERY function are only applied within each XML value, so they never eliminate any rows. Document- and row-filtering predicates should go into the XMLEXISTS function.

Tip 10: Use square brackets [ ] to avoid Boolean predicates in XMLEXISTS

A common error is to write the previous query without the square brackets in the XMLEXISTS function:

select xmlquery('$i/customerinfo/name' passing info as "i") 
from customer
where xmlexists('$i/customerinfo/phone = "905-555-4789"' passing info as "i")

This will produce the following result:

<name>Matt Foreman</name>
<name>Peter Jones</name>
<name>Mary Poppins</name>

3 record(s) selected

The expression in the XMLEXISTS predicate is written such that XMLEXISTS always evaluates to true. Hence, no rows are eliminated. This is because, for a given row, the XMLEXISTS predicate evaluates to false only if the XQuery expression inside returns the empty sequence. However, without the square brackets the XQuery expression is a Boolean expression which always returns a Boolean value and never the empty sequence. Note that XMLEXISTS truly checks for the existence of a value and evaluates to true if a value exists, even if that value happens to be the Boolean value "false". This is the correct behavior according to the SQL/XML standard, although it's probably not what you intended to express.

The impact is again that an index on phone cannot be used because no rows will be eliminated, and you receive a lot more rows than you actually wanted. Also, beware not to make this same mistake when using two or more predicates, as in this query:

Listing 9. Incorrect use of two predicates in XMLEXISTS
select xmlquery('$i/customerinfo/name' passing info as "i") 
from customer
where xmlexists('$i/customerinfo[phone = "905-555-4789"] and 
		 $i/customerinfo[name = "Matt Foreman"]' 
      passing info as "i")

This query uses square brackets, so what's wrong with it? The XQuery expression is still a Boolean expression because it's of the form "exp1 and exp2". Here is the proper way of writing this query to filter rows and allow for index usage:

Listing 10. Correct query to filter rows and allow for index usage
select xmlquery('$i/customerinfo/name' passing info as "i") 
from customer
where xmlexists('$i/customerinfo[phone = "905-555-4789" and name = "Matt Foreman"]' 
      passing info as "i")

In a nutshell, don't use Boolean predicates in XMLEXISTS. Put predicates in square brackets, including any "and" and "or".

Tip 11: Use RUNSTATS to collects statistics for XML data and indexes

The RUNSTATS utility has been extended to collect statistics on XML data and XML indexes. DB2's cost-based optimizer uses these statistics to generate efficient execution plans for XQuery and SQL/XML queries. Thus, continue to use RUNSTATS as you would for relational data. If your table contains relational and XML data and you want to refresh the relational statistics only, you can execute RUNSTATS with the new clause "EXCLUDING XML COLUMNS". Without this clause, the default and preferred behavior is to always collect statistics for relational and XML data.

For relational data as well as XML data you can enable sampling to reduce the time for executing runstats. On a large data set, the statistics from 10% of the data (or even less) are often still very representative of the total population. Whatever sampling percentage you choose, runstats allows you to sample rows (Bernoulli sampling) or pages (system sampling). Row level sampling reads all data pages but considers only a percentage of the rows on each page, and therefore only a subset of the corresponding XDA pages. Page level sampling reduces I/O significantly since it reads only a percentage of the data pages. Thus, page sampling can significantly improve performance if your table contains not just XML but also a fair amount of relational data. But, row level sampling may produce more accurate statistics if relational data values are highly clustered.

Here are some examples. The first runstats commands collects the most comprehensive and detailed statistics for the table customer and all its indexes without sampling. This is ideal if execution time allows. The seconds command collects the same statistics but only for 10% of the pages. In many cases, this will provide the optimizer with nearly as accurate statistics as the first command, but will return results much faster. The third command samples 15% of all rows, does not collect distribution statistics, and also applies sampling to indexes which the first and second commands didn't.

Listing 11. Use RUNSTATS to collect statistics
runstats on table myschema.customer 
with distribution on all columns and detailed indexes all;

runstats on table myschema.customer 
with distribution on all columns and detailed indexes all tablesample system (10);

runstats on table myschema.customer 
on all columns and sample detailed indexes all tablesample bernoulli (15);

In a nutshell, the DB2 does generate better execution plans if XML statistics are available. Use runstats as you normally would, or use runstats with sampling to reduce its execution time.

Tip 12: How to use SQL/XML publishing views to expose relational data as XML

SQL/XML publishing functions allow you to convert relational data into XML format. It can be beneficial to hide the SQL/XML publishing functions in a view definition, so that applications or other queries can simply select the constructed XML documents from the view instead of dealing with the publishing functions themselves.

Listing 12. SQL/XML publishing functions hidden in a view
create table unit( unitID varchar(8), name varchar(20), manager varchar(20));

create view UnitView(unitID, name, unitdoc) as
   select unitID, name, 
       XMLDOCUMENT(
          XMLELEMENT(NAME "Unit",
          XMLELEMENT(NAME "ID", u.unitID),
          XMLELEMENT(NAME "UnitName", u.name),
          XMLELEMENT(NAME "Mgr", u.manager)
             )   )
   from unit u;

Note that we included some of the relational columns in the view definition. This does not create any physical redundancy because it is just a view, not a materialized view. Exposing the relational columns helps to query this view efficiently. Let's say we need to fetch an XML document for one particular unit. All of the following three queries can do that, but the third one tends to perform better than the first two.

In the first two queries, the filtering predicate is expressed on the constructed XML. But, XML predicates can not be applied to the underlying relational column or its indexes. Thus, these queries require the view to construct XML for all units and then pick out the one for unit "WWPR". This is not optimal.

May perform sub-optimally:

Listing 13. Queries that perform sub-optimally
select unitdoc
from UnitView
where xmlexists('$i/Unit[ID = "WWPR"]' passing unitdoc as "i");

for $u in db2-fn:xmlcolumn('UNITVIEW.UNITDOC')/Unit
where $u/ID = "WWPR"
return $u;

The third query uses a relational predicate to ensure that only the XML document for "WWPR" is constructed, resulting in a shorter runtime, especially on a large data set. This query will perform well:

Listing 14. Query that performs well
select unitdoc
from UnitView
where UnitID = "WWPR";

In a nutshell, include relational columns in a SQL/XML publishing view, and when querying the view express any predicates on those columns rather than on the constructed XML.

Tip 13: How to use XMLTABLE views to expose XML data in relational format

Just like it can be useful to create a view to expose relational data in XML format, you may want to use a view to expose XML data in relational format. Similar caution needs to be applied as in Tip 12, but in the reverse way. Let's look at the following example where the SQL/XML function XMLTABLE is used to return values from XML documents in tabular format:

Listing 15. Values returned from XML documents in tabular format
create table customer(info XML);

create view myview(CustomerID, Name, Zip, Info) as 
SELECT T.*, info 
FROM customer, XMLTABLE ('$c/customerinfo' passing info as "c" 
   COLUMNS 
   "CID"     INTEGER      PATH './@Cid',
   "Name"    VARCHAR(30)  PATH './name',
   "Zip"     CHAR(12)     PATH './addr/pcode' ) as T;

Note that we included the XML column info in the view definition to help query this view efficiently. Let's say you want to retrieve a tabular list of customer IDs and names for a given ZIP code. Both of the following queries can do that, but the second one tends to perform better than the first. In the first query, the filtering predicate is expressed on the CHAR column "Zip" generated by the XMLTABLE function. But, relational predicates cannot be applied to the underlying XML column or its indexes. Thus, this query requires the view to generate rows for all customers and then picks out the one for zip code "95141". This is not optimal. The second query uses an XML predicate to ensure that only the rows for "95141" get generated, resulting in a shorter runtime, especially on a large data set.

Listing 16. Query with an XML predicate
-- may perform suboptimal:

select CustomerID, Name 
from myview
where Zip = "95141";


-- will perform well:

select CustomerID, Name
from myView
where xmlexists('$i/customerinfo[addr/pcode = "95141"]' passing info as "i");

If the base table on which the view is defined contains not just an XML column but also relational columns with indexes, you should include those relational columns in the view definition. If the queries against the view contain highly restrictive predicates on the relational columns, DB2 uses the relational indexes to filter qualifying rows to a small number, and then applies XMLTABLE and any remaining predicates to this interim result before returning the final result set.

In a nutshell, be careful with XMLTABLE views which expose XML data in relational form. When possible, include additional columns in the view definition so that filtering predicates can be expressed on those columns instead of the XMLTABLE columns.

Tip 14: For short queries or OLTP applications, use SQL/XML statements with parameter markers

Very short database queries often execute so fast that the time to compile and optimize them is a substantial portion of their total response time. Thus, it's useful to compile ("prepare") them just once and only pass predicate literal values for each execution. While DB2 9 XQuery does not support external parameters, the SQL/XML functions XMLQUERY, XMLTABLE and XMLEXISTS do. They allow you to pass SQL parameter markers as a variable into the embedded XQuery expressions. This is recommended for applications with short and repetitive queries.

Listing 17. Hardcoded predicate literal values
for $c in db2-fn:xmlcolumn('CUSTOMER.INFO')/customer
where $c/phone = "905-555-4789"
return $c;

select info 
from customer
where xmlexists('$i/customerinfo[phone = "905-555-4789"]' 
                passing info as "i")
Listing 18. With parameter marker
select info 
from customer
where xmlexists('$i/customerinfo[phone = $p]' 
                passing info as "i", cast(? as varchar(12)) as "p")

In a nutshell, short queries and OLTP transactions are faster as prepared statements with parameter markers. For XML, this requires SQL/XML to pass SQL-style parameters to XQuery expressions.

Tip 15: Avoid code page conversion during XML insert and retrieval

XML is different from other types of data in DB2 because it can be internally and externally encoded. Internally encoded means that the encoding of your XML data can be derived from the data itself. Externally encoded means that the encoding is derived from external information. The data type of the application variables which you use to exchange XML data with DB2 determines how the encoding is derived. If your application uses character type variables for XML, then it is externally encoded, i.e. in the application code page. If you use binary application data types, then the XML data is considered internally encoded. Internally encoded means that the encoding is determined by either a Unicode Byte-Order mark (BOM) or an encoding declaration in the XML document itself, such as

<?xml version="1.0" encoding="UTF-8" ?>

From a performance point of view, the goal is to avoid code page conversions as much as possible since they consume extra CPU cycles. Internally encoded XML data is preferred over externally encoded data because it can prevent unnecessary code page conversion. This means that in your application you should prefer binary data types over character types. For example, in CLI when you use SQLBindParameter() to bind parameter markers to input data buffers, you should use SQL_C_BINARY data buffers rather than SQL_C_CHAR, SQL_C_DBCHAR, or SQL_C_WCHAR. When inserting XML data from Java applications, reading in the XML data as a binary stream (setBinaryStream) is better than as a string (setString). Similarly, if your Java application receives XML from DB2 and writes it to a file, code page conversion may occur if the XML is written as non-binary data.

When you retrieve XML data from DB2 into your application, it is serialized. Serialization is the inverse operation of XML parsing. It is the process of converting DB2's internal XML format (a parsed, tree-like representation) into the textual XML format that your application can understand. In most cases it is best to let DB2 perform implicit serialization. This means your SQL/XML statements simply select XML-type values as in the following example, and DB2 performs the serialization into your application variables as efficiently as possible:

Listing 19. Query with implicit serialization
create table customer(info XML);

select info from customer where...;

select xmlquery('$i/customerinfo/name' passing info as "i") 
from customer
where...;

If your application deals with very large XML documents, it can be beneficial to use LOB locators for data retrieval. This requires explicit serialization to a LOB type, preferably BLOB, because explicit serialization into a character type such as CLOB can introduce encoding issues and unnecessary code page conversion. Explicit serialization uses the XMLSERIALIZE function:

select XMLSERIALIZE(info as BLOB(1M)) from customer where...;

In a nutshell, use binary data types in your application for exchanging XML with DB2 as this avoids unnecessary code page conversion. Be aware of encoding issues and when in doubt, follow the detailed guidelines in the DB2 9 documentation


Summary

To achieve maximum XML performance in DB2, a good start is to use DB2's autonomic features such as automatic storage and self-tuning memory management. This provides decent, out-of-the-box performance for many applications. It also frees up valuable DBA time for more dedicated performance tuning, when needed. All the conventional DB2 performance wisdom still applies to XML and is covered in a variety of developerWorks articles which are listed below.

On top of that, the 15 tips in this article can help you with common XML-specific performance aspects. If you need to improve the performance of your XML application, you don't need to apply all 15 tips but only the 1 or 2 that really matter in your situation. For example, reducing unnecessary code page conversion is not goign to help if your system is heavily I/O bound due to an unfortunate table space configuration. Similarly, using SQL/XML parameter markers may not help with the query performance if you actually need to execute runstats to enable better query execution plans. In short, the tips in this paper can help you to avoid performance issues, but fixing observed performance problems first requires identification of the root cause and bottlenecks. The standard diagnostic tools in DB2 such as visual explain, db2exfmt, and the snapshot monitor can be used for XML performance investigations just like for relational data.


Acknowledgements

Thanks to Cindy Saracco, Irina Kogan, Henrik Loeser, Nikolaj Richers and Marcus Roy for their reviews and help with this article.

Resources

Learn

Get products and technologies

  • Visit The DB2 Snapshot Monitor to learn the latest about this tool and get the product.
  • Download a free trial version of DB2 Enterprise 9.
  • Now you can use DB2 for free. Download DB2 Express-C, a no-charge version of DB2 Express Edition for the community that offers the same core data features as DB2 Express Edtion and provides a solid base to build and deploy applications.
  • Download the DB2 Developer Workbench for an integrated development environment for DB2.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=165299
ArticleTitle=15 best practices for pureXML performance in DB2
publish-date=05262009