Undeniably, XML has emerged as the de-facto standard for data exchange, service-oriented architectures (SOAs), and message-based transaction processing. As companies accumulate increasing amounts of XML data, they require more than message processing technology that handles one XML document at a time. Companies have started to persist large volumes of XML documents, sometimes due to regulatory requirements and sometimes because of XML's flexibility or the well-known difficulties associated with converting XML data to relational format and back. Accustomed to the benefits of mature relational databases, companies expect the same capabilities for XML data (the ability to persist, query, index, update, and validate XML data with full ACID compliance, recoverability, high availability, and high performance).
XML is sometimes considered a verbose and slow data format, especially in terms of processing large numbers of documents. This view of XML is often based on past experience with insufficient technology. For example, storing many XML documents in a file system and writing application code to parse and analyze them easily leads to poor performance and disappointment. This scenario no longer applies with state-of-the-art database and processor technology.
Using DB2's pureXML capabilities and Intel multi-core CPUs, Intel and IBM executed the industry's first terabyte XML database benchmark to demonstrate that high-end transaction processing over a terabyte of XML data is no longer wishful thinking. This article describes these performance tests, the hardware used, the DB2 configuration, and the results and lessons learned. Various DB2 technologies proved to be of critical importance, including deep compression, automatic storage, self-tuning memory, and, of course, pureXML. The results quantify DB2’s multi-user scalability with Linux on Intel quad-core and six-core processors (Intel Xeon Processor 7300 and 7400 Series). All system configurations and tests reported in this article were performed by Intel at Intel Labs.
DB2 pureXML provides support for XML data management such as XML storage, XML indexing, XML queries and updates, and optional document validation with XML Schemas. Users can define columns of type XML in which they can store one XML document per row. Tables can contain a mix of XML and relational columns which makes the integration of XML and relational data easy. When XML documents are inserted or loaded into an XML column, they are parsed and stored in a parsed tree format. This allows queries and updates to operate on XML data without XML parsing — a key performance benefit. XML indexes can be defined on specific elements or attribute to ensure high query performance. Queries and updates are based on the SQL/XML and XQuery standards and can access both XML and relational in a single statement if needed.
To prove high-end XML performance we have chosen to execute the TPoX benchmark. TPoX (Transaction Processing over XML) is an open-source and application-level XML database benchmark based on a financial application scenario. It evaluates the performance of XML database systems, focusing on XQuery, SQL/XML, XML storage and indexing, XML Schema support, XML inserts, updates and deletes, logging, concurrency and other database aspects. TPoX simulates a security trading scenario and uses a real-world XML Schema (FIXML) to model some of its data. TPoX is designed to exercise a realistic and representative set of XML operations.
The main logical data entities in TPoX are:
- Customer: A single customer can have one or multiple accounts
- Account: Each account contains one or multiple holdings
- Holding: A stock, bond, or mutual fund
- Security: A single customer can have one or multiple accounts
- Order: Each order buys or sells shares of exactly one security for exactly one account
For each customer, there is an XML document that contains all personal information, account information, and holding information for that customer (Figure 1). Each order is represented by an XML message that complies with the FIXML 4.4 schema. FIXML is a complex industry-standard XML schema for trade-related messages such as buy or sell orders. Each security is described by a single XML document. The collection of 20,833 security documents represents the majority of US-traded stocks, bonds, and mutual funds. While the number of security documents is fixed, the benchmark scenario scales in the number of custacc and order document. The 1TB TPoX database uses 300,000,000 order and 60,000,000 custacc documents.
Figure 1. TPoX data entities.
The TPoX workload consists of 17 transactions, listed in Table 1. Their relative weight in the transaction mix is shown in the rightmost column. Insert, update, and delete operations amount to 30 percent of the workload; queries to 70 percent of the workload. XML Schema validation is performed in transaction I2, U2, and U4.
Table 1. Business descriptions of TPoX transactions.
The workload is executed by a Java workload driver that spawns a configurable number of concurrent threads to simulate concurrent users. Each thread connects to the database and executes a stream of transactions without think times. When a transaction commits, the thread that submitted the transaction immediately picks another transaction from Table 1, randomly but with skewed probabilities based on the transaction weights. At run time, the workload driver replaces parameter markers in the transactions with concrete values drawn from random distributions. The Java code of the workload driver is available as open source and can be used for many types of database tests – not just the TPoX benchmark.
The test system (Figure 2) consists of the following hardware and software components:
- Database server
- Intel Xeon Processor 7400 Server
- 64GB of main memory
- Client machine: Intel Xeon Processor 5400 Server
- Operating System for client and server: Linux SLES 10, 64bit
- Database software: DB2 9.5 for Linux, UNIX, and Windows, Fixpack 2
- Client software: TPoX open source workload driver, Java 1.5
- EMC CX3-80
- 15 disks per LUN (RAID 0)
- 120 disks (8 LUNs) for the database
- 15 disks (1 LUN) for the log
- 30 disks (2 LUNs) for the raw data
- 2 RAID controller cards (one for the flat files, one for database and log)
- 2 fiber channel connections (4GB/s) per controller card
Figure 2. System under test.
To analyze how DB2 performance scales with an increasing number of cores per CPU, we ran the benchmark two times with different processors. The first set of tests used four Intel Xeon 7400 Series CPUs, which have six cores each. Then we repeated the benchmark using four Intel Xeon 7300 CPUs, which have four cores each. The comparison of CPUs in Table 2 shows that they differ in more than just the number of cores. While the Intel Xeon Processor 7400 Series has 50 percent more cores than the 7300 Series, its clock speed is 10 percent lower, but it has a 16MB L3 cache which the Intel Xeon 7300 Series does not have.
Table 2. Intel Xeon processors used in this benchmark.
When we switched the CPUs, all other details of the hardware and software remained identical. Both the Xeon 7400 and 7300 processor series use the same chipset so that replacing one with the other is just a “drop in” processor replacement with no other changes required.
The DB2 database was created with DB2's automatic storage feature and a page size of 16KB, using eight logical volumes plus a separate volume for the log. The database schema that we chose to implement the TPoX scenario is very simple. It consists of three XML columns in three tables, one for each of the three XML document types in TPoX (order, custacc, security):
create table custacc (cadoc xml inline length 16288) in custacc_tbs index in custacc_idx_tbs compress yes;
create table order (odoc xml inline length 16288) in orders_tbs index in orders_idx_tbs compress yes;
create table security (sdoc xml inline length 16288) in security_tbs index in security_tbs compress yes;
XML inlining and compression was used to reduce the storage footprint for the 1TB of raw XML data. We created five table spaces (one table space for each of the three tables plus one table space for custacc indexes and one for order indexes). All table spaces were configured with NO FILE SYSTEM CACHING and AUTOMATIC STORAGE. Each table space had its own buffer pool, plus one buffer pool for the temporary table space. We used the different table spaces and buffer pools mainly for ease of monitoring. We later confirmed that combining all tables and all indexes into a single table space and a single large buffer pool produced almost the same performance (only 6 percent lower throughput than with the manual configuration).
For the configuration of buffer pool sizes, sort heap, lock list, package cache, num_iocleaners, num_ioservers, and so on, we took the following approach. To avoid lengthy and repetitive tuning iterations we simply guessed reasonable values for all these parameters. The numbers we chose were not intended or known to be optimal for this workload. They were only meant to be a starting point for DB2's self-tuning behavior. For example, we knew that we needed large buffer pools for the custacc and order tables, but we didn't know what sizes would be optimal. We decided to let DB2's self-tuning memory manager (STMM) figure out the optimal size. To help the STMM converge quickly towards the optimal buffer pool sizes we didn’t want to start with the default of 1,000 pages, which we knew was way too small. For example, first we set the buffer pool for the custacc table to 770,000 pages and then to automatic, so that fewer iterative STMM adjustments would be needed to reach the optimal size than if we had started with 1,000 pages. The parameters INSTANCE_MEMORY and DATABASE_MEMORY were also set to automatic.
- You can see our complete DDL script in the TPoX open-source repository here.
- Performance Results
Now let's look at following types of results:
- Storage consumption and compression
- Transaction throughput of the mixed workload (70 percent queries, 30 percent insert/update/delete)
- Buffer pool hit ratios
- Out-of-the-box performance with minimal manual configuration or tuning
- Scalability from 4-core to 6-core CPUs
- Incremental insert performance
Since the security table is very small (20,833 documents) we examined the space consumption and compression ratio mainly for the two large tables, custacc and order (see Table 3). The 60 million custacc documents are compressed by 64 percent and require 121.4GB in a DB2 table space. The 300 million order documents are compressed by 57 percent and occupy 269.2GB in DB2. Including all data and indexes, the final database size was about 440GB. XML inlining and compression were critical to avoid I/O bottlenecks.
Table 3. Space consumption and compression.
Figure 3 shows the transaction throughput result for the mixed workload on the 24 core, 2.66Ghz, Intel Xeon 7400 platform. The horizontal axis shows the different number of concurrent users that the TPoX workloads driver simulated. Each user issues a stream of transactions without think time between transactions. The blue curve represents the transactions per seconds and belongs to the vertical axis on the left. The pink curve indicates the CPU utilization and belongs to the vertical axis on the right. The throughput and CPU utilization grow as the number of concurrent users is increased. When the number of users is increased from 100 to 150 and 200, the CPU utilization approaches the maximum capacity of the system and consequently the throughput flattens out. At 200 users the maximum throughput is 6763 TPoX transactions per second.
Figure 3 . Transactions per seconds and CPU utilization.
Increasing the number of users beyond 200 did not lead to higher throughput, only to longer transaction response times. The flattening throughput curve and the exhaustion of the system capacity at 200 users is directly related to the fact that all simulated users submit transactions without think time between one transaction and the next. If each user submitted, for example, one transaction per second, then the system could support thousands of users.
Figure 4 shows the output of the workload driver for the mixed workload with 200 users and a test duration of 2 hours. The detailed statistics for all 17 transactions in the workload mix include their maximum and average response times as well as their "count," which is the number of times each transaction was executed across all 200 users. 48.5 million transactions were executed in the two-hour test. All average transaction response times are less than 0.1 seconds. Since the workload driver was run on a separate client machine, the response times include the network round trip time.
Figure 4. Detailed transaction results at 200 users.
Optionally, the workload driver can print such a transaction summary every n minutes during the test period. This allowed us to confirm that the throughput is stable the entire time. The workload driver can also calculate the 90th, 95th, or 99th percentile of the transaction response times. Percentiles are useful if you want to confirm that 90 percent, 95 percent, or 99 percent of the transaction response times are below a certain threshold.
Remember that the workload driver is freely available as open source and can be used to run any kind of SQL, SQL/XML, or XQuery workload that you define. It's a very versatile tool for all sorts of database performance testing.
Adjusted by DB2's self-tuning memory manager, the combined size of all buffer pools reached 46 GB (out of 64GB physically memory). Since the database size after compression was about 440GB, the ratio between buffer pools and database size is 10.5 percent (46GB/440GB). Figure 5 shows that the buffer pools for custacc and order indexes had a hit ratios between 95 and 100 percent. The hit ratio for the custacc and order tables was between 60 percent and 70 percent. Without DB2's compression, this hit ratio would have been lower and performance would have been worse.
Figure 5. Buffer pool hit ratios.
How difficult is it to tune DB2 for the performance that we achieved in this test? For example, is it really necessary to have five separate table spaces and buffer pools for the different tables and indexes? Can we achieve similar performance with a database set up that is much simpler than the DDL script that we used initially?
In an attempt to answer these questions, we repeated the benchmark and configured the DB2 database with just four simple steps:
- Create the database with automatic storage
- Change the log location to a separate storage path
- Create a separate buffer pool for the temporary table space
- Use DB2's AUTOCONFIGURE command to let DB2 tune itself.
These steps are shown in Figure 6. Note that this database uses just a single default table space and a single default buffer pool for all tables and indexes. With this setup, the mixed workload with 200 users reached 6368 transactions per second, which is only 6 percent lower than the 6763 TPS that we achieved with the more detailed database configuration. This result shows that high-end performance does not always require expert database tuning and that DB2's autonomic capabilities work remarkably well.
Figure 6. Configuring the database in five commands.
Figure 7 compares the throughput measured in three different cases. From left to right they are
- 150 concurrent users, four Intel Xeon 7300 CPUs (16 cores total, 2.93 GHz)
- 150 concurrent users, four Intel Xeon 7400 CPUs (24 cores total, 2.66 GHz)
- 200 concurrent users, four Intel Xeon 7400 CPUs (24 cores total, 2.66 GHz)
In the first test, the Intel Xeon 7300 quad-core CPUs are saturated with 150 concurrent users. The workload achieves a maximum throughput of 4558 transactions per second at 99.3 CPU utilization. In test 2, moving from the quad-core (Xeon 7300) to the six-core (Xeon 7400) CPUs increases the transaction rate for 150 users by 42 percent at only 84.7 percent CPU utilization. Since the machine is not saturated, test 3 increases the number of users to 200. This leads to 95.2 percent CPU usage and 6763 transactions per second, a 48 percent performance gain of the six-core over the quad-core processors.
The performance gain of 1.42x and 1.48x in Figure 7 is remarkable because considering the number of cores and the clock speed only, the Intel Xeon 7400 CPUs are expected to provide 1.36x higher performance than Intel Xeon 7300 CPUs. The additional speed-up is mainly due to the 16MB L3 cache which is new in the Intel Xeon 7400 Series processors. Equally important is the fact that the Intel Xeon 7400 CPUs provide higher performance while maintaining the same power consumption as the Intel Xeon 7300 CPUs. Increased performance per watt is important to make computing more economical and cost-effective.
Figure 7. Scalability from Intel quad-core to six-core CPUs.
Inserting rows into an empty table with empty indexes can be faster than inserting into a table that already contains a large volume of data. To get a meaningful assessment of XML insert performance, we measured an insert-only workload on top of the populated 1TB database. The insert test added 2 million XML documents to the custacc table, and 3 million documents to the order table (see Figure 8). Both tables have two XML indexes. XML Schema validation was not performed. The custacc documents were inserted at a rate of about 4,900 documents per second, which amounts to ~100GB/hour. The smaller order documents were inserted at 11,900 documents per second, or 69 GB/hour. For both types of documents the insert tests used 600 concurrent users that issued insert statements without think time. A commit was performed after every single insert. Less frequent commits or using DB2's load utility can provide even higher XML ingestion rates.
Figure 8. Incremental XML insert performance.
What did we learn from the 1TB XML performance study? Apart from the actual performance and scalability results, several observations are valuable.
One of the lessons learned is that tuning DB2 for XML-based transaction processing is not very hard. The strategy to use DB2’s autonomic and self-tuning features as much as possible proved to be very successful. Within a reasonable amount of time we were not able to achieve higher performance with manual than with automatic tuning.
A prerequisite for good performance is well-balanced hardware, that is, using "the right" ratio between number of CPU cores, main memory, and number of disks. With 24 cores, 64GB of memory, and 120 data disks our test system has 5 disks per core and 2.66GB memory per core. The optimal ratio is workload dependent. In the TPoX mixed workload we observed an average of 1.7 physical I/O requests per transaction. Hence, at a peak transaction rate of 6763 TPS, the storage system had to sustain about 11,500 I/O operations per second (IOPS). Following the rule of thumb that a modern SCSI disk can support about 100 IOPS with reasonably low latencies, about 115 disks are needed to avoid I/O bottlenecks and allow high CPU utilization.
DB2 compression was critical. Without compression more disks and more memory would have been required to achieve the same performance. Compression reduced the required I/O, which is a benefit that far outweighed the extra CPU cycles to compress and decompress data.
To understand the database performance behavior it proved very useful to use the DB2 snapshot monitor and takes snapshots at regular intervals, such as every 5 minutes. For example, the collected data allows you to analyze I/O and page cleaning behavior over time.
For Linux and UNIX systems, DB2 9.5 has a fundamentally different process model than DB2 9.1. While DB2 9.1 spawns a separate process for each agent, DB2 9.5 runs as a single process with one thread per agent. Our results confirm that DB2's threaded engine exploits multi-core CPUs very well and achieves good speed-up from 4-core to 6-core Intel Xeon Processors.
Matthias Nicola is a Senior Technical Staff Member at IBM's Silicon Valley Lab in San Jose, CA. He focuses on DB2 performance and benchmarking, XML, temporal data management, in-database analytics, and other emerging technologies. He also works closely with customers and business partners to help them design, optimize, and implement DB2 solutions. Previously, Matthias worked on data warehouse performance at Informix Software. He received his Ph.D. in computer science from the Technical University of Aachen, Germany.
Agustin Gonzalez works at Intel Corp. as a senior staff software engineer in the Software and Solutions Group, where he works in performance enablement for Intel Xeon platforms. Previously he has worked in several startup and public companies, amassing more than 15 years of experience in large scale data management systems, performance optimization, and commercial software development.