Comparing XML Processing Performance in Middleware and Database: A Case Study
This page is the Web companion to the paper "Comparing XML Processing Performance in Middleware and Database".
Marcus Paradies, Susan Malaika, Matthias Nicola, Kevin Xie
The paper is available from the ACM
Your feedback is most welcome
Abstract
XML processing is at the core of many middleware systems. In recent years XML databases have become widely available. This article identifies three common XML processing use cases, and compares their performance when XML manipulation is performed in an XML database and when XML manipulation is part of middleware application code. The article concludes with guidelines for XML processing placement and identifies areas for further study.
The Three Use Cases - Supporting Materials
The use cases and measurements investigated in the companion paper reuse parts of the TPoX (Transaction Processing over XML) benchmark. TPoX is an application-level XML benchmark based on a financial application scenario. TPoX defines several XML document collections, XML queries and updates, and provides an XML data generator as well as a workload driver. The workload driver is a Java application that simulates and measures one or multiple concurrent users performing XML processing.
The use cases operate on 50,000 "custacc" documents from TPoX, each document describing a customer with his personal information and all of his one or more investments accounts. Each use case is implemented in two ways, (1) using SQL/XML to push the XML manipulation to a DB2 9.7 database instance, and (2) in Java with a DOM or SAX parser in the middle tier. Both options read or write XML to/from the database, only the location and implementation of the XML manipulation differs.
The three use cases investigated in this paper are summarized in the following table.
| Use Cases |
Names |
Descriptions |
| Use Case 1.1 |
Extract5 |
Extract a small number of values from an XML document, often used when presenting information |
| Use Case 1.2 |
Extract50 |
As for Extract5 above, except a medium number of values (50%) |
| Use Case 1.3 |
Extract100 |
As for Extract5 above, except a large number of values (nearly 100%) |
| Use Case 2.1 |
Split500 |
Split a large XML document containing 500 concatenated XML fragments (commonly used for FTP or download) into individual XML documents |
| Use Case 2.2 |
Split2500 |
As for Split500 above, except containing 2500 XML fragments |
| Use Case 2.3 |
Split5000 |
As for Split500 above, except containing 50000 XML fragments |
| Use Case 3 |
Modify |
Insert, change, and delete XML elements in an XML document |
The following sub-section describes the use cases in more detail. Each use case description will consist of two parts:
- The use case as implemented in middle-tier, e.g., with DOM in a stand-alone XML parser
- The use case as implemented in database, e.g., with SQL/XML inside a database
Use Case 1 - Extract
The first use case extracts information from XML documents and populates them into Java objects. The use case contains three extractions: Extract5, Extract50, and Extract100 as described in the table above.
Figure 1: Overview for Extracting XML elements with SAX in the Middletier
Figure 2: Overview for Extracting XML elements SQL/XML functions in the Database
For the middle-tier part, the code reads a BLOB from the database and parses the BLOB with a SAX parser in the application. During the parsing process the node values are extracted and stored in Java objects. For the database part, the XMLTABLE function from the SQL/XML standard is used to shred XML data to temporary relational tables. The result set is fetched and populated into Java objects. The SQL statement for Extract5 is illustrated below.
Note: The SAX pseudo code is very high level whereas the SQL portions include the complete code
1: read BLOB from database;
2: while (!endOfDocument)
3: element = readElement();
4: populateIntoJava(element);
5: endOfDocument = isEndofDocument();
SELECT T.*
FROM custacc_xml , XMLTABLE(XMLNAMESPACES(DEFAULT 'http://tpox-benchmark.com/custacc'),'$DOC/Customer'
COLUMNS
firstname VARCHAR(20) PATH 'Name/FirstName',
lastname VARCHAR(20) PATH 'Name/LastName',
dayOfBirth DATE PATH 'DateOfBirth') AS T
WHERE CUSTID = ?
SELECT T.*
FROM custacc_xml , XMLTABLE(XMLNAMESPACES(DEFAULT 'http://tpox-benchmark.com/custacc'),'$DOC/Customer'
COLUMNS
id INTEGER PATH '@id',
mnemonic VARCHAR(45) PATH 'Mnemonic',
shortname VARCHAR(100) PATH 'string-join(ShortNames/ShortName,",")',
middlename VARCHAR(100) PATH 'string-join(Name/MiddleName,",")',
title VARCHAR(10) PATH 'Name/Title',
firstname VARCHAR(20) PATH 'Name/FirstName',
lastname VARCHAR(20) PATH 'Name/LastName',
suffix VARCHAR(20) PATH 'Name/Suffix',
dayOfBirth DATE PATH 'DateOfBirth',
gender VARCHAR(10) PATH 'Gender',
nationality VARCHAR(30) PATH 'Nationality',
CountryOfResidence VARCHAR(45) PATH 'CountryOfResidence',
language VARCHAR(100) PATH 'string-join(Languages/Language,",")',
CustomerSince DATE PATH 'BankingInfo/CustomerSince',
PremiumCustomer VARCHAR(5) PATH 'BankingInfo/PremiumCustomer',
CustomerStatus VARCHAR(10) PATH 'BankingInfo/CustomerStatus',
LastContactDate DATE PATH 'BankingInfo/LastContactDate',
ReviewFrequency VARCHAR(20) PATH 'BankingInfo/ReviewFrequency',
OnlineLogin VARCHAR(20) PATH 'BankingInfo/Online/Login',
OnlinePin VARCHAR(20) PATH 'BankingInfo/Online/Pin',
OnlinePwd VARCHAR(20) PATH 'BankingInfo/Online/Trading-password',
TaxID VARCHAR(20) PATH 'BankingInfo/Tax/TaxID',
SSN VARCHAR(20) PATH 'BankingInfo/Tax/SSN',
TaxRate DECIMAL(2,2) PATH 'BankingInfo/Tax/TaxRate',
Currency VARCHAR(20) PATH 'BankingInfo/Currency'
) AS T
WHERE CUSTID = ?
SELECT T.*
FROM custacc_xml , XMLTABLE(XMLNAMESPACES(DEFAULT 'http://tpox-benchmark.com/custacc'),'$DOC/Customer'
COLUMNS
id INTEGER PATH '@id',
mnemonic VARCHAR(45) PATH 'Mnemonic',
shortname VARCHAR(100) PATH 'string-join(ShortNames/ShortName,",")',
middlename VARCHAR(100) PATH 'string-join(Name/MiddleName,",")',
title VARCHAR(10) PATH 'Name/Title',
firstname VARCHAR(20) PATH 'Name/FirstName',
lastname VARCHAR(20) PATH 'Name/LastName',
suffix VARCHAR(20) PATH 'Name/Suffix',
dayOfBirth DATE PATH 'DateOfBirth',
gender VARCHAR(10) PATH 'Gender',
nationality VARCHAR(30) PATH 'Nationality',
CountryOfResidence VARCHAR(45) PATH 'CountryOfResidence',
language VARCHAR(100) PATH 'string-join(Languages/Language,",")',
CustomerSince DATE PATH 'BankingInfo/CustomerSince',
PremiumCustomer VARCHAR(5) PATH 'BankingInfo/PremiumCustomer',
CustomerStatus VARCHAR(10) PATH 'BankingInfo/CustomerStatus',
LastContactDate DATE PATH 'BankingInfo/LastContactDate',
ReviewFrequency VARCHAR(20) PATH 'BankingInfo/ReviewFrequency',
OnlineLogin VARCHAR(20) PATH 'BankingInfo/Online/Login',
OnlinePin VARCHAR(20) PATH 'BankingInfo/Online/Pin',
OnlinePwd VARCHAR(20) PATH 'BankingInfo/Online/Trading-password',
TaxID VARCHAR(20) PATH 'BankingInfo/Tax/TaxID',
SSN VARCHAR(20) PATH 'BankingInfo/Tax/SSN',
TaxRate DECIMAL(2,2) PATH 'BankingInfo/Tax/TaxRate',
Currency VARCHAR(20) PATH 'BankingInfo/Currency'
) AS T
WHERE CUSTID = ?
SELECT T.*
FROM custacc_xml , XMLTABLE(XMLNAMESPACES(DEFAULT 'http://tpox-benchmark.com/custacc'),
'$DOC/Customer/Addresses/Address'
COLUMNS
id INTEGER PATH '../../@id',
street VARCHAR(200) PATH 'string-join(gStreet/Street," ")',
pobox VARCHAR(20) PATH 'POBox',
city VARCHAR(20) PATH 'City',
postalcode VARCHAR(20) PATH 'PostalCode',
state VARCHAR(20) PATH 'State',
country VARCHAR(45) PATH 'Country',
citycountry VARCHAR(45) PATH 'CityCountry',
phone VARCHAR(45) PATH 'string-join(Phones/Phone,",")',
emails VARCHAR(150) PATH 'string-join(../EmailAddresses/Email,",")'
) AS T
WHERE CUSTID = ?
select T.*
FROM custacc_xml , XMLTABLE(XMLNAMESPACES(DEFAULT 'http://tpox-benchmark.com/custacc'),
'$DOC/Customer/Accounts/Account/Holdings/Position'
COLUMNS
id INTEGER PATH '../../../../@id',
AccountID VARCHAR(20) PATH '../../@id',
Category VARCHAR(20) PATH '../../Category',
AccountTitle VARCHAR(45) PATH '../../AccountTitle',
ShortTitle VARCHAR(20) PATH '../../ShortTitle',
OpeningDate VARCHAR(25) PATH '../../OpeningDate',
AccountOfficer VARCHAR(45) PATH '../../AccountOfficer',
LastUpdate VARCHAR(25) PATH '../../LastUpdate',
OnlineActualBal INTEGER PATH '../../Balance/OnlineActualBal',
OnlineClearedBal INTEGER PATH '../../Balance/OnlineClearedBal',
WorkingBalance INTEGER PATH '../../Balance/WorkingBalance',
Passbook VARCHAR(10) PATH '../../Passbook',
ChargeCcy VARCHAR(10) PATH '../../ChargeCcy',
InterestCcy VARCHAR(10) PATH '../../InterestCcy',
AllowNetting VARCHAR(10) PATH '../../AllowNetting',
InPutter VARCHAR(500) PATH 'string-join(../../gInputter/Inputter,",")',
Position VARCHAR(500) PATH 'string-join((Symbol,Name,Type,Quantity),", ")'
) AS T
WHERE CUSTID = ?
Use Case 2 - Split
The second use case splits a large XML document into smaller documents (individual customer documents). The use case contains three splits: Split500, Split2500, and Split5000 as described in the table above. Splitting large XML batch files is a common scenario in business applications.
Figure 3: Overview for XML batch file splitting
Figure 4: Overview for XML batch file splitting with DOM in the Middletier
Figure 5: Overview for XML batch file splitting with SQL/XML functions in the database
Note: The DOM pseudo code is very high level whereas the SQL portions include the complete code
1: file = FileInput('batch500.xml')
2: dom = parse(file)
3: vectorOfDocs = splitDocument(dom)
4: FOR i in vectorOfDocs:
5: domString = serializeDOM(vectorOfDocs[i])
6: executeUpdate(domString)
INSERT into custacc_xml(DOC)
SELECT X.FRAG from XMLTABLE(
XMLNAMESPACES(DEFAULT 'http://tpox-benchmark.com/custacc'),
'$i/Customers/Customer' PASSING cast(? as xml) as "i"
COLUMNS FRAG XML PATH 'document{.}'
) as X
To split a XML batch file in the middle-tier, the document is parsed (e.g. using DOM) and then split into smaller pieces. In use case 2, the XML document is read from the file system and parsed using DOM into a DOM instance. The DOM tree is split into smaller chunks of XML data and new DOM instances are created. Each DOM tree contains w one Customer element. After that, the DOM is serialized into a Java string and stored as XML document into the database. For the database part, the XML batch file is sent as a binary stream to the database. The binary stream is parsed and then split into smaller documents using XMLTABLE function. The split XML documents are inserted into an XML column.
Use Case 3 - Modify
- Updating XML - modifying , deleting and inserting nodes
- Middle tier: DOM
- Database: XQuery Update
Figure 6: Overview for Modifying XML files with Xquery Updates in the Database
Figure 7: Overview for Modifying XML files with DOM in the Middletier
Note: The DOM pseudo code is very high level whereas the SQL portions include the complete code
1: read BLOB from database
2: dom = parse(BLOB)
3: dom = modify(dom) 4: domString = serializeDOM(dom)
5: executeUpdate(domString)
UPDATE custacc_xml
SET doc = XMLQUERY('declare default element namespace "http://tpox-benchmark.com/custacc";
copy $cust := $DOC
modify do insert
<Email primary="No">
{fn:concat(fn:string-join(($cust/Customer/Name/FirstName,$cust/Customer/Name/LastName),"."),"@domain.com")}
</Email>
as last into $cust/Customer/Addresses/EmailAddresses
return $cust
')
WHERE CUSTID = ? %
UPDATE custacc_xml
SET doc = XMLQUERY('declare default element namespace "http://tpox-benchmark.com/custacc";
copy $cust := $DOC
modify do replace
value of $cust/Customer/Addresses/EmailAddresses/Email[last()]
with fn:concat($cust/Customer/Name/FirstName,".",$cust/Customer/Name/LastName),"@newdomain.com")
return $cust
')
WHERE CUSTID = ?
UPDATE custacc_xml
SET doc = XMLQUERY('declare default element namespace "http://tpox-benchmark.com/custacc";
copy $cust := $DOC
modify for $i in $cust/Customer/Addresses/EmailAddresses/Email
where (fn:ends-with($i,"newdomain.com"))
return do delete
$i
return $cust
')
WHERE CUSTID = ? %
|
|