Shred XML documents using DB2 pureXML

Two methods for XML decomposition in DB2 for Linux, UNIX, and Windows

Since version 9.1, DB2 features significant new support for storing, managing, and searching XML data. One of these new features is annotated XML schema decomposition. Annotated XML schema decomposition allows you to decompose, or "shred", XML documents into relational tables. Another way to decompose XML documents is through the SQL or XML function: XMLTABLE. This article reviews the two methods of decomposing XML data including, how to use the XMLTABLE function for decomposition. It also compares annotated XML schema decomposition and XMLTABLE Decomposition and provides recommendations for the use of each.

Share:

Salvador Ledezma, Staff Software Engineer, IBM

Salvador has been working at IBM since 2002 at the Silicon Valley Labs, in San Jose, CA, initially developing J2EE applications for the DB2 z/OS database product. He currently spends his time working on runtime and tooling technologies for IBM's data server products, including DB2 pureXML.



Bert Van Der Linden (robbert@us.ibm.com), DB2 pureXML Architect, IBM

Bert joined IBM in 2001 to work on the XML project in DB2 as one of the early architects. Bert came to IBM from a startup company, Propel, where he led the design and implementation of the distributed and fault-tolerant middleware, which hosted a scalable e-commerce application. Before that, Bert worked for many years at Tandem Computers on the NonStop SQL, a database that runs many critical applications in the financial industry.



03 January 2008

Also available in Chinese Russian

Introduction

As XML data continues to proliferate in the enterprise, it is not always possible to store the XML as XML. Perhaps you are working with a legacy data architecture or other requirements restrict your storage to be relational. In fact, it is not uncommon to send and receive messages as XML data, whereas the messages are composed from and decomposed to relational data. Two features of DB2 help in this situation, where the data cannot be stored as XML: SQL/XML publishing functions and XML decomposition. First, the SQL/XML publishing functions help compose XML data from relational data. These are not discussed in this article. For an introduction to these functions, as well as general information on how to query XML data in DB2, refer to the developerWorks article "Query DB2 XML Data with SQL" (developerWorks, March 2006).

This article focuses on ways to "shred" XML data in DB2. Shredding is the process of mapping XML elements and attributes into relational tables and columns. One way to shred in DB2 is through the use of an annotated XML schema. If the XML data contains an XML schema, it is the easiest and fastest way to perform decomposition. If the mapping is significantly complex and involves multiple tables, existing tools automate both the mapping and decomposition steps.

Another, perhaps less-known, method for shredding is through the use of the SQL/XML function XMLTABLE. It is useful when an XML schema does not exist. Using the XMLTABLE function can be more complex since the decomposition steps must be manually coded. This means that the developer explicitly must state, using XQuery expressions, how a particular XML element is mapped to a table and column. Still, it is this flexibility that makes XMLTABLE decomposition more powerful than annotated XML schema decomposition, giving it the ability to perform some types of mappings that annotated XML schema decomposition cannot.

This article shows some decomposition examples using both annotated XML schema and the XMLTABLE function. It also shows some examples that annotated XML schema decomposition does not support, yet XMLTABLE does. Finally, it concludes with a comparison of best practices for each and provide some recommendations for their use.


Annotated XML schema decomposition

The annotated XML schema decomposition feature in DB2 can be used to decompose XML documents into relational tables. As the name implies, it uses annotations in the XML schema as the mapping language to map information in an XML document to relational tables. Since it requires an XML schema, the XML schema document must be stored in the DB2 XML Schema Repository (XSR) and designated for decomposition. Decomposing or "shredding" an XML document into the mapped relational columns is done through a DB2 stored procedure call or through a command line processor (CLP) command.

One way to annotate an XML schema is to use IBM Data Studio. Data Studio, a free download, is a comprehensive, integrated development environment for creating, editing, debugging, deploying, and testing DB2 database applications, including developing stored procedures and user-defined functions. For download information, refer to the Resources section of this article. One of the components of Data Studio is the annotated XML schema decomposition Mapping Editor. Using a simple and intuitive graphical interface, it allows you to map relationships between an XML schema and a relational schema. Graphically mapping XML elements or attributes to relational columns in DB2 automatically annotates the XML schema document. Once the XML schema is saved and then registered in the XSR, you are ready to decompose XML documents into DB2.

xdbDecompXML is a DB2 stored procedure used for annotated XML schema decomposition. Several xdbDecompXML versions exist and each is optimized for different size documents. They are called:

  • XdbDecompXML
  • XdbDecompXML10MB
  • XdbDecompXML25MB
  • XdbDecompXML50MB
  • XdbDecompXML75MB
  • XdbDecompXML100MB

As the names imply, each stored procedure differs only in the size of the XML document that is to be decomposed. For example, XML documents up to 1 MB, use xdbDecompXML and for XML documents up to 10MB, use xdbDecompXML10MB. All the stored procedures take eight parameters, three of which are reserved and must be set to NULL. The parameters are:

  • rschema
    The SQL schema part of the two-part XSR object name registered in the XML schema repository.
  • xmlschemaname
    The XML schema name of the two-part XSR object name
  • xmldoc
    The XML document to be decomposed passed in as a BLOB object. The size of the document determines which stored procedure should be called.
  • documentid
    An identifier for the XML document to be decomposed. It may be used in specific annotations in the XML schema, specifically db2-xdb:expression and db2-xdb:condition.
  • validation
    An integer indicating whether the XML document to be decomposed should also be validated against the XML schema. 0 means no validation, while 1 indicates that the document should be validated.
  • reserved
    The last three arguments are reserved and must be set to NULL.

Based on the annotations in the XML schema stored in the XSR, a call to xdbDecompXML allows DB2 to decompose and insert into the appropriate relational tables.

A full discussion of the XML schema annotations and the xdbDecompXML set of stored procedures is beyond the scope of this article. To learn more about annotated XML schema decomposition, including such advanced features as conditional decomposition based on content or specifying content transformation to be applied before insertion, refer to the "DB2 Version 9 XML Guide" publication (Found in the Resources section of this article). For those familiar with XML Extender and its method for decomposition, refer to "From DAD to Annotated XML Schema Decomposition" (developerWorks, April 2006) for more information.


XMLTABLE decomposition

While annotated XML schema decomposition requires the existence of an XML schema in the XSR, it is possible to decompose a document without an XML schema using the XMLTABLE function. XMLTABLE is a SQL table function that returns a table from the evaluation of XQuery expressions. The table that is returned can contain columns of any SQL data type, including XML. You can pass variables to the XQuery expression specified in XMLTABLE.

The general syntax for XMLTABLE looks like this:

Listing 1. General Syntax for XMLTABLE
XMLTABLE( xquery-expression  PASSING xml-source 
COLUMNS 
column-name	column-(sql)data-type PATH path-xquery-expression
,...)

xml-source is the XML document that provides the data for the tabular result. It can be stored in DB2 in an XML column or it can be a document outside of DB2. The XQuery expression specified in the xquery-expression argument is the row-producing XQuery expression. A row is generated for each item in the output sequence. The structure of the resulting table is defined by the COLUMNS clause of the function. In this clause, you define characteristics of the column by specifying column name (column-name), data type (column-(sql)data-type), and how the column value is generated. The column value of the resulting table can be generated by specifying an XQuery expression (path-xquery-expression) in the PATH clause of XMLTABLE. path-xquery-expression specifies the column value with respect to an item selected by xquery-expression. Refer to the examples listed below for more information.

Using XMLTABLE, values retrieved from within XML documents, when combined with an INSERT statement, can then be inserted into a relational table, allowing for the same functionality as annotated XML schema decomposition. This is often referred to as an "Insert-from-XMLTABLE" statement. This article uses the term XMLTABLE decomposition.

The data passed to the INSERT statement is not limited to what is returned from the XMLTABLE function. It can be sourced from other places, such as host variables, other tables, other XMLTABLE statements, or even other table functions. Using the power of both XQuery and SQL makes XMLTABLE decomposition very flexible. Still, an INSERT statement only inserts into one table. If a mapping requires insertion into multiple tables, then multiple INSERT statements with XMLTABLE invocations must be defined. For this reason, it may be helpful to place the XMLTABLE decomposition statements in a SQL stored procedure. Defining the statements in a SQL stored procedure allows the application to only make one call, no matter how many tables are populated by a particular mapping. There is also the added performance benefit that in the SQL stored procedure, the XML document to be decomposed is only parsed once, even if used in multiple XMLTABLE function calls.


XML decomposition: An example

Now look at an example of how to decompose XML documents using both annotated XML schema decomposition and XMLTABLE decomposition.

Sample XML data

Assume that you have XML data that represents mail messages. A typical document might look like this:

Listing 2. Sample XML data representing mail messages
<email:mails xmlns:email="http://mymail.com/mails">
   <mail>
	<envelope>
         <from></from>
	   <to></to>
	   <email:Date></email:Date>
	   <subject></subject>
	</envelope>
	<body></body>
	<attachment></attachment>
   </mail>
   <mail>
	...
   </mail>
	...
</email:mails>

Every <mail> message contains an envelope, a body, and an attachment and the XML document can have an unlimited number of <mail> elements. For simplicity, assume that attachments are textual, not binary data

Relational schema

Often times, when performing decomposition, you have no control over the existing relational schema. Assume you have three relational tables into which you would like to decompose the documents. They are defined as follows:

Listing 3. DDL for relational schema
create table envelopext (
docID 		integer not null generated always as identity primary key, 
mailfrom 	varchar(100), 
mailto		varchar(100), 
maildate 	varchar(30), 
subject 	varchar(100)
);

create table bodyxt (
bodyId		integer not null generated always as identity primary key,
body		varchar(30000)
);

create table attachxt (
attachId	integer not null generated always as identity primary key,
attachment	varchar(100)
);

Decompose and insert XML data into relational tables with annotated XML schema decomposition

Finally, assume that you have an XML schema that has already been annotated with the desired mapping. As mentioned previously, existing tooling, such as the Data Studio Mapping Editor, can help with mappings by automatically generating the annotations. (For a copy of the fully annotated XML schema, refer to the Download section.).

If you are using a Java application, simply call the DB2 decomposition stored procedure xdbDecompXML, using JDBC and using a Java InputStream, to directly pass the XML file's contents to the stored procedure. This is shown in Listing 4:

Listing 4. Calling the xdbDecompXML stored procedure
String sql = "{CALL xdbDecompXML(?,?,?,?,?,NULL,NULL,NULL)}";
CallableStatement cstmt = con.prepareCall( sql );

String filename = "mail.xml";
File currFile = new File( filename );
long length = currFile.length();

// SQL Schema Name
// assume NULL
cstmt.setNull( 1, Types.VARCHAR );

// XML Schema Name
cstmt.setString( 2, "\"TEST001\"" );

// XML document to be decomposed, represented as a BLOB
BufferedInputStream bis = 
new BufferedInputStream( new FileInputStream( filename ) );
cstmt.setBinaryStream( 3, bis, ( int ) length );

// XML Document identifier
cstmt.setString( 4, "TEST001" );

// Options to validate the instance document against the XML Schema
int validate = 0;
cstmt.setInt( 5, validate );

/*
The last 3 parameters are reserved for future use.  
Currently NULL values are expected and must be passed in
*/

cstmt.execute();
con.commit();

cstmt.close();

As far as the application is concerned, it is just making a standard stored procedure call. Since the XML schema was previously annotated and registered in the XSR, the stored procedure does the work of taking the XML document and decomposing it into the three mapped tables.


Decompose and insert XML data into relational tables with XMLTABLE decomposition

To show that XMLTABLE decomposition can perform the same operation, use XMLTABLE in combination with an SQL INSERT statement for the same XML document and set of relational tables. Though not necessary, place the XMLTABLE decomposition functionality into a SQL stored procedure. From the application's point of view, using XMLTABLE decomposition is a matter of calling a stored procedure, just like annotated XML schema decomposition.

Before you look at the stored procedure, write the Java code for calling it. This is shown in Listing 5:

Listing 5. Calling the stored procedure that uses XMLTABLE
// sql stored procedure
String sql = "{CALL XMLTINSERT(?)}";
		
CallableStatement cstmt = con.prepareCall( sql );
String filename = "mail.xml";

File currFile = new File( filename );
long length = currFile.length();
		
BufferedInputStream bis = 
new BufferedInputStream( new FileInputStream( filename ) );

cstmt.setBinaryStream( 1, bis, ( int ) length );
cstmt.execute();
con.commit();
cstmt.close();

The stored procedure is defined to have one parameter of type XML. Otherwise, it is very similar to the xdbDecompXML call. Again, a Java InputStream is defined to directly pass an XML file's contents to the stored procedure.

The SQL stored procedure with the XMLTABLE statement for the above example is shown in Listing 6:

Listing 6. Stored procedure using XMLTABLE decomposition
CREATE PROCEDURE XMLTINSERT (IN db2xml XML)
	SPECIFIC XMLTINSERT
------------------------------------------------------------------------
-- SQL Stored Procedure 
-- db2xml is an IN parameter of type XML 
------------------------------------------------------------------------
P1: BEGIN

-- TABLE ENVELOPEXT
INSERT INTO ENVELOPEXT (MAILFROM, MAILTO, MAILDATE, SUBJECT)
SELECT MAILFROM, MAILTO, MAILDATE, SUBJECT
FROM
XMLTABLE(XMLNAMESPACES(
'http://www.sal.com/mails' AS "email"),
'$doc/email:mails/mail'
PASSING  DB2XML AS "doc" 
	COLUMNS 
MAILFROM 	VARCHAR (100)	PATH 'envelope/from',
MAILTO 		VARCHAR (100) 	PATH 'envelope/to',
MAILDATE 	VARCHAR (30) 	PATH 'envelope/email:Date',
SUBJECT 	VARCHAR (100) 	PATH 'envelope/Subject') AS T;
	    
-- TABLE BODYXT
INSERT INTO BODYXT (BODY)
SELECT BODY
FROM
XMLTABLE(XMLNAMESPACES(
'http://www.sal.com/mails' AS "email"),
'$doc/email:mails/mail'
PASSING  DB2XML AS "doc" 
	COLUMNS 
BODY 		VARCHAR (30000)	PATH 'body') AS T;
	
-- TABLE ATTACHXT
INSERT INTO ATTACHXT (ATTACHMENT)
SELECT ATTACHMENT
FROM
XMLTABLE(XMLNAMESPACES(
'http://www.sal.com/mails' AS "email"),
'$doc/email:mails/mail'
PASSING  DB2XML AS "doc" 
	COLUMNS 
ATTACHMENT 	VARCHAR (100) PATH 'attachment/@email:name') AS T;

END P1%

Notice that there are three separate INSERT statements. This is because you have three separate tables and you have to invoke separate XMLTABLE statements for each table. Look at the first INSERT statement piece by piece:

INSERT INTO ENVELOPEXT (MAILFROM, MAILTO, MAILDATE, SUBJECT)

This is standard INSERT syntax. You will be inserting into columns MAILFROM, MAILTO, MAILDATE, and SUBJECT of the table ENVELOPEXT. Next is the source for the INSERT:

SELECT MAILFROM, MAILTO, MAILDATE, SUBJECT
FROM

The insert data is coming from a SELECT statement. You will be grabbing the MAILFROM, MAILTO, MAILDATE, and SUBJECT columns. Now, this is where things get interesting. The data in the SELECT comes from the result of the XMLTABLE function:

XMLTABLE(XMLNAMESPACES(
'http://www.sal.com/mails' AS "email"),
'$doc/email:mails/mail'
PASSING  DB2XML AS "doc"

XMLTABLE expects an XQuery expression as a parameter. In this example the XQUERY expression is:

'$doc/email:mails/mail'

More on that in a second, but first see where the XML data is coming from:

PASSING  DB2XML AS "doc"

DB2XML provides the XML source. In this case, DB2XML is the IN parameter of the stored procedure, an XML value. When the XQuery expression is evaluated, an XQuery variable is presented with a value equal to what is passed (DB2XML) and a name specified by the AS clause (doc). Therefore, '$doc/email:mails/mail' refers to the sequence of elements defined by this path in the XML document.

One last thing that needs to be mention. Notice that the path has a namespace in it. You must tell DB2 what the namespace is referring to. The XMLNAMESPACES function helps you do that:

XMLNAMESPACES('http://www.sal.com/mails' AS "email")

The XMLNAMESPACES function provides namespaces to the static context of the XQuery expressions used in XMLTABLE so the various XQuery expressions do not have to include or repeat namespace declarations in their prologs. While namespaces can be declared in the XQuery prolog, the beauty of using XMLNAMESPACES, compared to using a namespace declaration through the XQuery prolog, is that the namespace can be used in all the XQuery expressions in arguments of XMLTABLE, eliminating repetition of code.

Table functions return columns of a table. Define what those columns are with the COLUMNS clause:

	COLUMNS 
MAILFROM 	VARCHAR (100)	PATH 'envelope/from',
MAILTO 		VARCHAR (100) 	PATH 'envelope/to',
MAILDATE 	VARCHAR (30) 	PATH 'envelope/email:Date',
SUBJECT 	VARCHAR (100) 	PATH 'envelope/Subject') AS T;

Each PATH clause is an XQuery expression as well, but it operates not on the entire XML document, but rather on the context defined in our XML source clause ($doc/email:mails/mail'. So the path envelope/From is a path from the <mail> element. The MAILFROM column is defined to be a VARCHAR(100) and the contents are taken from the full XML path $doc/email:mails/mail/envelope/from. The other columns are populated in a similar fashion.

The XMLTABLE results are inserted into the ENVELOPEXT table. The other two XMLTABLE statements similarly insert into the BODYXT, and ATTACHXT tables, respectively. Also, though there are three separate XMLTABLE statements executed, the XML document is a variable in the stored procedure, so it is only parsed once.

Finally, recall what your original XML document looked like. The XMLTABLE statement returns one row per <mail> element. Because there may be multiple <mail> elements, many rows may be created in the tables with one call to the stored procedure. Therefore, if there are 20 <mail> elements, the stored procedure will insert 20 rows.

If the stored procedure is saved in a file called xtinsert.db2, to create the stored procedure in DB2, you or your DB2 administrator can use this command on the CLP and it is ready to be used by your applications:

db2 -td% -vf xtinsert.db2

Some XML decomposition special cases

The following discusses some XML documents that cannot be decomposed readily using an annotated XML schema. Because XMLTABLE decomposition is driven by XQuery expressions, it can be used to programmatically map these special cases.

Map many-to-many child element relationships using XBRL data

From a technical standpoint, eXtensible Business Reporting Language (XBRL) data is interesting in the way it defines relationships between elements. These relationships represent real-world business processes.

Consider the following simple XBRL document shown in listing 7:

Listing 7. Sample XBRL data
<?xml version="1.0" encoding="utf-8"?>
<linkbase 
   xmlns="http://www.xbrl.org/2003/linkbase" 
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
   xmlns:xlink="http://www.w3.org/1999/xlink" 
   xsi:schemaLocation="http://www.xbrl.org/2003/linkbase xbrl-linkbase-2003-12-31.xsd">
                                                     
  <calculationLink 
	xlink:type="extended" 
	xlink:role=http://www.xbrl.org/2003/role/link 
	xlink:title="Calculations, All">

    <loc 
	xlink:type="locator" 
	xlink:href="BasicCalculation.xsd#ci_TotalPropertyPlantEquipment" 
	xlink:label="ci_TotalPropertyPlantEquipment" />

    <loc 
	xlink:type="locator" 
	xlink:href="BasicCalculation.xsd#ci_Land" 
	xlink:label="ci_Land" />

    <loc 
	xlink:type="locator" 
	xlink:href="BasicCalculation.xsd#ci_Building" 
	xlink:label="ci_Building" />

    <loc 
	xlink:type="locator" 
	xlink:href="BasicCalculation.xsd#ci_FurnitureFixtures" 
	xlink:label="ci_FurnitureFixtures" />

    <loc 
	xlink:type="locator" 
	xlink:href="BasicCalculation.xsd#ci_ComputerEquipment" 
	xlink:label="ci_ComputerEquipment" />

    <loc 
	xlink:type="locator" 
	xlink:href="BasicCalculation.xsd#ci_Other" 
	xlink:label="ci_Other" />

    <calculationArc 
	xlink:type="arc" 
	xlink:arcrole="http://www.xbrl.org/2003/arcrole/summation-item" 
	xlink:from="ci_TotalPropertyPlantEquipment" 
	xlink:to="ci_Land" order="1" weight="1" use="optional" />

    <calculationArc 
	xlink:type="arc" 
	xlink:arcrole="http://www.xbrl.org/2003/arcrole/summation-item" 
	xlink:from="ci_TotalPropertyPlantEquipment" 
	xlink:to="ci_Building" order="2" weight="1" use="optional" />

    <calculationArc 
	xlink:type="arc" 
	xlink:arcrole="http://www.xbrl.org/2003/arcrole/summation-item" 
	xlink:from="ci_TotalPropertyPlantEquipment" 
	xlink:to="ci_FurnitureFixtures" order="3" weight="1" use="optional" />

    <calculationArc 
	xlink:type="arc" 
	xlink:arcrole="http://www.xbrl.org/2003/arcrole/summation-item" 
	xlink:from="ci_TotalPropertyPlantEquipment" 
	xlink:to="ci_ComputerEquipment" order="4" weight="1" use="optional" />

    <calculationArc 
	xlink:type="arc" 
	xlink:arcrole="http://www.xbrl.org/2003/arcrole/summation-item" 
	xlink:from="ci_TotalPropertyPlantEquipment" 
	xlink:to="ci_Other" order="5" weight="1" use="optional" />

  </calculationLink>
</linkbase>

XBRL: eXtensible Business Reporting Language

XBRL is an XML language used for electronically communicating business and financial data. As an emerging, open standard it is mainly used for collecting and reporting financial information. XBRL can show how items are related to one another, and thus can represent how item financials are calculated and whether they fall into particular groupings for organizational or presentational purposes. For more information, refer to the Resources section of this article.

XBRL enhances XML by adding more semantics. One key enhancement is connecting two items using "arcs." In this document, the interest is in <calculationArc> elements. They define a relationship between two assets. The "arcrole" attribute, based on the URI, tells you that this particular arc is a summation relationship between the "to" and "from" attributes. Each "to" attribute identifies the individual items, such as "ci_Land" and "ci_Building" that are summed. These items summed up together compose the cost of the "from" attribute, in this case "ci_TotalPropertyPlantEquipment."

The attributes (such as "ci_Land" and "ci_Building") are actually labels. The <loc> element helps you identify what each label represents. Look at the <loc> element, whose "label" attribute = "ci_Land":

<loc 
	xlink:type="locator" 
	xlink:href="BasicCalculation.xsd#ci_Land" 
	xlink:label="ci_Land" />

You see that there is an "href" attribute that points to the "BasicCalculation.xsd" XML schema document. If you were to look in this document, you would find the actual value of "ci_land".

A full analysis of XBRL data is beyond the scope of this article, but you can imagine that you might want to decompose this XML document into an "arc" table that stores the "from" and "to" attributes. But instead of the labels, you want the name of the reference that give you the actual values.

The table contains three columns: an automatically generated ARCID, an ARCFROM, and an ARCTO column, as shown in Listing 8:

Listing 8. DDL for the ARC table
CREATE TABLE ARC (
  ARCID		integer not null generated always as identity primary key,
  ARCFROM	varchar(100),
  ARCTO		varchar(100)
);

For decomposition, "from" and "to" attribute values in the <calculationArc> element are replaced with the appropriate "href" attribute in the <loc> element. For example, for the following element:

<calculationArc 
	xlink:type="arc" 
	xlink:arcrole="http://www.xbrl.org/2003/arcrole/summation-item" 
	xlink:from="ci_TotalPropertyPlantEquipment" 
	xlink:to="ci_Land" order="1" weight="1" use="optional" />

The values BasicCalculation.xsd#ci_TotalPropertyPlantEquipment and BasicCalculation.xsd#ci_Land from the <loc> element are inserted in the ARCFROM and ARCTO columns of the ARC table, respectively. These values correspond to the from="ci_TotalPropertyPlantEquipment" and to="ci_Land" labels in the <calculationArc> element. The other rows of the table are populated in a similar fashion.

Annotated XML schema decomposition does not support this type of decomposition of a many-to-many multiple child element mapping that forms rows conditionally based on dynamic values. Here, XMLTABLE can help. As mentioned before, XQuery is the engine at work behind the XMLTABLE function, making it a very powerful tool.

Listing 9 is a stored procedure that performs the decomposition as required:

Listing 9. SQL Stored procedure for XBRL decomposition using XMLTABLE
CREATE PROCEDURE XMLTINSERT2 ( IN db2xml XML )
	SPECIFIC XMLTINSERT2
------------------------------------------------------------------------
-- SQL Stored Procedure 
-- db2xml 
------------------------------------------------------------------------
P1: BEGIN
INSERT INTO ARC
	(ARCFROM,
	ARCTO) 
SELECT	
	ARCFROM,
	ARCTO 
	FROM XMLTABLE(
	'$doc/*:linkbase/*:calculationLink/*:calculationArc' 
		PASSING DB2XML AS "doc"
		COLUMNS
ARCFROM VARCHAR(100) PATH 'let $x := . return $x/../*:loc[@*:label=$x/@*:from]/@*:href',
ARCTO 	VARCHAR(100) PATH 'let $x := . return $x/../*:loc[@*:label=$x/@*:to]/@*:href'
	)AS T;	
END P1%

First, select the context to be the <calculationArc> element by using the following XQuery expression:

'$doc/*:linkbase/*:calculationLink/*:calculationArc'

Note that the original XBRL document contains several different namespaces. In your XQuery expression, you are not concerned with them so wild card the namespace with an asterisk (*). If you had multiple <calculationArc> elements with different namespaces, this might pose a problem, but this is not the case here.

You can obtain all you need from there. Look at the PATHs defined by the COLUMNS clause. To get ARCFROM, you want the element whose "label" attribute matches the current calculation's "from" attribute. To perform this test, you assign the current context node to a variable $x using the let XQuery keyword. Finally, you step up from the current context, navigate to the <loc> element and return the 'href' attribute whose attribute "label" matches the "from" attribute:

'let $x := . return $x/../*:loc[@*:label=$x/@*:from]/@*:href'

You obtain the value for ARCTO in the same way. The decomposed data in the table after calling the stored procedure is shown in Table 1:

Table 1. Decomposed data in the ARC table
ARCIDARCFROMARCTO
1BasicCalculation.xsd#ci_TotalPropertyPlantEquipmentBasicCalculation.xsd#ci_Land
2BasicCalculation.xsd#ci_TotalPropertyPlantEquipmentBasicCalculation.xsd#ci_Building
3BasicCalculation.xsd#ci_TotalPropertyPlantEquipmentBasicCalculation.xsd#ci_FurnitureFixtures
4BasicCalculation.xsd#ci_TotalPropertyPlantEquipmentBasicCalculation.xsd#ci_ComputerEquipment
5BasicCalculation.xsd#ci_TotalPropertyPlantEquipmentBasicCalculation.xsd#ci_Other

XML documents with recursive elements

Now take a look, in Listing 10, at another common type of XML document that is common in industry:

Listing 10. XML document representing bill of materials
<?xml version="1.0" encoding="UTF-8"?>
<items>
   <item desc="computersystem" model="L1234123">
	  <part desc="computer" partnum="5423452345">
		<part desc="motherboard" partnum="5423452345">
			<part desc="CPU" partnum="6109486697">
				<part desc="register" partnum="6109486697"/>
			</part>
			<part desc="memory" partnum="545454232">
				<part desc="transistor" partnum="6109486697"/>
			</part>
		</part>
		
		<part desc="diskdrive" partnum="6345634563456">
			<part desc="spindlemotor" partnum="191986123"/>
		</part>
		<part desc="powersupply" partnum="098765343">
			<part desc="powercord" partnum="191986123"/>
		</part>
	  </part>
	
	  <part desc="monitor" partnum="898234234">
		<part desc="cathoderaytube" partnum="191986123"/>
	  </part>
	
	  <part desc="keyboard" partnum="191986123">
		<part desc="keycaps" partnum="191986123"/>
	  </part>
	
	  <part desc="mouse" partnum="98798734">
		<part desc="mouseball" partnum="98798734"/>
	  </part>
   </item>
</items>

This type of XML document is commonly called a "bill of materials." The sample document contains only one item, but potentially, it can contain an unlimited number of items. What makes the document interesting is that the <part> element is recursive. An <item> element may contain <part> child elements, which in turn may also contain more <part> child elements, and so on, indefinitely.

Decompose this document into an ITEM table with ITEMNAME, PART, and PARENT columns.

Listing 11. DDL for the ITEM table
CREATE TABLE ITEM (
  PID		integer not null generated always as identity primary key,
  ITEMNAME	varchar(25),
  PART		varchar(25),
  PARENT	varchar(25)
);

Annotated XML schema decomposition does not allow the use of recursive element definitions in the XML schema. Use XMLTABLE decomposition to help out again. The stored procedure is given in Listing 12:

Listing 12. SQL Stored procedure for decomposing bill of materials
CREATE PROCEDURE XMLTINSERT3 ( IN db2xml XML )
	SPECIFIC XMLTINSERT3
------------------------------------------------------------------------
-- SQL Stored Procedure 
-- db2xml 
------------------------------------------------------------------------
P1: BEGIN
INSERT INTO ITEM
	(ITEMNAME,
	PART,
	PARENT) 
SELECT 	
	A.ITEMNAME,
	B.PART,
	B.PARENT
	FROM 
	XMLTABLE('$doc/items/item' PASSING DB2XML AS "doc"
	COLUMNS
		ITEMNAME 	VARCHAR(25)	PATH './@desc',
		ITEM		XML			PATH '.'
	)AS A,
	XMLTABLE('$doc//part' PASSING A.ITEM AS "doc"
	COLUMNS
		PART 		VARCHAR(30) PATH './@desc',
		PARENT		VARCHAR(35) PATH '../@desc'
	)AS B;	
END P1%

In this example, the INSERT statement has two sources: two XMLTABLE statements. The first statement gives you the name of the item using the context node of $doc/items/item. It also grabs the item subtree for use in the second statement.

The second statement contains an XQuery expression that selects all <part> elements no matter where they are ($doc//part) but only from the current <item> subtree (A.ITEM).

Now that you have all the <part> elements, you get the current node description and navigate up to get the parent description. Table 2 contains the contents of the ITEM table after invoking the stored procedure with the sample data:

Table 2. Decomposed data in the ITEM table
ITEMIDITEMNAMEPARTPARENT
1computersystemcomputercomputersystem
2computersystemmotherboardcomputer
3computersystemCPUmotherboard
4computersystemregisterCPU
5computersystemmemorymotherboard
6computersystemtransistormemory
7computersystemdiskdrivecomputer
8computersystemspindlemotordiskdrive
9computersystempowersupplycomputer
10computersystempowercordpowersupply
11computersystemmonitorcomputersystem
12computersystemcathoderaytubemonitor
13computersystemkeyboardcomputersystem
14computersystemkeycapskeyboard
15computersystemmousecomputersystem
16computersystemmouseballmouse

You can easily follow the hierarchy of parts by looking at the PARENT column. For example, row 4 is for the "register" part. If you follow the parent you get "CPU." "CPU's" parent is the "motherboard." The "motherboard's" parent is the "computer." And the "computer's" parent is the "computersystem", the top-level item.


Performance

FpML: Financial products markup language

Financial products Markup Language (FpML) is an industry-standard protocol for complex financial products. FpML represents over-the-counter (OTC) financial derivative transactions whose price is derived from other assets and that are traded bilaterally between parties (as opposed to through a public exchange). These transactions form a significant component of today's financial markets and are a major source of technology investment for the firms involved because of the major risks that they carry.

The example used in this performance comparison is based on the FpML 4.1 Recommendation Specification release on July 14, 2005. For more information, refer to the Resources section.

Though intensive performance studies have not been conducted, informal studies show that both annotated XML schema decomposition and XMLTABLE decomposition perform well relative to each other. Here, we share some of our informal results.

Simple mapping

Decomposition was performed using an enhanced, proprietary FpML XML schema that defines the content model for a message. The message wraps an FpML trade. This XML schema consists of 23 documents with a total size of approximately 871KB. The 12 most common items were mapped to 12 relational columns in one table. In this situation, when mapping to only a small number of relational columns, the performance difference between annotated XML schema decomposition and XMLTABLE decomposition was negligible.

Complex mapping

Some tests were also performed using a proprietary annotated XML schema provided by a customer. The XML schema was defined in one document that was approximately 142 KB. One XML instance document was decomposed into 21 tables and hundreds of relational columns across those tables. Some elements produced multiple row insertions, while others only generated one row. In this case, annotated XML schema decomposition performed approximately 40 percent faster than XMLTABLE decomposition.

Considerations

Recall that XMLTABLE must be invoked for each table that is populated. This may be costly if the number of tables is large. Also, annotated XML schema decomposition traverses the XML document only once, while each invocation of XMLTABLE requires a document traversal. Using a stored procedure to perform the XMLTABLE decomposition can help because even with multiple XMLTABLE invocations, the XML value is only parsed once and it is reused for each traversal for the life of the stored procedure call.

Finally, what has been presented are not official performance results. It is recommended that further performance studies be conducted in a more controlled environment for a more rigorous benchmark. Further studies can also define at a more fine-grained level what the threshold is for an XML schema or relational schema to go from simple to complex. Suggested variables to control are the size of the XML schema (both breadth and depth), the number of tables used for decomposition, and the number of columns per table.


Best practices for annotated XML schema decomposition and XMLTABLE decomposition

This article has shown that annotated XML schema decomposition and XMLTABLE decomposition are different ways to decompose XML documents into relational tables. There are a few considerations to keep in mind when deciding which approach to use for your applications.

Instance document vs. XML schema

The first and most obvious thing to consider when performing decomposition is that XMLTABLE decomposition works with an instance document and does not require an XML schema. Annotated XML schema decomposition requires the mappings to be defined as part of the XML schema.

Mapping changes

Annotated XML schema decomposition works on annotated XML schema stored in the XSR. If the mapping changes, the XML schema must be changed and re-registered in the XSR. This is easily accomplished using Data Studio.

XMLTABLE decomposition creates mappings using XQuery expressions. A mapping change requires changing all the XQuery expressions by hand. If defined in a stored procedure, just like the examples above, changes to the mapping require that the stored procedure be dropped and created again.

Mapping definition flexibility

Annotated XML schema decomposition mappings are defined by annotations. Since they are part of the XML schema document, mappings must conform to the extent and limitations of the XML schema language. If the XML schema language does not allow a particular construct, then this relationship cannot be mapped using an annotated XML schema. Still, for the wide range of allowable mappings, tooling makes creating the mapping annotations easy and automated. Furthermore, it is easy to map complex XML schemas to multiple tables.

Using XMLTABLE decomposition, data to populate columns in the INSERT statement can come from various sources and is not limited to one XMLTABLE call. Also, the power of XQuery can allow the developer to create sophisticated expressions adding much flexibility to how a mapping is defined. It is only limited by whatever the developer can dream up. At the same time, this flexibility does add complexity and if the developer is not familiar with the XQuery language, there may be a learning curve. Finally, XMLTABLE statements can become tedious if there are many tables and columns involved in a mapping.


Conclusion

Ultimately, annotated XML schema decomposition and XMLTABLE decomposition are both very powerful tools and it may come down to user preference as to which one is used. But in certain cases, one can be recommended over the other .

Use annotated XML schema decomposition when you have a complex schema and a mapping requirement over a large number of tables. Available tooling and performance considerations make it a good choice. Annotated XML schema decomposition also allows some advanced features, such as conditional decomposition based on content or specifying content transformation to be applied before insertion, which this article did not discuss. Though XMLTABLE decomposition functionality supports this as well, annotations allow it to be done more easily in the XML schema.

Use XMLTABLE decomposition when an XML schema does not exist or if a special purpose requires flexibility. For example, if you would like to perform a many-to-many multiple child element mapping that forms rows conditionally based on dynamic values, if you would like to map recursive elements, or if you want to control exactly how your INSERT statement is composed, then XMLTABLE decomposition is the way to go.

This article has demonstrated the use of XMLTABLE decomposition for decomposing XML documents into relational tables as an alternative to annotated XML schema decomposition. It also provided some recommendations and best practices for the use of each. Together, they are part of the pureXML technology in DB2 and they help developers and administrators manage their data in the way that works best for them.

Acknowledgements

Thanks is extended to members of the DB2 pureXML Development team at the IBM Silicon Valley Lab (SVL) who provided suggestions and comments for this article: Seeling Cheung, Dung Nguyen, Mayank Pradhan, David Salinero, and Cynthia Saracco.


Downloads

DescriptionNameSize
Scripts to create and register objectsmail.zip3KB
Scripts to create the XBRL examplemany-to-many.zip2KB
Scripts to create BOM examplerecursive.zip2KB

Resources

Learn

Get products and technologies

  • Download DB2 9.5 and get started on using XML technologies.
  • Download IBM Data Studio to easily and quickly build database applications.
  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus ®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=280030
ArticleTitle=Shred XML documents using DB2 pureXML
publish-date=01032008