Managing new forms of data often presents new challenges. Many IT leaders have discovered that's precisely the case when it comes to data in Extensible Markup Language (XML) format.
All too often, the obvious choices for managing and sharing XML data just don't cut it. File systems are fine for simple tasks, but they don't scale well when you need to cope with a large number of documents. Concurrency, recovery, security, and usability issues become unmanageable. Commercial relational database management systems (DBMSs) address those issues but fall short in other areas. They offer two fundamental database design options -- storing each XML document intact as a single large object or "shredding" it into multiple columns often across multiple tables. In many situations, these options introduce performance problems, administrative challenges, increased query complexity, and other issues. Finally, XML-only DBMSs introduce a new, largely unproven environment into an IT infrastructure, raising concerns about integration, staff skills, and long-range viability.
The Version 9 release of DB2 for Linux, Unix, and Windows platforms introduces another option. This release supports XML data as a first-class type. To do so, IBM extended DB2 to include:
- New storage techniques for efficient management of hierarchical structures inherent in XML documents
- New indexing technology to speed searches across and within XML documents
- New query language support (for XQuery), a new graphical query builder (for XQuery), and new query optimization techniques
- New support for validating XML data based on user-supplied schemas
- New administrative capabilities, including extensions to key database utilities
- Integration with popular application programming interfaces (APIs)
- Supporf of XML in a data warehouse environment by adding XML support for the data partitioning feature (DPF), range partitioning, and multidimensional clustering (MDC) in V9.7
It's important to note that DB2's "native" support for XML is in addition to its existing support for other technologies, including SQL, tabular data structures, and various DBMS features. As a result, users can create a single database object that manages both "traditional" SQL data and XML documents. Furthermore, they can write a single query that searches and integrates both forms of data.
This article explores these features as we delve into DB2's native XML support. First, though, let's consider why proper management of XML data is important.
With an increasing number of firms turning to XML to help them implement service-oriented architectures (SOA), exchange data among disparate systems and applications, and adapt to rapidly changing business conditions, many savvy IT leaders are looking for ways to effectively share, search, and manage the wealth of XML documents and messages that their firms are generating. DB2's new XML support is designed to help firms minimize the time and effort it takes to persist and use their XML data. This, in turn, can reduce development costs and improve business agility.
For example, the article "Use DB2 native XML with PHP" (developerWorks, October 2005) illustrates how DB2's new XML support reduces the complexity of the database design and application code required to support an e-commerce Web site. Similarly, "Managing XML for Maximum Return" (IBM, November 2005) discusses early customer experiences involving comparative test scenarios that also indicate potential labor savings and improved cycle time.
What's behind these benefits? DB2 enables users to store XML documents intact with full DBMS knowledge of the document's internal structure. This eliminates or minimizes administrative and programming tasks associated with other alternatives. Furthermore, it can speed searches across and within documents, and it enables customers to more readily accommodate changes to business requirements reflected in XML schemas.
DB2 allows client applications to work with both tabular and XML data structures through the query language of their choice -- SQL (including SQL with XML extensions, often referred to as "SQL/XML") or XQuery. As shown in Figure 1, engine-level components within DB2 support queries specified in either language.
Figure 1. The architecture of the new DB2 9 release
To efficiently manage traditional SQL data types and XML data, DB2 includes two distinct storage mechanisms. We'll discuss the pureXML® storage technology shortly. However, it's important to note that the underlying storage mechanism used for a given data type is transparent to the application. In other words, the application doesn't need to explicitly specify which storage mechanism to use or manage physical aspects of storage, such as splitting portions of XML documents across multiple database pages. It simply enjoys the runtime performance benefits of storing and querying data in a format that's efficient for the target data.
Let's delve into the DB2 XML features from a user's point of view.
Collections of XML documents are stored in DB2 tables containing one or more columns of the new XML data type. This enables administrators to use familiar SQL data definition language (DDL) statements to create database objects for persisting their XML data. However, this familiar interface masks the fact that DB2 stores the XML data differently, using new technology to preserve the XML data's hierarchical structure and support efficient searches spanning all or part of the original XML data.
To make it easy for users to integrate traditional forms of business data with XML data, DB2 administrators can create tables that contain columns of both traditional SQL data types and the new XML data type. Here's an example of one such table:
Listing 1. Creating a table with an XML column
create table items (
id int primary key not null,
brandname varchar(30),
itemname varchar(30),
sku int,
srp decimal(7,2),
comments xml
)
|
The first five columns of this table use traditional SQL data types to track information about each item for sale, including its ID number, brand name, item name, stock keeping unit (SKU), and suggested retail price (SRP). A "comments" column contains XML data with feedback customers have shared regarding the item.
Note that the internal structure of the XML data isn't specified when creating a table with an XML column. This is by design. XML documents are self-describing, and their internal structure can vary considerably. DB2's only requirement for storing XML data is that it must be "well formed" -- that is, it must adhere to certain syntax rules specified in the "W3C standard for XML" (see "Resources"). DB2's liberal approach provides users with considerable flexibility and makes it easy to store collections of XML documents that contain different attributes and internal structures due to evolving business requirements or situations where certain information is missing or irrelevant.
However, users who want to ensure that XML data conforms to their own structural rules can instruct DB2 to validate their data prior to storage. This is discussed in greater detail in "XML schemas and validation." Doing so essentially involves creating XML schemas (which are also part of the W3C XML standard) and registering these schemas with DB2. (See "Resources" for more information on XML schemas.)
At this point, you may be wondering how users populate a DB2 table with
XML data. The answer is simple -- they use one of three familiar DB2
mechanisms to do so. SQL INSERT statements as
well as the DB2 IMPORT facility accommodate XML
data in addition to other data types. (DB2
IMPORT issues INSERT
statements behind the scenes.) Load also supports XML inserts, which was added in DB2 9.5. If you're wondering why DB2 only supports
data inserts through SQL and not XQuery, that answer is pretty simple, too
-- the first version of the emerging XQuery standard focuses on database
read activities, not write activities. In the absence of a clearly
accepted standard, IBM opted to offer its users two familiar means for
persisting new XML data. (See "Resources" for
more information on the XQuery standard.)
As a practical matter, most users won't need to concern themselves with DB2's new physical storage management architecture for XML data. However, to help you understand what DB2 is doing behind the scenes, let's briefly discuss its internal approach to storing XML data.
DB2 stores and manipulates XML data in a parsed format that reflects the hierarchical nature of the original XML document. As such, it uses trees and nodes as its model for storing and processing XML data. If users instruct DB2 to validate their XML data against a registered XML schema prior to storage, DB2 will annotate all nodes in the XML hierarchy with information about the schema types.
Given the earlier definition of an "items" table, let's review a sample
XML document to be stored in that table. As shown in Listing 2 and illustrated in Figure 2, this
XML document contains multiple elements represented in a hierarchy,
including a root Comments element and one or more individual Comment
elements pertaining to a given item. Associated with each comment are a
comment identifier, customer information that may include sub-elements for
the customer's name and email address, the text of the customer's message
or comment, and an indicator of whether or not the customer would like a
reply.
Listing 2. Sample XML document
<Comments>
<Comment>
<CommentID>133</CommentID>
<Customer>
<Name>John Doe</Name>
<Email>johndoe@any_isp.com</Email>
</Customer>
<Message>Heels on shoes wear out too quickly.</Message>
<ResponseRequested>No</ResponseRequested>
</Comment>
. . .
</Comments>
|
Figure 2. Hierarchical representation of XML document in Listing 2
Upon storage, DB2 will preserve the internal structure of this document, converting its tag names and other information into integer values. Doing so helps conserve disk space and also improves the performance of queries that use navigational expressions. For example, DB2 might convert the "Comments" tag in Figure 2 to a "0" upon storage. However, users aren't aware of this internal representation.
Finally, DB2 will automatically split portions of a document—that is, nodes of the document tree—across multiple database pages as needed. Indeed, DB2 can split a collection (or sub-tree) of nodes at any level of the document hierarchy as needed. In such cases, DB2 automatically generates and maintains a "regions" index to provide an efficient means of tracking the physical representation of the entire document. The space where these XML docuements are stored is called XML Data Object Area (XDA).
If the "inline" option is used for the create table and the entire XML document is small enough to fit into the page, then it would be stored along with the relational data and not in the XDA area. If, however, the document is too large, then DB2 will automatically split the document as described above. The inline option was introduced in DB2 9.5. Also introduced in 9.5 is the ability to say "compress yes" during table creation or alteration and have not only the relational data compressed, but also the XML inlined data as well. In fact, DB2 went a step further in 9.7—setting compression on a table in 9.7 also compresses the XML documents in the XDA area.
Along with new hierarchical storage management support for XML, DB2
features new indexing technology to speed searches involving XML data.
Like their relational counterparts, these new XML indexes are created with
a familiar SQL DDL statement: CREATE INDEX.
However, in addition to specifying the target column to index, users also
specify an "xmlpattern" -- essentially, an XPATH expression without
predicates -- to identify the subset of the XML document of interest.
For example, using the earlier "items" table definition and the
corresponding sample XML document shown in Listing
2, an administrator might issue the following statement (see Listing 3) to index
all comment identifiers ("CommentID" values) contained in the "comments"
column. Recall that the CommentID element in our sample document is a
child of the Comment element, which itself is a child of the root
Comments element.
Listing 3. Creating an index for an XML column
create index myindex on items(comments) generate key
using xmlpattern '/Comments/Comment/CommentID' as sql double
|
A few details are worth noting. The path specified in the "xmlpattern"
clause is case specific. Thus, "/Comments/Comment/CommentID" will not
index the same XML element values as "/comments/comment/commentid."
Furthermore, because DB2 doesn't require a single XML schema for a given
XML column, DB2 may not know what data type to associate with the
specified pattern. Users must specify the data type explicitly using one
of the supported SQL types (VARCHAR,
VARCHAR HASHED,
DOUBLE, DATE, and
TIMESTAMP).
Finally, although an SQL DDL statement is used to create an XML index, an index over XML data isn't the same as an index over columns of traditional SQL data types. While details of DB2's XML indexing technology are beyond the scope of this article, you may have noticed two significant differences:
- Indexes on XML data typically involve only a subset of the document's (column's) contents. By contrast, indexes on traditional SQL data always involve the entire column's content.
- A single row in a table may result in multiple XML index entries because a single XML document may contain zero, one, or many "nodes" that match the specified xmlpattern. By contrast, a non-XML index contains one entry for each row in the table.
For certain applications, full-text search can be critical. IBM has
extended DB2's previous text search capabilities to include data stored in
XML columns. Extensions to the CREATE INDEX
statement enable administrators to create full text indexes to help
improve the performance of such searches.
Query language and optimization
DB2's pureXML support includes new query language capabilities. Programmers can now search for data using SQL or XQuery, a new query language that supports navigational (or path-based) expressions. Indeed, applications can freely employ statements from both query languages, and a single query statement can actually incorporate both SQL and XQuery.
We don't have time to explore the breadth and depth of these capabilities in this article, so let's just discuss a few highlights. If you're an SQL programmer with no prior XML experience, you'll be relieved to learn that a simple SQL statement will enable you to retrieve the contents of data stored in XML columns. For example, these two familiar queries will return all the data in the 'items' table related to a specific stock keeping unit (SKU), including XML documents with customer comments:
Listing 4. Querying XML data with SQL
select * from items where sku = 112233
select id, brandname, itemname, sku, srp, comments from items
where sku = 112233
|
Now let's consider a slightly different situation, in which you want to retrieve only the messages contained within customer comments of the "items" table, and you want to do so using XQuery. Here's perhaps the simplest way to formulate the statement:
Listing 5. Querying XML data with XQuery
xquery db2-fn:xmlcolumn('ITEMS.COMMENTS')/Comments/Comment/Message
|
Because DB2 supports two query languages, users must prefix XQuery statements with the keyword "xquery." The "db2-fn:xmlcolumn" function is one way to specify the target data to be queried. It requires a parameter specifying the XML column of the desired table -- in this case, the COMMENTS column of the ITEMS table. You've further restricted your target data to a specific subset of XML data -- namely, values of the "'Message" element, which is a child of the "Comment" element, which itself is a child of the root "Comments" element. (See Figure 2.)
The same query can be formulated using FLWOR
expressions commonly associated with XQuery statements.
FLWOR expressions -- an informal way of
referring to for,
let, where,
order by, and return
clauses -- enable query programmers to iterate over groups of nodes within
XML documents and to bind variables to intermediate results. For this
sample query, you can use for and
return expressions to retrieve messages from
customer comments, as shown here:
Listing 6. Using
FOR and RETURN clauses of XQuery
xquery for $y in db2-fn:xmlcolumn('ITEMS.COMMENTS')/Comments/Comment
return ($y/Message)
|
It's worth noting that IBM Data Studio V2.2.0.1 (the no-charge replacement for Control Center) and Optim Development Studio V2.2, both Eclipse-based development tools, include a graphical XQuery builder to help users generate and test queries.
Both SQL and XQuery can be combined in a single statement to restrict searches for both XML and non-XML columns. For example, consider the following XQuery statement:
Listing 7. Combining SQL and XQuery in one statement
xquery db2-fn:sqlquery('select comments from items
where srp > 100')/Comments/Comment/Message
|
The db2-fn:sqlquery clause restricts the input
to the broader XQuery statement; specifically, only customer comments
associated with items carrying a suggested retail price
(srp) of more than $100 are passed as input.
Further XQuery information indicates that DB2 should only return the
"Message" portions of such comments.
A number of papers and Web sites can help you get up to speed on DB2's SQL/XML extensions, DB2's support for XQuery, and the emerging XQuery standard. (See "Resources" for pointers.)
Finally, this discussion of query languages may leave you wondering about a related topic: query optimization. DB2 has two query language parsers: One for XQuery and one for SQL. Both generate a common, language-neutral, internal representation of queries. This means that queries written in either language are afforded the full benefits of DB2's cost-based query optimization techniques, which include efficient rewriting of query operators and selection of a low-cost data access plan. In addition, DB2 can leverage new query and join operators, as well as new index processing techniques, to provide strong runtime performance for queries involving XML documents.
The flexible nature of XML sometimes concerns database professionals who worry about data quality. As we've already noted, DB2 enables users to store any well-formed XML document in any column defined on the new XML data type. Thus, a single column can contain documents with different structures (or schemas) as well as different content. When the nature of the data to be captured is unclear or difficult to predict, such flexibility can be an absolute necessity. But in other cases, it can be a liability. That's why DB2 gives users the option of registering their XML schemas and instructing DB2 to validate XML documents against these schemas prior to storage.
If you're not familiar with XML schemas, they're simply well-formed XML documents that dictate the structure and content of other documents. For example, XML schemas specify what elements are valid, the order in which these elements should appear in a document, what XML data types are associated with each element, and so on. Various tools can help you create XML schemas from existing XML documents, including Rational® products.
Users can elect to store different XML documents that correspond to different registered schemas within a single column. This is significant because evolving business needs can impact the structure and content of XML data that needs to be captured. Considering our earlier "items" table, imagine that several months after this table was deployed you decided to capture additional information in the XML column, such as more customer contact information, a record of the actions taken in response to certain comments, and so on. DB2 can accommodate these new enhancements without forcing users to change the table's structure or any applications that rely on it. Indeed, existing data (based on an "old" XML schema) can remain in place, and new data can be added that complies with one or more new schemas. In this way, administrators can support new business requirements with minimal deployment time and cost. Furthermore, they don't need to compromise the integrity of their XML to do so -- they can simply supply DB2 with new information about what's "valid" for their XML data.
Registering an XML schema in DB2's internal repository is simple. DB2 provides stored procedures to automate the process, or administrators can manually issue corresponding commands. A single schema can be used to validate multiple XML columns in multiple tables, if desired.
DB2's new support for XML includes extensions to familiar tools and
utilities to help administrators manage and tune their databases. For
example, backup and restore facilities -- including high availability data
replication for failover situations -- all support documents stored in XML
columns. Similarly, extensions to the IMPORT
and EXPORT facilities now operate on both
traditional SQL and XML data. Thus, you could issue a single
IMPORT command to populate the entire "items"
table (reading XML data from native files) and issue a single
EXPORT command to write all the table's data to
external files. In DB2 9.5, the extensions were added to the
LOAD facility so that you can LOAD large
quantities of XML documents.
In addition, DB2's graphical administrative tool, the DB2 Control Center, as well as its no-charge replacement, IBM Data Studio, enables administrators to browse tables containing XML data, create and manage XML-based indexes, issue SQL/XML and XQuery statements, and perform a number of other administrative tasks. Since performance is often a key concern, appropriate DB2 facilities have also been extended to accommodate XML data. These include the DB2 Snapshot Monitor, which provides a point-in-time summary or "snapshot" of DB2 activities; RUNSTATS, which collects statistics about the nature of the data stored in a DB2 database; and EXPLAIN, which reports on which access path the query optimizer selected to satisfy a given request. Examining EXPLAIN output can enable an administrator to determine which XML indexes are being used.
Programming language extensions
DB2's new XML support wouldn't be very useful if the XML stored in its databases wasn't readily accessible to programmers. Recognizing this, IBM implemented enhancements to its various programming language interfaces to support easy access to its XML data. These enhancements span Java™ (JDBC), C (embedded SQL and call-level interface), COBOL (embedded SQL), PHP, and Microsoft®'s .NET environment (through the DB2.NET provider).
Because the application programming interface (API) varies according to the programming language in use, this article does not review each of these extensions here. However, you can read a summary of these extensions in a conference paper entitled "Native XML Support in DB2 Universal Database" (see "Resources") or read the article "Use DB2 native XML with PHP."
With the release of DB2 9.7, XML support has been extended to a data warehouse environment. DPF (data partitioning feature) will now fully support XML columns. This allows the XML operations of parsing, navigation, transformation, serialization, and construction now to be parallelized for performance benefits. You can now create or alter any table to add an XML column from any partition, create indexes from any partition, use the XML validation commands, use SQL/XML or XQuery, or load large volumes of XML data in parallel. Starting with DB2 9.7, partitioned tables can include XML data and can benefit from the easy roll-in and roll-out of data provided by the table partitioning functionality. In addition to queries over relational data, queries over XML data can also benefit from the performance advantage of partition elimination.
Multidimensional clustering (MDC) tables now support XML data. MDC provides a method for clustering data in tables along multiple dimensions. MDC tables can significantly improve query performance and reduce the overhead of data maintenance operations such as reorganizing data, inserting data, and deleting data. You can now create MDC tables that contain one or more XML columns, add XML columns to MDC tables, and create indexes over XML data in MDC tables. Queries can use both indexes over XML data and MDC indexes to enhance performance.
DB2 9 is the first IBM implementation of a "hybrid" or multi-structured database management system. In addition to supporting a tabular data model, DB2 also supports the native hierarchical data model found in XML documents and messages. Users can freely mix and match storage of traditional SQL data and XML in a single table. They can also query and integrate both forms of data using SQL (with XML extensions, if desired) and XQuery, the emerging standard for querying XML data. By building on a proven database management infrastructure, IBM is providing DB2 users with sophisticated support for both relational and native XML DBMS technologies.
Thanks to Grant Hutchison, Matthias Nicola, and Gary Robinson for reviewing this article.
Learn
-
IBM DB2 e-kit for Database Professionals: Grow your skills, and quickly and easily become certified for DB2 for Linux, UNIX, and Windows.
-
DB2 pureXML Web
site: Learn more about DB2's pureXML support.
-
DB2
pureXML enablement wiki: This is the home page for all things
pureXML. Here you will find categories for more articles, success stories,
pureXML partners, our pureXML forum, and our devotees site for those who
wish to stay abreast of the latest pureXML happenings. Stay up to date
with the latest success stories, articles, education, and more on DB2
pureXML.
-
DB2 pureXML devotee site: You can sign up to be on our pureXML
devotee sites, and attend sessions given by experts and customers that
delve deeper into pureXML functions and solutions.
-
W3C standard for XML: XML data
must be "well formed" -- it must adhere to certain syntax rules specified
in the W3C standard for XML -- to be stored in DB2.
- XML schemas: Also part of the W3C XML standard, XML schemas provide a means for defining the structure, content, and semantics of XML documents.
- XQuery standard: XQuery is a query language that uses the structure of XML intelligently to express queries across many types of XML data sources.
-
System RX:
One Part Relational, One Part XML (SIGMOD 2005 Conference Web
site): Written by 14 IBM researchers and developers, this paper describes
the overall architecture and design aspects of a hybrid relational and XML
database system called System RX..
- "Firing up the Hybrid Engine"(IBM Database Magazine,
Quarter 3, 2005): IBM's new hybrid DB2 puts the full power of a relational
engine to work on a truly native XML store that sits side by side with
DB2's relational data repository.
- "The IBM approach to unified XML/relational databases" (IBM, March
2005): This white paper discusses how IBM provides a truly native unified
XML/relational database, supporting the XML data model from the client
through the database down to the disk and back again.
- "DB2
Basics: An introduction to the SQL/XML publishing functions"
(developerWorks, November 2005): This article introduces you to the
SQL/XML publishing functions, and shows you, by way of working examples,
how to use these functions.
- "Native XML Support in DB2 Universal Database" (Proceedings of the
31st VLDB Conference, 2005) provides a summary of the programming language
extensions.
- "Managing XML for Maximum Return" (IBM, November 2005): Learn
about the business benefits of DB2's XML support in this white paper.
- "Use
DB2 native XML with PHP" (developerWorks, October 2005):
Learn about the effectiveness of using the native XML capabilities of DB2
UDB for Linux, UNIX, and Windows to simplify application code and the
relational schemas..
-
XQuery Tutorial:
Learn the fundamentals of the XQuery language.
- "XQuery from the Experts: Influences on the design of XQuery"
(developerWorks, September 2003): IBM's own XQuery pioneer Don Chamberlin
discusses the emergence of the XQuery language -- specifically, the need
for a query language for XML data, and the basic principles behind it.
This excerpt is from Chapter 2 of the released Addison-Wesley book
XQuery from the Experts.
- "An
Introduction to XQuery" (developerWorks, November 2005): Learn
about the origins of the language and related XML standards. Howard Katz
introduces the W3C's XQuery specification, currently winding its way
toward Recommendation status.
- Stay current with developerWorks technical events and webcasts.
-
developerWorks
Information Management zone: Learn more about DB2. Find technical
documentation, how-to articles, education, downloads, product information,
and more.
- Learn about DB2
Express-C, the no-charge version of DB2 Express Edition for the
community.
Get products and technologies
- Build your next
development project with IBM trial software, available for download directly from
developerWorks.
- Now you can use
DB2 for free. Download DB2
Express-C, a no-charge version of DB2 Express Edition for the
community that offers the same core data features as DB2 Express Edtion
and provides a solid base to build and deploy applications.
Discuss
- Participate in the discussion forum.
- Participate in developerWorks
blogs and get involved in the developerWorks community.




