Relational databases are, by their nature, more flexible than hierarchical data storage structures such as XML. Many relationships that are simple to model in relational databases (such as the relationship between invoices and parts in a shipping system) turn out to be fairly difficult to model in XML. In this column, I'll take a look at a typical many-to-many modeling challenge, and go through some options you have when creating an XML model for that information.
A typical modeling puzzle
If you have some experience modeling data for relational databases, you know that many-to-many relationships between different relational entities appear all the time. This column uses a common example as its starting point: invoices and parts in a shipping system. This is the classic example of a many-to-many relationship: Invoices may include many parts, and each part may appear on many invoices. Additionally, you may have other information associated with the relationship itself that you need to model. For example, when a part appears on an invoice, it will typically have a quantity and price associated with it. To model this information in a relational database, you might create the following design:
Invoice InvoiceID CustomerName InvoiceDate ShippingMethod InvoiceAmount Part PartID PartCode PartDescription InvoicePart InvoiceID PartID Price Quantity
This proves to be very flexible: If you want to discover what parts appear on a particular invoice, or how many invoices contain a particular part, you can simply write join queries (using the primary and foreign keys) to return the information that interests you. However, you'll start to run into a snag when trying to model this information in XML. You might have an
Invoice element that contains
InvoicePart elements, but then what? If you include the part information in the
InvoicePart element, it will be repeated for every invoice containing that part! Similarly, if you choose to have a
Part element that contains
InvoicePart elements, where is the
Invoice data represented?
The key to modeling this type of relational data in XML is to understand all the tools you have at your disposal, and apply common sense to determine what will work best in a particular case. Let's take a look at some options for modeling this information in XML.
The modeling toolset
XML provides a couple of different mechanisms for representing relationships between elements. The most commonly used mechanism is the parent-child relationship. This can be used to represent a one-to-one or one-to-many relationship between elements. However, when you try to represent a many-to-many relationship, this mechanism is insufficient, as each element may only have a single parent element.
Relationships in XML can also be represented with
ID-IDREF(S) attributes. Using these attributes, an element may refer to one or more other elements (by including the value of those elements' ID fields in the pointing element's own
IDREFS field). While this may seem to be directly analogous to a relational database's key mechanisms, there's one important difference: Most parsers treat these pointers as unidirectional. In other words, given an
IDREFS field, it's possible to quickly find the element or elements with the associated ID or IDs, but not the other way around. As you'll see when I discuss modeling solutions, this turns out to be a real impediment to design.
Armed with these two relational modeling tools, let's now take a look at how you can model your invoice and part information in XML.
Some possible solutions
There are several ways you can attack this modeling problem in your XML document. As with any data modeling, it pays to think about not only the most efficient way to represent the information in the document, but also who the audience for the document is and how the document will be used.
Discard the information
The easiest solution to the problem is to simply discard some of the information that would normally need to be linked through a many-to-many relationship. This call is typically made based on the expected audience for the document. If you are sending a summary of invoices to a customer for billing purposes, for example, you may not need to include the part information. Similarly, if you are creating a document that is going to be used as a parts database, the specifics of how the part has been ordered on invoices may not be important.
Eliminate the many-to-many relationship
Another possibility is to limit the scope of the document so that the many-to-many relationship disappears. In this scenario, rather than trying to have one document that encompasses all the information in the database, the document is scoped to describe one of the elements that participates in the many-to-many relationship. For example, we might decide to create one XML document per invoice. In this case, the following XML document suffices:
<invoice customerName="John Q. Anybody" invoiceDate="1/7/2002" shippingMethod="UPS" invoiceAmount="29.55"> <part partCode="X1Y23" partDescription="Grommet, steel, 3-inch" price="0.25" quantity="72" /> <part partCode="Y2Z29" partDescription="Sprocket, brass, 2-inch" price="0.35" quantity="33" /> </invoice>
Note that you've combined the information from the
InvoicePart tables together into one element --
part -- since the relationship between
InvoicePart is now is now one-to-one for the purposes of the document.
This approach makes it easy to create XML documents that describe a subset of our data, however, it comes at the price of flexibility. A programmer attempting to summarize part orders for a particular invoice date would have to parse many documents and aggregate the data manually -- an unenviable task. However, this approach can often produce the best results for specific targeted purposes.
Use a single IDREF relationship
So what happens if you need to retain all the nuances of the information without losing the relationship
between the data points? The obvious solution is to use a single
IDREF relationship to point the relating element back to the element it needs to reference. For example, you might create a structure that looks
something like this:
<shippingData> <invoice customerName="John Q. Anybody" invoiceDate="1/7/2002" shippingMethod="UPS" invoiceAmount="29.55"> <invoicePart partIDREF="X1Y23" price="0.25" quantity="72" /> <invoicePart partIDREF="Y2Z29" price="0.35" quantity="33" /> </invoice> <invoice customerName="Michael X. Somebody" invoiceDate="1/8/2002" shippingMethod="FedEx" invoiceAmount="22.00"> <invoicePart partIDREF="X1Y23" price="0.25" quantity="88" /> </invoice> <part partID="X1Y23" partDescription="Grommet, steel, 3-inch" /> <part partID="Y2Z29" partDescription="Sprocket, brass, 2-inch" /> </shippingData>
Note that even though you have two invoices that include grommets, the specifics about grommets are included
only once in the actual document. This allows you to decrease document size while still retaining all of the
data richness you need. However, this document size reduction comes at a price: Parsing the document becomes
significantly more difficult, especially if a streaming parser such as SAX is being used. Also, the one-way
IDREF pointers becomes a problem as well: This document works fine if you want to summarize the parts for a particular invoice, but what if you want to summarize the invoices for a particular part? Because you
can't easily navigate from an
ID to an
IDREF, this document is not well suited to that application. This design represents the classic solution to the many-to-many problem. And, if you're willing to live with a little more document size overhead, you can actually improve upon it (from a flexibility perspective).
Use a double IDREF relationship
In this design, there are two different
ID relationships: one pointing from the relating element to the non-parent participant in the relationship, and the other pointing from the non-parent participant in the relationship to the relating element (this will need to be an
IDREFS attribute). Here's our same example, designed using the double pointers:
<shippingData> <invoice customerName="John Q. Anybody" invoiceDate="1/7/2002" shippingMethod="UPS" invoiceAmount="29.55"> <invoicePart invoicePartID="IP1" partIDREF="X1Y23" price="0.25" quantity="72" /> <invoicePart invoicePartID="IP2" partIDREF="Y2Z29" price="0.35" quantity="33" /> </invoice> <invoice customerName="Michael X. Somebody" invoiceDate="1/8/2002" shippingMethod="FedEx" invoiceAmount="22.00"> <invoicePart invoicePartID="IP3" partIDREF="X1Y23" price="0.25" quantity="88" /> </invoice> <part invoicePartIDREFS="IP1 IP3" partID="X1Y23" partDescription="Grommet, steel, 3-inch" /> <part invoicePartIDREFS="IP2" partID="Y2Z29" partDescription="Sprocket, brass, 2-inch" /> </shippingData>
Here, you have the ability to navigate the many-to-many relationship in either direction. If you wanted to summarize the invoices for a particular part, you would simply navigate the
invoicePartIDREFS attribute back to the corresponding
invoicePart elements, and from there to the invoice parent element. While this method still makes parsing difficult, most parsers will be able to more easily traverse the new pointing relationship than the old one. This makes this document slightly larger, but the most flexible of any of the examples in this column.
So which solution should you use? It all depends on the intended audience of the document. If the document has a known purpose (for example, it's generated on the fly for styling), then you can take any of the shortcuts I have described to make the document easier to manipulate for its intended purpose. If the document is intended to be a transitional document only (i.e., the information will be pulled into a relational model before it is used), then you can get away with less flexible structures. However, if the document is intended to be used as the data of record for a particular subset of information, you'll want to make it as easy as possible for those accessing the document to extract the information -- and the relationships between the information.
- For the nitty gritty of relational design, take a look at Professional XML Databases, by Wrox Press.
- Learn about
IDREFSelements in the W3C spec.
- Download XML parsers for the DOM (document-at-once) and SAX (streamed) from IBM alphaWorks.
- IBM's DB2 Extender page gives a basic overview of how DB2 works with XML, with links to a detailed white paper on querying with XML, viewable as a PDF file, and to DB2 Extender downloads.
- Need some detail about working with XML and IBM's DB2 and WebSphere Application Server? The IBM Redbook Integrating XML with DB2 XML Extender and DB2 Text Extender shows how to use XML technology efficiently in business applications, and explains how to integrate it with DB2 Universal Database, DB2 XML Extender and Text Extender, and WebSphere Application Server. This book will help developers to set up the environment and to create and process XML documents that can be stored and recovered using SQL.
- To find out more about IBM's data management platform for e-business, visit Information Management, here on developerWorks.
- Find other articles in Kevin William's XML for Data column.