XML for Data: Modeling many-to-many relationships

Tips and techniques to help you create more flexible XML models

In this column, Kevin Williams takes a look at some options for modeling many-to-many relationships in XML. Several different techniques, and the advantages and disadvantages of each, are discussed. Examples are provided in XML.

Kevin Williams (kevin@blueoxide.com), CEO, Blue Oxide Technologies, LLC

Kevin Williams is the CEO of Blue Oxide Technologies, LLC, a company that designs software tools that help companies take advantage of the service-oriented Internet. Visit their Web site at http://www.blueoxide.com. Kevin can be reached for comment at kevin@blueoxide.com.



01 January 2002

Also available in Japanese

Relational databases are, by their nature, more flexible than hierarchical data storage structures such as XML. Many relationships that are simple to model in relational databases (such as the relationship between invoices and parts in a shipping system) turn out to be fairly difficult to model in XML. In this column, I'll take a look at a typical many-to-many modeling challenge, and go through some options you have when creating an XML model for that information.

A typical modeling puzzle

If you have some experience modeling data for relational databases, you know that many-to-many relationships between different relational entities appear all the time. This column uses a common example as its starting point: invoices and parts in a shipping system. This is the classic example of a many-to-many relationship: Invoices may include many parts, and each part may appear on many invoices. Additionally, you may have other information associated with the relationship itself that you need to model. For example, when a part appears on an invoice, it will typically have a quantity and price associated with it. To model this information in a relational database, you might create the following design:

Invoice
  InvoiceID
  CustomerName
  InvoiceDate
  ShippingMethod
  InvoiceAmount

Part
  PartID
  PartCode
  PartDescription

InvoicePart
  InvoiceID
  PartID
  Price
  Quantity

This proves to be very flexible: If you want to discover what parts appear on a particular invoice, or how many invoices contain a particular part, you can simply write join queries (using the primary and foreign keys) to return the information that interests you. However, you'll start to run into a snag when trying to model this information in XML. You might have an Invoice element that contains InvoicePart elements, but then what? If you include the part information in the InvoicePart element, it will be repeated for every invoice containing that part! Similarly, if you choose to have a Part element that contains InvoicePart elements, where is the Invoice data represented?

The key to modeling this type of relational data in XML is to understand all the tools you have at your disposal, and apply common sense to determine what will work best in a particular case. Let's take a look at some options for modeling this information in XML.


The modeling toolset

XML provides a couple of different mechanisms for representing relationships between elements. The most commonly used mechanism is the parent-child relationship. This can be used to represent a one-to-one or one-to-many relationship between elements. However, when you try to represent a many-to-many relationship, this mechanism is insufficient, as each element may only have a single parent element.

Relationships in XML can also be represented with ID-IDREF(S) attributes. Using these attributes, an element may refer to one or more other elements (by including the value of those elements' ID fields in the pointing element's own IDREF or IDREFS field). While this may seem to be directly analogous to a relational database's key mechanisms, there's one important difference: Most parsers treat these pointers as unidirectional. In other words, given an IDREF or IDREFS field, it's possible to quickly find the element or elements with the associated ID or IDs, but not the other way around. As you'll see when I discuss modeling solutions, this turns out to be a real impediment to design.

Armed with these two relational modeling tools, let's now take a look at how you can model your invoice and part information in XML.


Some possible solutions

There are several ways you can attack this modeling problem in your XML document. As with any data modeling, it pays to think about not only the most efficient way to represent the information in the document, but also who the audience for the document is and how the document will be used.

Discard the information

The easiest solution to the problem is to simply discard some of the information that would normally need to be linked through a many-to-many relationship. This call is typically made based on the expected audience for the document. If you are sending a summary of invoices to a customer for billing purposes, for example, you may not need to include the part information. Similarly, if you are creating a document that is going to be used as a parts database, the specifics of how the part has been ordered on invoices may not be important.

Eliminate the many-to-many relationship

Another possibility is to limit the scope of the document so that the many-to-many relationship disappears. In this scenario, rather than trying to have one document that encompasses all the information in the database, the document is scoped to describe one of the elements that participates in the many-to-many relationship. For example, we might decide to create one XML document per invoice. In this case, the following XML document suffices:

<invoice
  customerName="John Q. Anybody"
  invoiceDate="1/7/2002"  
  shippingMethod="UPS"
  invoiceAmount="29.55">
  <part
    partCode="X1Y23"
    partDescription="Grommet, steel, 3-inch"
    price="0.25"
    quantity="72" />
  <part        
    partCode="Y2Z29"
    partDescription="Sprocket, brass, 2-inch"
    price="0.35"
    quantity="33" />
</invoice>

Note that you've combined the information from the Part and InvoicePart tables together into one element -- part -- since the relationship between Part and InvoicePart is now is now one-to-one for the purposes of the document.

This approach makes it easy to create XML documents that describe a subset of our data, however, it comes at the price of flexibility. A programmer attempting to summarize part orders for a particular invoice date would have to parse many documents and aggregate the data manually -- an unenviable task. However, this approach can often produce the best results for specific targeted purposes.

Use a single IDREF relationship

So what happens if you need to retain all the nuances of the information without losing the relationship between the data points? The obvious solution is to use a single IDREF relationship to point the relating element back to the element it needs to reference. For example, you might create a structure that looks something like this:

<shippingData>
  <invoice
    customerName="John Q. Anybody"
    invoiceDate="1/7/2002"  
    shippingMethod="UPS"
    invoiceAmount="29.55">
    <invoicePart
      partIDREF="X1Y23"
      price="0.25"
      quantity="72" />
    <invoicePart        
      partIDREF="Y2Z29"
      price="0.35"
      quantity="33" />
  </invoice>
  <invoice
    customerName="Michael X. Somebody"
    invoiceDate="1/8/2002"  
    shippingMethod="FedEx"
    invoiceAmount="22.00">
    <invoicePart
      partIDREF="X1Y23"
      price="0.25"
      quantity="88" />
  </invoice>
  <part
    partID="X1Y23"
    partDescription="Grommet, steel, 3-inch" />
  <part
    partID="Y2Z29"
    partDescription="Sprocket, brass, 2-inch" />
</shippingData>

Note that even though you have two invoices that include grommets, the specifics about grommets are included only once in the actual document. This allows you to decrease document size while still retaining all of the data richness you need. However, this document size reduction comes at a price: Parsing the document becomes significantly more difficult, especially if a streaming parser such as SAX is being used. Also, the one-way nature of IDREF pointers becomes a problem as well: This document works fine if you want to summarize the parts for a particular invoice, but what if you want to summarize the invoices for a particular part? Because you can't easily navigate from an ID to an IDREF, this document is not well suited to that application. This design represents the classic solution to the many-to-many problem. And, if you're willing to live with a little more document size overhead, you can actually improve upon it (from a flexibility perspective).

Use a double IDREF relationship

In this design, there are two different IDREF-to-ID relationships: one pointing from the relating element to the non-parent participant in the relationship, and the other pointing from the non-parent participant in the relationship to the relating element (this will need to be an IDREFS attribute). Here's our same example, designed using the double pointers:

<shippingData>
  <invoice
    customerName="John Q. Anybody"
    invoiceDate="1/7/2002"  
    shippingMethod="UPS"
    invoiceAmount="29.55">
    <invoicePart
      invoicePartID="IP1"
      partIDREF="X1Y23"
      price="0.25"
      quantity="72" />
    <invoicePart        
      invoicePartID="IP2"
      partIDREF="Y2Z29"
      price="0.35"
      quantity="33" />
  </invoice>
  <invoice
    customerName="Michael X. Somebody"
    invoiceDate="1/8/2002"  
    shippingMethod="FedEx"
    invoiceAmount="22.00">
    <invoicePart
      invoicePartID="IP3"
      partIDREF="X1Y23"
      price="0.25"
      quantity="88" />
  </invoice>
  <part
    invoicePartIDREFS="IP1 IP3"
    partID="X1Y23"
    partDescription="Grommet, steel, 3-inch" />
  <part
    invoicePartIDREFS="IP2"
    partID="Y2Z29"
    partDescription="Sprocket, brass, 2-inch" />
</shippingData>

Here, you have the ability to navigate the many-to-many relationship in either direction. If you wanted to summarize the invoices for a particular part, you would simply navigate the invoicePartIDREFS attribute back to the corresponding invoicePart elements, and from there to the invoice parent element. While this method still makes parsing difficult, most parsers will be able to more easily traverse the new pointing relationship than the old one. This makes this document slightly larger, but the most flexible of any of the examples in this column.


Conclusion

So which solution should you use? It all depends on the intended audience of the document. If the document has a known purpose (for example, it's generated on the fly for styling), then you can take any of the shortcuts I have described to make the document easier to manipulate for its intended purpose. If the document is intended to be a transitional document only (i.e., the information will be pulled into a relational model before it is used), then you can get away with less flexible structures. However, if the document is intended to be used as the data of record for a particular subset of information, you'll want to make it as easy as possible for those accessing the document to extract the information -- and the relationships between the information.

Resources

  • For the nitty gritty of relational design, take a look at Professional XML Databases, by Wrox Press.
  • Learn about IDREF and IDREFS elements in the W3C spec.
  • Download XML parsers for the DOM (document-at-once) and SAX (streamed) from IBM alphaWorks.
  • IBM's DB2 Extender page gives a basic overview of how DB2 works with XML, with links to a detailed white paper on querying with XML, viewable as a PDF file, and to DB2 Extender downloads.
  • Need some detail about working with XML and IBM's DB2 and WebSphere Application Server? The IBM Redbook Integrating XML with DB2 XML Extender and DB2 Text Extender shows how to use XML technology efficiently in business applications, and explains how to integrate it with DB2 Universal Database, DB2 XML Extender and Text Extender, and WebSphere Application Server. This book will help developers to set up the environment and to create and process XML documents that can be stored and recovered using SQL.
  • To find out more about IBM's data management platform for e-business, visit Information Management, here on developerWorks.
  • Find other articles in Kevin William's XML for Data column.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12064
ArticleTitle=XML for Data: Modeling many-to-many relationships
publish-date=01012002