One of the biggest criticisms that traditional database experts level at XML is the fact that its hierarchical nature encourages the sort of repetitiveness that would be banished by relational normalization. This is certainly a valid complaint, and the key to XML's success is that its flexibility and convenience outweigh this failing. (Of course, database purists say that XML's advantages only appear to outweigh its problems to the less rigorous.) In this tip, I offer a couple of techniques that can help with this in certain cases. However, it is not a general solution to the problem of XML's hierarchical limitations.
Sometimes repetitive data occurs when data can be reused, but reuse is not required. A good example of this is the billing and shipping addresses of a business partner. Listing 1 is a sample customer record that includes such addresses.
Listing 1. A sample customer record
<customer> <name>Bards, Inc.</name> <billing-address>1000 Lay Way, Burgh, UK</billing-address> <shipping-address>1000 Lay Way, Burgh, UK</shipping-address> <phone>606-217-8899</phone> <email>firstname.lastname@example.org</email> </customer>
In this case,
shipping-address are the same string. Imagine that this file is the result of a form that was filled out. The data may be entered into two separate fields, even though it's the same value -- this is a well-known recipe for data inconsistency errors. For this reason, many such forms offer a check box so one can enter just the billing address and match the shipping address to the billing address automatically. One can do the same sort of thing in the XML data if the vocabulary allows it. XML 1.0 makes this possible through the use of ID types. Listing 2 offers an example of this.
Listing 2. A customer record format that uses ID types to avoid repetition
<!DOCTYPE customer [ <!ELEMENT customer (name, billing-address, shipping-address, phone, email )> <!ELEMENT billing-address (#PCDATA)> <!ATTLIST billing-address id ID #IMPLIED> <!ELEMENT shipping-address (#PCDATA)> <!ATTLIST shipping-address ref IDREF #IMPLIED> <!ELEMENT name (#PCDATA)> <!ELEMENT phone (#PCDATA)> <!ELEMENT email (#PCDATA)> ]> <customer> <name>Bards, Inc.</name> <billing-address id="x">1000 Lay Way, Burgh, UK</billing-address> <shipping-address ref="x"/> <phone>606-217-8899</phone> <email>email@example.com</email> </customer>
In this case, the vocabulary is augmented by allowing the
billing-address element to have an optional attribute
id, which is defined as a unique ID type. The
shipping-address element also gets an optional attribute
ref, which is defined as a reference to a unique ID type. For the purposes of this example, I placed the DTD that's needed to assert these attribute types into the internal subset. The processing code then needs to know how to handle the special attributes and properly infer the value of the shipping address.
Another approach is to use XPath to reference the target value, as in Listing 3.
Listing 3. A customer record format that uses XPath to avoid repetition
<customer> <name>Bards, Inc.</name> <billing-address>1000 Lay Way, Burgh, UK</billing-address> <shipping-address> <xpath-ref select="../billing-address"/> </shipping-address> <phone>606-217-8899</phone> <email>firstname.lastname@example.org</email> </customer>
This time I have added a special element to the vocabulary,
xpath-ref, which contains an XPath expression to be evaluated with its parent element as the context node. In this example, it selects the document's
billing-address element node name, which is presumably then converted to a string. Again, a processor would have to implement this reference, but this XPath method offers more flexibility; for one thing, XPath functions and other expression facilities can be used to select more complex values.
You should use internal references like this with some care. With the ID method, be sure to maintain the validity of the document, and with the XPath method, watch out for situations where a modification causes the XPath to fail to select the expected result.
When designing XML vocabularies, try to minimize repetition wherever possible. You can do this many ways, and internal references can be a handy tool in that effort.
- Review and bookmark Tim Bray's excellent Annotated XML 1.0 Specification. Section 3.3.1 covers ID type attributes. Also see the W3Schools DTD tutorial and the Zvon DTD tutorial.
- Learn about XPath in this developerWorks tutorial (October 2001). Also see the W3Schools XPath tutorial and the Zvon XPath tutorial.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at email@example.com.