Skip to main content

Soapbox: Why XML Schema beats DTDs hands-down for data

A look at some data features of XML Schema

Kevin Williams (kevin@blueoxide.com), Chief XML architect, Equient (a division of Veridian)
Kevin Williams is the chief XML architect for Equient, a division of Veridian specializing in XML design for information management systems. He has also co-authored several books on XML from Wrox Press. He can be reached for comment at kevin@blueoxide.com. Random XML musings, tips, tricks, and opinionated rants may be found at his Web site www.blueoxide.com, which Kevin insists will be up and running "any day now, I swear."

Summary:  In his turn on the Soapbox, info-management developer and author Kevin Williams tells why he's sold on XML Schema for the structural definition of XML documents for data. He looks at four features of XML Schema that are particularly suited to data representation, and he shows some examples of each. Code samples include XSD schemas and schema fragments.

View more content in this series

Date:  01 Jun 2001
Level:  Intermediate
Activity:  3551 views

As you're no doubt aware, the W3C recently promoted the XML Schema specification to Recommendation status, making that spec the XML structural definition language of choice. While most people find the specifications a little hard to read, the jargon conceals a very strong set of features, especially for those of us who are designing XML structures for data. I'd like to take a look at a few of those features.

Strong typing

Strong typing is probably the biggest advantage XML Schema has over DTDs, and it is the aspect of XML Schema you've heard the most about. In a DTD, you don't have a whole lot of choices for constraining the allowable content of your elements and attributes. For example, an element's content may be described in exactly one of four ways: EMPTY, ANY, element content, or mixed element-and-text content. There's no way to specify that an element's text content must be a valid representation of an integer, or even that the content may not exceed a certain number of characters. The designers of XML Schema remedied this situation with the strong typing available in the specification. The provided simple data types are analogous to pretty much any built-in type you might encounter in a relational or object-oriented database environment. Additionally, XML Schema provides mechanisms to further constrain the allowable content of an element or attribute -- even to the point of setting a valid range of values or defining a regular expression to which the content must conform! One issue I still have with XML schemas is that the specs do not provide a way to strictly govern the order of elements and text appearing in a mixed-content element, but that's a minor gripe at best, and certainly not important when designing structures for data.

Why use XML Schema for data?

  • Strong typing for elements and attributes
  • Standardized way to represent null values for elements
  • Key mechanism that is directly analogous to relational database foreign keys
  • Defined as XML documents, making them programmatically accessible

Standardized null representation

When designing XML structures for data, the question always arises: How should nulls be represented? Many database implementations have specific meanings for nulls -- such as "not provided" or "unknown" -- that often differ from the meaning of a field that contains a zero-length string. When using DTDs to design documents, there's no built-in way to state that an element or attribute may be null. Instead, the document designer must decide on some default way to indicate that a particular column contains null in the source data (such as the omission of an attribute, or setting it to some distinct value that is not expected to appear in the source data, such as NULL_VALUE). That technique presents a problem, however, when a document consumer is written by someone who is not familiar with the design methodology that underlies the DTD; without documentation, the consumer is liable to have trouble identifying the null value properly. Again, XML Schema comes to the rescue, providing a standardized way to declare that an element may contain a null in an instance document. For example, an element defined as shown in Listing 1 would allow an element called recipient to be specified as null.

<xsd:element name="recipient" type="xsd:string" nillable="true"/>

Listing 2 shows an instance document fragment in which recipient is null.

<recipient xsi:nil="true" />

Unfortunately, this mechanism does not allow attributes to be explicitly stated as allowing nulls -- after all, you can't have attributes on attributes -- but this at least allows us to distinguish between an element with no content and a null element in a standardized way.


True key representation

If you have ever attempted to describe a relational database with a complex relationship map using a DTD, you've likely had to use the ID-IDREF pointing mechanism. For example, in a structure where two entities are related in a many-to-many way through a relating table (borrowers and assets on a loan application, for example), the simple XML parent-child relationship is insufficient. However, IDs and IDREFs have their own weaknesses: IDs must be unique across an entire document, and IDREF declarations do not specify the type of element an instance of the IDREF attribute must reference. XML Schema provides a way to specify these pointing relationships in much the same way that foreign-key relationships are declared in a relational database. For example, say you have a foreign-key relationship that you can't express using a simple parent-child relationship in our XML. You can declare the two related elements as in Listing 3:

<xsd:element name="rootElement">
  <xsd:complexType>
    <xsd:sequence>
      <xsd:element name="elementOne" maxOccurs="unbounded">
        <xsd:complexType>
          <xsd:attribute name="elementOneKey" type="integer" />
          <xsd:attribute name="elementOneDesc" type="text" />
        </xsd:complexType>
<xsd:key name="elementOnePK">
          <xsd:selector xpath=".//elementOne"/>
          <xsd:field xpath="@elementOneKey"/>
        </xsd:key>
      </xsd:element>
      <xsd:element name="elementTwo" maxOccurs="unbounded">
        <xsd:complexType>
          <xsd:attribute name="elementTwoKey" type="integer" />
          <xsd:attribute name="elementOneKey" type="integer" />
          <xsd:attribute name="elementTwoDesc" type="text" />
        </xsd:complexType>
<xsd:keyref name="elementOneFK" refer="elementOnePK">
          <xsd:selector xpath=".//elementTwo"/>
          <xsd:field xpath="@elementOneKey"/>
        </xsd:keyref>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>
</xsd:element>

In Listing 3, the key definition in the complex type for the elementOne element declares that the elementOneKey attribute must be present for all elementOne elements, and that it must be unique across all elementOneKey attributes on elementOne elements (note that this differs from IDs, which must be unique regardless of the element with which they are associated). The keyref definition in the complex type for the elementTwo element then states that the elementOneKey field must match one of the elementOneKey fields found on an elementOne element elsewhere in the document. Another nice feature of this key mechanism is that the keys may be strongly typed -- as opposed to ID and IDREFs, which must be XML name tokens -- so you can use that automatically incremented primary key in your table without modification. It's also possible to define composite keys so that you can create primary keys (using the key element) and foreign keys (using the keyref element) that map directly to the keys found in your existing relational database.


Schemas are XML

One of the things about XML schemas that often gets overlooked is that schema definitions are themselves valid XML documents. The clever programmer can take advantage of this to create some very flexible system architectures. For example, say that you always use a complex type called "Address" in all of the documents in your system. It would be easy to write an XSLT style sheet that consumes the schema document for each document type you are using, finds elements that are defined as having the Address complex type, and creates document-specific style sheets that render the address information in a consistent way (with all of the appropriate punctuation, hard breaks, and so on). You could also write code that probes the internal structure of an XML schema, then creates a relational table structure from that information, or HTML documentation of the structure itself.


Conclusion

I've taken a brief look at some aspects of XML Schema that make schemas much better than DTDs for the definition of XML structures for data. While DTDs are likely to be around for a while yet (there are plenty of legacy documents that still rely on them for their structural definition), support for XML Schema is quickly being implemented for all the major XML software offerings. In the following months, I'll take a look at some of the ideas I've laid out here in greater depth in my forthcoming column.


Resources

About the author

Kevin Williams is the chief XML architect for Equient, a division of Veridian specializing in XML design for information management systems. He has also co-authored several books on XML from Wrox Press. He can be reached for comment at kevin@blueoxide.com. Random XML musings, tips, tricks, and opinionated rants may be found at his Web site www.blueoxide.com, which Kevin insists will be up and running "any day now, I swear."

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12005
ArticleTitle=Soapbox: Why XML Schema beats DTDs hands-down for data
publish-date=06012001
author1-email=kevin@blueoxide.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers