Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Tip: How not to design an XML format

Minor changes can help make code more robust and easier to use

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct professor, Polytechnic University
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML , Processing XML with Java , Java Network Programming , and The XML 1.1 Bible . He's currently working on the XOM API for processing XML and the Jaxen XPath engine.

Summary:  This tip investigates an XML format that demonstrates a number of common mistakes and design flaws, and explains how you can correct these issues and improve the format.

View more content in this series

Date:  04 Nov 2005
Level:  Introductory

Comments:  

I was playing with a popular XML command-line utility when I noticed it had an option to generate a list of directory content as an XML document. Curious, I tried it. Listing 1 shows what I saw.


Listing 1. The original format
                
<xml>
<d p="rwxrwxrwx" a="2005.07.26 15:23:13" 
    m="2005.07.26 15:21:23" s="170" n="."/>
<d p="rwxrwxrwx" a="2005.07.26 15:21:30" 
    m="2005.06.10 13:49:59" s="2448" n=".."/>
<f p="rw-r--r--" a="2005.07.26 15:20:46" 
    m="2005.06.07 14:00:55" s="6148" n=".DS_Store"/>
<f p="rw-r--r--" a="2005.07.26 15:21:30" 
    m="2005.07.26 15:21:24" s="800" n="canonthunderbirdplist.xml"/>
<f p="rw-r--r--" a="2005.07.26 15:21:30" 
    m="2005.07.26 15:20:46" s="945" n="thunderbirdplist.xml"/>
</xml>

This format makes at least three separate mistakes -- none of them critical but all of them serious. This is a classic example of how not to design an XML format; but with a little tweaking, it is possible to make this much easier to use and much more robust.

XML used as name

The first problem is the name of the root element. The three letter string "xml" is reserved by the XML specification and shouldn't be used in any format promulgated by anyone other than the World Wide Web Consortium (W3C). The W3C might redefine this name (and all others that begin with the three letters x, m, and l, in any combination of case) in future specs. This isn't a well-formedness error, but some parsers may nonetheless report a warning when reading this document.

A second problem with this name is that it's nondescriptive. Of course, it's XML. You could make that even more obvious if you included an XML declaration. The name of the root element should reflect the type of the document. For instance, DirectoryListing would be good name for this element.


Abbreviated names

The second major problem is the abbreviated names used for the elements and attributes. What's an f? A p? An a? An m? An s? I can puzzle these out because I know where this data came from, but it's not so obvious to anyone else. Much better to spell out the names fully, as shown in Listing 2.


Listing 2. Full names
                <DirectoryListing>
  <Directory permissions="rwxrwxrwx" size="170" name="."
        lastAccessed="2005.07.26 15:23:13" 
        lastModified="2005.07.26 15:21:23"/>
  <Directory permissions="rwxrwxrwx"  size="2448" name=".."
        lastAccessed="2005.07.26 15:21:30" 
        lastModified="2005.06.10 13:49:59"/>
  <File permissions="rw-r--r--" size="6148" name=".DS_Store"
        lastAccessed="2005.07.26 15:20:46" 
        lastModified="2005.06.07 14:00:55"/>
  <File permissions="rw-r--r--" size="800" name="canonthunderbirdplist.xml"
        lastAccessed="2005.07.26 15:21:30" 
        lastModified="2005.07.26 15:21:24"/>
  <File permissions="rw-r--r--"   size="945" name="thunderbirdplist.xml"
        lastAccessed="2005.07.26 15:21:30" 
        lastModified="2005.07.26 15:20:46"/>
</DirectoryListing> 

Isn't that clearer? It's also a little larger, but not problematically so. If size is a real problem (although it rarely is), there are ways to reduce disk and memory footprints that don't result in more opaque documents.


Non-XML structures

The final problem is perhaps the most significant. The attributes contain a lot of substructure that isn't available through the XML parser. Each program that parses this document must include its own custom parser for the permissions format and the time format. Why not let the XML parser do the hard work? These attributes should be moved to child elements that contain the necessary substructure. For example, a permissions child element might look like this:

<permissions>
  <world>
    <read>true</read>
    <write>false</write>
    <execute>false</execute>
  </world>
  <group>
    <read>true</read>
    <write>true</write>
    <execute>false</execute>
  </group>
  <owner>
    <read>true</read>
    <write>true</write>
    <execute>false</execute>
  </owner>
</permissions>

You can still use attributes if you prefer, but you should separate out each permission as an individual attribute:

<File size="945" name="thunderbirdplist.xml"
  lastAccessed="2005.07.26 15:21:30" lastModified="2005.07.26 15:20:46" 
  worldread="true" worldwrite="false" worldexecute="false"
  groupread="true" groupwrite="true"  groupexecute="false"
  ownerread="true" ownerwrite="true"  ownerexecute="false"
/>

You can structure this in at least a dozen different ways, but whichever approach you pick should make the structure explicit through XML markup, and not leave it implicit in the text strings. The goal is for the XML parser to offer the individual pieces of information to the client application rather than force the client program to further subdivide the content.

Similarly, you can divide the time information as follows:

<lastModified>
  <year>2005</year>
  <month>06</month>
  <day>10</day> 
  <hour>13</hour>
  <minute>49</minute>
  <second>59</second>
</lastModified>

To some extent, the times are unitary, so perhaps it's OK to keep them as undifferentiated strings. However, the string format should be something more standard that can easily be verified by schema languages and used as input to various date and time libraries such as java.util.Date. The ISO 8601 format for times works well here:

lastModified="2005-06-10T13:49:59"

ISO 8601 also defines syntax for representing time zones and fractions of a second, should that become important in the future.


Summary

Listing 3 shows the final version of the document.


Listing 3. The improved format
                <?xml version="1.0" encoding="UTF-8"?>
<DirectoryListing>
  <Directory size="170" name="."
        lastAccessed="20050726T15:23:13" lastModified="20050726T15:21:23">
    <Permissions>
      <world>
        <read>true</read>
        <write>true</write>
        <execute>true</execute>
      </world>
      <group>
        <read>true</read>
        <write>true</write>
        <execute>true</execute>
      </group>
      <owner>
        <read>true</read>
        <write>true</write>
        <execute>true</execute>
      </owner>
    </Permissions>
  </Directory>
  <Directory size="2448" name=".."
        lastAccessed="20050726T15:21:30" lastModified="20050610T13:49:59">
    <Permissions>
      <world>
        <read>true</read>
        <write>true</write>
        <execute>true</execute>
      </world>
      <group>
        <read>true</read>
        <write>true</write>
        <execute>true</execute>
      </group>
      <owner>
        <read>true</read>
        <write>true</write>
        <execute>true</execute>
      </owner>
    </Permissions>
  </Directory>
  <File size="6148" name=".DS_Store"
        lastAccessed="20050726T15:20:46" lastModified="20050607T14:00:55">
    <Permissions>
      <world>
        <read>true</read>
        <write>false</write>
        <execute>false</execute>
      </world>
      <group>
        <read>true</read>
        <write>false</write>
        <execute>false</execute>
      </group>
      <owner>
        <read>true</read>
        <write>true</write>
        <execute>false</execute>
      </owner>
    </Permissions>
  </File>
  <File size="800" name="canonthunderbirdplist.xml"
        lastAccessed="20050726T15:21:30" lastModified="20050726T15:21:24">
    <Permissions>
      <world>
        <read>true</read>
        <write>false</write>
        <execute>false</execute>
      </world>
      <group>
        <read>true</read>
        <write>false</write>
        <execute>false</execute>
      </group>
      <owner>
        <read>true</read>
        <write>true</write>
        <execute>false</execute>
      </owner>
    </Permissions>
  </File>
  <File size="945" name="thunderbirdplist.xml"
        lastAccessed="20050726T15:21:30" lastModified="20050726T15:20:46">
    <Permissions>
      <world>
        <read>true</read>
        <write>false</write>
        <execute>false</execute>
      </world>
      <group>
        <read>true</read>
        <write>false</write>
        <execute>false</execute>
      </group>
      <owner>
        <read>true</read>
        <write>true</write>
        <execute>false</execute>
      </owner>
    </Permissions>
  </File>
</DirectoryListing>

It's longer but much clearer and much easier to process. Don't be afraid of verbosity. If you need a shorter, simpler format, you can transform it with an XSLT stylesheet. However, it's much easier to take the extra structure out than it is to put it back in.

Abbreviated names and opaque string formats are a relic of the era when supercomputers had 32K of memory and speeds measured in kilohertz. This hasn't been true for decades. When designing XML formats, you should prefer clarity and precision to compactness. Design for comprehensibility and maintainability rather than trying to squeeze out every last byte. You'll make life a lot easier for everyone who has to process your documents -- including yourself.


Resources

Learn

Get products and technologies

  • The author didn't mention the name of the tool that produced the output in Listing 1 for two reasons: A lot of other formats make these same mistakes; and it's a nice tool overall despite the minor warts that he focuses on in this article. However, if you're curious, you can find it here.

About the author

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML , Processing XML with Java , Java Network Programming , and The XML 1.1 Bible . He's currently working on the XOM API for processing XML and the Jaxen XPath engine.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=96201
ArticleTitle=Tip: How not to design an XML format
publish-date=11042005
author1-email=elharo@metalab.unc.edu
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).