I was playing with a popular XML command-line utility when I noticed it had an option to generate a list of directory content as an XML document. Curious, I tried it. Listing 1 shows what I saw.
Listing 1. The original format
<xml>
<d p="rwxrwxrwx" a="2005.07.26 15:23:13"
m="2005.07.26 15:21:23" s="170" n="."/>
<d p="rwxrwxrwx" a="2005.07.26 15:21:30"
m="2005.06.10 13:49:59" s="2448" n=".."/>
<f p="rw-r--r--" a="2005.07.26 15:20:46"
m="2005.06.07 14:00:55" s="6148" n=".DS_Store"/>
<f p="rw-r--r--" a="2005.07.26 15:21:30"
m="2005.07.26 15:21:24" s="800" n="canonthunderbirdplist.xml"/>
<f p="rw-r--r--" a="2005.07.26 15:21:30"
m="2005.07.26 15:20:46" s="945" n="thunderbirdplist.xml"/>
</xml>
|
This format makes at least three separate mistakes -- none of them critical but all of them serious. This is a classic example of how not to design an XML format; but with a little tweaking, it is possible to make this much easier to use and much more robust.
The first problem is the name of the root element. The three letter string "xml" is reserved by the XML specification and shouldn't be used in any format promulgated by anyone other than the World Wide Web Consortium (W3C). The W3C might redefine this name (and all others that begin with the three letters x, m, and l, in any combination of case) in future specs. This isn't a well-formedness error, but some parsers may nonetheless report a warning when reading this document.
A second problem with this name is that it's nondescriptive. Of course, it's XML. You could make that even more obvious if you included an XML declaration. The name of the root element should reflect the type of the document. For instance, DirectoryListing would be good name for this element.
The second major problem is the abbreviated names used for the elements and attributes.
What's an f? A p? An a? An m? An s? I can puzzle these out because I know where this data came from, but it's not so obvious to anyone else. Much better to spell out the names fully, as shown in Listing 2.
Listing 2. Full names
<DirectoryListing>
<Directory permissions="rwxrwxrwx" size="170" name="."
lastAccessed="2005.07.26 15:23:13"
lastModified="2005.07.26 15:21:23"/>
<Directory permissions="rwxrwxrwx" size="2448" name=".."
lastAccessed="2005.07.26 15:21:30"
lastModified="2005.06.10 13:49:59"/>
<File permissions="rw-r--r--" size="6148" name=".DS_Store"
lastAccessed="2005.07.26 15:20:46"
lastModified="2005.06.07 14:00:55"/>
<File permissions="rw-r--r--" size="800" name="canonthunderbirdplist.xml"
lastAccessed="2005.07.26 15:21:30"
lastModified="2005.07.26 15:21:24"/>
<File permissions="rw-r--r--" size="945" name="thunderbirdplist.xml"
lastAccessed="2005.07.26 15:21:30"
lastModified="2005.07.26 15:20:46"/>
</DirectoryListing> |
Isn't that clearer? It's also a little larger, but not problematically so. If size is a real problem (although it rarely is), there are ways to reduce disk and memory footprints that don't result in more opaque documents.
The final problem is perhaps the most significant. The attributes contain a lot of substructure that isn't available through the XML parser. Each program that parses this document must include its own custom parser for the permissions format and the time format. Why not let the XML parser do the hard work? These attributes should be moved to child elements that contain the necessary substructure. For example, a permissions child element might look like this:
<permissions>
<world>
<read>true</read>
<write>false</write>
<execute>false</execute>
</world>
<group>
<read>true</read>
<write>true</write>
<execute>false</execute>
</group>
<owner>
<read>true</read>
<write>true</write>
<execute>false</execute>
</owner>
</permissions> |
You can still use attributes if you prefer, but you should separate out each permission as an individual attribute:
<File size="945" name="thunderbirdplist.xml" lastAccessed="2005.07.26 15:21:30" lastModified="2005.07.26 15:20:46" worldread="true" worldwrite="false" worldexecute="false" groupread="true" groupwrite="true" groupexecute="false" ownerread="true" ownerwrite="true" ownerexecute="false" /> |
You can structure this in at least a dozen different ways, but whichever approach you pick should make the structure explicit through XML markup, and not leave it implicit in the text strings. The goal is for the XML parser to offer the individual pieces of information to the client application rather than force the client program to further subdivide the content.
Similarly, you can divide the time information as follows:
<lastModified> <year>2005</year> <month>06</month> <day>10</day> <hour>13</hour> <minute>49</minute> <second>59</second> </lastModified> |
To some extent, the times are unitary, so perhaps it's OK to keep them as undifferentiated strings. However,
the string format should be something more standard that can easily be verified by schema languages and used as
input to various date and time libraries such as java.util.Date.
The ISO 8601 format for times works well here:
lastModified="2005-06-10T13:49:59" |
ISO 8601 also defines syntax for representing time zones and fractions of a second, should that become important in the future.
Listing 3 shows the final version of the document.
Listing 3. The improved format
<?xml version="1.0" encoding="UTF-8"?>
<DirectoryListing>
<Directory size="170" name="."
lastAccessed="20050726T15:23:13" lastModified="20050726T15:21:23">
<Permissions>
<world>
<read>true</read>
<write>true</write>
<execute>true</execute>
</world>
<group>
<read>true</read>
<write>true</write>
<execute>true</execute>
</group>
<owner>
<read>true</read>
<write>true</write>
<execute>true</execute>
</owner>
</Permissions>
</Directory>
<Directory size="2448" name=".."
lastAccessed="20050726T15:21:30" lastModified="20050610T13:49:59">
<Permissions>
<world>
<read>true</read>
<write>true</write>
<execute>true</execute>
</world>
<group>
<read>true</read>
<write>true</write>
<execute>true</execute>
</group>
<owner>
<read>true</read>
<write>true</write>
<execute>true</execute>
</owner>
</Permissions>
</Directory>
<File size="6148" name=".DS_Store"
lastAccessed="20050726T15:20:46" lastModified="20050607T14:00:55">
<Permissions>
<world>
<read>true</read>
<write>false</write>
<execute>false</execute>
</world>
<group>
<read>true</read>
<write>false</write>
<execute>false</execute>
</group>
<owner>
<read>true</read>
<write>true</write>
<execute>false</execute>
</owner>
</Permissions>
</File>
<File size="800" name="canonthunderbirdplist.xml"
lastAccessed="20050726T15:21:30" lastModified="20050726T15:21:24">
<Permissions>
<world>
<read>true</read>
<write>false</write>
<execute>false</execute>
</world>
<group>
<read>true</read>
<write>false</write>
<execute>false</execute>
</group>
<owner>
<read>true</read>
<write>true</write>
<execute>false</execute>
</owner>
</Permissions>
</File>
<File size="945" name="thunderbirdplist.xml"
lastAccessed="20050726T15:21:30" lastModified="20050726T15:20:46">
<Permissions>
<world>
<read>true</read>
<write>false</write>
<execute>false</execute>
</world>
<group>
<read>true</read>
<write>false</write>
<execute>false</execute>
</group>
<owner>
<read>true</read>
<write>true</write>
<execute>false</execute>
</owner>
</Permissions>
</File>
</DirectoryListing> |
It's longer but much clearer and much easier to process. Don't be afraid of verbosity. If you need a shorter, simpler format, you can transform it with an XSLT stylesheet. However, it's much easier to take the extra structure out than it is to put it back in.
Abbreviated names and opaque string formats are a relic of the era when supercomputers had 32K of memory and speeds measured in kilohertz. This hasn't been true for decades. When designing XML formats, you should prefer clarity and precision to compactness. Design for comprehensibility and maintainability rather than trying to squeeze out every last byte. You'll make life a lot easier for everyone who has to process your documents -- including yourself.
Learn
-
The points made here are discussed in more depth in
Effective XML
.
- ISO 8601 is the official time format of the W3C XML Schema Language
and is suggested by the W3C for other uses as well.
-
Make sure your XML design does not get in the way of normal content processing. See Uche Ogbuji's developerWorks series "Principles of XML design."
- Find out how you can become an IBM Certified Developer in XML and related technologies.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
Get products and technologies
-
The author didn't mention the name of the tool that produced the output in Listing 1 for two reasons: A lot of other formats make these same mistakes; and it's a nice tool overall despite the minor warts that he focuses on in this article. However, if you're curious, you can find it here.
Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he resides in the Prospect Heights neighborhood of Brooklyn with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site has become one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, has become one of the most popular XML sites. His books include Effective XML , Processing XML with Java , Java Network Programming , and The XML 1.1 Bible . He's currently working on the XOM API for processing XML and the Jaxen XPath engine.