Nowadays you can easily take XML for granted. It's everywhere! But when you stand back and look at it, you can see that it's a powerful technology. IDEs help build XML trees. Several validation technologies make sure that the XML code is right. XSLT is a dedicated XML translation language. Support is even built directly into the syntax of languages such as E4X in ActionScript.
XML has a dark side, though. It can be misused. It can be lousy. It can be overly complex. It can be under-defined. It can be just plain tough to work with. So what can you do to make better use of this powerful technology? In this article, I give you 10 specific dos and don'ts that help you do the right thing to build XML that is easy to use.
I can't tell you how many times I've seen the XML code stored in files that have the .xml extension. It's worthless. It's not telling me anything I don't already know if I just "cat" the file. The moment I see tags I know it's XML. Instead, use an extension that is meaningful to the customer. And an extension that is sufficiently unique that when it eventually goes into a Google search, which it will, the search returns links to the documentation or some examples of your XML file format.
Another issue I see in some XML is that the root tag is
<xml>. Again, you aren't telling me anything. What's in the file? If it's a contact list, then the root node should be
<contacts>. XML is meant to be human readable, so use tag names and attribute names that are relevant to the business problem at hand. If the root node is
<contacts> I expect to see
<contact> tags within that, then
<name> tags, with
<last> and so on.
I get that XML is a persistence format. And most languages have a way to persist data structures in XML. That's fine if you know, for sure, that the only processes that will ever write or read the XML are the same language. That, however, is hardly ever the case. If your application is writing something to a file, it's likely that at some point either the user will read it, or some application in another language will read it.
What I'm getting to is this, keep language specific constructs out of the XML. How
often have you seen
<data type="NSDate">07-18-2010</data>? What's NSDate? Oh, that's the class name for the date in the application's platform. So what happens when you switch platforms, or languages? You'll need a translation layer to go between the NSDate tags and whatever your new platform expects.
Keep the language specifics out of the XML and use something simple, like:
<date>…</date>. It's easy to understand, human readable, and not dependent on any particular language or framework.
Along that line another important lesson is to keep your XML from being too generic. Take this example piece of XML in Listing 1.
Listing 1. A generic node tree
<nodes> <node type="user"> <node type="first">jack</node> </node> </nodes>
What does this mean? I understand that it's a user list. But it's not easy to read for humans, and it's not easily editable. What's almost worse is that it makes using the XML in tools like XSLT, or validating it with a schema really difficult. What this XML really means is something like Listing 2.
Listing 2. A better node tree
<users> <user> <first>jack</first> </user> </users>
Isn't this better? It says what it means and means what it says. It's easy to read and parse. It's easy to validate and to translate with XSLT. It's even smaller.
Now I know what you are going to say; "Disk space is cheap. For a ten cents I'll take another terabyte." True enough. And certainly you can make XML files that are gigabytes. But programming is all about trade-offs. You trade space for time, or memory for time. But when you have a huge XML file you are getting the worst of both worlds. The file is big on the drive, and takes a long time to parse through and to validate. Plus a large file precludes using a DOM-based parser since it takes forever to build the tree, and chews up a lot of memory doing it.
So what's the alternative? One possibility is to make multiple files. One that acts as an index and others that have the large resources that might not be used by all of the clients of the XML. Another possibility is to move any big chunks of CDATA that are in the file out of XML altogether and into their own files with their own formats. If you want to keep all of the data together then zip up all of the files into a new file with a new extension. Every popular language has modules that make it easy to zip and unzip files quickly.
Namespaces are a powerful part of the XML lexicon. They make it easy to provide an extensible file format. You can define a base set of tags for whatever your application needs, and then allow customers to add their own data into the file, in their own namespace, without disturbing your tree.
That said, namespaces make it a lot tougher to parse and manage the data. Namespaces confuse language extensions like E4X. They make it tougher to use the XML in XSLT. And they make the XML much harder to read.
So, use XML namespaces only when you must. Don't just use them because it's the ‘XML thing to do'. XML works just fine without namespaces.
All of these dos and don'ts come down to keeping your XML clean, simple, and easy to understand. In that spirit, even the XML spec does allow for many things but you don't necessarily have to use them. For example, you might use dashes in the element and attribute names. But that makes using that XML in a language extension, like E4X, much harder to do. The question is, is it worth it?
My recommendation is to stay away from any special characters in the element or attribute names.
Parsing XML is tough. To parse XML safely, ensuring that you protect code that looks for tags or attributes that it might not find and that it fails gracefully, is a lot of work. It means extra code, extra complexity, and it obscures the real business logic that is your true focus. So how do you avoid that? You validate the XML before you use it. You can use several standards for this. You can specify a Document Type Definition (DTD), or an XML Schema (see Resources for more on DTDs and XML Schemas.). I personally find XML Schema a lot easier to work with, but if this is new to you I recommend trying out several different validation systems.
The big advantage here is that you can depend on the XML once you validate it. It might not be worth doing for anything that your application both reads and writes internally. But it is very handy if the XML is generated by another application or written by hand.
It's easy to overlook the fact that XML stored in files amounts to a file format. With
any format, one of the very first things it should contain is a file version number. It's easy enough to add;
<customers version="1">...</customers>. And the code that reads the file should check to make sure that the version number is less than or equal to its current version and throw an exception if it's not. That will ensure that any future versions of the code can't confuse the older versions with new tags. Of course, you'll have to support any older versions of the files as you continue development on your application.
Engineers are pretty lazy. I can say that because I am one. But come on, we all are. If a framework says that it will export XML for us, we are likely to say "that's good enough." But framework built XML is usually pretty bad. For example, you are likely to get something like Listing 3:
Listing 3. A user list
<users> <user> <id>1</id> <first>jack</first> </user> </users>
<id> really be a tag? I'd argue that it should be an attribute. It's short and it makes sense to be able to look for a user by id using some simple XPath (
If this is going to be a human readable file then it should properly use attributes as in Listing 4.
Listing 4. A better user list
<users> <user id="1"> <first>jack</first> </user> </users>
I can see why a framework would generate Listing 3, it's safer just to always use nodes. But attributes allow you to identify important elements in the DOM tree and you should use them.
XML puts a bunch of constraints on certain characters; quotes, ampersands, less than,
greater than, and other characters. In the real world, however, you use a lot of these
characters. So either you need to convert everything in XML-safe encodings, or you
need to put large areas of text, code or whatever, into
CDATA blocks. I think developers avoid
CDATA because they think it will make it tougher to parse. But
CDATA sections are no harder to parse than anything else, and most DOM parsers will simply flatten them for you so that you don't have to think about it at all.
Another important reason to use
CDATA is to preserve the
exact formating of data. For example, if you export Wiki pages, you will want to retain the exact positions of characters like return and line-feed because those are given special attention in the Wiki format.
So why not use
CDATA all the time? Because it makes the document that much harder to read. And it's particularly frustrating when it's not necessary. So use it, and encourage people that write to your XML format to use it, for data that you think will have special characters and where you want to retain the formatting. But don't use it beyond those places.
So far I've talked about XML documents that have rigid format to them. I've even gone so far as to recommend using a validator, like XML Schema, that will enforce a rigid structure. There is good reason for that: It's easy to parse structured data. But what if you need some flexibility? I recommend putting optional data into an optional block within its own node. For example, look at Listing 5.
Listing 5. A cluttered user record
<users> <user id="1"> <first>jack</first> <middle>d</middle> <last>herrington</last> <runningpace>8:00</runningpace> </user> </users>
It contains all of the data that you might expect about the user, and then some. So first, middle, last, I get that, but why ‘runningpace'? Is it required? Will you have lots of these fields? Will it be extensible? If the answer was yes to all of that then I would recommend something like Listing 6.
Listing 6. A well structured user record
<users> <user id="1"> <first>jack</first> <middle>d</middle> <last>herrington</last> <userdata> <field name="runningpace">8:00</field> </userdata> </user> </users>
This way you can have as many fields as you want, but they don't clutter the namespace
of the host
<user> element. You can even
validate that document, and also refer to a given field using XPath (//user/userdata/field[@name='runningpace').
I've given you a lot to think about here. Five things not to do, and five more things that I recommend doing. Not all them will apply in all circumstances. Sometimes XML is just a persistence format that is thrown across a wire where the lifespan is but a few milliseconds. In that case, no one really cares. But if you use XML like a file format then you need to treat it as such and use many of the best practices outlined here.
- Language lawyers will want to have a look at the W3C XML Specification: Become a 'language lawyer: and dig into the details for XML, a simple, very flexible text format designed for large-scale electronic publishing and an important player in data exchange on the Web and elsewhere.
- Document Type Definition (DTD) (Wikipedia): Read more about DTDs, a set of markup declarations that define a document type for SGML-family markup languages (SGML, XML, HTML).
- XML Schema (Wikipedia): Read a brief description of a type of XML document that constrains the structure and content of documents of that type.
- W3C XSLT specification: Learn more about a fantastic way to transform XML into a variety of formats.
- W3C XPath specification: Explore an extremely valuable XML tool that you can use to find nodes quickly and easily within even the most complicated XML document.
- The E4X extension for Actionscript (ECMAScript): Look further at a very cool way to integrate XML directly into your application logic. It makes it so easy that it almost becomes a defacto open storage format in the language. (Wikipedia)
- More articles by this author (Jack Herrington, developerWorks, March 2005-current): Read articles about Ajax, JSON, PHP, XML, and other technologies.
- XML area on developerWorks: Get the resources you need to advance your skills in the XML arena.
- My developerWorks: Personalize your developerWorks experience.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks. Also, read more XML tips.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- XML development with Eclipse: Harness the power of XML with Eclipse (Pawel Leszek, developerWorks, April 2003): Check out Eclipse and its XML editing extensions documented in this excellent article.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks blogs: Check out these blogs and get involved.