I've written numerous articles and books on XML and XML-related topics by now, and as I spent a little time looking back over them, I was surprised at what I found. Even though I'm a programmer, and have always preferred to get into the bits and bytes (I took assembly in college and liked it) and like to have complete control, the general trend of my articles and tips and books has been less and less about XML itself and more and more about APIs that manipulate XML, and the APIs that wrap those APIs, and the other APIs that allow you to never touch XML at all.
So the end result of all that is a general concern: have we forgotten how to actually write good XML? Do we even know what good XML is anymore? In this short article, I want to review a few XML 101 tips, and make sure that, with all the tools we have, we don't forget basic principles along the way.
First steps: Using the right parts of XML
One of the biggest issues that's become apparent is that XML authors (the term is being loosely applied here!) have begun to stuff everything into elements. Attributes, processing instructions—they've become things of the past.
Elements are probably the easiest thing about XML authoring to get right, and that's
mostly because XML authors tend to use elements for everything. That, of course,
is not right, and has some pretty serious downsides. In XML, elements are almost
always (there are exceptions, but I'm talking about optimal usage here) best used to
represent data that has some sort of hierarchy to it, or that at least might have
some hierarchy to it. To start with the negative case, a first name will never have
hierarchy to it; it's a single word, and that's it. However, if you chose to have a
name element, that might be a good element: it breaks down into a first name and a last name at least, and perhaps a middle name, a title, and other components as well. So this isn't really proper use of an element:
<firstName>Bob</firstName> |
But this is:
<name firstName="Bob"
lastName="Zemeckis"
title="Mr." />
|
If you wonder why I didn't nest firstName, lastName, and title elements within the
name element, read the last paragraph again. I'll also come back
to this in more detail in the next section, focused on attributes.
As a general rule, if you can't imagine a case where an element might appear more than
once in a single spot in your document (you could have two address elements with different types, like home and office, for
instance, or two author elements on a book), then try to use attributes. That doesn't mean that every time the element is used, it has to appear twice, merely that it can.
It's also not true that elements are best suited for textual data; in plenty of cases a good element has only attributes, and no text content at all (again, I'll reference you to the next section, on attributes). The most important thing to remember is that there are more constructs in XML than just elements.
It seems obvious to say, but you should use attributes in XML for single-value data. If an element is used for, say, a person, find the things about that person that are singular values; they're probably your attributes. Social security numbers, IDs, maybe birth dates—these are all excellent attribute candidates.
Of course, you'll see more exceptions to the attribute rules than to other rules. In fact, attributes are used far less than they should be, in my opinion. Developers tend to like the nice solid look of an element:
<person> <ssn>489098723</ssn> </person> |
But this makes absolutely no sense—a social security number is very much a single-valued datum. Worse, making the number an element in this way creates a performance hog. When you grab an element's child, as you have to in this case, you get the element node, then navigate to all its child nodes, and iterate over them. Maybe the social security number is the first node, but maybe it's the last node; it's not clear. Then, you've got to get that node's children, which in this case probably consist of a single text node, and get their values. So the process goes through several steps:
- Get the parent node (in this case, the
personelement). - Get its children.
- Iterate until you find the one you want (the
ssnelement). - Optionally, normalize the text nodes of this new element, to make sure things are easier later on.
- Get the children of the element.
- Grab the value of the text node child.
Compare that to the same process using attributes:
- Get the parent node (the
personelement). - Get its attributes.
- Iterate until you find the one you want (the
ssnattribute). - Get the value of the attribute.
In the second process, you avoid any need for normalization (attributes always have a singular bit of text associated with them), and you don't have to deal with child elements. I've also found that getting an element's attribute is almost always faster than getting a specific child element. So you see a huge performance benefit, especially over thousands of iterations.
The point is that while it doesn't look quite as nice, I'd much rather see more XML like this:
<person ssn="489098723"
firstName="Bobby"
lastName="McKenza"
>
<occupation>
<occupation-type status="part-time"
job="author" />
<occupation-type status="part-time"
job="programmer" />
</occupation>
<address type="home"
street="112 E. Harney Way"
city="New York"
state="NY"
zip="10012" />
<!-- etc... -->
</person>
|
This looks odd to most of us, because almost all of the information is in attributes. But I believe that it's a better use of attributes, and that it will provide a noticeable improvement in performance when it's accessed 10,000 or more times (it takes that many accesses for the really small performance benefits to add up).
I don't use processing instructions, or PIs, often. However, I've been pretty annoyed at the huge number of XML-consuming APIs that now require you to put special codes or instructions in comments, or use a certain namespace and all sorts of special elements, or even use elements with instructions in them. If your XML needs to provide information to a processing API or tool, that is a processing instruction, and XML has a very specific means of doing that.
You've certainly seen these before, especially in the realm of XSL:
<?xml version="1.0"?> <?xml-stylesheet href="classic.xsl" type="text/html"?> <?xml-stylesheet href="alternate.xsl" type="text/xml" alternate="yes"?> <article> <!-- etc ... --> </article> |
In this case, the PI lets an XSLT processor know about the stylesheets it should apply to this particular XML document. Without getting into all the APIs that ignore this facility (although JAXB comes immediately to mind), you should take advantage of this facility, which is infinitely better than something like this:
<?xml version="1.0"?> <!-- xml-stylesheet url="classic.xsl" document-type="text/html" --> <!-- xml-stylesheet url="alternate.xsl" document-type="text/xml" --> <article> <!-- etc ... --> </article> |
I see code like this all the time. Comments are not meaningful, and it's very dangerous to have an API that consumes comments and treats the information in them as instructions. Javadoc and documentation-producing APIs use comments, but at least they are using them as documentation, rather than actual meaningful instructions. Obviously, if you chose an API that requires this sort of markup, you're stuck, but you should realize that you're not really using XML the way it was intended. In this case, I believe the XML spec writers got it right.
I probably should have said this earlier, but I'm by no means an XML purist. In fact, I think many things in XML are silly, misplaced, or downright wrong. However, I do think that if you choose to use XML, some basic ideas are essential to use it correctly; otherwise, why bother to use XML in the first place? If you don't use elements for multi-valued and lengthy data, if you hardly use attributes at all, and if you misuse or don't use PIs, at some point you might just be better off to write a proprietary language or parser to handle your data. Otherwise, you have all the hassles of XML—verbosity, strange-looking documents, encoding headaches, and so on—without the benefits—speedy parsers tuned for a specific purpose.
So please consider these suggestions not because you'll use XML properly or please the semantic gods. Those are bad reasons to do anything in your programming life. But if you look at them as ways to get more out of XML, and to write better performing software, then this all makes a lot of sense. So go forth and break rules...but not without really good reasons!
Learn
- Sun's online Java™ and XML Headquarters: When it comes to JAXP, there's no better place to start than here.
- The core API documentation for Java 5.0 technology: See how the JAXP JavaDoc is now integrated into the API.
-
SAX Web site: Find out more about the APIs under the covers of JAXP. Start with SAX 2 for the Java environment.
- The W3C Web site: For another view of XML supported by SAX, take a look at DOM.
-
Introduction to XML tutorial (Doug Tidwell, developerWorks, August 2002): Need a more basic introduction to XML? Try the this tutorial and other educational offerings, which cover the most fundamental topics.
-
IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
Get products and technologies
-
Apache
Xerces parser: Read about this parser that Sun uses in their JDK 5.0 implementation.
Discuss
- Participate in the discussion forum.
-
developerWorks blogs: Check out these blogs and get involved in the developerWorks community.
-
developerWorks XML zone: Share your thoughts: After you read this article, post your comments and thoughts in this forum. The XML zone editors moderate the forum and welcome your input.

Brett McLaughlin has worked in computers since the Logo days. (Remember the little triangle?) In recent years, he's become one of the most well-known authors and programmers in the Java and XML communities. He's worked for Nextel Communications, implementing complex enterprise systems; at Lutris Technologies, actually writing application servers; and most recently at O'Reilly Media, Inc., where he continues to write and edit books that matter. Brett's upcoming book, Head Rush Ajax, brings the award-winning and innovative Head First approach to Ajax. His last book, Java 1.5 Tiger: A Developer's Notebook, was the first book available on the newest version of Java technology. And his classic Java and XML remains one of the definitive works on using XML technologies in the Java language.
Comments (Undergoing maintenance)





