Skip to main content

XML and Java technology: A return to basics

Dust off your XML fundamentals and master good document construction

Brett D. McLaughlin, Sr., Author and Editor, O'Reilly Media, Inc.
Photo of Brett McLaughlin
Brett McLaughlin has worked in computers since the Logo days. (Remember the little triangle?) In recent years, he's become one of the most well-known authors and programmers in the Java and XML communities. He's worked for Nextel Communications, implementing complex enterprise systems; at Lutris Technologies, actually writing application servers; and most recently at O'Reilly Media, Inc., where he continues to write and edit books that matter. Brett's upcoming book, Head Rush Ajax, brings the award-winning and innovative Head First approach to Ajax. His last book, Java 1.5 Tiger: A Developer's Notebook, was the first book available on the newest version of Java technology. And his classic Java and XML remains one of the definitive works on using XML technologies in the Java language.

Summary:  Brett McLaughlin revisits some XML basics, from document structure to the age-old attributes versus elements issue. You'll relearn how to optimize your XML and ensure it's in tip-top shape.

View more content in this series

Date:  09 Oct 2007
Level:  Introductory
Activity:  2976 views

I've written numerous articles and books on XML and XML-related topics by now, and as I spent a little time looking back over them, I was surprised at what I found. Even though I'm a programmer, and have always preferred to get into the bits and bytes (I took assembly in college and liked it) and like to have complete control, the general trend of my articles and tips and books has been less and less about XML itself and more and more about APIs that manipulate XML, and the APIs that wrap those APIs, and the other APIs that allow you to never touch XML at all.

So the end result of all that is a general concern: have we forgotten how to actually write good XML? Do we even know what good XML is anymore? In this short article, I want to review a few XML 101 tips, and make sure that, with all the tools we have, we don't forget basic principles along the way.

First steps: Using the right parts of XML

One of the biggest issues that's become apparent is that XML authors (the term is being loosely applied here!) have begun to stuff everything into elements. Attributes, processing instructions—they've become things of the past.

Elements

Elements are probably the easiest thing about XML authoring to get right, and that's mostly because XML authors tend to use elements for everything. That, of course, is not right, and has some pretty serious downsides. In XML, elements are almost always (there are exceptions, but I'm talking about optimal usage here) best used to represent data that has some sort of hierarchy to it, or that at least might have some hierarchy to it. To start with the negative case, a first name will never have hierarchy to it; it's a single word, and that's it. However, if you chose to have a name element, that might be a good element: it breaks down into a first name and a last name at least, and perhaps a middle name, a title, and other components as well. So this isn't really proper use of an element:

<firstName>Bob</firstName>

But this is:

<name firstName="Bob"
         lastName="Zemeckis"
         title="Mr." />

If you wonder why I didn't nest firstName, lastName, and title elements within the name element, read the last paragraph again. I'll also come back to this in more detail in the next section, focused on attributes.

As a general rule, if you can't imagine a case where an element might appear more than once in a single spot in your document (you could have two address elements with different types, like home and office, for instance, or two author elements on a book), then try to use attributes. That doesn't mean that every time the element is used, it has to appear twice, merely that it can.

It's also not true that elements are best suited for textual data; in plenty of cases a good element has only attributes, and no text content at all (again, I'll reference you to the next section, on attributes). The most important thing to remember is that there are more constructs in XML than just elements.

Attributes

It seems obvious to say, but you should use attributes in XML for single-value data. If an element is used for, say, a person, find the things about that person that are singular values; they're probably your attributes. Social security numbers, IDs, maybe birth dates—these are all excellent attribute candidates.

Of course, you'll see more exceptions to the attribute rules than to other rules. In fact, attributes are used far less than they should be, in my opinion. Developers tend to like the nice solid look of an element:

<person>
  <ssn>489098723</ssn>
</person>

But this makes absolutely no sense—a social security number is very much a single-valued datum. Worse, making the number an element in this way creates a performance hog. When you grab an element's child, as you have to in this case, you get the element node, then navigate to all its child nodes, and iterate over them. Maybe the social security number is the first node, but maybe it's the last node; it's not clear. Then, you've got to get that node's children, which in this case probably consist of a single text node, and get their values. So the process goes through several steps:

  1. Get the parent node (in this case, the person element).
  2. Get its children.
  3. Iterate until you find the one you want (the ssn element).
  4. Optionally, normalize the text nodes of this new element, to make sure things are easier later on.
  5. Get the children of the element.
  6. Grab the value of the text node child.

Compare that to the same process using attributes:

  1. Get the parent node (the person element).
  2. Get its attributes.
  3. Iterate until you find the one you want (the ssn attribute).
  4. Get the value of the attribute.

In the second process, you avoid any need for normalization (attributes always have a singular bit of text associated with them), and you don't have to deal with child elements. I've also found that getting an element's attribute is almost always faster than getting a specific child element. So you see a huge performance benefit, especially over thousands of iterations.

The point is that while it doesn't look quite as nice, I'd much rather see more XML like this:

<person ssn="489098723"
        firstName="Bobby"
        lastName="McKenza"
>
  <occupation>
    <occupation-type status="part-time"
                     job="author" />
    <occupation-type status="part-time"
                     job="programmer" />
  </occupation>
  <address type="home"
           street="112 E. Harney Way"
           city="New York"
           state="NY"
           zip="10012" />
  <!-- etc... -->
</person>

This looks odd to most of us, because almost all of the information is in attributes. But I believe that it's a better use of attributes, and that it will provide a noticeable improvement in performance when it's accessed 10,000 or more times (it takes that many accesses for the really small performance benefits to add up).

Processing instructions

I don't use processing instructions, or PIs, often. However, I've been pretty annoyed at the huge number of XML-consuming APIs that now require you to put special codes or instructions in comments, or use a certain namespace and all sorts of special elements, or even use elements with instructions in them. If your XML needs to provide information to a processing API or tool, that is a processing instruction, and XML has a very specific means of doing that.

You've certainly seen these before, especially in the realm of XSL:

<?xml version="1.0"?>
<?xml-stylesheet href="classic.xsl" type="text/html"?>
<?xml-stylesheet href="alternate.xsl" type="text/xml" alternate="yes"?>
<article>
  <!-- etc ... -->
</article>

In this case, the PI lets an XSLT processor know about the stylesheets it should apply to this particular XML document. Without getting into all the APIs that ignore this facility (although JAXB comes immediately to mind), you should take advantage of this facility, which is infinitely better than something like this:

<?xml version="1.0"?>

<!-- xml-stylesheet url="classic.xsl" document-type="text/html" -->
<!-- xml-stylesheet url="alternate.xsl" document-type="text/xml" -->
<article>
  <!-- etc ... -->
</article>

I see code like this all the time. Comments are not meaningful, and it's very dangerous to have an API that consumes comments and treats the information in them as instructions. Javadoc and documentation-producing APIs use comments, but at least they are using them as documentation, rather than actual meaningful instructions. Obviously, if you chose an API that requires this sort of markup, you're stuck, but you should realize that you're not really using XML the way it was intended. In this case, I believe the XML spec writers got it right.


In conclusion

I probably should have said this earlier, but I'm by no means an XML purist. In fact, I think many things in XML are silly, misplaced, or downright wrong. However, I do think that if you choose to use XML, some basic ideas are essential to use it correctly; otherwise, why bother to use XML in the first place? If you don't use elements for multi-valued and lengthy data, if you hardly use attributes at all, and if you misuse or don't use PIs, at some point you might just be better off to write a proprietary language or parser to handle your data. Otherwise, you have all the hassles of XML—verbosity, strange-looking documents, encoding headaches, and so on—without the benefits—speedy parsers tuned for a specific purpose.

So please consider these suggestions not because you'll use XML properly or please the semantic gods. Those are bad reasons to do anything in your programming life. But if you look at them as ways to get more out of XML, and to write better performing software, then this all makes a lot of sense. So go forth and break rules...but not without really good reasons!


Resources

Learn

Get products and technologies

  • Apache Xerces parser: Read about this parser that Sun uses in their JDK 5.0 implementation.

Discuss

About the author

Photo of Brett McLaughlin

Brett McLaughlin has worked in computers since the Logo days. (Remember the little triangle?) In recent years, he's become one of the most well-known authors and programmers in the Java and XML communities. He's worked for Nextel Communications, implementing complex enterprise systems; at Lutris Technologies, actually writing application servers; and most recently at O'Reilly Media, Inc., where he continues to write and edit books that matter. Brett's upcoming book, Head Rush Ajax, brings the award-winning and innovative Head First approach to Ajax. His last book, Java 1.5 Tiger: A Developer's Notebook, was the first book available on the newest version of Java technology. And his classic Java and XML remains one of the definitive works on using XML technologies in the Java language.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=256947
ArticleTitle=XML and Java technology: A return to basics
publish-date=10092007
author1-email=brett@newInstance.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers