 | Level: Introductory David Mertz (mertz@gnosis.cx), Gesticulator, Gnosis Software, Inc.
01 Nov 2001 In this tip, developerWorks columnist David Mertz advises when to use tag attributes and when to use subelement contents to represent data. Learn what considerations go into designing a DTD, Schema, or just an ad hoc XML format. You'll also learn when attributes and contents are interchangeable, and when they aren't. Code samples show the options.
An odd thing about XML is that it provides two almost, but not quite, equivalent ways of spelling "this is the data." One way to indicate a data value is to put it inside a subelement; another way is to put it in attribute values. Because there usually isn't an obvious answer for when of each of the two approaches is appropriate, XML is not
entirely orthogonal (which is computer science speak for "each construct does one thing, and no other construct does the same thing"). This tip offers some guidance for when to use subelements and when to use attributes.
One time when you do not need to decide what data goes where is when you are handed an XML dialect specification to follow -- given to you as a DTD or as a W3C XML Schema, or described informally or by example. If you are not making the choices, don't worry about the suggestions in this tip. Often, though, developers need to design the exact XML dialect to use for a process. If that's your case, read on.
One thing to keep in mind is the difference between XML documents that merely need to be well formed, and those that need to be valid relative to some DTD/Schema. Validity is much more rigorous; it allows you to insist that certain data be present and be structured in a certain way. For the very same reason, it is much more work to make sure a given document production process conforms with validity requirements. Both approaches have advantages; imposing a DTD adds complexity to the element/attribute issue, but there are tradeoffs in both cases. These tradeoffs are discussed below.
Is data order important?
If you want to use a DTD, subelements are strictly ordered, while attributes are unordered. In well-formed-only XML documents, you are free to play with order; after all, in this case any tag can go inside any other tag, at any depth. In both cases,
attributes are usually better for unordered data. For XML documents with a DTD, however, use of attributes is almost required for this type of data
For example, you might have a list of contacts, each of whom must have a name, age, and telephone number. But there is no logical sense in which age precedes telephone number. The attributes are thus unordered. In this case, attributes are more intuitive. Compare the brief XML documents in listings 1 and 2:
Listing 1. Attribute data for contacts
<?xml version="1.0" ?>
<!DOCTYPE contacts SYSTEM "attrs.dtd" >
<contacts>
<contact
name="Jane Doe"
age="74"
telephone="555-3412" />
<contact name="Chieu Win" telephone="555-8888" age="44" />
</contacts> |
Listing 2. Subelement data for contacts
<?xml version="1.0" ?>
<!DOCTYPE contacts SYSTEM "subelem.dtd" >
<contacts>
<contact>
<name>Jane Doe</name>
<age>74</age>
<telephone>555-3412</telephone>
</contact>
<contact>
<name>Chieu Win</name>
<telephone>555-8888</telephone>
<age>44</age>
</contact>
</contacts>
|
Imagine the DTD that is implied by each XML format. For the attribute-oriented form in Listing 1, it might look like Listing 3.
Listing 3. Attribute DTD for contacts document
<!ELEMENT contacts (contact*)>
<!ELEMENT contact EMPTY>
<!ATTLIST contact name CDATA #REQUIRED
age CDATA #REQUIRED
telephone CDATA #REQUIRED >
|
A subelement-oriented DTD to do the same thing could look like Listing 4.
Listing 4. Subelement DTD for contacts document
<!ELEMENT contacts (contact*)>
<!ELEMENT contact (name,age,telephone)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
<!ELEMENT telephone (#PCDATA)>
|
The obvious problem with the DTD in Listing 4 is that the simple example in Listing 2 is invalid under the DTD (even though it has the data we want). The subelements are out of order. The sidebar shows
how you can use unordered subelements with a DTD, but unless there is a different compelling reason, it is better to use the attribute-style for unordered data.
Does multiple data occur at the same level?
If the same type of data occurs many times within an object, subelements win, hands down. For example, in the contact list scenario, a contacts object contains many contact objects. In this case, it is clear that each contact should be described within a child element of the contacts element.
In real life, however, developers often creep away from this design principle in the course of making revisions. Here is how it happens: First, you find that each Flazbar has a flizbam attached to it (and a flizbam is described by a datum). Good enough, it seems like an obvious choice to save the extra verbosity of a subelement and create a flizbam attribute for the Flazbar tag. A while down the road -- after you have written wonderful production code for handling Flazbars -- you discover that in some situations a Flazbar can have two flizbams. Not a problem: with very little change to your installed code, you just change the DTD to contain:
<!ATTLIST Flazbar flizbam CDATA #REQUIRED
flizbam2 CDATA #IMPLIED>
|
With the amended code, your old XML documents are still valid, but new ones work also. After a while you discover the third flizbam ...
It's hard to avoid being tempted into this design pitfall. Data and objects evolve over time, and singular things frequently become dual or multiple. Some XML programmers eschew attributes altogether for this reason, but I think that goes too far. My advice is to think carefully at the design stage about whether a singular datum might have siblings later on. If there is a reasonable probability of multiple siblings in the future, use subelements from the start. If you can be reasonably confident that a data object will remain unique, stick with attributes.
 |
Simulating unordered subelements
You could create a DTD that makes the XML document in Listing 2 valid by including the definition, as in this listing.
A DTD that defines contact list subelements very flexibly
However, the DTD above allows far too much flexibility. You could have contact elements with no name, and ones with several ages -- neither of which meets the semantic requirements. To get what we really want would
require the extremely cumbersome definition below.
A cumbersome but accurate DTD for contact list elements
This DTD is ugly, and it gets uglier at a factorial rate with more data points. Plus, making a DTD stricter than is semantically necessary for data producers is also undesirable (for example, imposing the first subelement DTD).
|
|
Is whitespace preservation required?
After normalization of attributes, you can count on every token in an attribute being separated from its neighbors by whitespace. But that's all you can count on. For readability by developers, you can add vertical and horizontal whitespace to long attribute values without any problem (in fact, you should do this). But once those readable
attributes go through an XML parser, the layout of the attribute will probably be somewhat different than in the source XML.
If whitespace is important, subelements are a better choice. For instance, if you are representing something like source code or poetry, where exact spacing matters, stick to element contents.
Does readability count?
Ideally, XML should be a format computers read, not one humans read. But, fortunately or unfortunately, programmers are humans too; and for the foreseeable future, we are going to spend a lot of time reading, writing, and debugging XML files. It is positively painful to read XML that is formatted with only machines in mind (no whitespace, or nonsensical whitespace).
Personally, I find it much easier to read and write attribute-oriented XML formats than subelement-oriented ones. Look again at Listing 1 and Listing 2 above to see what I mean.
Neither is horrible to read, but the attribute version in Listing 2 is easier -- and better still to write, because you do not need to worry about capricious subelement ordering.
Conclusion
I have pointed to some cases where
subelements or attributes are more desirable. Keeping in mind the principles addressed can lead to clearer and cleaner XML document formats. Unfortunately, sometimes the real situation falls into multiple cases (pointing in opposite
directions). And a lot of times, data designs change enough to invalidate previous motivations. Use the rules given in this tip where possible, but above all "use (informed) common sense."
Resources
- Everything you really need to know about XML is in the Extensible Markup Language (XML) 1.0 W3C Recommendation. Of course, understanding exactly what this signifies
requires some subtlety.
- The XML Cover Pages has some tips on Using elements and attributes. That page also contain pointers to a number of articles, each giving contrary advice about what criteria to use in deciding between attributes and elements. That is why we programmers get paid the big bucks!
- One way to view the distinction between attributes and elements is in terms of "Document-Centric" vs. "Data-Centric" documents.
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
About the author  | 
|  |
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. |
Rate this page
|  |