For the last seven years, I have been fortunate enough to watch XML develop and mature from a special point of view thanks to my work as a consultant, trainer, and writer.
When XML was first introduced, organizations and developers were politely suspicious of this new "markup language -- whatever that is." Then, as they applied XML to more and more of their problems, they became enthusiastic. Now, a wide range of developers and organizations naturally include XML in their projects.
Unfortunately, with growing use has come growing abuse -- in this respect, XML mirrors the adoption of other technologies. The first users of any new technology are often passionate about it (they have to be if they want to convince colleagues and customers of its value), but they too may have doubts, so they will usually take time to research how best to implement the new technology.
As a technology matures, it is increasingly taken for granted. And as the technology is used with more and more applications, more mistakes are made with it. Fortunately, in parallel, experience builds up: Well-tested solutions to common problems emerge and are documented, along with common pitfalls.
For this series of four articles, I have gone through my notes and searched for XML pitfalls that appear again and again. My hope is that as I document them and provide alternatives, I will help you avoid falling victim to common problems with the technology.
I'll start with the most fundamental layer: XML itself. Adherence to a common syntax is the first step towards building reliable applications. This installment focuses on three common issues:
- Use of a parser and character escaping
- Encodings
- Namespaces
The articles that follow will review how to exploit XML documents reliably, how to validate and test XML documents, and how to interface XML with the many other file formats, such as images, movies, word processing, and more.
This first section covers some basic material on XML syntax. If you are already well-versed in this, feel free to skip to the next section.
XML syntax is simple: Essentially, you must balance the opening and closing tags. Yet I wish I had a proverbial nickel for every e-mail I've received saying, "I'm trying to process the attached XML document through such and such tool and it fails -- could you recommend a better tool?" Invariably, I open the document and find an obvious syntax error such as an empty tag without the closing slash (it should look like this: <empty/>).
If the document does not completely adhere to XML syntax, then it's not an XML document; if it's not an XML document, XML tools cannot process it. XML has a very precise and formal syntax. Either a document adheres to the syntax fully, or it is not recognized. Simple as that.
Conversely, some applications may refuse perfectly valid documents. An application might not implement the syntax fully and fail to recognize, say, character entities (î, for example).
The problem is XML's apparent simplicity. It often may seem easier and faster to hack something rather than to learn yet another component. This may work in a closed loop where an application reads the document it has produced, but it is unlikely to work in a production environment where several applications work on the document.
Fortunately, it's easy to avoid this problem entirely by using an XML parser. XML parsers are available in every programming language (even Cobol enjoys strong XML support), so you have no reason not to use them.
As a developer, you have two options: an XML parser or a marshalling component. If you want or need low-level control over the decoding of an XML document, then you should use an XML parser. For the purposes of this article, it does not matter if the parser follows the DOM, JDOM, SAX, or StAX, but a real XML parser is the only guarantee that you will read every XML document properly.
If you don't need as much control over parsing, you may find a marshalling component -- such as JAXB, Castor, or Axis -- more convenient. Marshalling components map directly between XML tags and Java™ objects. JAXB and Castor are designed to work with documents on file, and Axis works with Web services. Marshalling components embed an XML parser, so you can be sure that they implement the syntax fully.
While I recommend the use of a parser for reading XML documents, you might get by if you implement your own routines for writing documents. Reading XML documents is a complex task because the reader must support the complete syntax, but writing XML documents is comparatively easy because you can get by with a subset of the syntax: If you don't need attributes, you don't have to support them; if you don't need multiple encodings, you don't have to support them; and so on.
The only pitfall here is that you need to escape reserved characters properly (see Table 1). Pay special attention to entity characters (for example, î) because they depend on the encoding of the document (see "Encoding headaches" below).
Table 1. Reserved characters
| Character | Escape sequence | Notes |
| < | < | |
| & | & | |
| > | > | |
| ' | ' | In attributes only, if you use " as the separator |
| " | " | In attributes only, if you use ' as the separator |
| other | &#unicode; | Any character not supported in the current encoding |
A simple loop, similar to Listing 1, is usually sufficient. It is possible to implement the function more efficiently, but Listing 1 is syntactically valid if you write to a UTF-8 or UTF-16 stream (otherwise, you need to escape some characters to character entities as well).
Listing 1. Trivial escaping implementation
// assumes UTF-8 or UTF-16 as encoding,
public String escape(String content)
{
StringBuffer buffer = new StringBuffer();
for(int i = 0;i < content.length();i++)
{
char c = content.charAt(i);
if(c == '<')
buffer.append("<");
else if(c == '>')
buffer.append(">");
else if(c == '&')
buffer.append("&");
else if(c == '"')
buffer.append(""");
else if(c == '\'')
buffer.append("'");
else
buffer.append(c);
}
return buffer.toString();
} |
Some developers prefer CDATA sections over escaping. CDATA is a mechanism for indicating that a portion of the document may contain unescaped reserved characters. An example is: <condition><![CDATA[a > 4]]></condition>. I will revisit CDATA sections in the third article in this series, but for now, suffice it to say that they are less safe than escaping because one CDATA section cannot include another CDATA section.
For a more flexible solution, turn to a transformer -- see my tip "Implement XMLReader," here on developerWorks.
What if you must interface with an application that deviates from the XML syntax and you cannot convince the developer to fix his or her application?
I find it easier to treat such applications as if they are not producing XML at all, and I include an additional step to convert from their deviant XML into proper XML. Why the extra step? Because it isolates the non-conformance and allows me to use any XML tool I choose for the remainder of the processing.
More serious problems can arise from the use of encodings. Developers often overlook the fact that encodings do not limit the set of characters that XML supports. Every XML document supports the full Unicode character set (16-bit or 32-bit characters in XML 1.1).
Encoding XML documents can reduce their size, but it does not limit the document to a subset of Unicode -- thanks to the magic of character entities. Indeed, through character entities, it is possible to insert any character from the Unicode table, even if the document uses the most restrictive encoding (US-ASCII, which is only good for four languages: English, Hawaiian, Latin, and Swahili).
This is a problem because while a Java application or a recent version of DB2® might support Unicode, few legacy applications do. So if the XML stream feeds a legacy application, you must deal with Unicode. To avoid misunderstanding, let me state again that imposing an encoding is not a solution because, as explained above, it is always possible to escape special characters to character entities.
Because rewriting a legacy application is seldom an option, you need a conversion routine that will convert Unicode characters into a set that is acceptable to the application -- for example converting "î" into a straight "i" (removing the circumflex). Most XML parsers provide routines for manipulating Unicode characters.
The third and final source of problems this article covers is the use of XML namespaces.
Namespaces were introduced to manage XML vocabularies and to prevent tag synonyms. It is common for two vocabularies to use the same tag in different contexts. For example, a messaging vocabulary might have tags for subject, date, from, to, and body (see Listing 2), while a digital asset vocabulary might have tags for subject, date, description, camera, and frame number (see Listing 3).
Listing 2. A messaging vocabulary
<envelope> <subject>Test memo</subject> <date>April 26, 2005</date> <from>jack@writeit.com</from> <to>john@xmli.com</to> <body>memo body goes here</body> </envelope> |
Listing 3. A digital asset vocabulary
<photo> <subject>Westlicht Museum of Camera and Photography, Vienna</subject> <date>April 25, 2005</date> <description>Lobby of the museum</description> <camera>Nikon D70</camera> <frame>5643</frame> </photo> |
Conflicts arise when a digital asset is sent through the messaging platform because the messaging software confuses the subject and date tags in the two vocabularies. In other words, the name of a tag is not a global identifier.
XML namespaces turn local names into global ones by appending a global identifier to the tag name. To guarantee the uniqueness of global identifiers, they must be URIs (meaning they most likely contain a domain name that has been registered to guarantee uniqueness). The result looks like Listing 4.
Listing 4. Combining vocabularies
<env:envelope xmlns:env="http://psol.com/2005/env"
xmlns:ph="http://psol.com/2005/photo">
<env:subject>Latest photo</env:subject>
<env:date>April 27, 2005</env:date>
<env:from>jack@writeit.com</env:from>
<env:to>john@xmli.com</env:to>
<env:body>
<ph:photo>
<ph:subject>Westlicht Museum
of Camera and Photography, Vienna</ph:subject>
<ph:date>April 25, 2005</ph:date>
<ph:description>Lobby of the museum</ph:description>
<ph:camera>Nikon D70</ph:camera>
<ph:frame>5643</ph:frame>
</ph:photo></env:body>
</env:envelope> |
Let me clarify two things that are often misunderstood:
- The URI is an identifier, not an address.
- The prefix is not an identifier.
Although in practice most URIs are addresses (URLs), for XML namespaces, they are only used as identifiers. I wish that namespaces could be identified like Java packages, but they cannot -- for example, com.psol.vocabulary instead of the more confusing http://psol.com/vocabulary.
Because they are identifiers, the addresses may be invalid -- meaning they may return a "404 - Resource not found" error if you try to follow them. But they still serve their purpose. And contrary to common belief, namespace URIs do not point to a W3C XML Schema.
Secondly, because in this context URIs are identifiers, your application must match the URI letter-for-letter. It would be a mistake to adapt the URI of an XML vocabulary to, for instance, point to your server. For example, the URI for XSL is http://www.w3.org/1999/XSL/Transform. You cannot adapt it into, say, http://www.ibm.com/1999/XSL/Transform if you work at IBM®. In fact, you cannot change the URI of an existing vocabulary at all.
When I teach XSLT, my students frequently complain that the processor does not work, when, in fact, he or she has not reproduced the XSLT URI exactly.
One result of all this is that you should refrain from changing namespaces. It is generally a bad idea to include a version schema to a URI as it is guaranteed to break applications when you upgrade (and yes, I realize the W3C did just that with SOAP).
Another common mistake is to confuse a prefix with an identifier. A prefix is not an identifier for the same reason that a tag name cannot be an identifier: The risk of two different applications using the same prefix is high. Therefore, namespace prefixes are transparent, and you should never manipulate them explicitly in your application. However, it is perfectly reasonable for an XML writer to modify prefixes in a document (for example, to avoid conflicts).
So avoid writing code like that in Listing 5, and instead emulate that in Listing 6.
Listing 5. Incorrect testing of prefixes
startElement(String uri,String local,String qname,Attributes atts)
{
if(qname.equals("env:Envelope"))
; // do something
} |
Listing 6. Correct testing of namespace URI
startElement(String uri,String local,String qname,Attributes atts)
{
if(uri.equals("http://psol.com/2005/envelope")
&& local.equals("Envelope"))
; // do something
} |
More problems and more solutions
If you are mindful of these pitfalls, you can improve your XML coding dramatically. More importantly, you can minimize the risks of incompatibilities and greatly simplify maintenance of XML applications. The remainder of this series reviews other common pitfalls related not to syntax but to applications of XML.
- Participate in the discussion forum.
- Read Benoît Marchal's tip "Implement XMLReader
" (developerWorks, November 2003), which explains how to use a transformer to write syntactically correct documents.
- Check out "SAX, the power API" (developerWorks, August 2001), which explains how to use a SAX parser and compares event-based APIs (like SAX) to object-based ones (such as DOM).
- Avoid the pitfalls of XML namespaces -- read Uche Ogbuji's article "Use XML namespaces with care" (developerWorks, April 2004), part of his "Principles of XML design" series.
- Review the other installments of Benoît Marchal's Working XML column.
- Find hundreds more XML resources on the
developerWorks XML zone.
- Browse for books on these and other technical topics.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.





