Skip to main content

Working XML: Safe coding practices, Part 1

Avoid common XML mistakes

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Photo of Benoit Marchal
Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.

Summary:  Benoît reviewed his project notes and has compiled a list of the most common pitfalls with XML technology. Save yourself a great deal of frustration by watching out for these potential problems in your own projects. In the first article in this series of four, Benoît reviews the hazards of the XML language itself.

View more content in this series

Date:  19 Aug 2005 (Published 06 May 2005)
Level:  Intermediate
Activity:  3054 views

For the last seven years, I have been fortunate enough to watch XML develop and mature from a special point of view thanks to my work as a consultant, trainer, and writer.

When XML was first introduced, organizations and developers were politely suspicious of this new "markup language -- whatever that is." Then, as they applied XML to more and more of their problems, they became enthusiastic. Now, a wide range of developers and organizations naturally include XML in their projects.

Unfortunately, with growing use has come growing abuse -- in this respect, XML mirrors the adoption of other technologies. The first users of any new technology are often passionate about it (they have to be if they want to convince colleagues and customers of its value), but they too may have doubts, so they will usually take time to research how best to implement the new technology.

As a technology matures, it is increasingly taken for granted. And as the technology is used with more and more applications, more mistakes are made with it. Fortunately, in parallel, experience builds up: Well-tested solutions to common problems emerge and are documented, along with common pitfalls.

For this series of four articles, I have gone through my notes and searched for XML pitfalls that appear again and again. My hope is that as I document them and provide alternatives, I will help you avoid falling victim to common problems with the technology.

I'll start with the most fundamental layer: XML itself. Adherence to a common syntax is the first step towards building reliable applications. This installment focuses on three common issues:

  • Use of a parser and character escaping
  • Encodings
  • Namespaces

The articles that follow will review how to exploit XML documents reliably, how to validate and test XML documents, and how to interface XML with the many other file formats, such as images, movies, word processing, and more.

A gentle syntax

This first section covers some basic material on XML syntax. If you are already well-versed in this, feel free to skip to the next section.

XML syntax is simple: Essentially, you must balance the opening and closing tags. Yet I wish I had a proverbial nickel for every e-mail I've received saying, "I'm trying to process the attached XML document through such and such tool and it fails -- could you recommend a better tool?" Invariably, I open the document and find an obvious syntax error such as an empty tag without the closing slash (it should look like this: <empty/>). If the document does not completely adhere to XML syntax, then it's not an XML document; if it's not an XML document, XML tools cannot process it. XML has a very precise and formal syntax. Either a document adheres to the syntax fully, or it is not recognized. Simple as that.

Conversely, some applications may refuse perfectly valid documents. An application might not implement the syntax fully and fail to recognize, say, character entities (î, for example).

The problem is XML's apparent simplicity. It often may seem easier and faster to hack something rather than to learn yet another component. This may work in a closed loop where an application reads the document it has produced, but it is unlikely to work in a production environment where several applications work on the document.

Solution and fix

Fortunately, it's easy to avoid this problem entirely by using an XML parser. XML parsers are available in every programming language (even Cobol enjoys strong XML support), so you have no reason not to use them.

As a developer, you have two options: an XML parser or a marshalling component. If you want or need low-level control over the decoding of an XML document, then you should use an XML parser. For the purposes of this article, it does not matter if the parser follows the DOM, JDOM, SAX, or StAX, but a real XML parser is the only guarantee that you will read every XML document properly.

If you don't need as much control over parsing, you may find a marshalling component -- such as JAXB, Castor, or Axis -- more convenient. Marshalling components map directly between XML tags and Java™ objects. JAXB and Castor are designed to work with documents on file, and Axis works with Web services. Marshalling components embed an XML parser, so you can be sure that they implement the syntax fully.

While I recommend the use of a parser for reading XML documents, you might get by if you implement your own routines for writing documents. Reading XML documents is a complex task because the reader must support the complete syntax, but writing XML documents is comparatively easy because you can get by with a subset of the syntax: If you don't need attributes, you don't have to support them; if you don't need multiple encodings, you don't have to support them; and so on.

The only pitfall here is that you need to escape reserved characters properly (see Table 1). Pay special attention to entity characters (for example, î) because they depend on the encoding of the document (see "Encoding headaches" below).


Table 1. Reserved characters
CharacterEscape sequenceNotes
<&lt; 
&&amp; 
>&gt; 
'&apos;In attributes only, if you use " as the separator
"&quot;In attributes only, if you use ' as the separator
other&#unicode;Any character not supported in the current encoding

A simple loop, similar to Listing 1, is usually sufficient. It is possible to implement the function more efficiently, but Listing 1 is syntactically valid if you write to a UTF-8 or UTF-16 stream (otherwise, you need to escape some characters to character entities as well).


Listing 1. Trivial escaping implementation
// assumes UTF-8 or UTF-16 as encoding,
public String escape(String content)
{
    StringBuffer buffer = new StringBuffer();
    for(int i = 0;i < content.length();i++)
    {
       char c = content.charAt(i);
       if(c == '<')
          buffer.append("&lt;");
       else if(c == '>')
          buffer.append("&gt;");
       else if(c == '&')
          buffer.append("&amp;");
       else if(c == '"')
          buffer.append("&quot;");
       else if(c == '\'')
          buffer.append("&apos;");
       else
          buffer.append(c);
    }
    return buffer.toString();
}

Some developers prefer CDATA sections over escaping. CDATA is a mechanism for indicating that a portion of the document may contain unescaped reserved characters. An example is: <condition><![CDATA[a > 4]]></condition>. I will revisit CDATA sections in the third article in this series, but for now, suffice it to say that they are less safe than escaping because one CDATA section cannot include another CDATA section.

For a more flexible solution, turn to a transformer -- see my tip "Implement XMLReader," here on developerWorks.

The other fix

What if you must interface with an application that deviates from the XML syntax and you cannot convince the developer to fix his or her application?

I find it easier to treat such applications as if they are not producing XML at all, and I include an additional step to convert from their deviant XML into proper XML. Why the extra step? Because it isolates the non-conformance and allows me to use any XML tool I choose for the remainder of the processing.


Encoding headaches

More serious problems can arise from the use of encodings. Developers often overlook the fact that encodings do not limit the set of characters that XML supports. Every XML document supports the full Unicode character set (16-bit or 32-bit characters in XML 1.1).

Encoding XML documents can reduce their size, but it does not limit the document to a subset of Unicode -- thanks to the magic of character entities. Indeed, through character entities, it is possible to insert any character from the Unicode table, even if the document uses the most restrictive encoding (US-ASCII, which is only good for four languages: English, Hawaiian, Latin, and Swahili).

This is a problem because while a Java application or a recent version of DB2® might support Unicode, few legacy applications do. So if the XML stream feeds a legacy application, you must deal with Unicode. To avoid misunderstanding, let me state again that imposing an encoding is not a solution because, as explained above, it is always possible to escape special characters to character entities.

Because rewriting a legacy application is seldom an option, you need a conversion routine that will convert Unicode characters into a set that is acceptable to the application -- for example converting "î" into a straight "i" (removing the circumflex). Most XML parsers provide routines for manipulating Unicode characters.


Namespace concerns

The third and final source of problems this article covers is the use of XML namespaces.

Namespaces were introduced to manage XML vocabularies and to prevent tag synonyms. It is common for two vocabularies to use the same tag in different contexts. For example, a messaging vocabulary might have tags for subject, date, from, to, and body (see Listing 2), while a digital asset vocabulary might have tags for subject, date, description, camera, and frame number (see Listing 3).


Listing 2. A messaging vocabulary
<envelope>
   <subject>Test memo</subject>
   <date>April 26, 2005</date>
   <from>jack@writeit.com</from>
   <to>john@xmli.com</to>
   <body>memo body goes here</body>
</envelope>




Listing 3. A digital asset vocabulary
<photo>
   <subject>Westlicht Museum of Camera and Photography, Vienna</subject>
   <date>April 25, 2005</date>
   <description>Lobby of the museum</description>
   <camera>Nikon D70</camera>
   <frame>5643</frame>
</photo>

Conflicts arise when a digital asset is sent through the messaging platform because the messaging software confuses the subject and date tags in the two vocabularies. In other words, the name of a tag is not a global identifier.

XML namespaces turn local names into global ones by appending a global identifier to the tag name. To guarantee the uniqueness of global identifiers, they must be URIs (meaning they most likely contain a domain name that has been registered to guarantee uniqueness). The result looks like Listing 4.


Listing 4. Combining vocabularies
<env:envelope xmlns:env="http://psol.com/2005/env"
              xmlns:ph="http://psol.com/2005/photo">
   <env:subject>Latest photo</env:subject>
   <env:date>April 27, 2005</env:date>
   <env:from>jack@writeit.com</env:from>
   <env:to>john@xmli.com</env:to>
   <env:body>
      <ph:photo>
         <ph:subject>Westlicht Museum
             of Camera and Photography, Vienna</ph:subject>
         <ph:date>April 25, 2005</ph:date>
         <ph:description>Lobby of the museum</ph:description>
         <ph:camera>Nikon D70</ph:camera>
         <ph:frame>5643</ph:frame>
     </ph:photo></env:body>
</env:envelope>

Let me clarify two things that are often misunderstood:

  • The URI is an identifier, not an address.
  • The prefix is not an identifier.

URIs and addresses

Although in practice most URIs are addresses (URLs), for XML namespaces, they are only used as identifiers. I wish that namespaces could be identified like Java packages, but they cannot -- for example, com.psol.vocabulary instead of the more confusing http://psol.com/vocabulary.

Because they are identifiers, the addresses may be invalid -- meaning they may return a "404 - Resource not found" error if you try to follow them. But they still serve their purpose. And contrary to common belief, namespace URIs do not point to a W3C XML Schema.

Secondly, because in this context URIs are identifiers, your application must match the URI letter-for-letter. It would be a mistake to adapt the URI of an XML vocabulary to, for instance, point to your server. For example, the URI for XSL is http://www.w3.org/1999/XSL/Transform. You cannot adapt it into, say, http://www.ibm.com/1999/XSL/Transform if you work at IBM®. In fact, you cannot change the URI of an existing vocabulary at all.

When I teach XSLT, my students frequently complain that the processor does not work, when, in fact, he or she has not reproduced the XSLT URI exactly.

One result of all this is that you should refrain from changing namespaces. It is generally a bad idea to include a version schema to a URI as it is guaranteed to break applications when you upgrade (and yes, I realize the W3C did just that with SOAP).

Prefixes

Another common mistake is to confuse a prefix with an identifier. A prefix is not an identifier for the same reason that a tag name cannot be an identifier: The risk of two different applications using the same prefix is high. Therefore, namespace prefixes are transparent, and you should never manipulate them explicitly in your application. However, it is perfectly reasonable for an XML writer to modify prefixes in a document (for example, to avoid conflicts).

So avoid writing code like that in Listing 5, and instead emulate that in Listing 6.


Listing 5. Incorrect testing of prefixes
startElement(String uri,String local,String qname,Attributes atts) 
{
   if(qname.equals("env:Envelope"))
      ;   // do something
}




Listing 6. Correct testing of namespace URI
startElement(String uri,String local,String qname,Attributes atts) 
{
   if(uri.equals("http://psol.com/2005/envelope")
      && local.equals("Envelope"))
      ;   // do something
}


More problems and more solutions

If you are mindful of these pitfalls, you can improve your XML coding dramatically. More importantly, you can minimize the risks of incompatibilities and greatly simplify maintenance of XML applications. The remainder of this series reviews other common pitfalls related not to syntax but to applications of XML.


Resources

About the author

Photo of Benoit Marchal

Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at marchal.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=82689
ArticleTitle=Working XML: Safe coding practices, Part 1
publish-date=08192005
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers