Skip to main content

Thinking XML: Good advice for creating XML

Principles of XML design from the community at large

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc
Uche Ogbuji photo
Uche Ogbuji is a consultant and co-founder of Fourthought, Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or contact him at uche@ogbuji.net.

Summary:  The use of XML has become widespread, but much of it is not well formed. When it is well formed, it's often of poor design, which makes processing and maintenance very difficult. And much of the infrastructure for serving XML can compound these problems. In response, there has been some public discussion of XML best practices, such as Henri Sivonen's document, "HOWTO Avoid Being Called a Bozo When Producing XML." Uche Ogbuji frequently discusses XML best practices on IBM developerWorks, and in this column, he gives you his opinion about the main points discussed in such articles.

View more content in this series

Date:  31 Jan 2006
Level:  Introductory
Activity:  2715 views

I have been discussing XML best practices in this column and in other series for years. Others, such as fellow columnist Elliotte Rusty Harold, have covered it as well. The more XML experts that join the discussion of XML design principles, the better, so the community can converge on solid advice for developers at all levels of XML adoption. In this article, using a recent document and a classic one, you learn more details about XML best practices.

Enter the no bozo zone

Henri Sivonen wrote a useful article, "HOWTO Avoid Being Called a Bozo When Producing XML" (see Resources). Adopting the perspective of XML-based Web feed formats, such as RSS and Atom, he goes over his Dos and Don'ts for producing well-formed XML with namespaces. As he says in his introduction:

There seem to be developers who think that well-formedness is awfully hard -- if not impossible -- to get right when producing XML programmatically and developers who can get it right and wonder why the others are so incompetent. I assume no one wants to appear incompetent or to be called names. Therefore, I hope the following list of Dos and Don'ts helps developers to move from the first group to the latter.

The first bit of advice Henri gives is, "Don't think of XML as a text format." I think this is dangerous advice. Certainly his main point is valid -- you cannot be as careless in producing or editing XML as you would a simple text document, but this applies to all text formats with any structure. However, saying that XML is not text is denying one of the most important characteristics of XML, one that is enshrined in the very definition of XML in the specification. ("A textual object is a well-formed XML document [if it conforms to this specification.]") Henri's statement is also confusing because there is a technical definition of text in XML that is essentially the sequence of characters interpreted as XML. Text is not merely what goes within leaf elements or within attributes -- technically called character data. Text is the fundamental fabric of all XML entities, so to say that XML is not text is a contradiction. I think it's more useful to highlight the specific ways in which XML differs from text formats with which developers might already be familiar.

This comment is an example of how Henri's advice is colored by his interest in the problem of generating well-formed Web feeds. He is right to warn people that carelessly slapping strings together and hoping they are well formed is a dangerous course. I too have written articles advising people to use mature XML toolkits rather than simple text tools when generating XML (see Resources). My concern is that the way in which Henri couches this advice is a bit confusing and could be misconstrued in the broader context of XML processing. He reiterates his advice in the sections, "Don't use text-based templates" and "Don't print". I think this should be summarized as: "Do not use mechanisms that you're not sure will result in well-formed XML." That's very important advice indeed. One approach to safe XML generation is sending SAX events, as Henri suggests in, "Use a tree or a stack (or an XML parser)." If you do so, however, do not assume you are home free. The SAX tools you use might not do all the necessary well-formedness checking. For example, some Unicode characters are not allowed in XML. You may need an additional level of checking to account for such issues.

Henri rightly suggests that users not try to manage namespaces by hand. As I've discussed on developerWorks, XML namespaces require a great deal of care. His suggestion that developers only think in terms of universal name [namespace Uniform Resource Identifier (URI) plus local name] is generally sound, but sometimes a developer cannot avoid dealing with prefixes or XML declarations. In specifications, such as XSLT, a QName (prefix/local name combination) can be used within attribute values, and the prefix is supposed to be interpreted according to in-scope namespace declarations. This kind of pattern is called a QName in context. In this case, the developer must have control over the declared prefix or the resulting XML processing will fail. When developers do manage their own namespace declarations, the result is often messy because of the complexities of XML namespaces.

One way to clean up namespace syntax that might become messy while passing through a pipeline of XML processing is to insert a canonicalization step to the end of the pipeline. XML canonicalization eliminates the syntactic variations permitted by XML 1.0 and XML namespaces, including different namespace declaration patterns. Canonicalization will not eliminate all the issues that make namespace declarations treacherous to developers. Canonicalization does not help with QNames in context problems since it does not change the prefixes used in a document, but it does reduce the mess of namespace declarations to the point where you can easily spot problems or even write code to automatically fix them. The GenX library, which is one of the XML generation options Henri suggests, automatically generates canonical XML, and many other toolkits provide canonicalization as an option.

Henri's advice about Unicode and character handling is almost completely sound. However, in "Avoid adding pretty-printing white space in character data," I think the case is a bit overstated. Pretty-printing XML is safe in most cases between elements, rather than within elements with character data. As Henri says, if you have the XML in Listing 1, it is usually not safe to render it as in Listing 2.


Listing 1. XML sample

<foo>bar</foo>


Listing 2. XML sample with white space added to character data

<foo>
  bar
</foo>

But it is usually safe to pretty-print the XML in Listing 3, so that the output is as in Listing 4.


Listing 3. Another XML sample

<doc><foo>bar</foo></doc>


Listing 4. XML sample in Listing 3 with white space added to character data

<doc>
  <foo>bar</foo>
</doc>

Many XML serializer tools understand this distinction between relatively safe and relatively unsafe pretty-printing. It is important to understand that the form of pretty-printing shown in Listings 3 and 4 can cause distortion if white space is added to mixed content. Such problems can be avoided if the serialization is guided by a schema. In practice, though, most vocabularies that use mixed content are not so sensitive to white space normalization, so don't worry too much about pretty-printing. You should be knowledgeable of the issues, and be sure there is an option to turn pretty-printing off (preferably the default should be to not pretty-print). Henri recommends a pretty-printing practice as in Listing 5, but I disagree because I think it makes for ugly markup that's not friendly to manipulation by people.


Listing 5. Pretty-printing convention suggested by Henri Sivonen but not recommended by this author

<foo
    >bar</foo
>

From the monastery

Switching to a very different speed, the second resource I shall explore in this article is Simon St. Laurent's "Monastic XML" (see Resources). This is a collection of brief essays with advice on how to process and even think about XML for maximum effect. Simon uses the metaphors of monasticism and asceticism to suggest that it is dangerous to load XML too heavily with baggage that does not suit its simple, textual roots. In "Marking-up at the foundation," he discusses the fundamental roles of character data and markup (elements and attributes). In "Naming things and reading names," he explains why the generic identifier (also called the element type name) is an important concept and how it should be the sole primary key to the structure of the marked-up information. Realistically, if you're using XML namespaces, the primary key is the universal name (namespace URI plus local name), and this complication is one of the reasons Simon urges caution in "Namespaces as opportunity." "Accepting the discipline of trees" calls out one of XML's dirty secrets: Even though it seems that XML's hierarchical structure could be easily extended to graph structure, in practice, the modeling of graphs in XML has proven a bit difficult. But by far the most important lesson on the "Monastic XML" site is found in "Optimizing markup for processing is always premature." XML is a declarative technology, and therein lies its strengths, as well as its frustrations, for many developers. Developers who try to pull XML design too close to the details of processing generally end up making that processing more difficult in the long term. The key to success with XML is to focus on the nature of the information that needs to be represented in the abstract separately from the technical design of the systems that need to process that information.


Wrap up

There is always bound to be some difference of opinion when considering XML best practices, especially in these early stages, but it is great to have a variety of voices on the topic. There are a few other sources for discussion of the topic, and I'll continue to cover them in this column. If you have sources for advice on best practices or want to share your own opinion, please join the discussion on the Thinking XML discussion forum.


Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

About the author

Uche Ogbuji photo

Uche Ogbuji is a consultant and co-founder of Fourthought, Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia, or contact him at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=102855
ArticleTitle=Thinking XML: Good advice for creating XML
publish-date=01312006
author1-email=uche@ogbuji.net
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers