This is the concluding article in a series that looks at common pitfalls with the deployment of XML. XML has been around for...seven years already! During that time, I've been involved in quite a few consulting projects and have taught many seminars on XML technology. I've noticed that some misunderstandings and problems occur again and again -- and then some more. This series is my attempt to document those common pitfalls and offer practical solutions. Part 1 looked at syntax, Part 2 at design issues, and Part 3 at validation. Part 4 wraps up the series with a look at binary data.
XML is used for many kinds of information: books, reports, database extracts, blogs, metadata, business documents (invoices, orders, general ledger, and more), e-government documents (Social Security forms, customs documents, tax papers, and more), and the like. By nature, XML is limited to textual information. Multimedia components such as logos, charts, photos, diagrams, podcasts, blueprints, and movies are stored in binary formats and cannot be included as such in XML documents. (However, vector images stored as XML files using Scalable Vector Graphics, or SVG, are an exception.)
XML developers who need to deal with mixed documents containing both text and binary data face some unique challenges. Essentially you have two options:
- Link from XML to the binary data.
- Embed the binary data in the XML document.
Each of these techniques has its own strengths and weaknesses.
When reviewing binary documents, you might stumble upon external unparsed entities, such as binary data. Although entities are included in the XML standard, I can recall only a handful of XML vocabularies that use them. I suggest that you avoid them; better alternatives are available.
To understand entities, you must be aware of the state of file systems 20 years ago when SGML -- XML's ancestor -- was introduced. Huge disparities existed among file systems. Some systems were hierarchical; others used flat or semi-flat spaces. So SGML defined a more abstract view of the file system, called entities. In SGML lingo, an entity is an abstract representation of a resource or a file in a document. A low-level entity manager links the abstract view to the actual document. When moving a document from one computer to another, you would update the entity manager but leave the documents unchanged. Entities can be SGML documents, DTDs, or so-called unparsed entities like binary data.
Entities have survived in XML (just look at the EntityResolver interface in SAX), but their use is limited:
- Entities must be declared in DTDs. This is problematic because fewer documents use DTDs.
- The raison d'être of entities was to offer a consistent view of inconsistent file systems. Today, URIs offer a better and more standard alternative.
When a document includes data from different sources (XML and others), you should design the parsing code accordingly. Except in trivial vocabularies, you want to break the XML parsing into two parts: one part consisting of specialized components that deal with the binary data; the other part an organizer that manages the relationships among the various files, and calls into the appropriate components from the first set (see Figure 1). Several patterns -- the plug-in pattern from Fowler, and the builder and prototype patterns from Gamma -- can help (see Resources).
Figure 1. The organizer manages file relationships and calls into more specialized components
Linking and embedding are equally viable strategies, but they have different qualities and are not suitable for the same types of projects.
If you know HTML, then you're familiar with the use of linking for anchors, images, applets, plug-ins, and more. With linking, each document is stored as a separate resource, and one document contains a reference to the others. Independence is linking's main benefit:
- The resources can evolve at their own pace. For example, you can update to a higher-resolution logo without changing the main document.
- Each resource can be edited with the most appropriate software because it's stored in its own file.
- The mechanism is highly extensible; adding new data types does not change the original vocabulary.
- Reusing content is easy and highly scalable because a resource can be linked to many times, but it can only be downloaded once.
Linking has proven instrumental to the Web's development. It's hard to imagine a successful Web without links! But linking's strength is also a weakness. Links can become obsolete and sometimes even break, resulting in a less than satisfactory user experience (the dreaded 404 error). And links are more complex for end users. Copying a document involves copying not only the main file but also the files it links to, which can result in manipulation errors.
From a markup standpoint, links are trivial. You only need to reserve an attribute (or an element, although attributes are popular for links) to store the URI of the linked document:
<ulink xlink:href= "http://www.marchal.com/en/photos/humour/phbd0001.jpg">photo</ulink> |
Although you can call the link attributes anything you like, the W3C has developed XLink as a standard set of attributes (see Resources).
Embedding copies the binary documents in the XML document. You must take care to respect the XML syntax. More specifically:
- XML reserves a handful of characters (most importantly
<and&) that can appear in binary content. - XML builds on Unicode encodings, which means that a given character may be represented differently depending on the current encoding.
Suppose the binary data is three bytes long: 0xea, 0x51, and 0xa9. If you map these bytes to characters in the Unicode table, their representation will depend on the underlying encoding:
Latin-1: êQ© ASCII: êQ© UTF-8: êQ© |
If, on the other hand, you store the binary data as is, then you might end up with illegal sequences in UTF-8 and other encodings. You don't know which encoding the document uses, so this is not a viable solution.
The only solution is to encode data to a safe set of characters. The most commonly-used algorithm is base64. Base64 encodes three bytes as four characters (6lGp for the preceding example). XML Schema has standardized base64 as an XML datatype, and you can find base64 implementations for most languages (see Resources).
The pros and cons of embedding binary data mirror those of linking:
- Embedding minimizes errors because the end user has only one file.
- Embedding is less efficient because it can duplicate information and requires base64 encoding of the data.
- The binary data is not directly accessible, making it harder to use best-of-breed editors.
- The application must somehow decode all the content, so it is more difficult to extend the application.
Microsoft Word 2003 is one popular application that stores images and multimedia information as embedded, base64-encoded data in XML documents.
The special case of XML documents
Probably the most frequent pitfall is embedding XML data in another document. Specifically, it is a mistake to treat the embedded document as textual content, which it is not.
The problem is not the escaping -- it's easy to replace < with < or to use a CDATA section -- but the encoding. Unless the embedding and embedded documents use the same encoding, which is almost impossible to guarantee, it won't work. See Listing 1 for an example of this error:
Listing 1. Incorrect embedding
<?xml version="1.0" encoding="UTF-8"?>
<document>
<title>Encoding problem</title>
<data><![CDATA[<?xml version="1.0" encoding="ISO-8859-1"?>
<text>
<p xml:lang="fr">Les caractères accentués
(Latin-1) seront mal compris!</p>
<p xml:lang="en">Accentuated characters (Latin-1)
will be misinterpreted!</p>
</text>]]></data>
</document> |
The double encoding makes the original document unreadable. The only safe alternatives are either to base64-encode the embedded document or to use namespaces and combine both vocabularies into one document, as shown in Listing 2:
Listing 2. The safe alternative
<?xml version="1.0" encoding="UTF-8"?>
<d:document xmlns:d="http://psol.com/2005/doc"
xmlns:t="http://psol.com/2005/text">
<d:title>Encoding problem</d:title>
<d:data>
<t:text>
<t:p xml:lang="fr">Les caractères accentués
(Latin-1) seront mal compris !</t:p>
<t:p xml:lang="en">Accentuated characters (Latin-1)
will be misinterpreted!</t:p>
</t:text>
</d:data>
</d:document> |
Packaging is a middle ground between linking and embedding. The concept uses linking for increased flexibility, as you saw earlier, but adds an extra layer to combine the many documents into one file, as shown in Figure 2.
Figure 2. Packaging several files
Two formats are popular: Multipurpose Internet Mail Extensions, or MIME (originally developed for multimedia e-mail), and Zip. I prefer Zip because it offers a hierarchical data structure.
Open Document, the standard vocabulary behind OpenOffice.org, is an example of Zip packaging. The textual content is stored in an XML file that's zipped with all images included in the document. SOAP Message Transmission Optimization Mechanism (MTOM) is an example of how MIME is used to package multiple types of content (see Resources).
MIME is available to Java developers through the JavaMail API, and Zip is built into Java SE in the java.util.zip package.
XML is no island, and many vocabularies need to include binary content. First assess your needs. If binary content is seldom used, your users will find it more convenient to embed binary data in the XML document. If binary content is a significant part of the overall content (say more than one third), you will probably find that linking or packaging are more flexible approaches. Regardless of your decision, remember to pay special attention to the XML content. It should be treated as binary data or, better yet, merged into the original document through XML namespaces.
Learn
-
"Working XML: Safe coding practices" (developerWorks, July, August, and September 2005): Read the previous articles in this series.
-
"How to use XLink with XML" (developerWorks, July 2001): Learn how to include standard links in XML documents.
-
"The open office file format" (developerWorks, January 2003): Read about the XML vocabulary that served as the starting point for the Open Document standard.
-
Project Xanadu: The Xanadu linking model aims at deep interconnection, intercomparison, and reuse.
-
Open Document: Find out about the OASIS Open Document standard, which defines an XML schema and semantics for office applications.
- MTOM: Discover this relatively new W3C recommendation for implementing packaging of binary data in SOAP requests. It offers a fresh approach that marries linking and embedding.
-
The plugin pattern: This design pattern allows for extending software at runtime.
-
"Pattern Summaries: Prototype": Check out the prototype pattern, which makes for extensible systems.
-
Builder pattern: Structure your code to handle different binary formats with the builder pattern.
-
RFC 1421 and RFC 2045: These documents specify the base64 algorithm.
- developerWorks XML zone: Find more XML resources here, including articles, tutorials, tips, and standards.
- IBM Certified Solution Developer -- XML and related technologies: Learn how to get certified.
Get products and technologies
-
Base64: Download this open source Java library for implementing base64 encoding.
Discuss

Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at www.marchal.com.





