Skip to main content

Working XML: Safe coding practices, Part 4

The special case of binary data

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Photo of Benoit Marchal
Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at www.marchal.com.

Summary:  Find out how to make the best choices when you work with documents that mix XML and binary data. The concluding article in Benoît's four-part series on safe XML coding practices helps you understand the pros, cons, and pitfalls of the available solutions for mixing textual and binary content.

View more content in this series

Date:  02 Nov 2005
Level:  Intermediate
Activity:  649 views
Comments:  

This is the concluding article in a series that looks at common pitfalls with the deployment of XML. XML has been around for...seven years already! During that time, I've been involved in quite a few consulting projects and have taught many seminars on XML technology. I've noticed that some misunderstandings and problems occur again and again -- and then some more. This series is my attempt to document those common pitfalls and offer practical solutions. Part 1 looked at syntax, Part 2 at design issues, and Part 3 at validation. Part 4 wraps up the series with a look at binary data.

Text versus binary

XML is used for many kinds of information: books, reports, database extracts, blogs, metadata, business documents (invoices, orders, general ledger, and more), e-government documents (Social Security forms, customs documents, tax papers, and more), and the like. By nature, XML is limited to textual information. Multimedia components such as logos, charts, photos, diagrams, podcasts, blueprints, and movies are stored in binary formats and cannot be included as such in XML documents. (However, vector images stored as XML files using Scalable Vector Graphics, or SVG, are an exception.)

XML developers who need to deal with mixed documents containing both text and binary data face some unique challenges. Essentially you have two options:

  • Link from XML to the binary data.
  • Embed the binary data in the XML document.

Each of these techniques has its own strengths and weaknesses.

Some history

When reviewing binary documents, you might stumble upon external unparsed entities, such as binary data. Although entities are included in the XML standard, I can recall only a handful of XML vocabularies that use them. I suggest that you avoid them; better alternatives are available.

To understand entities, you must be aware of the state of file systems 20 years ago when SGML -- XML's ancestor -- was introduced. Huge disparities existed among file systems. Some systems were hierarchical; others used flat or semi-flat spaces. So SGML defined a more abstract view of the file system, called entities. In SGML lingo, an entity is an abstract representation of a resource or a file in a document. A low-level entity manager links the abstract view to the actual document. When moving a document from one computer to another, you would update the entity manager but leave the documents unchanged. Entities can be SGML documents, DTDs, or so-called unparsed entities like binary data.

Entities have survived in XML (just look at the EntityResolver interface in SAX), but their use is limited:

  • Entities must be declared in DTDs. This is problematic because fewer documents use DTDs.
  • The raison d'être of entities was to offer a consistent view of inconsistent file systems. Today, URIs offer a better and more standard alternative.

Design considerations

When a document includes data from different sources (XML and others), you should design the parsing code accordingly. Except in trivial vocabularies, you want to break the XML parsing into two parts: one part consisting of specialized components that deal with the binary data; the other part an organizer that manages the relationships among the various files, and calls into the appropriate components from the first set (see Figure 1). Several patterns -- the plug-in pattern from Fowler, and the builder and prototype patterns from Gamma -- can help (see Resources).


Figure 1. The organizer manages file relationships and calls into more specialized components
Model

Linking

Linking and embedding are equally viable strategies, but they have different qualities and are not suitable for the same types of projects.

More reliable links with Xanadu

Technically, links don't ever need to break. With a versioning system, you can make links highly reliable. The Xanadu project, which pioneered the use of hypertext, demonstrates more sophisticated and reliable linking than the Web (see Resources). On the other hand, the W3C argues that the simplicity of Web linking has made the Web more popular, and that few content providers can afford more robust links.

If you know HTML, then you're familiar with the use of linking for anchors, images, applets, plug-ins, and more. With linking, each document is stored as a separate resource, and one document contains a reference to the others. Independence is linking's main benefit:

  • The resources can evolve at their own pace. For example, you can update to a higher-resolution logo without changing the main document.
  • Each resource can be edited with the most appropriate software because it's stored in its own file.
  • The mechanism is highly extensible; adding new data types does not change the original vocabulary.
  • Reusing content is easy and highly scalable because a resource can be linked to many times, but it can only be downloaded once.

Linking has proven instrumental to the Web's development. It's hard to imagine a successful Web without links! But linking's strength is also a weakness. Links can become obsolete and sometimes even break, resulting in a less than satisfactory user experience (the dreaded 404 error). And links are more complex for end users. Copying a document involves copying not only the main file but also the files it links to, which can result in manipulation errors.

Technical considerations

From a markup standpoint, links are trivial. You only need to reserve an attribute (or an element, although attributes are popular for links) to store the URI of the linked document:

<ulink xlink:href=
  "http://www.marchal.com/en/photos/humour/phbd0001.jpg">photo</ulink>

Although you can call the link attributes anything you like, the W3C has developed XLink as a standard set of attributes (see Resources).


Embedding

Embedding copies the binary documents in the XML document. You must take care to respect the XML syntax. More specifically:

  • XML reserves a handful of characters (most importantly < and &) that can appear in binary content.
  • XML builds on Unicode encodings, which means that a given character may be represented differently depending on the current encoding.

Suppose the binary data is three bytes long: 0xea, 0x51, and 0xa9. If you map these bytes to characters in the Unicode table, their representation will depend on the underlying encoding:

Latin-1: êQ© 
ASCII: êQ© 
UTF-8: êQ©

If, on the other hand, you store the binary data as is, then you might end up with illegal sequences in UTF-8 and other encodings. You don't know which encoding the document uses, so this is not a viable solution.

The only solution is to encode data to a safe set of characters. The most commonly-used algorithm is base64. Base64 encodes three bytes as four characters (6lGp for the preceding example). XML Schema has standardized base64 as an XML datatype, and you can find base64 implementations for most languages (see Resources).

The pros and cons of embedding binary data mirror those of linking:

  • Embedding minimizes errors because the end user has only one file.
  • Embedding is less efficient because it can duplicate information and requires base64 encoding of the data.
  • The binary data is not directly accessible, making it harder to use best-of-breed editors.
  • The application must somehow decode all the content, so it is more difficult to extend the application.

Microsoft Word 2003 is one popular application that stores images and multimedia information as embedded, base64-encoded data in XML documents.

The special case of XML documents

Probably the most frequent pitfall is embedding XML data in another document. Specifically, it is a mistake to treat the embedded document as textual content, which it is not.

The problem is not the escaping -- it's easy to replace < with < or to use a CDATA section -- but the encoding. Unless the embedding and embedded documents use the same encoding, which is almost impossible to guarantee, it won't work. See Listing 1 for an example of this error:


Listing 1. Incorrect embedding
<?xml version="1.0" encoding="UTF-8"?>
<document>
   <title>Encoding problem</title>
   <data><![CDATA[<?xml version="1.0" encoding="ISO-8859-1"?>
      <text>
         <p xml:lang="fr">Les caractères accentués
            (Latin-1) seront mal compris!</p>
         <p xml:lang="en">Accentuated characters (Latin-1)
            will be misinterpreted!</p>
      </text>]]></data>
</document>

The double encoding makes the original document unreadable. The only safe alternatives are either to base64-encode the embedded document or to use namespaces and combine both vocabularies into one document, as shown in Listing 2:


Listing 2. The safe alternative
<?xml version="1.0" encoding="UTF-8"?>
<d:document xmlns:d="http://psol.com/2005/doc"
            xmlns:t="http://psol.com/2005/text">
   <d:title>Encoding problem</d:title>
   <d:data>
      <t:text>
         <t:p xml:lang="fr">Les caractères accentués
            (Latin-1) seront mal compris !</t:p>
         <t:p xml:lang="en">Accentuated characters (Latin-1)
            will be misinterpreted!</t:p>
      </t:text>
   </d:data>
</d:document>


Packaging

Packaging is a middle ground between linking and embedding. The concept uses linking for increased flexibility, as you saw earlier, but adds an extra layer to combine the many documents into one file, as shown in Figure 2.


Figure 2. Packaging several files
Packaging

Two formats are popular: Multipurpose Internet Mail Extensions, or MIME (originally developed for multimedia e-mail), and Zip. I prefer Zip because it offers a hierarchical data structure.

Open Document, the standard vocabulary behind OpenOffice.org, is an example of Zip packaging. The textual content is stored in an XML file that's zipped with all images included in the document. SOAP Message Transmission Optimization Mechanism (MTOM) is an example of how MIME is used to package multiple types of content (see Resources).

MIME is available to Java developers through the JavaMail API, and Zip is built into Java SE in the java.util.zip package.


What to do?

XML is no island, and many vocabularies need to include binary content. First assess your needs. If binary content is seldom used, your users will find it more convenient to embed binary data in the XML document. If binary content is a significant part of the overall content (say more than one third), you will probably find that linking or packaging are more flexible approaches. Regardless of your decision, remember to pay special attention to the XML content. It should be treated as binary data or, better yet, merged into the original document through XML namespaces.


Resources

Learn

Get products and technologies

  • Base64: Download this open source Java library for implementing base64 encoding.

Discuss

About the author

Photo of Benoit Marchal

Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at www.marchal.com.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=97997
ArticleTitle=Working XML: Safe coding practices, Part 4
publish-date=11022005
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com