XML Matters: The XOM Java XML API

A rigorously correct tree-oriented XML model

In this installment, David looks at Elliotte Rusty Harold's XOM. Broadly speaking, this is yet another object-oriented XML API, somewhat in the style of DOM, however a number of features set XOM apart, and Harold argues that they are important design elements. Chief among these is a rigorous insistence on maintaining invariants in in-memory objects so that an XOM instance can always be serialized to correct XML. In addition, XOM aims at greater simplicity and regularity than other Java XML APIs. You can share your thoughts on this article with the author and other readers in the accompanying discussion forum.

Share:

David Mertz (mertz@gnosis.cx), Formalizer, Gnosis Software, Inc.

Photo of David MertzDavid Mertz once led the desperate life of scholarship. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book .



17 December 2003

Also available in Japanese

Developers generally have a number of different attitudes toward the XML APIs they develop. Stream-oriented APIs like SAX, libxml, SSAX, and expat focus on parsimony -- make it fast, make it work in little memory, and make few assumptions about the overall document structure. Moreover, stream-oriented APIs can be handled in a very procedural style, which is nice for C programming, where objects are not ubiquitous.

In contrast, tree- or object-oriented APIs generally create an in-memory image of an XML document, translated into some sort of object-oriented (or at least hierarchical, such as XOM) rendition. Walking, filtering, and transforming XML proxy objects somehow utilizes the native syntax of a programming language. XSLT can also be considered an API of sorts, and its functional/declarative model is different from either, but I'll skip discussion of XSLT for this installment.

Among the tree-oriented APIs, several divergent design goals come to the fore. Libraries like gnosis.xml.objectify, ElementTree, REXML, SXML, XML::Grove, and JDOM pretty much aim at shaping XML into the most native seeming objects possible for their respective programming languages. The goal in each of these is to avoid thinking about the fact that your data started out as XML; it is just another object to you.

At the other end of the scale, DOM almost completely eschews any concern with the particular programming language that DOM methods might be invoked in. While the designers of DOM tended to come from a Java technology background, DOM does not really feel any more unnatural in other languages than it does in Java. On the other hand, DOM suffers terribly from the weight of design-by-committee: Its methods are too numerous, inconsistently named, and have poor orthogonality. Making things still worse, a DOM object is not entirely guaranteed to be serializable into XML; in some cases you can create non-well-formed DOM objects in memory (which requires extra checks and behaviors for serialization).

The closest analogue of XOM is DOM, and to some extent JDOM. However XOM aims to remedy the problems of DOM by starting from a fresh design, valuing orthogonality, and centralizing control of the API in a single expert (the aforementioned Mr. Harold). XOM is basically Java-focused, even though a Python implementation also exists (but seems to have little benefit over other techniques in Python). However, the primary goal of XOM is not to be true to Java technology, but rather to be true to XML. Harold's goal is to capture and enforce the precise infoset of XML in a minimal set of relevant object methods.

The XML-focused orientation of XOM has two facets. On the one hand, it is impossible to create nodes that violate XML rules -- for example with disallowed tag names, or with null bytes in the content (constraints not checked by most APIs). On the other hand, XOM provides only methods that operate at the same conceptual level as XML itself -- for example, serialization is only as XML, CDATA sections are not retained as separate nodes, and XML attribute order is ignored.

A first application

To get the feel of XOM, I decided to write the same little application that I wrote in SSAX for the last installment. The outline utility takes an XML document as input, and produces a summary of it in an outline style, displaying the initial portion of the longer text sections for context. Moreover, namespaces are dropped from tag names, and only the local portion is displayed. This utility is not particularly complicated or compelling, but it does cover the basics of walking trees of children and attributes.

My test XML document is a highly reduced version of a recent developerWorks article:

Listing 1. An XML document with most XML features
$ cat example.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css"
      href="http://gnosis.cx/dW/programming/dW.css" ?>
<dw-document xmlns:dw="http://www.ibm.com/developerWorks/">
  <title>The Twisted Matrix Framework</title>
  <author name="David Mertz, Ph.D.">
    <bio>David thinks it's turtles all the way down...</bio>
  </author>
  <docbody>
    <dw:heading refname="" toc="yes">Introduction</dw:heading>
    <p>
      Sorting through Twisted Matrix is reminiscent of the old story
      about blind men and elephants. Twisted Matrix has many
      capabilities within it, and it takes a bit of a gestalt switch
      to <i>get</i> a good sense of why they are all there.
    </p>
  </docbody>
</dw-document>

The output I want from my utility is:

Listing 2. An outline display of example.xml
$ java Outline example.xml
<dw-document>
  <title>
  <author name='David Mertz, Ph.D.'>
    <bio>
      |David thinks it's turtles all ...
  <docbody>
    <heading refname='' toc='yes'>
    <p>
      |
      Sorting through Twisted...
      <i>
      | a good sense of why they are ...

Simple enough -- and identical to the SSAX utility I presented.


Writing the outline utility

The main work of Outline.java is performed by the class nu.xom.Builder, which builds an in-memory XOM object based on an XML source. One surprise I found is that if you specify a class initializer of true, XOM will insist on validation, even for XML documents that do not specify a DTD. In other words, all such documents will throw a ValidityException (but this might be system-dependent upon installed Java XML parsers). The best approach is probably to omit an initialization flag, and let XOM figure out the best parser.

Listing 3. The Outline.java utility
import nu.xom.*;
import java.io.IOException;

public class Outline {
  public static void main(String[] args) {
    try {
      // Use 'Builder(true)' to require validation
      Builder parser = new Builder();
      Document doc = parser.build(args[0]);
      showElement(doc.getRootElement(), 0);
    }
    catch (ValidityException ex) {
      System.err.println(args[0]+" is invalid.");
    }
    catch (ParseException ex) {
      System.err.println(args[0]+" is not well-formed.");
    }
    catch (java.io.IOException ex) {
      System.err.println(args[0]+" cannot be read");
    }
  }
  private static void showElement(Element element, int level) {
    // Show the tag, along with its attributes
    indent(level, "<"+element.getLocalName());
    for (int i=0; i < element.getAttributeCount(); i++) {
      Attribute attr = element.getAttribute(i);
      System.out.print(" "+attr.getLocalName()+"='"+attr.getValue()+"'");
    }
    System.out.println(">");

    // Now loop through child nodes
    for (int i=0; i < element.getChildCount(); i++) {
      Node node = element.getChild(i);
      if (node instanceof Text) {
        String text = node.getValue();
        if (text.length() > 30) {
          indent(level+1, "|"+text.substring(0,30)+"...\n");
        }
      } else if (node instanceof Element) {
        showElement((Element)node, level+1);
      }
    }
  }
  private static void indent(int level, String string) {
    for (int i=0; i < level; i++) { System.out.print("  "); }
    System.out.print(string);
  }
}

The organization here is pretty straightforward. The .showElement() method displays the name and attributes of each element, then recurses to its children, incrementally indenting a level on each recursion.

In designing this utility, I took an illustrative misstep. The Element class has a .getChildElements() method that returns a traversable list of elements -- excluding other Node objects from the enumeration. On its face, using this enumeration would seem more straightforward; the method is, in fact, widely useful since you can optionally limit the enumeration to children with a given name. Since an Element also has a .getValue() method for retrieving the PCDATA, it appears that you could grab these content strings with each such child element.

Unfortunately, the semantics of .getValue() are slightly wrong for my intended use: .getValue() retrieves all the text inside a given tag -- not just that portion of it leading up to the next child tag. For instance, in the above example, the blurb inside the <bio> element is also thereby inside the enclosing <author> element, so author.getValue() retrieves stuff I do not want. As a result, I have to walk through all the child nodes, and decide what to do with each based on which subclass of Node I find. In particular, for purposes of this utility, I am only interested in Text and Element; not Comment, ProcessingInstruction, DocType, etc.


Creating a new XML document

While, in my opinion, the main benefit of XML APIs is in parsing and traversing existing XML documents, sometimes you also want to create new documents within a program -- or at least modify existing ones. For the simplest tasks, basic string operations really do suffice. But it's not hard to make a programming error, and fail to close a tag or escape a special value. Using XOM for document creation guards against any such errors.

Here is a brief example, mostly taken from the XOM tutorial:

Listing 4. HelloWorld.java
import nu.xom.*;
public class HelloWorld {
  public static void main(String[] args) {
    Element root = new Element("root");
    root.appendChild("Hello World!");
    Attribute foo = new Attribute("foo","bar");
    root.addAttribute(foo);
    Document doc = new Document(root);
    String result = doc.toXML();
    System.out.println(result);
  }
}

This outputs the following:

Listing 5. HelloWorld output
$ java HelloWorld
<?xml version="1.0"?>
<root foo="bar">Hello World!</root>

Beyond the basic .appendChild() and .addAttribute() methods, the .copy() and .detach() methods and the .remove*() collection are useful for rearranging XOM trees. Every tree and every node inside it has a .toXML() method, and this is the sole serialization format for XOM objects.


Comparisons

In writing my little outline utility, I became curious about how convenient XOM really is compared to other APIs. Since the same utility was written for the last installment on SSAX, that makes for an obvious comparison. As it turns out, the Scheme and Java language versions -- using SSAX and XOM respectively -- work out to pretty much the same length in lines, despite Scheme's use of macros and dynamic typing. Of course, the coding style is very different, and the Scheme actually uses fewer characters (if you ignore the larger number of comments in the SSAX version).

Regular readers of this column, however, know that I often advocate Python -- specifically my own Gnosis Utilities APIs. I decided to make a quick shot at the same utility using the latest development version of gnosis.xml.objectify:

Listing 6. outline.py utility
from sys import stdin, stdout, stderr
from gnosis.xml.objectify import XML_Objectify, \
            make_instance, tagname, content, attributes
XML_Objectify.expat_kwargs['nspace_sep'] = None

def showNode(node, level=0):
    stdout.write("  "*level+"<"+tagname(node))
    for key,val in attributes(node).items():
        stdout.write(" %s='%s'" % (key,val))
    stdout.write(">\n")
    for child in content(node):
        if isinstance(child, unicode):
            if len(child) > 30:
                stdout.write("  "*(level+1)+"|"+child[:30]+"...\n")
        else:
            showNode(child, level+1)
showNode(make_instance(stdin))

I found it interesting that the Java language version with XOM was still about 2.5 times as long (and was also very close to the same speed, once I benchmarked against a large XML version of Shakespeare's Hamlet; Python's smaller startup time biases small tests).

Much of the extra code in the Java language relates to the various exception checking in the Outline.main() method. In Python, I can let the built-in exception stacks do the work for me; of course, if I were to start doing something more meaningful with exceptions than just report them, then Python would start to look more like the Java language.

Obviously, however, programmers who want to use Java technology, for whatever reason, gain little benefit from knowing that libraries for Python or Scheme might allow more compact code. And the Java language certainly has a number of strengths that can merit the extra verboseness.


Conclusion

The real problem with DOM is that it is good enough for many purposes; it has far too many methods, many overlapping in purpose and not named consistently. Committees and legacies do that. Despite that, everyone already has a DOM library handy -- not just Java programmers, but also programmers of many other languages. It is too easy to just choose DOM because it is widespread and available.

Although I would not generally choose to write in the Java language if I had the option to write Python (or maybe Ruby, or even Perl), XOM really does everything better than DOM. XOM is more correct, easier to learn, and more consistent. Most of its capabilities have not been covered in this introduction, but rest assured it incorporates the usual collection of XML technologies: XPath, XSLT, XInclude, the ability to interface with SAX and DOM, and so on.

If you are doing XML development in the Java language, and you are able to include a custom LGPL library in your application, I strongly recommend that you give XOM a serious look.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12356
ArticleTitle=XML Matters: The XOM Java XML API
publish-date=12172003