Skip to main content

Tip: Use language-specific tools for XML processing

Alternatives to SAX and DOM

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  DOM and SAX are the two best known systems for XML processing, but they are really compromises across programming languages. As such, they do not take advantage of any language's particular strengths. Often it is better to duck conventional wisdom and use special APIs that take advantage of particular strengths.

View more content in this series

Date:  30 Jan 2004
Level:  Intermediate
Activity:  2093 views

SAX and DOM are the ruling pair of XML processing APIs. The choice developers generally know best are between the standard model of SAX (push events from the parser to a detached handler) and DOM (parse the document into a tree of readily-accessible objects). SAX generally offers better performance on sizeable documents and DOM generally offers more straightforward code. SAX was designed for the Java language, although bindings to other languages have been developed. In these other languages, however, the Java heritage of SAX generally shows through and many of the strengths of the language being used are forfeit. DOM was designed to be as language-neutral as possible, specified in ISO Interface Definition Language (IDL); standard bindings exist for the Java language and ECMAScript (JavaScript), but these still reflect the language-neutral constructs of the IDL, and all the language bindings, official or unofficial, again forfeit some strengths of the host language.

To better take advantage of core language strengths, various developers have developed XML processing APIs that are native to particular languages. Almost all well-known languages have one or more toolkits offering such an API. For some time, the conventional wisdom has been that it's best to stick to SAX and DOM for maximum portability, but experience has convinced me that this is more often than not an overstated consideration. For one thing, because the language bindings for SAX and DOM have some deviations, code is rarely truly portable across languages; the work needed to adapt the code from one language to another is still considerable. Using SAX and DOM usually does improve portability between implementations in the same language, but this has to be traded off against the fact that the programmer often loses productivity by forfeiting some language strengths.

Pull APIs

One area where developers in several languages independently made early explorations is in the pull DOM, a system that wraps SAX so that one can pull events from the parser rather than having it pushed. This adjustment generally allows for more straightforward code, and implementations usually use native language constructs to a greater extent than pure SAX or DOM. Java Specification Requests (JSR #173) for Streaming API for XML (StAX) is a Java API for pull-parsing XML. Other pull APIs include libxml2's xmlTextReader for C, C++, Python, Perl, and many other languages that have libxml2 wrappers. Python comes with a xml.dm.pulldom module, which offers a pull API.


Marshallers and XML data structures

Another early convention apart from SAX and DOM was developing tools that turn XML into generic data structures native to the language -- a process called unmarshalling -- and vice versa (marshalling). The idea is to make developers in a specific language feel at home and not have to really think about the XML behind the data. Unfortunately, many developers are hostile to XML and this is often the only way they can find it palatable. But even for those who are comfortable with XML, marshalling tools are useful for quick and dirty processing: JDOM is a DOM-like API that sticks strictly to Java-language idioms; Python users have ElementTree, which creates a specialized data structure from XML, focusing on elements; Perl users have the now rather dated XML::Grove, which interchanges parsed XML, HTML, or SGML with a tree of Perl hashes; Ruby users have XMLification for very simple translation of Ruby objects to XML; an option for PHP is class_path_parser.php, which allows you to register XPath-like expressions for an XML source and dispatches PHP handler functions accordingly; an option for Haskell is Haskell2Xml, which allows you to read and write ordinary Haskell data as XML documents.


XML data bindings

A twist on marshalling that is emerging as a popular option is to use XML schema languages and other such sources to create data structures in the native language that use the vocabulary expressed in the XML document. Such systems are called XML data bindings, and in many cases they lead to the most natural possible manipulation of XML. Java technology users can look to JSR #31, "XML Data Binding Specification". The Castor, JBind, and JiBX tools have some similar features to JAXB. Python users have Anobind, gnosis.xml.objectify, and xmltramp, which operate from direct inspection of the source XML, and generateDS.py, which uses a W3C XML Schema to drive the binding. An option for Perl is XML::Smart.


Wrap-up

So regardless of which language you prefer, you have many options for processing XML. Don't be afraid to put aside conventional wisdom and look for options besides the ruling pair.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12369
ArticleTitle=Tip: Use language-specific tools for XML processing
publish-date=01302004
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers