The two most common systems for binding XML parsing to applications that need it are the W3C's Document Object Model (DOM) and the open community standard Simple API for XML (SAX). DOM is an API that allows code to point directly at the various properties of XML document parts and, thus, is easy to program. However, DOM usually requires that objects representing every portion of the document be in memory. Since the sum of these objects can take up 10 times as much memory as the document itself (or more), DOM can be very inefficient when dealing with large documents. SAX walks the document tree bit by bit and sends out events corresponding to the current node. This means that SAX can discard those parts of the document that are not in scope at the time, which makes it more efficient. DOM is an API that allows code to directly read and modify the various properties of XML document parts and, thus, is easy to program.
In order to give developers the ease of DOM and the efficiency of SAX, there have been many projects focusing on varieties of DOM that only load in parts of an XML document as they are requested. These APIs are called pull DOMs.
To parse or not to parse...
I will present an example from the Python language standard library. Even of you are not familiar with Python, I expect this example will be easy to follow. Recent versions of Python bundle several XML facilities, including a small DOM implementation, minidom, and a SAX library. Python also comes with a pull DOM (in a module called xml.dom.pulldom), which I will demonstrate. Listing 1 illustrates the use of Python's pull DOM to load Jon Bosak's well-known XML representation of Shakespeare's "Hamlet." The task is to print the first line in Act IV, scene II of the play.
1 #Get the first line in Act IV, scene II 2 3 from xml.dom import pulldom 4 5 hamlet_file = open("hamlet.xml") 6 7 events = pulldom.parse(hamlet_file) 8 act_counter = 0 9 for (event, node) in events: 10 if event == pulldom.START_ELEMENT: 11 if node.tagName == "ACT": 12 act_counter += 1 13 scene_counter = 1 14 if node.tagName == "SCENE": 15 if act_counter == 4 and scene_counter == 2: 16 events.expandNode(node) 17 #Traditional DOM processing starts here 18 #Get all descendant elements named "LINE" 19 line_nodes = node.getElementsByTagName("LINE") 20 #Print the text data of the text node 21 #of the first LINE element 22 print line_nodes.firstChild.data 23 scene_counter += 1
First, a sketch of the structure of hamlet.xml. The top-level element is
PLAY, which contains, among other elements, a number of
ACT elements, which in turn contain a number of
SPEECHes, which contain a collection of
LINEs, each spoken by a single actor. It's a pretty simple hierarchy.
After importing the library in line 3, I open the XML file and initialize its parsing in line 5.
pulldom parse returns an object representing a virtual collection of all the parsing events from the file. I loop over this collection in lines 9-23. Each iteration in the loop gets back an event and a virtual node that potentially represents the entire subtree rooted at that node. You can check what sort of event it is -- and by implication, what sort of node -- as well as a few superficial things about the node, such as the node name. If you want information about its children, you can either wait for the appropriate subsequent events or expand the node to its full actuality using the
In line 10, I check whether the current event is the start of an element, which is the only event type I bother with in the pull part of the program. If it is the
ACT element (checked in line 11), I update a counter of such elements (initialized on line 8), and reset the
SCENE elements counter. If it is a
SCENE element (checked in line 14), I check whether it is the act and scene number I want and update the counter if it isn't.
If it is the scene I want, I pull the entire DOM structure for the scene into memory, using
expandNode, as mentioned above. From this point on, the node is a regular DOM node, on which you can invoke regular DOM methods. In line 19, I use the
getElementsByTagName DOM method to get all descendant elements named
LINE. It is important to understand that if I had invoked this method prior to line 16, it would have caused an error; this is because there is no real DOM tree until the node is expanded. It is also important to remember that the point at which you choose to make this expansion determines the resulting efficiency. If I had chosen to expand the entire
ACT rather than the
SCENE, all the other
SCENE elements would have wound up in memory as well.
Finally, after I've grabbed the first
LINE element, I plumb it for its text node child in line 22, and print its content. Mission accomplished.
Pull DOM available in a language near you
This example is in Python, but a growing number of languages have some sort of pull API. Perl has XML::Twig, which is unfortunately not DOM-based, and the Java community is standardizing a pull DOM API of its own. Pull DOMs lie somewhere between SAX and DOM, and make it easy to process XML documents of arbitrary size without too much fuss.
- The Java community process is trying to standardize a pull API for Java: The Streaming API for XML (StAX), in JSR 173 Streaming API for XML.
- XML::Twig is a Perl module for pull-based processing, but it isn't DOM-based.
- And of course there's Python's pulldom module, which is used in this article.
- Here's a directory of the complete plays of Shakespeare (including "Hamlet") marked up in XML by Jon Bosak.
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
- Want us to send you useful XML tips like this every week? Sign up for the developerWorks XML Tips newsletter here.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.