Tip: Using pull-based DOMs

Finding the balance between easy and efficient programming

XML application developers usually have to contend with the complexities of SAX or the inefficiencies of DOM. This tip shows how a pull approach to DOM can effectively bridge the gap between the two by offering simple, efficient parsing.

Uche Ogbuji, Principal Consultant, Fourthought, Inc.

Photo of Uche Ogbuji Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colo., USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

01 May 2002

The two most common systems for binding XML parsing to applications that need it are the W3C's Document Object Model (DOM) and the open community standard Simple API for XML (SAX). DOM is an API that allows code to point directly at the various properties of XML document parts and, thus, is easy to program. However, DOM usually requires that objects representing every portion of the document be in memory. Since the sum of these objects can take up 10 times as much memory as the document itself (or more), DOM can be very inefficient when dealing with large documents. SAX walks the document tree bit by bit and sends out events corresponding to the current node. This means that SAX can discard those parts of the document that are not in scope at the time, which makes it more efficient. DOM is an API that allows code to directly read and modify the various properties of XML document parts and, thus, is easy to program.

In order to give developers the ease of DOM and the efficiency of SAX, there have been many projects focusing on varieties of DOM that only load in parts of an XML document as they are requested. These APIs are called pull DOMs.

To parse or not to parse...

I will present an example from the Python language standard library. Even of you are not familiar with Python, I expect this example will be easy to follow. Recent versions of Python bundle several XML facilities, including a small DOM implementation, minidom, and a SAX library. Python also comes with a pull DOM (in a module called xml.dom.pulldom), which I will demonstrate. Listing 1 illustrates the use of Python's pull DOM to load Jon Bosak's well-known XML representation of Shakespeare's "Hamlet." The task is to print the first line in Act IV, scene II of the play.

       1	#Get the first line in Act IV, scene II
     3	from xml.dom import pulldom
     5	hamlet_file = open("hamlet.xml")
     7	events = pulldom.parse(hamlet_file)
     8	act_counter = 0
     9	for (event, node) in events:
    10	    if event == pulldom.START_ELEMENT:
    11	        if node.tagName == "ACT":
    12	            act_counter += 1
    13	            scene_counter = 1
    14	        if node.tagName == "SCENE":
    15	            if act_counter == 4 and scene_counter == 2:
    16	                events.expandNode(node)
    17	                #Traditional DOM processing starts here
    18	                #Get all descendant elements named "LINE"
    19	                line_nodes = node.getElementsByTagName("LINE")
    20	                #Print the text data of the text node
    21	                #of the first LINE element
    22	                print line_nodes[0].firstChild.data
    23	            scene_counter += 1

First, a sketch of the structure of hamlet.xml. The top-level element is PLAY, which contains, among other elements, a number of ACT elements, which in turn contain a number of SCENE elements. SCENEs contain SPEECHes, which contain a collection of LINEs, each spoken by a single actor. It's a pretty simple hierarchy.

After importing the library in line 3, I open the XML file and initialize its parsing in line 5. A pulldom parse returns an object representing a virtual collection of all the parsing events from the file. I loop over this collection in lines 9-23. Each iteration in the loop gets back an event and a virtual node that potentially represents the entire subtree rooted at that node. You can check what sort of event it is -- and by implication, what sort of node -- as well as a few superficial things about the node, such as the node name. If you want information about its children, you can either wait for the appropriate subsequent events or expand the node to its full actuality using the expandNode method.

In line 10, I check whether the current event is the start of an element, which is the only event type I bother with in the pull part of the program. If it is the ACT element (checked in line 11), I update a counter of such elements (initialized on line 8), and reset the SCENE elements counter. If it is a SCENE element (checked in line 14), I check whether it is the act and scene number I want and update the counter if it isn't.

If it is the scene I want, I pull the entire DOM structure for the scene into memory, using expandNode, as mentioned above. From this point on, the node is a regular DOM node, on which you can invoke regular DOM methods. In line 19, I use the getElementsByTagName DOM method to get all descendant elements named LINE. It is important to understand that if I had invoked this method prior to line 16, it would have caused an error; this is because there is no real DOM tree until the node is expanded. It is also important to remember that the point at which you choose to make this expansion determines the resulting efficiency. If I had chosen to expand the entire ACT rather than the SCENE, all the other SCENE elements would have wound up in memory as well.

Finally, after I've grabbed the first LINE element, I plumb it for its text node child in line 22, and print its content. Mission accomplished.

Pull DOM available in a language near you

This example is in Python, but a growing number of languages have some sort of pull API. Perl has XML::Twig, which is unfortunately not DOM-based, and the Java community is standardizing a pull DOM API of its own. Pull DOMs lie somewhere between SAX and DOM, and make it easy to process XML documents of arbitrary size without too much fuss.



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into XML on developerWorks

ArticleTitle=Tip: Using pull-based DOMs