Skip to main content

Tip: Using pull-based DOMs

Finding the balance between easy and efficient programming

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colo., USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  XML application developers usually have to contend with the complexities of SAX or the inefficiencies of DOM. This tip shows how a pull approach to DOM can effectively bridge the gap between the two by offering simple, efficient parsing.

View more content in this series

Date:  01 May 2002
Level:  Introductory
Activity:  1963 views

The two most common systems for binding XML parsing to applications that need it are the W3C's Document Object Model (DOM) and the open community standard Simple API for XML (SAX). DOM is an API that allows code to point directly at the various properties of XML document parts and, thus, is easy to program. However, DOM usually requires that objects representing every portion of the document be in memory. Since the sum of these objects can take up 10 times as much memory as the document itself (or more), DOM can be very inefficient when dealing with large documents. SAX walks the document tree bit by bit and sends out events corresponding to the current node. This means that SAX can discard those parts of the document that are not in scope at the time, which makes it more efficient. DOM is an API that allows code to directly read and modify the various properties of XML document parts and, thus, is easy to program.

In order to give developers the ease of DOM and the efficiency of SAX, there have been many projects focusing on varieties of DOM that only load in parts of an XML document as they are requested. These APIs are called pull DOMs.

To parse or not to parse...

I will present an example from the Python language standard library. Even of you are not familiar with Python, I expect this example will be easy to follow. Recent versions of Python bundle several XML facilities, including a small DOM implementation, minidom, and a SAX library. Python also comes with a pull DOM (in a module called xml.dom.pulldom), which I will demonstrate. Listing 1 illustrates the use of Python's pull DOM to load Jon Bosak's well-known XML representation of Shakespeare's "Hamlet." The task is to print the first line in Act IV, scene II of the play.

       1	#Get the first line in Act IV, scene II
     2	
     3	from xml.dom import pulldom
     4	
     5	hamlet_file = open("hamlet.xml")
     6	
     7	events = pulldom.parse(hamlet_file)
     8	act_counter = 0
     9	for (event, node) in events:
    10	    if event == pulldom.START_ELEMENT:
    11	        if node.tagName == "ACT":
    12	            act_counter += 1
    13	            scene_counter = 1
    14	        if node.tagName == "SCENE":
    15	            if act_counter == 4 and scene_counter == 2:
    16	                events.expandNode(node)
    17	                #Traditional DOM processing starts here
    18	                #Get all descendant elements named "LINE"
    19	                line_nodes = node.getElementsByTagName("LINE")
    20	                #Print the text data of the text node
    21	                #of the first LINE element
    22	                print line_nodes[0].firstChild.data
    23	            scene_counter += 1


First, a sketch of the structure of hamlet.xml. The top-level element is PLAY, which contains, among other elements, a number of ACT elements, which in turn contain a number of SCENE elements. SCENEs contain SPEECHes, which contain a collection of LINEs, each spoken by a single actor. It's a pretty simple hierarchy.

After importing the library in line 3, I open the XML file and initialize its parsing in line 5. A pulldom parse returns an object representing a virtual collection of all the parsing events from the file. I loop over this collection in lines 9-23. Each iteration in the loop gets back an event and a virtual node that potentially represents the entire subtree rooted at that node. You can check what sort of event it is -- and by implication, what sort of node -- as well as a few superficial things about the node, such as the node name. If you want information about its children, you can either wait for the appropriate subsequent events or expand the node to its full actuality using the expandNode method.

In line 10, I check whether the current event is the start of an element, which is the only event type I bother with in the pull part of the program. If it is the ACT element (checked in line 11), I update a counter of such elements (initialized on line 8), and reset the SCENE elements counter. If it is a SCENE element (checked in line 14), I check whether it is the act and scene number I want and update the counter if it isn't.

If it is the scene I want, I pull the entire DOM structure for the scene into memory, using expandNode, as mentioned above. From this point on, the node is a regular DOM node, on which you can invoke regular DOM methods. In line 19, I use the getElementsByTagName DOM method to get all descendant elements named LINE. It is important to understand that if I had invoked this method prior to line 16, it would have caused an error; this is because there is no real DOM tree until the node is expanded. It is also important to remember that the point at which you choose to make this expansion determines the resulting efficiency. If I had chosen to expand the entire ACT rather than the SCENE, all the other SCENE elements would have wound up in memory as well.

Finally, after I've grabbed the first LINE element, I plumb it for its text node child in line 22, and print its content. Mission accomplished.


Pull DOM available in a language near you

This example is in Python, but a growing number of languages have some sort of pull API. Perl has XML::Twig, which is unfortunately not DOM-based, and the Java community is standardizing a pull DOM API of its own. Pull DOMs lie somewhere between SAX and DOM, and make it easy to process XML documents of arbitrary size without too much fuss.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colo., USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12111
ArticleTitle=Tip: Using pull-based DOMs
publish-date=05012002
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers