Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Tip: Using generators

Streamlining DOM XML processing with Python

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact him at uche@ogbuji.net.

Summary:  Generators are a very powerful new language feature of Python 2.2. In this tip, Uche Ogbuji presents a set of techniques for using generators for fast and lucid XML processing patterns in Python.

View more content in this series

Date:  01 Oct 2002
Level:  Introductory

Comments:  


Of the numerous features recently added to Python, generators truly stand out. Fredrik Lundh, a Python insider well known for wry commentary, put it best: Comparing generators to other recent changes in Python, he said, "Generators [open] a new universe; the rest is more like moving the furniture around." In this tip, I show you how generators can be used to make XML DOM processing with Python even smoother.

Note: In his excellent series on generators in the IBM developerWorks Linux zone (see Resources), David Mertz discusses the basics of generators and some neat applications. If you are not familiar with generators, I recommend that you review at least his first article on the topic. You will need Python 2.2 to use generators.

Smart and fast iterators

Node iterators are a very common design pattern in XML, and are a fixture of DOM. A node iterator visits every node in an XML document in turn, following the standard document order. Node iterators are notorious for being slow and fiddly, but generators in Python offer a wonderfully concise and efficient implementation of node iterators. See Listing 1.


Listing 1. A simple node iterator using generators (nodeiter1.py)
                
		
#Required in Python 2.2, and must be the first import
from __future__ import generators
from xml.dom import Node

def doc_order_iter(node):
    """
    Iterates over each node in document order,
    returning each in turn
    """
    #Document order returns the current node,
    #then each of its children in turn
    yield node
    for child in node.childNodes:
        #Create a generator for each child,
        #Over which to iterate
        for cn in doc_order_iter(child):
            yield cn
    return

#Parse a document then print the Python representation
#of each node object in document order

DOC = """\
<memo xmlns='http://spam.com'>
<date form="iso-8601">2002-08-14</date>
This is just to <strong>say</strong>:
I ate the eggs you left in the fridge
And were probably saving for breakfast.
Do you know?  They were <emph>quite</emph> rotten.
</memo>
"""

from xml.dom import minidom
doc = minidom.parseString(DOC)
for node in doc_order_iter(doc):
    print node

Traversing trees is one of the classic use-cases for generators (such a traversal is the main example in the official specification for Python generators). This means that XML's natural tree structure is very well suited to generator idioms. You can copy the doc_order_iter function to your code and use it any time you need to iterate over a document's nodes (a fairly common requirement in XML processing). The function starts by yielding the given node right away. Remember that this suspends its operation and returns control to the caller. The next time it is called, it initiates a loop over its children. The function recursively sets up a generator for each of its child nodes. This has the effect of traversing down the tree such that each node is visited first, and then each of its children, from left to right. The return at the end signals the complete termination of the generator's operation. The output is shown in Listing 2:


Listing 2. The output
                
	
$ python nodeiter1.py
<xml.dom.minidom.Document instance at 0x8267a1c>
<DOM Element: memo at 136739676>
<DOM Text node "
">
<DOM Element: date at 136742132>
<DOM Text node "2002-08-14">
<DOM Text node "
">
<DOM Text node "This is ju...">
<DOM Element: strong at 136744780>
<DOM Text node "say">
<DOM Text node ":">
<DOM Text node "
">
<DOM Text node "I ate the ...">
<DOM Text node "
">
<DOM Text node "And were p...">
<DOM Text node "
">
<DOM Text node "Do you kno...">
<DOM Element: emph at 136748228>
<DOM Text node "quite">
<DOM Text node " rotten.">
<DOM Text node "
">


How about attributes?

doc_order_iter does not iterate over attribute nodes because they are not considered children of their parent nodes in the DOM. It's not likely that you'll need to do a full iteration that includes attributes, but as an illustration, Listing 3 is a function that does so.


Listing 3. full_doc_order_iter also iterates over attributes
                

from xml.dom import Node

def full_doc_order_iter(node):
    """
    Returns each node in document order,
    including attributes, which are visited
    immediately after their element
    """
    yield node
    if node.nodeType == Node.ELEMENT_NODE:
        #Attributes are stored in a dictionary and
        #have no set order.  The values() call
        #gets a list of actual attribute node objects
        #from the dictionary
        for attr in node.attributes.values():
            yield attr
    for child in node.childNodes:
        for cn in full_doc_order_iter(child):
            yield cn
    return

After the given node is yielded, if it is an element node, its attributes are yielded as well (in arbitrary order).

Warning: This function is written as if attribute nodes have no children (and therefore there is no need to spawn a new generator for each attribute node). In the DOM proper, attributes can have one or more children, which must be text nodes. This is something of an oddity and annoyance in the DOM, and in practice, most people treat attributes atomically, even in DOM processing.

Being selective

The iterators presented so far return all the nodes in the document, without discrimination. But sometimes you want to be more selective. You can easily modify these generators to apply a criteria against each node in order to decide whether to return it. Listing 4 shows a function, doc_order_iter_filter, which is a general framework for iterating over all nodes in a document (not including attribute nodes) and returning only those that pass a filter function given to the generator.


Listing 4: doc_order_iter_filter iterates over non-attribute nodes, returning those that pass a filter
                

def doc_order_iter_filter(node, filter_func):
    if filter_func(node):
        yield node
    for child in node.childNodes:
        for cn in doc_order_iter_filter(child, filter_func):
            yield cn
    return

This time a filter function is passed in. This function takes a node object and returns 0 or 1. You can make this function as simple or complex as you please in its selection of nodes. Before yielding the given node, it is checked against the filter function. This check is applied again for each of the values returned from the recursive generators that are spawned.

As an example, Listing 5 uses this framework to generate only the elements in the document.


Listing 5. Using the framework to generate only the elements in a document
                

def get_elements(node):
    elem_filter = lambda n: n.nodeType == Node.ELEMENT_NODE
    return doc_order_iter_filter(node, elem_filter)

The filter setup does a simple check of the node type. Then a generator is created and returned, prepped to iterate over all nodes with the given filter applied. It is important to remember that the last line is not an ordinary function call. It creates a generator -- a function in suspended animation, in a way, and the calling code can then treat this return result as an iterator. This iterator, in turn, returns a new, yielded value each time it is called until the generator hits the return statement. This means that the order of control is not typical of procedural languages: In effect, the doc_order_iter_filter generator outlives the function (get_elements) from which it is invoked.


Summary

Generators are a true boon to XML processing in Python. I have found myself using them so frequently in XML processing that it's hard for me to imagine how I did without them before. Use generators when you have to perform some operation repeatedly over all or selected nodes in the document. It does not have to be in document order. You can arrange the flow of control however you need to yield the results you require.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact him at uche@ogbuji.net.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12169
ArticleTitle=Tip: Using generators
publish-date=10012002
author1-email=uche@ogbuji.net
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).