Python has never suffered from a scarcity of XML libraries.
Since version 2.0, it has included the familiar xml.dom.minidom and related pulldom and
Simple API for XML (SAX) models. Since 2.4, it has included the popular
ElementTree API. In addition, there have always been
third-party libraries that offer higher-level or
more pythonic interfaces.
While any XML library is sufficient for simple Document Object Model (DOM) or SAX parsing of small files, developers are increasingly faced with larger datasets and a need for real-time parsing of XML in a Web services context. Meanwhile, experienced XML developers may prefer XML-native languages such as XPath or XSLT for their compactness and expressivity. It would be ideal to have access to the declarative syntax of XPath while retaining the general-purpose functionality available in Python.
lxml is the first Python XML library that demonstrates high-performance characteristics and includes native support for
XPath 1.0, XSLT 1.0, custom element classes, and even a pythonic
data-binding interface. It is built on top of two C libraries: libxml2 and libxslt. They provide most of the
horsepower behind the core tasks of parsing, serializing, and
transforming.
Which parts of lxml you use in your code depends on your needs: Are you comfortable with XPath? Do you prefer to work with Python-like objects? How much memory do you have on the system to keep large trees available?
This article does not cover all of lxml but instead demonstrates techniques to efficiently process very large XML files, optimizing for high speed and low memory usage. Two freely available example documents are used: U.S. copyright renewal data converted into XML by Google and the Open Directory RDF content.
lxml is compared here only to cElementTree and not to the dozens of other Python libraries available. I chose cElementTree because it is a native part of Python 2.5 and, like lxml, built on C libraries.
What's so hard about very large data?
XML libraries are often designed for and tested on small sample files. Indeed, many real-world projects are begun without complete data available. Programmers work diligently for weeks or months using sample content and writing code such as that shown in Listing 1.
Listing 1. A simple parse operation
from lxml import etree
doc = etree.parse('content-sample.xml') |
The lxml parse method reads the entire document and
builds an in-memory tree. Relative to cElementTree, an
lxml tree is much more expensive because it
retains more information about a node's context, including
references to its parent.
Parsing a 2G document this way immediately puts a machine with 2G
RAM into swap, with disastrous performance implications. If the
whole application is written assuming this data will be available
in memory, a major refactor is in order.
When building an in-memory tree is not desired or practical, use an iterative parsing technique that does not rely on reading the entire source file. lxml offers two approaches:
- Supplying a target parser class
- Using the
iterparsemethod
Using the target parser method
The target parser method is familiar to developers who are comfortable with SAX event-driven code. A target parser is a class that implements the following methods:
startfires on element open. The data and children of the element are not yet available.endfires on element close. All of the element's child nodes, including text nodes, are now available.datafires on text children and has access to that text.closefires when parsing is complete.
Listing 2 demonstrates creating a target parser class (here called TitleTarget)
that implements the required methods. This parser collects the text children of the Title element
in an internal list (self.text) and, upon reaching the close() method,
returns that list.
Listing 2. A target parser that returns a list of all text children of the
Title tag
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True if tag == 'Title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data.encode('utf-8'))
def close(self):
return self.text
parser = etree.XMLParser(target = TitleTarget())
# This and most other samples read in the Google copyright data
infile = 'copyright.xml'
results = etree.parse(infile, parser)
# When iterated over, 'results' will contain the output from
# target parser's close() method
out = open('titles.txt', 'w')
out.write('\n'.join(results))
out.close()
|
This code was timed at 54 seconds when run against the copyright data. Target parsing can be reasonably fast and does not generate a memory-consuming parse tree, but all events fire for all elements in the data. For very large documents, this might not be desirable when only a few elements are of interest, such as in this example. Is it possible to limit processing to a selected tag and get better performance?
lxml's iterparse method is an extension of the ElementTree API.
iterparse returns a Python iterator for the selected element
context. It accepts two useful arguments: a tuple of events to
monitor and a tag name. In this case, I'm only interested in
the text content of <Title> (which is available
upon reaching the end event). The output from Listing 3
will be identical to that of the target parser method in Listing 2 but ought to
be much faster because lxml can optimize the event handling internally. It's also many fewer lines of code.
Listing 3. Simple iteration over a named tag and event
context = etree.iterparse(infile, events=('end,'), tag='Title')
for event, elem in context:
out.write('%s\n' % elem.text.encode('utf-8'))
|
If you run this code and monitor the output, you see it
begins by appending titles very quickly but soon slows to a
crawl. A quick check of top shows that the computer has
gone into swap:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 170 root 15 -5 0 0 0 D 3.9 0.0 0:01.32 kswapd0 |
What's going on? Although iterparse does not consume the entire
file at first, it does not free the references to nodes from each
iteration. When the whole document
will be accessed repeatedly, this is a feature. However, in this case I would much rather
reclaim that memory at the end of each loop. This includes both references
to children or text nodes that were already been processed
and preceding siblings of the current node, whose references
from the root node are also implicitly preserved, as in Listing 4:
Listing 4. Revised iteration which clears unneeded node references
for event, elem in context:
out.write('%s\n' % elem.text.encode('utf-8'))
# It's safe to call clear() here because no descendants will be accessed
elem.clear()
# Also eliminate now-empty references from the root node to <Title>
while elem.getprevious() is not None:
del elem.getparent()[0] |
For convenience, I refactor Listing 4 into a function
that takes a callable func to do an operation against the current node, as in Listing 5.
I'll use this method in the succeeding examples.
Listing 5. A function to loop through a context, calling
func each time, and then clean up unneeded references
def fast_iter(context, func):
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context |
This optimized iterparse approach in Listing 4
produces output that's identical to that produced by the target parser in Listing 2, but in half the
time. It is even faster than cElementTree when the task
is restricted to a particular event and tag name, as it is
here. (In most cases, though, cElementTree will outperform
lxml when parsing is the primary activity.)
Table 1 shows timings of various parser techniques as measured against the copyright data on the computer described in the benchmarks sidebar.
Table 1. Comparisons of iterative parsing methods: extract
text() from <Title>| XML library | Method | Average time, in seconds |
|---|---|---|
| cElementTree | Iterparse | 32 |
| lxml | Target parser | 54 |
| lxml | Optimized iterparse | 25 |
Running the same iterparse method in Listing 4 on the
Open Directory data takes 122 seconds per run, or slightly more
than five times longer than parsing the copyright data. As the Open
Directory data is also slightly more than five times as large (at 1.9
gigabytes), you should expect roughly linear time performance from this
method, even on very large files.
If all you need to do with an XML file is grab some text from within a single node, it might be possible to use a simple regular expression that will probably operate faster than any XML parser. In practice, though, this is nearly impossible to get right when the data is at all complex, and I do not recommend it. XML libraries are invaluable when true data manipulation is required.
Serializing XML to a string or file is where lxml excels because it
relies on libxml2 C code directly. If
your task requires any serialization at all, lxml is a clear choice,
but there are some tricks to get the best performance out of the library.
Use deepcopy when serializing subtrees
lxml retains references between child nodes and their parents. One effect of this is that a node in lxml can have one and only one parent. (cElementTree has no concept of parent nodes.)
Listing 6 takes each <Record> in the copyright
file and writes a simplified record containing only the title and the
copyright information.
Listing 6. Serialize an element's children
from lxml import etree
import deepcopy
def serialize(elem):
# Output a new tree like:
# <SimplerRecord>
# <Title>This title</Title>
# <Copyright><Date>date</Date><Id>id</Id></Copyright>
# </SimplerRecord>
# Create a new root node
r = etree.Element('SimplerRecord')
# Create a new child
t = etree.SubElement(r, 'Title')
# Set this child's text attribute to the original text contents of <Title>
t.text = elem.iterchildren(tag='Title').next().text
# Deep copy a descendant tree
for c in elem.iterchildren(tag='Copyright'):
r.append( deepcopy(c) )
return r
out = open('titles.xml', 'w')
context = etree.iterparse('copyright.xml', events=('end',), tag='Record')
# Iterate through each of the <Record> nodes using our fast iteration method
fast_iter(context,
# For each <Record>, serialize a simplified version and write it
# to the output file
lambda elem:
out.write(
etree.tostring(serialize(elem), encoding='utf-8'))) |
Don't use deepcopy to simply replicate
the text of a single node. It's faster to create a new node, populate its
text attribute manually, and then serialize it. In my tests, calling
deepcopy
for both <Title> and <Copyright> was 15 percent
slower than the code in Listing 6. You'll see the
greatest performance boosts from deepcopy
when serializing large descendant trees.
When benchmarked against cElementTree using the code in Listing 7, lxml's serializer was almost twice as fast (50 seconds versus 95 seconds):
Listing 7. Serializing with cElementTree
def serialize_cet(elem):
r = cet.Element('Record')
# Create a new element with the same text child
t = cet.SubElement(r, 'Title')
t.text = elem.find('Title').text
# ElementTree does not store parent references -- an element can
# exist in multiple trees. It's not necessary to use deepcopy here.
for c in elem.findall('Copyright'):
r.append(h)
return r
context = cet.iterparse('copyright.xml', events=('end','start'))
context = iter(context)
event, root = context.next()
for event, elem in context:
if elem.tag == 'Record' and event =='end':
result = serialize_cet(elem)
out.write(cet.tostring(result, encoding='utf-8'))
root.clear() |
For more information about this iteration pattern, see "Incremental Parsing" of the ElementTree documentation. (See Resources for a link.)
After parsing, the most common XML task is to locate specific data of interest inside the parsed tree. lxml offers several approaches, from a simplified search syntax to full XPath 1.0. As a user, you should be aware of the performance characteristics and optimization techniques for each approach.
The find and findall methods,
inherited from the ElementTree
API, locate one or more descendant nodes using a simplified XPath-like
expression language called ElementPath. Users migrating from
ElementTree to lxml can naturally continue to use the
find/ElementPath syntax.
lxml supplies two other options for discovering subnodes:
the iterchildren/iterdescendants
methods and true XPath.
In cases where the expression should match a node name,
it is far faster (in some cases twice as fast) to use the
iterchildren or iterdescendants
methods with their optional tag parameter when compared to their equivalent
ElementPath expressions.
For more complex patterns, use the XPath class to
precompile search patterns. Simple patterns that mimic
the behavior of iterchildren with tag arguments (for example,
etree.XPath("child::Title")) execute in effectively the same
time as their iterchildren equivalents. It's important to
precompile, though. Compiling the pattern in each execution of the loop
or using the xpath() method on an element (described in the lxml documentation, see Resources )
can be almost twice as slow as compiling once and then using that pattern repeatedly.
XPath evaluation in lxml is fast. If only a subset of
nodes needs to be serialized, it is much better to limit
with precise XPath expressions up front than to inspect all the nodes later.
For example, limiting the sample serialization
to include only titles containing the word night, as in Listing 8,
takes 60 percent of the time to serialize the full set.
Listing 8. Conditional serialization with XPath classes
def write_if_node(out, node):
if node is not None:
out.write(etree.tostring(node, encoding='utf-8'))
def serialize_with_xpath(elem, xp1, xp2):
'''Take our source <Record> element and apply two pre-compiled XPath classes.
Return a node only if the first expression matches.
'''
r = etree.Element('Record')
t = etree.SubElement(r, 'Title')
x = xp1(elem)
if x:
t.text = x[0].text
for c in xp2(elem):
r.append(deepcopy(c))
return r
xp1 = etree.XPath("child::Title[contains(text(), 'night')]")
xp2 = etree.XPath("child::Copyright")
out = open('out.xml', 'w')
context = etree.iterparse('copyright.xml', events=('end',), tag='Record')
fast_iter(context,
lambda elem: write_if_node(out, serialize_with_xpath(elem, xp1, xp2))) |
Finding nodes in other parts of the document
Note that, even when using iterparse, it is possible to
use XPath predicates based on looking ahead of the current
node. To find all <Record> nodes that are immediately followed by a record whose
title contains the word night, do this:
etree.XPath("Title[contains(../Record/following::Record[1]/Title/text(), 'night')]")
|
However, when using the memory-efficient iteration strategy described in Listing 4, this command returns nothing because preceding nodes are cleared as parsing proceeds through the document:
etree.XPath("Title[contains(../Record/preceding::Record[1]/Title/text(), 'night')]")
|
While it is possible to write an efficient algorithm to solve this particular problem, tasks involving analyses across nodes—especially those that might be randomly distributed in the document—are usually more suited for an XML database that uses XQuery, such as eXist.
Other ways to increase performance
In addition to the use of specific methods within lxml, you can use approaches outside of the library to influence execution speed. Some of these are simple code changes; others require new thinking about how to handle large data problems.
The Psyco module is an often-missed way to increase the speed of Python applications with minimal work. Typical gains for a pure Python program are between two and four times, but lxml does most of its work in C, so the difference is unusually small. When I ran Listing 4 with Psyco enabled, I reduced runtime by only three seconds (43.9 seconds versus 47.3 seconds). Psyco has a large memory overhead which might even negate any gains if the machine has to go to swap.
If your lxml-driven application has core pure Python code that's executed frequently (perhaps extensive string manipulation on text nodes), you might benefit if you enable Psyco for only those methods. For more information about Psyco, see Resources.
If, instead, your application relies mostly on internal, C-driven lxml features, it might be to your advantage to run it as a threaded application in a multiprocessor environment. There are restrictions on how to start the threads—especially with XSLT. Consult the FAQ section on threads in the lxml documentation for more information.
If it is possible to divide extremely large documents into individually analyzable subtrees, then it becomes feasible to split the document at the subtree level (using lxml's fast serialization) and distribute the work on those files among multiple computers. Employing on-demand virtual servers is an increasingly popular solution for executing central processing unit (CPU) bound offline tasks. A walkthrough for Python programmers on setting up and managing Amazon's virtual Elastic Compute Cloud (EC2) cluster is available. See Resources for more information.
General strategies for any high-volume XML task
The specific code samples presented here might not apply to your project, but consider a few principles—borne out by testing and the lxml documentation—when faced with XML data measured in gigabytes or more:
- Use an iterative parsing strategy to incrementally process large documents.
- If searching the entire document in random order is required, move to an indexed XML database.
- Be extremely conservative in the data that you select. If you are only interested in particular nodes, use methods that select by those names. If you require predicate syntax, try one of the XPath classes and methods available.
- Consider the task at hand and the comfort level of the developer. Object models such as lxml's objectify or Amara might be more natural for Python developers when speed is not a consideration. cElementTree is faster when only parsing is required.
- Take the time to do even simple benchmarking. When processing millions of records, small differences add up, and it is not always obvious which methods are the most efficient.
Many software products come with the pick-two caveat, meaning that you must choose only two: speed, flexibility, or readability. When used carefully, lxml can provide all three. XML developers who have struggled with DOM performance or with the event-driven model of SAX now have the chance to work with higher-level pythonic libraries. Programmers coming from a Python background who are new to XML have an easy way to explore the expressivity of XPath and XSLT. Both coding styles can co-exist happily in an lxml-based application.
lxml has more to offer than what was explored here. Be sure to look into
the lxml.objectify module, especially for smaller datasets or applications that are not primarily XML-based.
For content in HTML that might not be well formed, lxml provides two
useful packages: the lxml.html
module and the BeautifulSoup parser. It's also possible to extend lxml itself if you write Python modules that are callable from XSLT or create custom Python or C extensions. Find information about all of these
in the lxml documentation mentioned in Resources.
Learn
- Help getting lxml to work reliably on MacOS-X: Read this thread for invaluable help on installing lxml on MacOS X.
- ElementTree Overview: Find information about the ElementTree API and cElementTree.
- Amazon
EC2 Basics for Python programmers: Learn how this virtual machine hosting service from Amazon works.
- Incremental
Parsing: In this section of the ElementTree documentation, get more information about the iteration pattern used in Listing 6.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- The technology
bookstore: Browse for books on these and other technical topics.
- developerWorks
podcasts: Listen to interesting interviews and discussions for software developers.
Get products and technologies
- lxml: lxml's documentation is clear but almost dauntingly comprehensive. Find the real gems in the FAQ and benchmarking sections.
- Google U.S.
copyright renewal data: Download and experiment with this U.S. copyright renewal data converted into XML by Google (371MB, zipped, 426,907 individual records).
- Open Directory RDF content: Download RDF dumps of the Open Directory database (1.9GB, zipped, 5,354,663 individual records).
- eXist: Check out this open source database management system that uses XQuery.
- Psyco: Learn more about this Python extension module that can massively speed up the execution of Python code.
- Amara: Try this Python XML library with a rich feature set and pythonic API. Amara does not have the same performance properties as lxml or cElementTree but is very usable for most XML tasks..
- IBM
trial software for product evaluation: Build your next project with trial software available for download directly from developerWorks, including application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- XML zone discussion forums: Participate in any of several XML-related discussions.
- developerWorks XML zone: Share your thoughts: After you read this article, post your comments and thoughts in this forum. The XML zone editors moderate the forum and welcome your input.
- developerWorks blogs: Check out these blogs and get involved in the developerWorks community.

Liza Daly is a software engineer who specializes in applications for the publishing industry. She has been the lead developer on major online products for Oxford University Press, O'Reilly Media, and other publishers. Currently she is an independent consultant and the founder of Threepress, an open source project developing ebook applications.
Comments (Undergoing maintenance)





