 | Level: Introductory Cameron Laird (claird@phaseit.net), Vice President, Phaseit, Inc.
03 Apr 2007 XPath can dramatically simplify and speed applications with even modest XML involvement. If XPath isn't already in your toolkit, now's the time to add it. Concrete examples coded in brief Python make the appeal of query idioms apparent.
XPath deserves to be your friend.
If you're even an occasional XML programmer, and you don't
yet use XPath, you have an opportunity to multiply the performance and maintainability of your applications.
This article details a few specific examples that demonstrate what
a difference query methods can make in even simple XML processing.
Laying a foundation
To ensure your XPath progress is smooth, consider a few fundamentals:
- The following examples are coded in Python.
- You've probably been told too many times before
that Xsomething is good for you -- it's designed to solve a specific and real problem, committees
agree on it, and so on. I know I have. XPath is different from all the other alphabet soup
of the XML world, though; it really is good for you.
For working programmers, XPath ranks among the most valuable XML definitions to
learn, just behind XML itself and XHTML.
- You don't need XPath. You're a working
XML programmer whose applications already have required functionality.
The point, though, is that typical procedural programming is a poor match for common XML problems.
Although your programs already work correctly, a few pinches of query coding can make them
much easier to read, more maintainable, and, in many cases, faster, sometimes dramatically so.
The goal is not to abandon all the programming that already serves
you well, but to learn a few supplementary techniques that offer immediate payoff.
Most readers of the XML zone on developerWorks concentrate on C, C++, or Java.
Python is a good vehicle for exposition, though, because it reads
like streamlined pseudo code, and is widely available.
Even Python beginners should be able to follow the subsequent directions.
As with many other languages, Python supports an abundance of XML-capable libraries.
It has been confusing in the past to know where to start.
With the 2.5 release of Python, however, Fredrik Lundh's marvelous ElementTree module is standard;
that's the library these examples are going to use.
The library itself is tiny, and can be adjoined to any release of Python
back to 1.5.2, as Resources
explains. With ElementTree in place, you have everything
necessary to execute the examples here for yourself.
XML programming models
Most of the XML programming seen currently is based on the Document Object Model (DOM).
This views an XML instance as a tree of elements, each identified by a tag, and many
with attributes and/or subelements. XML programming is a
chore, among other reasons, because processing must
navigate to each element through the tree. Consider, for
example, the template XML that developerWorks uses in development
of content for developerWorks articles. It looks much like
XHTML. Its schema is simple and shallow, certainly compared to
XML instances common in industrial applications, which might
capture thousands of details about machinery or financial
transactions. Even with this simple model, though, if you wanted to report on, say, all anchor tags in the text -- all
elements tagged <a> -- you need to navigate through
essentially the entire document, to an arbitrary nesting
depth. No simple non-recursive procedural expression accesses all elements, although many libraries
include helper functions for walking the tree.
The result: a simple program to report on all anchors looks something like this:
Listing 1. DOM-based code to report on all anchors
import elementtree.ElementTree
def detail_anchor(element):
if element.tag == "a":
attributes = element.attrib
if "href" in attributes.keys():
print "'%s' is at URL '%s'." % (element.text,
attributes['href'])
if "name" in attributes.keys():
print "'%s' anchors '%s'." % (element.text,
attributes['name'])
def report(element):
detail_anchor(element)
for x in element.getchildren():
report(x)
report(elementtree.ElementTree.parse("draft2.xml").getroot())
|
With the 5.5.1 reference template described below, this produces
output that looks like this:
Listing 2. Report from program in Listing 1
'related developerWorks content' is at 'http://www.ibm.com/developerworks'.
'entire series' is at 'http://www.ibm.com/...'
...
'IBM product evaluation versions' is at 'http://www.ibm.com/...'
|
How much simpler would it be if there were a standard way to
query a DOM instance, and extract information on exactly the
elements, attributes, and tags you specify? See for yourself:
Listing 3. XPath-based equivalent to Listing 1.
import elementtree.ElementTree
def detail_anchor(element):
if element.tag == "a":
attributes = element.attrib
if "href" in attributes.keys():
print "'%s' is at URL '%s'." % (element.text,
attributes['href'])
if "name" in attributes.keys():
print "'%s' anchors '%s'." % (element.text,
attributes['name'])
for element in \
elementtree.ElementTree.parse("draft2.xml").findall("//a"):
detail_anchor(element)
|
Notice that the "//a" can be pronounced in English as, "search for
all elements in the entire document tagged 'a'."
The output from Listing 3 is essentially the same as Listing 2.
 |
ElementTree installation
You'll want your own Python installation with ElementTree.
While you're welcome just to read the code listings, you'll
enjoy them much more if you can execute them for yourself, and,
even better, change them to experiment according to your own ideas.
It's not hard to do so. In contrast to the clumsiness of many
XML toolkits I encounter, ElementTree should take you just a few
minutes to install and begin to use.
First, you need Python itself. Most modern UNIX distributions,
including MacOS and nearly all variations of Linux, build-in Python. For Windows, see ActivePython in the Resources.
With Python in place:
- Retrieve the elementtree source
bundle (from the ElementTree home, as described in the
Resources.).
- Unpack it.
- Navigate to the appropriate
directory (probably something like
elementtree-1.2.6-20050316/ -- the
point is that you need to be where
setup.py is).
- From a command line, invoke
python setup.py install.
You're done. That's all it takes. To verify your installation, launch the interactive Python shell and request import elementtree.ElementTree.
|
|
Compare Listings 1 and 3. The former certainly is simple enough -- but
the latter is even simpler. It eliminates the need to specify
recursion explicitly. More importantly, XPath codes in terms of
XML (logical) markup, rather than structure. Markup is closer
to human thinking, and lasts longer; in this perspective,
structure is a relatively short-lived implementation detail.
With more complicated queries, XPath's
declarative style becomes an even clearer winner over typical
structure-oriented procedural search implementations.
The point of this article is that you can incorporate XPath into
your existing XML program development and reap immediate rewards
of performance and maintainability. To me, it's much like the
use of assemblers, compilers, and higher-order languages: I
could and have written entire programs in machine language, but
it's simple prudence to learn higher-productivity methods.
Moreover, XPath often improves performance over hand-coded searches.
This suggests a question, though: Are there ways to improve on XPath?
Yes, of course -- but with qualifications. XQuery and XSLT
are two more XML definitions comparable to XPath in their
acceptance (XSLT is maybe the most widely used of the three, and
XQuery has the fewest useful implementations). XSLT is like
XPath in its declarative emphasis, while XQuery adds procedural
capabilities to XPath's querying. Both make templating -- roughly
speaking, queries that embed recognizable XML fragments -- more or less idiomatic.
This is noteworthy because several XML
theoreticians recommend templating as a desirable expressive form.
On the whole, however, the profits from XQuery, XSLT, or such
language-specific query packages as Amara, XQJ, or JAXP, are
incremental, in comparison with the big return that I claim for XPath.
The paper by Uche Ogbuji cited in Resources nicely compares
sample problems solved several different ways.
Turn to these more specialized interfaces only when
your XML processing feels excessively complicated.
But start XPath today!
Other examples
This article aims to motivate XPath, rather than to teach it.
I want you to have a clear idea of how little it takes to start to learn XPath.
At the same time, you'll want to know how far you can go with XPath.
The query of Listing 3 has //a as a simple example.
Eventually, you'll learn about more complicated queries, such
as /parent/child[@attr='value'],
which specifies all nodes "child" which are children of a "parent"
and have an attribute "attr" valued "value". Two more examples:
//@*, which examines all attributes
of all tags; and
//tr/td|th, which retrieves
all td or
th elements within a
tr.
Performance
While performance concern needs always
to reduce to a specific measurement, useful thinking about XPath
performance requires a bit of context.
In crude terms, XPath is faster than you are. In large, complicated
programs where performance might matter, XPath queries often are
quicker -- sometimes much quicker -- than the equivalent hand-coded
procedural code. In principle, you might be able to use knowledge
you have about data content or layout to code a clever look-up
that's very fast. In practice, what I see happen with XPath is
analogous to what happens when someone uses a higher-level
language (HLL) rather than assembler or C: the developer is able to
complete a correctly working program with much less effort, and
any superficial intrinsic speed disadvantage of the HLL is quickly
made up by superior application-specific algorithms.
In just the same way, XPath simplifies coding for the big problems
where speed matters enough to allow the programmer a little leisure
to engineer his solution carefully. XPath-based solutions end up
being not only easier to correct and maintain, but at least as fast.
Moreover, some XPath implementations are written carefully,
to make more efficient use of memory than a naive procedural search.
In these cases, XPath-based codings can speed up
searches by a large factor -- ten or more.
It's frustrating not to exhibit specific measurements to
illustrate these points. Any specific comparison is
misleading, though; XPath comparative performance depends
on the development language, the specific XPath implementation,
the XML image queried, the search, and, often, the layout of
run-time memory available. Learn how to do simple timings
for yourself, and test XPath on datasets typical for
your applications. If your experience is at all like
mine, you'll sometimes find XPath only half as fast as
native coding, especially for small problems, but up to
an order of magnitude faster for others.
Summary
XPath is an XML facility that you can start to use at low cost: You
might well already have it built in to the XML library with
the development language you use. At the same time, XPath has
the potential to improve performance significantly, and its
simple expressiveness definitely simplifies programming and
maintenance. Moreover, there's a wealth of XPath tutorials
to help you learn what you'll need. The result: you'll find
an investment in XPath one that quickly pays off.
Once you know XPath, you'll also be in a much better position to
judge whether your work with XML is so demanding that more
advanced tools like XQuery and XSLT will benefit you.
Be sure to read the information referenced in Resources to help you start your
own XPath programming.
Resources Learn
- Charming Python column for developerWorks: Why use Python to explain XPath? See David Mertz' column for details on the merits of this language. Most immediately, it's widely available, and even programmers who don't know it generally find it easy to read.
- ActivePython: Explore ActivePython, a quality-assured installable distribution of Python, according to its home page. The author particularly recommends ActivePython for Windows users interested in Python, because of the reliability of its installation.
- ElementTree home page: See how ElementTree wrapper, written and maintained by Fredrik Lundh, adds code to load XML files as trees of Element objects, and save them back again. Along with the simple installation the sidebar above discusses, ElementTree comes in a variety of implementations to exploit other libraries and improve performance.
- Of the many tutorials written for XPath, start with:
- Get started with XPath (Bertrand Portier, developerWorks, May 2004): Learn what XPath is, the syntax and semantics of the XPath language, how to use XPath location paths, how to use XPath expressions, how to use XPath functions, and how XPath relates to XSLT. This tutorial covers XPath Version 1.0.
- XPath tutorial (Miloslav Nic and Jiri Jirat, zvon): Learn about selected XPath features as demonstrated in many examples.
- XML Path Language (XPath)
specification: Read the W3C-maintained spec about common syntax and semantics for functionality shared between XSL Transformations [XSLT] and XPointer.
- XPath as data binding tool, Part 2 (Brett McLaughlin, developerWorks, January 2006): Use the JAXP API to perform XPath queries and code with XPath in Java.
- "XGrep is a grep like utility for XML documents." XGrep is a marvelous tool developers should have at hand at all times. This article uses Python for its example to illustrate XML programming. If Cameron's emphasis were on XPath itself, though, rather than how to program with it, he'd teach in terms of XGrep.
- Read these two introductory developerWorks articles on:
- Look at two authors with much different perspectives, but both think and write with care; they complement each other well.
- developerWorks XML zone: Learn all about XML at the developerWorks XML zone.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
Get products and technologies
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
About the author  | 
|  | Cameron Laird is a long-time developerWorks contributor and former columnist. He often writes about the projects that accelerate development of his employer's applications, focused on reliability, security, and integration of systems not originally designed for each other. |
Rate this page
|  |