XPath deserves to be your friend.
If you're even an occasional XML programmer, and you don't yet use XPath, you have an opportunity to multiply the performance and maintainability of your applications. This article details a few specific examples that demonstrate what a difference query methods can make in even simple XML processing.
To ensure your XPath progress is smooth, consider a few fundamentals:
- The following examples are coded in Python.
- You've probably been told too many times before that Xsomething is good for you -- it's designed to solve a specific and real problem, committees agree on it, and so on. I know I have. XPath is different from all the other alphabet soup of the XML world, though; it really is good for you. For working programmers, XPath ranks among the most valuable XML definitions to learn, just behind XML itself and XHTML.
- You don't need XPath. You're a working XML programmer whose applications already have required functionality. The point, though, is that typical procedural programming is a poor match for common XML problems. Although your programs already work correctly, a few pinches of query coding can make them much easier to read, more maintainable, and, in many cases, faster, sometimes dramatically so. The goal is not to abandon all the programming that already serves you well, but to learn a few supplementary techniques that offer immediate payoff.
Most readers of the XML zone on developerWorks concentrate on C, C++, or Java. Python is a good vehicle for exposition, though, because it reads like streamlined pseudo code, and is widely available. Even Python beginners should be able to follow the subsequent directions.
As with many other languages, Python supports an abundance of XML-capable libraries. It has been confusing in the past to know where to start. With the 2.5 release of Python, however, Fredrik Lundh's marvelous ElementTree module is standard; that's the library these examples are going to use. The library itself is tiny, and can be adjoined to any release of Python back to 1.5.2, as Resources explains. With ElementTree in place, you have everything necessary to execute the examples here for yourself.
Most of the XML programming seen currently is based on the Document Object Model (DOM).
This views an XML instance as a tree of elements, each identified by a tag, and many
with attributes and/or subelements. XML programming is a
chore, among other reasons, because processing must
navigate to each element through the tree. Consider, for
example, the template XML that developerWorks uses in development
of content for developerWorks articles. It looks much like
XHTML. Its schema is simple and shallow, certainly compared to
XML instances common in industrial applications, which might
capture thousands of details about machinery or financial
transactions. Even with this simple model, though, if you wanted to report on, say, all anchor tags in the text -- all
elements tagged <a> -- you need to navigate through
essentially the entire document, to an arbitrary nesting
depth. No simple non-recursive procedural expression accesses all elements, although many libraries
include helper functions for walking the tree.
The result: a simple program to report on all anchors looks something like this:
Listing 1. DOM-based code to report on all anchors
import elementtree.ElementTree
def detail_anchor(element):
if element.tag == "a":
attributes = element.attrib
if "href" in attributes.keys():
print "'%s' is at URL '%s'." % (element.text,
attributes['href'])
if "name" in attributes.keys():
print "'%s' anchors '%s'." % (element.text,
attributes['name'])
def report(element):
detail_anchor(element)
for x in element.getchildren():
report(x)
report(elementtree.ElementTree.parse("draft2.xml").getroot())
|
With the 5.5.1 reference template described below, this produces
output that looks like this:
Listing 2. Report from program in Listing 1
'related developerWorks content' is at 'http://www.ibm.com/developerworks'.
'entire series' is at 'http://www.ibm.com/...'
...
'IBM product evaluation versions' is at 'http://www.ibm.com/...'
|
How much simpler would it be if there were a standard way to query a DOM instance, and extract information on exactly the elements, attributes, and tags you specify? See for yourself:
Listing 3. XPath-based equivalent to Listing 1.
import elementtree.ElementTree
def detail_anchor(element):
if element.tag == "a":
attributes = element.attrib
if "href" in attributes.keys():
print "'%s' is at URL '%s'." % (element.text,
attributes['href'])
if "name" in attributes.keys():
print "'%s' anchors '%s'." % (element.text,
attributes['name'])
for element in \
elementtree.ElementTree.parse("draft2.xml").findall("//a"):
detail_anchor(element)
|
Notice that the "//a" can be pronounced in English as, "search for all elements in the entire document tagged 'a'." The output from Listing 3 is essentially the same as Listing 2.
Compare Listings 1 and 3. The former certainly is simple enough -- but the latter is even simpler. It eliminates the need to specify recursion explicitly. More importantly, XPath codes in terms of XML (logical) markup, rather than structure. Markup is closer to human thinking, and lasts longer; in this perspective, structure is a relatively short-lived implementation detail. With more complicated queries, XPath's declarative style becomes an even clearer winner over typical structure-oriented procedural search implementations.
The point of this article is that you can incorporate XPath into your existing XML program development and reap immediate rewards of performance and maintainability. To me, it's much like the use of assemblers, compilers, and higher-order languages: I could and have written entire programs in machine language, but it's simple prudence to learn higher-productivity methods. Moreover, XPath often improves performance over hand-coded searches.
This suggests a question, though: Are there ways to improve on XPath? Yes, of course -- but with qualifications. XQuery and XSLT are two more XML definitions comparable to XPath in their acceptance (XSLT is maybe the most widely used of the three, and XQuery has the fewest useful implementations). XSLT is like XPath in its declarative emphasis, while XQuery adds procedural capabilities to XPath's querying. Both make templating -- roughly speaking, queries that embed recognizable XML fragments -- more or less idiomatic. This is noteworthy because several XML theoreticians recommend templating as a desirable expressive form.
On the whole, however, the profits from XQuery, XSLT, or such language-specific query packages as Amara, XQJ, or JAXP, are incremental, in comparison with the big return that I claim for XPath. The paper by Uche Ogbuji cited in Resources nicely compares sample problems solved several different ways. Turn to these more specialized interfaces only when your XML processing feels excessively complicated. But start XPath today!
This article aims to motivate XPath, rather than to teach it. I want you to have a clear idea of how little it takes to start to learn XPath.
At the same time, you'll want to know how far you can go with XPath.
The query of Listing 3 has //a as a simple example.
Eventually, you'll learn about more complicated queries, such
as /parent/child[@attr='value'],
which specifies all nodes "child" which are children of a "parent"
and have an attribute "attr" valued "value". Two more examples:
//@*, which examines all attributes
of all tags; and
//tr/td|th, which retrieves
all td or
th elements within a
tr.
While performance concern needs always to reduce to a specific measurement, useful thinking about XPath performance requires a bit of context. In crude terms, XPath is faster than you are. In large, complicated programs where performance might matter, XPath queries often are quicker -- sometimes much quicker -- than the equivalent hand-coded procedural code. In principle, you might be able to use knowledge you have about data content or layout to code a clever look-up that's very fast. In practice, what I see happen with XPath is analogous to what happens when someone uses a higher-level language (HLL) rather than assembler or C: the developer is able to complete a correctly working program with much less effort, and any superficial intrinsic speed disadvantage of the HLL is quickly made up by superior application-specific algorithms.
In just the same way, XPath simplifies coding for the big problems where speed matters enough to allow the programmer a little leisure to engineer his solution carefully. XPath-based solutions end up being not only easier to correct and maintain, but at least as fast.
Moreover, some XPath implementations are written carefully, to make more efficient use of memory than a naive procedural search. In these cases, XPath-based codings can speed up searches by a large factor -- ten or more.
It's frustrating not to exhibit specific measurements to illustrate these points. Any specific comparison is misleading, though; XPath comparative performance depends on the development language, the specific XPath implementation, the XML image queried, the search, and, often, the layout of run-time memory available. Learn how to do simple timings for yourself, and test XPath on datasets typical for your applications. If your experience is at all like mine, you'll sometimes find XPath only half as fast as native coding, especially for small problems, but up to an order of magnitude faster for others.
XPath is an XML facility that you can start to use at low cost: You might well already have it built in to the XML library with the development language you use. At the same time, XPath has the potential to improve performance significantly, and its simple expressiveness definitely simplifies programming and maintenance. Moreover, there's a wealth of XPath tutorials to help you learn what you'll need. The result: you'll find an investment in XPath one that quickly pays off.
Once you know XPath, you'll also be in a much better position to judge whether your work with XML is so demanding that more advanced tools like XQuery and XSLT will benefit you.
Be sure to read the information referenced in Resources to help you start your own XPath programming.
Learn
- Charming Python column for developerWorks: Why use Python to explain XPath? See David Mertz' column for details on the merits of this language. Most immediately, it's widely available, and even programmers who don't know it generally find it easy to read.
- ActivePython: Explore ActivePython, a quality-assured installable distribution of Python, according to its home page. The author particularly recommends ActivePython for Windows users interested in Python, because of the reliability of its installation.
- ElementTree home page: See how ElementTree wrapper, written and maintained by Fredrik Lundh, adds code to load XML files as trees of Element objects, and save them back again. Along with the simple installation the sidebar above discusses, ElementTree comes in a variety of implementations to exploit other libraries and improve performance.
- Of the many tutorials written for XPath, start with:
- Get started with XPath (Bertrand Portier, developerWorks, May 2004): Learn what XPath is, the syntax and semantics of the XPath language, how to use XPath location paths, how to use XPath expressions, how to use XPath functions, and how XPath relates to XSLT. This tutorial covers XPath Version 1.0.
- XPath tutorial (Miloslav Nic and Jiri Jirat, zvon): Learn about selected XPath features as demonstrated in many examples.
- XML Path Language (XPath)
specification: Read the W3C-maintained spec about common syntax and semantics for functionality shared between XSL Transformations [XSLT] and XPointer.
- XPath as data binding tool, Part 2 (Brett McLaughlin, developerWorks, January 2006): Use the JAXP API to perform XPath queries and code with XPath in Java.
- "XGrep is a grep like utility for XML documents." XGrep is a marvelous tool developers should have at hand at all times. This article uses Python for its example to illustrate XML programming. If Cameron's emphasis were on XPath itself, though, rather than how to program with it, he'd teach in terms of XGrep.
- Read these two introductory developerWorks articles on:
- An introduction to XQuery (Howard Katz, updated January 2006 ): Look at the W3C's proposed standard for an XML query language.
- What kind of language is XSLT? (Michael Kay, April 2005): Put XSLT in context -- where it comes from, what it's good at, and why to use it.
- Look at two authors with much different perspectives, but both think and write with care; they complement each other well.
- A conversation with Jonathan Robie about XQuery: Jonathan Robie makes sense out of XML query languages in this very interesting interview.
- Is XQuery an omni-tool?: Uche Ogbuji focuses especially on XQuery.
- developerWorks XML zone: Learn all about XML at the developerWorks XML zone.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
Get products and technologies
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
Discuss
- XML zone discussion forums: Participate in any of several XML-centered forums.
