Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

XML Matters: The RXP parser

An extremely fast validating parser with a Python binding

David Mertz (mertz@gnosis.cx), Comparator, Gnosis Software, Inc.
Photo of David Mertz
For David Mertz, an atomic object is a combination of facts. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book, Text Processing in Python .

Summary:  RXP is a validating parser written in C that creates a non-DOM tree representation of XML documents. While RXP itself is not well documented -- and not for the faint of heart -- at least two excellent higher level APIs have been built on top of RXP: pyRXP, a Python binding; and LT XML, a collection of utilities and libraries. In this article, David introduces you to RXP, compares it with the expat parser, and briefly discusses pyRXP and LT XML as ways of taking advantage of the speed RXP has to offer without all of its complexity.

View more content in this series

Date:  04 Aug 2003
Level:  Intermediate

Comments:  

Readers of this column will have picked up on the fact that while I write here about XML generally, I have a particular fondness for Python tools. I had planned to break with this pattern for this installment, and focus on using RXP with C applications. However, once I took a closer look at the RXP library, I found that the easiest way to utilize it is through the Python module pyRXP.

While the underlying RXP GPL library is almost certainly the fastest validating XML parser you can find, the actual parser code is quite under-documented, and comes with just one simple example of a command-line tool. This tool, rxp, is similar to the utility xmlcat.py (which I presented in my tip Command-line XML processing) as well as a variety of similar utilities -- it reads XML documents, validates them, and outputs a canonical form. You can look through the source code for the file rxp.c to see the way that RXP parsing generates a compact document tree as a data structure.

On top of RXP itself, the Language Technology Group has built LT XML, which contains a variety of higher-level tools and APIs. A number of additional tools are built using LT XML, including XED (an XML editor). In this article, I will take a brief look at the tools in LT XML, but my main focus will be examining the RXP tree API as exposed through the pyRXP binding. As far as I can determine, other high-level languages that might naturally have RXP bindings -- such as Perl, TCL, and Ruby -- have not yet grown them.

Talking about speed

RXP is fast. A C application that uses the (optionally) validating RXP parser is probably not much different in speed than one that uses the non-validating expat parser (which is itself known to be very fast). RXP works by building a compact in-memory tree structure of the XML document being parsed. Failures in parsing are failures in tree building, and a successful parse gives you a data structure that is much more efficient than a DOM representation of XML.

Where you need to build a complete data structure out of an XML document, RXP probably edges out expat slightly; and if you need validation, expat is simply not an option. However, for purely sequential processing, or for extracting a small subset of the information in an XML document, expat can be superior, since it doesn't need to save any representation of already processed (or already skipped) tags. In fact, for sufficiently large documents, expat gains an overpowering advantage -- you rarely want to create an in-memory representation of a 1 GB XML document (with RXP you have no choice about this). An application built around expat is happy to pull off a few tags of interest as it reads through that much XML, likely utilizing orders of magnitude less memory than the document size.

The speed of RXP really stands out in the context of the pyRXP binding. The last installment of this column did some fairly detailed speed and memory-usage comparisons of several XML document models in Python: ElementTree, gnosis.xml.objectify, xml.minidom, and cDomlette. The tests performed simply created a minimal in-memory representation using each API, and measured the time and memory usage required for the construction. It is easy to do the same thing with pyRXP:


Listing 1. time_rxp.py
from pyRXP import Parser
import sys, time
start = time.clock()
tups = Parser().parse(sys.stdin.read())
print
				"Time: %.3f" % (time.clock()-start)

Parsing the 3 MB weblog.xml file takes only 4 seconds using pyRXP, where the best performance in prior testing was cDomlette which took an estimated 25 seconds on my test machine. In memory usage, time_rxp.py peaks around 28 MB, just about the same as the most parsimonious prior contender, gnosis.xml.objectify. In other words, pyRXP ties the best memory usage and is over six times as fast as the prior best!

pyRXP is so much faster than other Python XML document model APIs for a very specific reason: RXP builds a complete data structure in C, and all pyRXP needs to do is turn this completed structure into a very similar Python data structure. In contrast, modules like gnosis.xml.objectify and ElementTree -- while utilizing the underlying expat parser for the actual parsing -- still need to make callbacks into Python functions for each tag or piece of content encountered. Function call overhead in Python is significant, especially compared to the cheapness of C calls. In principle, someone could write an expat-based, C-coded Python extension that builds an entire data structure before handing it back to the Python interpreter (the speed would be similar to pyRXP). But creating such an extension would require more programming effort than is needed for the pyRXP wrapper, because even in C expat works by programming callbacks for each tag and content. In contrast, RXP builds the data structure right in the parser.

pyRXP's tuple tree data structure

pyRXP (and RXP itself) uses an efficient, lightweight tree representation of XML documents. Each node in a pyRXP tree is simply a tuple of the form:

(tagname, attr_dict, child_list, reserved)

No specialized Python classes are used in the representation -- just tuples, dicts, lists, and strings (and None in the reserved position). Perhaps surprisingly, this form is adequate to represent all the information in an XML document. The tagname is a straightforward string; the attribute dictionary is a dictionary that maps attributes to values, as you would expect. The child list is more subtle: Strings can be interleaved with tuples in the list, indicating a mixed content element. Moreover, an element that has no content is represented by an empty child list, but a self-closed tag is represented by None. Listing 2 shows this structure in action:


Listing 2. The pyRXP tuple tree data structure
>>> import pprint
>>> xml = '''<foo this="that" spam="eggs">
... <bar>1</bar><bar>2</bar>
... <baz></baz><baz/></foo>'''
>>> tree = Parser().parse(xml)
>>> pprint.pprint(tree)
('foo',
 {'this': 'that', 'spam': 'eggs'},
 ['\n',
  ('bar', None, ['1'], None),
  ('bar', None, ['2'], None),
  '\n',
  ('baz', None, [], None),
  ('baz', None, None, None)],
 None)

All the XML information is in there, but navigating through it can be inconvenient.


Contrasting data access styles

Recall that in the last installment, I contrasted several implementations of a simple application for filtering a test weblog.xml document, and displaying some information from it. A single <entry> element in this file might look something like:


Listing 3. A weblog.xml entry record
<entry>
  <host>64.172.22.154</host>
  <referer>-</referer>
  <userAgent>-</userAgent>
  <dateTime>19/Aug/2001:01:46:01</dateTime>
  <reqID>-0500</reqID>
  <reqType>GET</reqType>
  <resource>/</resource>
  <protocol>HTTP/1.1</protocol>
  <statusCode>200</statusCode>
  <byteCount>2131</byteCount>
</entry>

The file weblog.xml contains thousands of such entries. A filter that utilizes gnosis.xml.objectify looks like this:


Listing 4. select_hits_xo.py
from gnosis.xml.objectify import XML_Objectify, EXPAT
weblog = XML_Objectify('weblog.xml',EXPAT).make_instance()
interesting = [entry for entry in weblog.entry
              if entry.host.PCDATA=='209.202.148.31' 
and entry.statusCode.PCDATA=='200'] for e in interesting: print "%s (%s)" % (e.resource.PCDATA, e.byteCount.PCDATA)

How might you write the same application for a pyRXP tuple tree? Unfortunately, since you have to look through nested lists and numeric tuple positions, access is much less straightforward:


Listing 5. select_hits_rxp1.py
from pyRXP import Parser
TAGNAME,ATTRS,CHILDREN = range(3)
weblog = Parser().parse(open('weblog.xml').read())
interesting = []
for child in weblog[CHILDREN]:
    if child[TAGNAME]!='entry': continue
    gotHost, gotStatus = 0, 0 
for fld in child[CHILDREN]: tag = fld[TAGNAME] if tag=='host' and fld[CHILDREN]==['209.202.148.31']: gotHost = 1
elif tag=='statusCode' and fld[CHILDREN]==['200']: gotStatus = 1
if gotHost and gotStatus: interesting.append(child[CHILDREN]) for e in interesting: resource, byteCount = '', ''
for fld in e: if fld[TAGNAME]=='resource': resource = fld[CHILDREN][0] elif fld[TAGNAME]=='byteCount': byteCount = fld[CHILDREN][0] print "%s (%s)" % (resource, byteCount)

Even with some named constants to stand for tuple positions, this version is certainly harder to read (but I think it is about the best you can do directly with tuple trees). The output is identical, albeit the pyRXP version gets this output in 5 seconds rather than 25 seconds.

The pyRXP module is distributed with a few miscellaneous files, one of which is an interesting module called xmlutils. In a clever strategy, the class xmlutils.TagWrapper acts as a proxy wrapper for pyRXP tuple trees. The overall effect is that you can access tuple trees in a native Python style that is very similar to that provided by gnosis.xml.objectify or ElementTree:


Listing 6. select_hits_rxp2.py
from pyRXP import Parser
import xmlutils
tree = Parser().parse(open('weblog.xml').read())
weblog = xmlutils.TagWrapper(tree)
interesting = [child for child in weblog
              if child.tagName=='entry' 
if str(child.host)=='209.202.148.31'
if str(child.statusCode)=='200'] for e in interesting: print "%s (%s)" % (e.resource, e.byteCount)

So far, so good. The code is quite elegant. Still, proxying adds some overhead. This version of the script runs in 7.5 seconds instead of 5, which still seems quite a lot better than the 25 seconds for gnosis.xml.objectify. Those 2.5 seconds that the filter spends in proxy overhead, however, correspond to less than a tenth of a second that select_hits_xo.py spends in its filtering. The parsing step swamps this difference, but if you imagine an application that parses an XML document once, then performs hundreds of different filtering actions (for example, at user specification), the proxy wrapper starts to look a lot less appealing. However, pyRXP developers warn that xmlutils is experimental, so perhaps much more efficient wrappers could be developed.


Using LT XML

The LT XML collection is built on top of RXP and contains a variety of command-line tools for working with XML, as well as some higher-level APIs than those in RXP itself. One of the powerful tools in LT XML is called sggrep, which is a sort of grep for XML files. The syntax is a little confusing, but basically it is a way of formulating expressions that are a combination of regular expressions and XPaths.

Some other tools in LT XML include:

  • textonly, which strips out the tags and outputs PCDATA contents
  • sgsort to sort XML elements
  • sgcount to count elements
  • xmlnorm to cannonicalize XML documents

Each of these tools utilizes input and output pipes, and can therefore be combined on command-lines and in shell scripts. Moreover, the connection with non-XML versions of analogous tools can be seen by removing the "sg" prefix from many of the names.

One interesting technique is to pipe several sggrep queries together. Each sggrep command can specify both the main query and a subquery. For example, "I want <foo> elements that contain <bar> elements with the content baz." The main query asks for <foo>; the subquery specifies properties of child <bar>. The tool sggrep allows for either a more verbose form that explicitly names queries, subqueries, and patterns with -q, -s, and -t, or a compact form that omits the switches (you use the -- switch to activate the compact form). Listing 7 is a complex command-line that does almost the same thing as the filtering utilities discussed above:


Listing 7. A webhost.xml filtering compound query
% cat weblog.xml |
  sggrep '.*/entry' '.*/entry/host' '209.202.148.31' -- |
  sggrep -q '.*/entry' -s '.*/entry/statusCode' -t '200' |
  sggrep '.*/resource|byteCount' -- |
  textonly -s '\n'

This output is not quite right, it is broken on to lines like:

/publish/programming/regular_expressions.html
45674

rather than formatted per line as the Python filters do:

/publish/programming/regular_expressions.html (45674)

Probably some standard Unix shell tools like awk, sed, or tr could be used cleverly to get the precise output desired.

On the plus side, sggrep and the other LT XML tools are quite fast, as much so as pyRXP is without using the TagWrapper overhead. Furthermore, all of the capabilities exposed by the bundled utilities are also exposed to C programmers who want to use similar APIs. And perhaps best of all, LT XML itself now has a Python binding (but, interestingly, for no other script language).


Resources

  • Participate in the discussion forum.

  • Visit the home page for the RXP parser.

  • Find out more about the pyRXP binding, which is produced by ReportLab who also bring you tools for working with PDF files in Python.

  • Read David's previous tip on command-line XML processing, here on developerWorks (May 2003).

  • Read David Mertz's earlier columns on XML libraries:
    • XML Matters #2 introduced gnosis.xml.objectify, then called simply xml_objectify (developerWorks, August 2000).
    • XML Matters #11 updated readers on some early improvements to gnosis.xml.objectify. Some newer features were not covered in this column, but are in the module's HISTORY and other documentation files (June 2001).
    • XML Matters #14 discussed the HaXml module for the Haskell lazy pure-functional programming language (October 2001).
    • XML Matters #18 discussed Ruby's REXML library (March 2002).
    • XML Matters #28 discussed Fredrik Lundh's ElementTree XML API (June 2003).
    You'll find all previous installments of David's XML Matters column at the column summary page.

  • Find more XML resources on the developerWorks XML zone.

  • Check out Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.

  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

About the author

Photo of David Mertz

For David Mertz, an atomic object is a combination of facts. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's new book, Text Processing in Python .

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12305
ArticleTitle=XML Matters: The RXP parser
publish-date=08042003
author1-email=mertz@gnosis.cx
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).