Let me first introduce the Ruby language. I cannot say nearly enough here to get unfamiliar readers up to speed -- for that, I recommend consulting the article Resources. But as a programmer learning the Ruby language myself, I can let you know why it is interesting. Ruby is a scripting language that has been described as "Perl done right." Then again, so probably has every newer scripting language, including Python. For Ruby, the description rings truer, not in the sense that Perl is done wrong (no language flames here), but in the sense the Ruby keeps much of Perl's conciseness and many of its shortcuts, while starting from a clean Smalltalk-ish OOP attitude. Moreover (at least to me), Ruby achieves conciseness while still avoiding the "executable line noise" quality found in some Perl code. At the same time, a number of Ruby constructs "feel" more direct than Python versions (even if they don't really save much overall length).
REXML is a library written by Sean Russell. It is not the only
XML library for Ruby, but it is a popular one, and is written
in pure Ruby (so is NQXML, but XMLParser wraps around the
Jade library, written in C). In his REXML overview, Russell comments:
I have this problem: I dislike obscifucated [sic] APIs. There are several XML parser APIs for Java. Most of them follow DOM or SAX, and are very similar in philosophy with an increasing number of Java APIs. Namely, they look like they were designed by theorists who never had to use their own APIs. The extant XML APIs, in general, suck. They take a markup language which was specifically designed to be very simple, elegant, and powerful, and wrap an obnoxious, bloated, and large API around it. I was always having to refer to the API documentation to do even the most basic XML tree manipulations; nothing was intuitive, and almost every operation was complex.
While I might not have put it quite as stridently, I agree with Russell: XML APIs are just plain too much work for most of what one does with them.
I would guess that what 80% of all the programmers who
deal with XML documents really want is just a way to grab the
data and easily manipulate it as structured data. DOM makes
this hard, and SAX makes it even harder. In several previous
articles, I have advocated the clarity and simplicity of my own
Python xml_objectify module. Let me repeat a quick example
using the file address.xml, which describes an address
book.
How to refer to nested data using xml_objectify
>>> from xml_objectify import XML_Objectify
>>> addressbook = XML_Objectify('address.xml').make_instance()
>>> print addressbook.person[1].address.city
New York |
We need to know a little bit about the format of the data... but not too much
(see Resources for the sample document used throughout this
article). We need to know that the root of
the document is the address book (but not necessarily that it
is named <addressbook>). And we need to know that the
document can list multiple persons (but nothing will go wrong if
there is only one, who can be referred to as either addressbook.person or
addressbook.person[0]). The rest of what we need to know
is that, conceptually, persons have addresses and addresses
have cities. It all just works!
In contrast, DOM -- which advertises itself as OOP-ified XML -- makes us jump through hoops. The first challenge is referring to the root element; at least five different ways to do this come to mind:
Using DOM to name the XML document root
>>> from xml.dom import minidom
>>> dom = minidom.parse('address.xml')
>>> dom.firstChild
<DOM Element: addressbook at 1811436>
>>> dom._get_documentElement()
<DOM Element: addressbook at 1811436>
>>> dom._get_firstChild()
<DOM Element: addressbook at 1811436>
>>> dom.getElementsByTagName('addressbook')[0]
<DOM Element: addressbook at 1811436>
>>> dom.childNodes[0]
<DOM Element: addressbook at 1811436>
|
You also have to do some guessing as to exactly what is a method and
what is an attribute (or keep a manual handy). Given that we
know we want the root element, the
._get_documentElement() method is probably the best choice. Now, what
if we want to find our way down to the second person's city, as
in the xml_objectify example?
How to refer to nested data using DOM
>>> addressbook = dom._get_documentElement()
>>> print addressbook.getElementsByTagName('person')[1].\
.. getElementsByTagName('address')[0].getAttribute('city')
New York |
This style is quite verbose, but is probably the closest DOM
equivalent. You might use the .childNodes attribute array
directly to save a few characters, but this is fragile if, for
example, there are children of <addressbook>
other than <person>. You also have to know the nitty-gritty
detail that city is an element attribute rather than a subtag
content (either way might make sense for the basic data in question).
The goal of REXML is to just work. For the most part, it
succeeds pretty well. Actually, REXML supports two different
styles of XML processing -- "tree" and "stream." The first
is a simpler version of what DOM tries to do; the second is a simpler
version of what SAX tries to do. Let's look at the
tree style first. Suppose we want to grab the same address
book document in the prior example. The examples below
come from a modified eval.rb that I created; the standard
eval.rb (linked to in the Ruby tutorial) can display
extremely long results from expression evaluations of complex
objects -- mine remains quiet in the non-error case:
How to refer to nested data using REXML
ruby> require "rexml/document"
ruby> include REXML
ruby> addrbook = (Document.new File.new "address.xml").root
ruby> persons = addrbook.elements.to_a("//person")
ruby> puts persons[1].elements["address"].attributes["city"]
New York
|
This expression is rather natural. The .to_a() method creates
an array of all the <person> elements in the document, which
can be useful in other naming. An element is something like a
DOM node, but is really much closer to the XML itself (and is also simpler to work with). The argument to
.to_a() is an XPath, in this case identifying all the
<person> elements anywhere in the document. If we only
wanted the one at the first level, we might use:
Creating an array of matching elements
ruby> persons = addrbook.elements.to_a("/addressbook/person") |
We can use XPaths even more directly as overloaded indexes to
the .elements attribute. For example:
Another way to refer to nested data using REXML
ruby> puts addrbook.elements["//person[2]/address"].attributes["city"] New York |
Notice that XPath uses one-based indexing, unlike the
zero-based indexing of Ruby and Python arrays. In other words,
it is still the same person whose city we are checking. We can
see more about this person by looking at the REXML elements
themselves:
Displaying the XML source of elements with REXML
ruby> puts addrbook.elements["//person[2]/address"] <address city='New York' street='118 St.' number='344' state='NY'/> ruby> puts addrbook.elements["//person[2]/contact-info"] <contact-info> <email address='robb@iro.ibm.com'/> <home-phone number='03-3987873'/> </contact-info> |
Moreover, XPaths need not match just one element. We saw this
in defining the persons array, but another example emphasizes
this:
Matching multiple elements with XPaths
ruby> puts addrbook.elements.to_a("//person/address[@state='CA']")
<address city='Sacramento' street='Spruce Rd.' number='99' state='CA'/>
<address city='Los Angeles' street='Pine Rd.' number='1234' state='CA'/>
|
In contrast, the indexing of the .elements attribute only
produces the first matching element:
When XPaths match only the first occurrence
ruby> puts addrbook.elements["//person/address[@state='CA']"]
<address city='Sacramento' street='Spruce Rd.' number='99' state='CA'/>
ruby> puts addrbook.elements.to_a("//person/address[@state='CA']")[0]
<address city='Sacramento' street='Spruce Rd.' number='99' state='CA'/> |
XPath addresses may also be used via the XPath class in
REXML, which has methods such as .first(), .each(), and .match().
One particularly idiomatic method of REXML elements is the
.each iterator. While Ruby has a looping construct for
that can operate over collections, Ruby programmers generally
prefer to use iterator methods that pass control to a
codeblock. The two constructs that follow are equivalent, but
the second has a more natural feel in Ruby:
Iterating through matching XPaths in REXML
ruby> for addr in addrbook.elements.to_a("//address[@state='CA']")
| puts addr.attributes["city"]
| end
Sacramento
Los Angeles
ruby> addrbook.elements.each("//address[@state='CA']") {
| |addr| puts addr.attributes["city"]
| }
Sacramento
Los Angeles |
For purposes of "just working," the tree mode of REXML is
probably the easiest approach in the Ruby language. But
REXML also offers a stream mode that is like a lighter
weight variant of SAX. As with SAX, REXML gives the
application programmer no default data structures from the XML
document. Instead, a "listener" or "handler" class is
responsible for providing a set of methods that respond to
various events in the document stream. These are the usual
collection: A tag starts, a tag ends, element text is
encountered, and so on.
While stream mode is not nearly as effortless as working in
tree mode, it should generally be much faster. The REXML
tutorial claims that stream mode is one thousand five hundred
times as fast. While I have not attempted to benchmark it, I
suspect this is a limit case (my small examples were still
instantaneous in tree mode). Either way, the difference in
speed is likely to be significant, if speed matters.
Let's look at a very simple example that does the same thing as the "list the California cities" examples above. Extending this to complex document processing is relatively straightforward:
Stream processing XML documents in REXML
ruby> require "rexml/document"
ruby> require "rexml/streamlistener"
ruby> include REXML
ruby> class Handler
| include StreamListener
| def tag_start name, attrs
| if name=="address" and attrs.assoc("state")[1]=="CA"
| puts attrs.assoc("city")[1]
| end
| end
| end
ruby> Document.parse_stream((File.new "address.xml"), Handler.new)
Sacramento
Los Angeles
|
One thing to note in the stream processing example is that tag attributes are passed as an array of arrays, which is slightly more work to handle than a hash would be (but is probably faster to create within the library).
This installment has looked at one more lightweight
alternative to the cumbersome APIs of DOM, SAX, and XSLT. Along
with the xml_objectify, PYX, and HaXml options
examined in earlier installments, Ruby programmers also have a quick
way of processing XML without a steep learning curve.
- Participate in the discussion forum.
- Visit the Ruby Web site for news and discussion of Ruby and XML. Also, check out the language reference and other documents.
- I have also looked at Ruby creator Yukihiro Matsumoto's book
for O'Reilly entitled Ruby in a Nutshell.
As a programmer learning Ruby, I admit this is probably not as well-suited to me as to a more experienced Ruby users.
Still, I get the feeling this book was "not quite translated enough" (from its original Japanese text).
While this book is very well-organized as a reference, a good number of the descriptions left me
uncertain about occasional subtleties of the language.
- The REXML homepage contains a very good tutorial, which is not entirely complete, but it does a good job of getting users up to speed.
- The address book example I use in this article can be found at http://gnosis.cx/download/address.xml.
- Finally, take a look at Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.

David Mertz wishes to let a thousand flowers bloom. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.
Comments (Undergoing maintenance)





