Python XML bindings seem to pop up almost every day, not because of
anything missing in existing libraries like gnosis.xml.objectify or
ElementTree, but simply out of Not Invented Here syndrome. Though perhaps somewhat biased, I continue to feel that my own gnosis.xml.objectify -- the
first of these tools to be developed -- continues to be the most
versatile and Pythonic binding available (and also one of the fastest
and most memory friendly). Unfortunately, the multiplication of
just-slightly-different libraries for the same purpose is an
affliction Python suffers in several other areas as well.
In part, developers invent their own tools simply because they do not
immediately see how to accomplish goals in the existing tools. In this article, I will
remedy that, in part, relative to gnosis.xml.objectify.
The gnosis.xml.objectify philosophy
My goal in creating gnosis.xml.objectify was to provide a module
that transforms data in XML documents into completely native
Python objects. In particular, it is not very Pythonic to access
data using getters and setters, or other similar methods. In Java
and some other languages you do things this way -- and largely as a
result of the Java style, this is how you do things in DOM, even in
Python.
For gnosis.xml.objectify, all the data that comes from an XML
document -- whether it's from element bodies or from XML attributes -- is
simply data in object attributes. If a given object has multiple
children with the same name, the attribute points to a list of
like-named children. But even if the object has only one child
with a given name, that one child is kind enough to act like a list
for iteration purposes. When accessing a gnosis.xml.objectify
object, the simplest thing that could possibly work almost always
does work.
Listing 1 is a very quick primer and example for readers new to the library:
Listing 1. Basic Usage of gnosis.xml.objectify
>>> from gnosis.xml.objectify import make_instance >>> xml = "<foo><bar>Text</bar><baz a1='bat'/><baz>blip</baz></foo>" >>> foo = make_instance(xml) >>> foo <foo id="48b170"> >>> foo.bar <bar id="48b300"> >>> foo.baz [<baz id="48b210">, <baz id="48b030">] >>> for bar in foo.bar: print bar ... <bar id="48b300"> >>> foo.baz[0].a1 u'bat' >>> foo.bar.PCDATA u'Text' >>> foo.bar[0].PCDATA u'Text' |
What gnosis.xml.objectify does not do
The node objects in gnosis.xml.objectify trees are, by design, quite
dumb. Yes, they print moderately nice looking representations of
themselves; and single instances also act list-like when appropriate,
but instance-like otherwise. But generally, node objects eschew any
special methods or attributes (or at least they do so unless you
decide to program your own special behavior into particular node types,
specified by their element name). For one thing, any methods I might
have added to node objects would potentially conflict with tagnames
in the generic XML documents that gnosis.xml.objectify parses. But more
importantly, I believe Python is natively a perfectly good language
(excellent, in fact), so you can and should use exactly the same
generic techniques that you would use to work with any old object on ones that
happen to have been generated from XML sources.
However, I have found -- particularly of late -- that the very flexibility
of gnosis.xml.objectify gives some users the false impression that
they cannot achieve the constrained goals that some more XML-oriented
bindings provide as default behaviors. To address this, I have added a
subpackage, gnosis.xml.objectify.utils (see Resources),
to the Gnosis Utilities
package to illustrate several of the most-requested XML-oriented
usages. However, these utilities, while genuinely useful as provided,
are still intended more as examples of what you can do than as
official APIs for gnosis.xml.objectify. The idea here is that
gnosis.xml.objectify does not have an API, except the API of
Python itself.
One of the perceived strengths of Fredrik Lundh's ElementTree and
Uche Ogbuji's Anobind (see Resources) is their use of XPath-like node-search
methods. To my mind, XPath syntax is still somewhat overly
XML-oriented, but enough users requested this that I decided to
add a utility function, gnosis.xml.objectify.utils.XPath(), to Gnosis
Utilities. In about 50 lines, I was able to implement a significant
superset of the XPath support in either ElementTree or anobind --
though not the complete XPath specification, which is large.
Specifically, I enabled the following XPath features:
- Named node search by specifying a tagname
- Recursive node search using the
//delimiter - Wildcard searches using the
*symbol - Text node search using the
text()pseudo-function - Attribute search using the
@prefix - Wildcard attribute search using the
@*symbol - Node indexing and slicing
Moreover, since this is Python, I allow users to use not only XPath simple
numeric indexing but also a general slice notation. Since XPath is
one-based in indexing while Python is zero-based, I emphasize the
non-Python semantics by indicating slices differently in a
pseudo-XPath; for example, /tagname[2..5] indicates the inclusive
range from the second to the fifth <tagname> element in the document
root.
While I was at it, I wrote the whole thing as a lazy iterator so you don't need
to instantiate a large node-list if you don't need
one. Of course, if you want an instantiated node-list, just use
list(XPath(obj,path)) to get one.
However, even though I recognize the coolness of predictive indexing,
my simple function does not bother implementing it. There is
nothing conceptually difficult about implementing the remaining bits
of full XPath; I just did not find it necessary (or concise) as an
illustration. For example, the test script test_xpath.py that I will include
in future Gnosis Utilities distributions includes the
following test XPaths (and outputs correctly on each):
Listing 2. Patterns tested in test_xpath.py
patterns = '''/bar //bar //* /baz/*/bar
/bar[2] //bar[2..4]
//@a1 //bar/@a1 /baz/@* //@*
baz//bar/text() /baz/text()[3]'''
|
To support this, I created a little recursive traversal function
that walks all the nodes of a gnosis.xml.objectify object. You can use
it by itself if you like. You may find it useful for performing your own
non-XPath filtering on a tree. Of course, the following calls should
be equivalent: walk_xo(obj) and XPath(o,"//*") (the first will
perform slightly less housekeeping). The function looks like this:
Listing 3. Compact, lazy, recursive node traversal
def walk_xo(o):
yield o
for node in children(o):
for child in walk_xo(node):
yield child
|
Simple, huh? Another small support function simply parses out index values if they are given within a (pseudo-)XPath. I will not bother reproducing that here.
An (almost) full XPath wrapper
The trick in making the XPath() function so concise is the fact it
has so little need to worry about XML per se (see Listing 4). Most of the work here
lies in just making sense of the XPath string itself. Some existing
one-line wrapper functions -- like children(), text(), and
attributes() -- make the code look a bit nicer, but are themselves
extremely simple filters. In other words, you could use something
very close to this same function against objects that never derived
from XML.
Listing 4. The gnosis.xml.objectify.utils.XPath() function
def XPath(o, path):
"Find node(s) within an _XO_ object"
path = path.replace('//','/!!') # Placeholder hack for easy splitting
if path.startswith('/'): # No need for init / since node==root
path = path[1:]
if path.startswith('!!'): # Recursive path fragment
path, start, stop = indices(path)
i = 0
for node in walk_xo(o):
if i >= stop: return
for match in XPath(node, path[2:]):
if start <= i < stop:
yield match
i += 1
elif '/' in path[1:]: # Compound, non-recursive
head, tail = path.split('/', 1)
for node in XPath(o, head):
for match in XPath(node, tail):
yield match
else: # Atomic path fragment
path, start, stop = indices(path)
if path=="*": # Node wildcard
for node in islice(children(o), start, stop):
yield node
elif path=="text()": # Node text(s)
for s in islice(text(o), start, stop):
yield s
elif path.startswith('@*'): # All node attributes
for attr in attributes(o):
yield attr
elif path.startswith('@'): # Specific node attribute
for attr in attributes(o):
if attr[0]==path[1:]:
yield attr
elif hasattr(o, path): # Named node type
for node in islice(getattr(o, path), start, stop):
yield node
|
From time to time, users have been bothered by the fact that
gnosis.xml.objectify does not reserialize its objects to XML. In
comparison with other Python XML bindings, this is said to be a
weakness. I disagree: Those other bindings still force you to
think of their Python objects in XML terms, not Python terms.
Only blessed objects and attributes are serialized, not
everything a Python object might have.
For example, in ElementTree, you can perform steps like:
Listing 5. ElementTree example
>>> from elementtree import ElementTree
>>> et = ElementTree.parse("xpath.xml")
>>> et.write(sys.stdout)
|
But if you change the object et (or any child nodes you might
generate with methods like .getroot(), .find(), or .findall()),
your additions are not generally serializable. For example, this does
not change the serialization at all, even though it changes the object:
Listing 6. Modified ElementTree example
>>> et.new = 'flaz' >>> et.getroot().more = 123 >>> et.write(sys.stdout). |
Similarly, with Anobind and its .unbind() method, you can add special XML-oriented nodes using API methods like
.append(), .insert(), or .remove(). But then,
gnosis.xml.objectify can also add blessed attributes using its
gnosis.xml.objectify.addChild() utility function (and using
gnosis.xml.objectify.createPyObj() to make a special _XO_ object to add).
If you just want generic serialization of gnosis.xml.objectify
objects, perhaps with a few values changed from the original XML, you
can write a utility function to do this in 12 lines:
Listing 7. Generic XML serialization
def write_xml(o, out=stdout):
"Serialize an _XO_ object back into XML"
out.write("<%s" % tagname(o))
for attr in attributes(o):
out.write(' %s=%s' % attr)
out.write('>')
for node in content(o):
if type(node) in StringTypes:
out.write(node)
else:
write_xml(node, out=out)
out.write("</%s>" % tagname(o))
|
But to my mind, the real power of working with objects in Python comes in non-generic serialization and transformation. Rather than just dumping every attribute back into XML, you might want to filter and massage nodes before writing them. Of course, just what you manipulate depends on your application requirements.
An approach to XML binding taken by Dave Kuhlman's generateDS (see Resources), as well as some other less mature bindings, is to require custom Python
classes for each XML element type in the documents that you process.
In Kuhlman's case, these custom classes are generated from
corresponding W3C XML Schemas (but only allow a subset of the full WXS
specification). In contrast, gnosis.xml.objectify -- along with
ElementTree, Anobind, and some others -- will bind any old XML document without any special programming.
However, gnosis.xml.objectify, like Anobind but unlike
ElementTree, lets you create custom node classes if you want to
use them. In fact, you can substitute the base class
for every node object, giving your whole application custom
behaviors.
I think beginning users of gnosis.xml.objectify have been
intimidated by the idea of specializing classes per-tagname. Here are a few
examples that show just how non-threatening it really is.
Redefining the _XO_ base class
Whenever you customize a base class, you need to inject the next
class back into the gnosis.xml.objectify namespace. This step
involves some magic, but is not difficult to do. I might give the step a
friendlier name in a wrapper function in the future, but the style
emphasizes that you are changing the module itself. For example,
tagnames are mangled in Gnosis Utilities 1.1.1, but not attribute
names; this makes it more difficult than necessary to access attributes
whose names contains characters disallowed in Python variables. One fix
for this is to also allow dictionary-like access to these attributes:
Listing 8. Adding dictionary-like attribute access
>>> import gnosis.xml.objectify
>>> class newXO(gnosis.xml.objectify._XO_):
... def __getitem__(self, key):
... return getattr(self,key)
...
>>> gnosis.xml.objectify._XO_ = newXO
>>> o = make_instance('<o><my-doc my-name="david">Stuff</my-doc></o>')
>>> print o.my__doc['my-name']
david
>>> getattr(o.my__doc,'my-name') # Works without custom base
u'david'
|
Redefining per-tagname node classes
Redefining base classes is probably of greatest utility for specific per-tagname classes that you know certain things about. For example, if a certain element is always a leaf node in a particular document type (and has no XML attributes), you might want to refer to its PCDATA just by the node name itself. Of course, if the input XML is not structured in the way you assume, accessing children is more difficult in this case. One way to program this behavior is:
Listing 9. An AutoPCData custom node class
>>> from gnosis.xml.objectify import make_instance >>> xml = '''<group> ... <var><description>foo</description></var> ... <var><description>bar</description></var> ... </group>''' group = make_instance(xml) print group[0].variable[0].description <description id="23cf2c"> print group[0].variable[0].description.PCDATA foo >>> import gnosis.xml.objectify >>> class AutoPCDATA(gnosis.xml.objectify._XO_): ... def __repr__(self): ... return self.PCDATA ... >>> gnosis.xml.objectify._XO_description = AutoPCDATA >>> group = make_instance(xml) >>> print group[0].variable[0].description foo |
Even more clever, in AutoPCDATA you can check objects for
what attributes other than .PCDATA they have, and return
different values for the different cases.
Another application-specific approach to custom classes performs
calculated access. One of the several Python bindings called
XMLObject gives an example of data about a family with multiple
members:
Listing 10. Family tree as XML
<Family> <Member Name="Abe" DOB="3/31/42" /> <Member Name="Betty" DOB="2/4/49" /> <Member Name="Edith" Father="Abe" Mother="Betty" DOB="8/30/80" /> <Member Name="Janet" Father="Frank" Mother="Edith" DOB="1/17/03" /> </Family> |
It might be handy to access family members solely by name, without
bothering with the whole XML hierarchy. One obvious approach is
with a custom Family class:
Listing 11. Dictionary-like access into a child attribute
class Family(gnosis.xml.objectify._XO_):
def __getitem__(self, key):
for member in self.Member:
if member.Name = key:
return member
gnosis.xml.objectify._XO_Family = Family
Family = make_instance('family.xml')
print Family['Janet'].DOB
|
However, if names are not unique you may want to expand upon this particular approach.
The general techniques for wrapping gnosis.xml.objectify shown in
this article are meant mostly as examples for more specific
customizations by users. You can achieve great flexibility and power
by keeping APIs highly open and minimally specified, leaving
customization at the application level rather than the library level.
- Read these previous installments of XML Matters that touch on
the evolving
gnosis.xml.objectifybinding:- "On the 'Pythonic' treatment of XML documents as objects(II)" (August 2000)
- "Revisiting xml_pickle and xml_objectify" (June 2001)
- "The RXP parser" (August 2003)
- "The XOM Java XML API" (December 2003)
- "Practical XML data design and manipulation for voting systems" (June 2004)
- Download the latest development snapshot of Gnosis Utilities.
- Find out more about Fredrik Lundh's
ElementTreelibrary, a popular Python XML binding tool. - Read David Mertz's discussion of
ElementTreein a prior XML Matters installment: "Process XML in Python with ElementTree" (June 2003). - Check out the homepage for Dave Kuhlman's
generateDSmodule. Kuhlmann also wrote a nice essay comparinggenerateDSwithgnosis.xml.objectify. Gnosis Utilities has grown several useful additions since then, though (and most likely, so hasgenerateDS). - Learn more about Uche Ogbuji's Python XML binding,
Anobind. - While you're at it, take a look at Uche's ongoing discussions of many XML binding libraries, including:
- Reference the XML Path Language (XPath) Version 1.0 Recommendation on the W3C site.
- Find hundreds more XML resources on the
developerWorks XML technology zone.
-
Find other articles in David Mertz's XML Matters column.
- Browse for books on these and other technical topics.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

To David Mertz, all the world is a stage, and his career is devoted to providing marginal staging instructions. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future columns are welcomed. Check out David's book Text Processing in Python.