My IBM columns, tutorials, and articles have had a dual -- or maybe triple -- purpose for your humble author. First, I cherish the opportunity to share what knowledge I have with other programmers/developers, and perhaps, thereby, make a few of your own tasks a bit easier. (It is also awfully nice that I get paid money for writing these things.)
Another purpose of my writing is to release programming code to the public domain. In writing this code, my goal has been to illustrate general programming concepts, and I've tailored the code around these concepts. At the same time, however, my intention is to offer code to the programming community that individual developers can utilize directly for their own purposes.
Through the course of releasing my code, I have received a number of valuable suggestions and enhancement patches from users of these modules. Most of the improvements from users are ones that I would have never imagined on my own, and a few are almost shocking in their insight. In this article, I present some uses of
xml_objectify that were not possible when I wrote XML Matters #1 and XML Matters #2, the columns that initially discussed these modules.
One change, in particular, has been an ongoing struggle. My timing was probably slightly unlucky. Soon after I first created
xml_pickle in August 2000, the PyXML distribution went through several incompatible versions. Not much later, Python 2.0 came out with its own not-quite-compatible XML support. Users contributed several patches to match then-current Python XML support along the way, but in their current state
xml_pickle both require Python 2.0+ and its included PyXML package. Given the effective requirement for Python 2.0 in terms of the XML packages, I also allowed in a few other changes with Python 2.0 syntax. The backward incompatibility with Python 1.5 is unfortunate, but it would be too unwieldy to maintain it in this case.
One of the features of
xml_objectify I introduced in XML Matters #2 was the special
_XML attribute that kept complete element contents (including subelement markup of character data). The default behavior is still to create an
_XML attribute of a nested object only when it contains character-level markup. You now, however, have a choice about changing this behavior using the function
keep_containers() and the values
NEVER. For example:
>>> xml_str = '''<doc><p>Spam and eggs <b>are</b> tasty</p> .. <p>The Spanish Inquisition</p> .. <foot>Our weapon is fear</foot></doc>''' >>> open( 'test.xml', 'w' ).write(xml_str) >>> from xml_objectify import * >>> py_obj = XML_Objectify( 'test.xml' ).make_instance() >>> py_obj.p.PCDATA u'Spam and eggs tasty' >>> py_obj.p._XML # first <p> has <b> markup u'Spam and eggs <b>are</b> tasty' >>> py_obj.p.PCDATA u'The Spanish Inquisition' >>> py_obj.p._XML # second <p> has no markup Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: '_XO_p' instance has no attribute '_XML'
>>> _=keep_containers(ALWAYS) >>> py_obj = XML_Objectify( 'test.xml' ).make_instance() >>> py_obj.p._XML u'The Spanish Inquisition' >>> _=keep_containers(NEVER) >>> py_obj = XML_Objectify( 'test.xml' ).make_instance() >>> py_obj.p._XML Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: '_XO_p' instance has no attribute '_XML'
Probably the most powerful feature of
xml_objectify is also a subtle one. Many users have probably never needed (or even noticed) class magic behavior. However, it is possible to have special classes on hand that will determine the behaviors of "objectified" XML nodes. (The original article mentioned this, but it is worth seeing in action.)
I should point out a few details prior to presenting the examples. In order to avoid a sloppy conflict in the first module version,
xml_objectify now "mangles" the names of the class templates for XML nodes. The "abstract" node class, or
_XO_, has a few "magic" behaviors in itself. Upon creation -- whether created dynamically or by a programmer -- concrete node classes have the form
<tagname> is a tag that occurs in the objectified XML document).
The "magic" that
_XO_ itself provides is the
__len__() methods. These let you treat each node attribute as if it is a list in those contexts where it would be nice for the attribute to behave like a list. But at the same time, you can refer to an "only child" node without having to subscript. For example:
>>> print type(py_obj.p), type(py_obj.foot) <type 'list'> <type 'instance'> >>> print py_obj.p.PCDATA, '...', py_obj.foot.PCDATA The Spanish Inquisition ... Our weapon is fear >>> for line in py_obj.p: print line.PCDATA, .. Spam and eggs tasty The Spanish Inquisition >>> for line in py_obj.foot: print line.PCDATA, .. Our weapon is fear >>> map(lambda line: len(line.PCDATA), py_obj.foot)  >>> map(lambda line: len(line.PCDATA), py_obj.p) [20, 23]
Still more magic is possible if you want to create your very own node classes within a program. Basically, you can make an attribute node behave in any way you wish:
>>> import xml_objectify >>> xml_str = '''<buffet> .. <plate><food>Steak</food><food>Potatoes</food></plate> .. <plate><food>Corn</food><food>Broccoli</food></plate> .. <buffet>''' >>> open( 'buffet.xml', 'w' ).write(xml_str) >>> class plate (xml_objectify._XO_): .. def eat (self): .. for food in self.food: .. if food.PCDATA == 'Broccoli': .. return "If I liked Broccoli, I might have to eat it!" .. return "Yum!" .. >>> xml_objectify._XO_plate = plate >>> py_obj = XML_Objectify( 'buffet.xml' ).make_instance() >>> print py_obj.plate.eat() If I liked Broccoli, I might have to eat it! >>> print py_obj.plate.eat() Yum!
Notice that the trick with the
xml_objectify._XO_plate assignment is important. To get the proper magic behavior, the right magic and mangled class needs to live in that namespace.
In my opinion, it is fabulously cool to be able to grab a bunch of data from an XML file, then have a perfectly natural Python object act on that data as its own attributes, using its own methods.
For working with large XML documents, Costas Malamas has contributed an invaluable enhancement. Until recently, the only way
xml_objectify worked was to create a DOM tree, then recurse through that tree to generate the "Pythonic" objects. That worked fine for small XML documents, but for around 50k-100k files it started to become absurdly slow. There appears to be a complexity order effect going on that renders
xml_objectify unusable for large documents.
Fortunately, Malamas provided an alternative method for parsing an XML document based on the Python
EXPAT bindings (EXPAT is a high-performance XML library in C). While there are still a few wrinkles to iron out in the
ExpatFactory class (failure for some documents with processing instructions), in most cases, the new technique provides speedy handling of even huge XML documents.
The EXPAT technique also imposes a couple of limitations by design: You obviously lose the
_dom attribute of your
xml_obj (if you kept
xml_obj in the first place); and you do not have an
_XML attribute to play with anymore. Resolution of the latter limitation may occur in the future, however.
Choosing which parsing technique to use is straightforward:
>>> xml_obj = XML_Objectify('buffet.xml',EXPAT) >>> xml_obj = XML_Objectify('buffet.xml',parser=DOM)
Absent a specified option, the default is the legacy DOM technique, but future code should specify explicitly in case the default changes.
DOM are constants within
xml_objectify that simply contain matching string values.
In a manner similar to
xml_objectify, you will need to populate the
xml_pickle namespace when you want to retain the instance methods of "unpickled" objects. That sounds confusing, but some code makes it simple:
>>> import xml_pickle >>> class MyClass: .. def DoIt(self): .. print "Done!" .. >>> o1 = MyClass() >>> o1.attr1 = 'spam' >>> xml_str = xml_pickle.XML_Pickler(o1).dumps() >>> o2 = xml_pickle.XML_Pickler().loads(xml_str) >>> o2.DoIt() Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: 'MyClass' instance has no attribute 'DoIt' >>> xml_pickle.MyClass = MyClass >>> o2 = xml_pickle.XML_Pickler().loads(xml_str) >>> o2.DoIt() Done!
Basically, if you put the classes you want to pickle into the
xml_pickle namespace before you start all the pickling/unpickling, you can restore all of your object behavior. However, notice that as with
cPickle, the methods are not themselves pickled (just the attributes are). You use the class that is present at runtime for the methods (which might be more current since last pickling).
Joshua Macy (with some help from Joe Kraska) has lifted a limitation of
xml_pickle that I pointed out in the original article. In early versions,
xml_pickle made no effort to check for cyclical references in pickled objects. Furthermore (and for the same reason), earlier versions pickled every attribute as a deep copy of its actual Python object. If you have a Python object with many substructures containing references to the same objects, the pickled size can get big quickly. Moreover, unpickled objects will contain multiple objects that, while possibly equal (
a == a), are not identical (
a is a) as were the prepickled originals.
Despite the gains in Macy's approach, however, it is desirable to introduce a DEEPCOPY option back into the module. The main issue with the (quite elegant)
id scheme is that it is likely to be much more difficult for a generic tool to use. Maybe users of languages other than Python want to easily use
xml_pickle'd objects (maybe more as hierarchical data stores than as full dynamic objects, but that's fine). Or perhaps XSLT transformations of pickled objects would be useful for certain purposes. A pickled excerpt shows the difficulty:
<?xml version="1.0"?> <!DOCTYPE PyObject SYSTEM "PyObjects.dtd"> <PyObject class="XML_Pickler" id="1383532"> <attr name="lst" type="list" id="1391340"> <item type="numeric" value="1" /> <item type="numeric" value="3.5" /> <item type="numeric" value="2" /> <item type="numeric" value="(4+7j)" /> </attr> <attr name="lst2" type="ref" refid="1391340" /> <attr name="num" type="numeric" value="37" /> .. </PyObject>
You can see that the attribute
lst2 would be a bit of work to figure out in a generic way (such as with developer eyeballs). One has to pull off the
refid, then search back for the corresponding
id. Actually, the use of the
type="ref" XML attribute may have been a bad choice. Given that it has a
refid XML attribute, things might become more understandable by simply still recording
type="list" as with the
lst. But of course, once something is done, it is difficult to improve it without breaking backward compatibility.
A small caveat on references might appeal to subtle-minded hackers.
refid values develop out of the Python
id() of the relevant objects. The values do not mean anything inherently, but they have the nice property of being unique at any given moment of runtime.
xml_pickle gives no assurance that pickling the "same" object in different runs will produce entirely identical XML files (the
id values will almost certainly change). In general, the ad hoc
id values will not matter to a program, but with the use of things like cryptographic hashes or CRCs as part of a process, this could be a "gotcha."
The enhancement doesn't require too much description, but in response to user requests, there is the addition of
Numeric arrays to the set of "picklable" types. For scientific and mathematical Python users, these types may make up important attributes of their objects.
xml_pickle makes an intelligent effort to ensure that
Numeric is present when supporting it. If not, it falls back to the
One lesson I have learned in developing, or maybe just shepherding the development of, these modules is the value of a Python truism: First get it right, then make it fast!
Collectively, we have reached the latter fairly well. Some optimizations to
xml_pickle have brought its behavior from O(N^2) to a manageable O(N), relative to pickled object size. The trick here is that
str = str + "more stuff" can be shockingly inefficient if you perform it often enough. With the EXPAT techniques,
xml_objectify is similarly swift. I do not think I would have gotten something to the world quickly, nor received the amount of valuable contributions, if I had worried too much about optimization early on.
I look forward to learning more about the practical social dynamics of open-source software development as I am able to create more tools and libraries such as the ones I've discussed in this column. It has been an interesting path, and I wonder where it will lead.
- Find author David Mertz's XML modules here: xml_objectify.py and xml_pickle.py.
- For those interested in older, or prerelease, version numbers of the modules, browse through the gnosis.cx/download directory. A variety of versions (with version numbers in the name) live here. The module that drops a version number is generally the most recent "stable" version. Plus, you can find lots of other goodies in this directory (all public domain).
- Find David Mertz's initial articles (August 2000) on
xml_objectifyat IBM developerWorks: XML Matters, "On the Pythonic treatment of XML documents as objects," and XML Matters, "On the Pythonic treatment of XML documents as objects (II)".
Find other articles in David Mertz's XML Matters column.