 | Level: Introductory David Mertz (mertz@gnosis.cx), Archivist, Gnosis Software, Inc.
01 Jun 2001 Since author David Mertz first introduced his handy utilities for high-level Python handling of XML documents, users and readers have contributed a number of extremely useful enhancements and suggestions. This column presents some of the changes to his module suite, as well as some tips on advanced aspects of using and customizing the modules. Code samples demonstrate py_obj._XML attributes, node attributes treated as objects and lists, py_obj magic attribute behavior, and more. My IBM columns, tutorials, and articles have had a dual -- or maybe triple -- purpose for your humble author. First, I cherish the opportunity to share what knowledge I have with other programmers/developers, and perhaps, thereby, make a few of your own tasks a bit easier. (It is also awfully nice that I get paid money for writing these things.) New possibilities
Another purpose of my writing is to release programming code to the public domain. In writing this code, my goal has been to illustrate general programming concepts, and I've tailored the code around these concepts. At the same time, however, my intention is to offer code to the programming community that individual developers can utilize directly for their own purposes. Through the course of releasing my code, I have received a number of valuable suggestions and enhancement patches from users of these modules. Most of the improvements from users are ones that I would have never imagined on my own, and a few are almost shocking in their insight. In this article, I present some uses of xml_pickle and xml_objectify that were not possible when I wrote XML Matters #1 and XML Matters #2, the columns that initially discussed these modules.
Enhancements to xml_objectify
One change, in particular, has been an ongoing struggle. My timing was probably slightly unlucky. Soon after I first created xml_objectify and xml_pickle in August 2000, the PyXML distribution went through several incompatible versions. Not much later, Python 2.0 came out with its own not-quite-compatible XML support. Users contributed several patches to match then-current Python XML support along the way, but in their current state xml_objectify and xml_pickle both require Python 2.0+ and its included PyXML package. Given the effective requirement for Python 2.0 in terms of the XML packages, I also allowed in a few other changes with Python 2.0 syntax. The backward incompatibility with Python 1.5 is unfortunate, but it would be too unwieldy to maintain it in this case. One of the features of xml_objectify I introduced in XML Matters #2 was the special _XML attribute that kept complete element contents (including subelement markup of character data). The default behavior is still to create an _XML attribute of a nested object only when it contains character-level markup. You now, however, have a choice about changing this behavior using the function keep_containers() and the values ALWAYS, MAYBE and NEVER. For example:
>>> xml_str =
'''<doc><p>Spam and eggs <b>are</b> tasty</p>
.. <p>The Spanish Inquisition</p>
.. <foot>Our weapon is fear</foot></doc>'''
>>> open( 'test.xml', 'w' ).write(xml_str)
>>> from xml_objectify import *
>>> py_obj = XML_Objectify( 'test.xml' ).make_instance()
>>> py_obj.p[0].PCDATA
u'Spam and eggs tasty'
>>> py_obj.p[0]._XML # first <p> has <b> markup
u'Spam and eggs <b>are</b> tasty'
>>> py_obj.p[1].PCDATA
u'The Spanish Inquisition'
>>> py_obj.p[1]._XML # second <p> has no markup
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: '_XO_p' instance has no attribute '_XML'
|
>>> _=keep_containers(ALWAYS)
>>> py_obj = XML_Objectify( 'test.xml' ).make_instance()
>>> py_obj.p[1]._XML
u'The Spanish Inquisition'
>>> _=keep_containers(NEVER)
>>> py_obj = XML_Objectify( 'test.xml' ).make_instance()
>>> py_obj.p[0]._XML
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: '_XO_p' instance has no attribute '_XML'
|
Probably the most powerful feature of xml_objectify is also a subtle one. Many users have probably never needed (or even noticed) class magic behavior. However, it is possible to have special classes on hand that will determine the behaviors of "objectified" XML nodes. (The original article mentioned this, but it is worth seeing in action.) I should point out a few details prior to presenting the examples. In order to avoid a sloppy conflict in the first module version, xml_objectify now "mangles" the names of the class templates for XML nodes. The "abstract" node class, or _XO_, has a few "magic" behaviors in itself. Upon creation -- whether created dynamically or by a programmer -- concrete node classes have the form _XO_tagname (where <tagname> is a tag that occurs in the objectified XML document). The "magic" that _XO_ itself provides is the __getitem__() and __len__() methods. These let you treat each node attribute as if it is a list in those contexts where it would be nice for the attribute to behave like a list. But at the same time, you can refer to an "only child" node without having to subscript. For example:
>>> print type(py_obj.p), type(py_obj.foot)
<type 'list'> <type 'instance'>
>>> print py_obj.p[1].PCDATA, '...', py_obj.foot.PCDATA
The Spanish Inquisition ... Our weapon is fear
>>> for line in py_obj.p: print line.PCDATA,
..
Spam and eggs tasty The Spanish Inquisition
>>> for line in py_obj.foot: print line.PCDATA,
..
Our weapon is fear
>>> map(lambda line: len(line.PCDATA), py_obj.foot)
[18]
>>> map(lambda line: len(line.PCDATA), py_obj.p)
[20, 23]
|
Still more magic is possible if you want to create your very own node classes within a program. Basically, you can make an attribute node behave in any way you wish:
>>> import xml_objectify
>>> xml_str = '''<buffet>
.. <plate><food>Steak</food><food>Potatoes</food></plate>
.. <plate><food>Corn</food><food>Broccoli</food></plate>
.. <buffet>'''
>>> open( 'buffet.xml', 'w' ).write(xml_str)
>>> class
plate (xml_objectify._XO_):
.. def
eat (self):
.. for food in self.food:
.. if food.PCDATA == 'Broccoli':
.. return
"If I liked Broccoli, I might have to eat it!"
.. return
"Yum!"
..
>>> xml_objectify._XO_plate = plate
>>> py_obj = XML_Objectify( 'buffet.xml' ).make_instance()
>>> print py_obj.plate[1].eat()
If I liked Broccoli, I might have to eat it!
>>> print py_obj.plate[0].eat()
Yum!
|
Notice that the trick with the xml_objectify._XO_plate assignment is important. To get the proper magic behavior, the right magic and mangled class needs to live in that namespace. In my opinion, it is fabulously cool to be able to grab a bunch of data from an XML file, then have a perfectly natural Python object act on that data as its own attributes, using its own methods. The EXPAT technique
For working with large XML documents, Costas Malamas has contributed an invaluable enhancement. Until recently, the only way xml_objectify worked was to create a DOM tree, then recurse through that tree to generate the "Pythonic" objects. That worked fine for small XML documents, but for around 50k-100k files it started to become absurdly slow. There appears to be a complexity order effect going on that renders xml_objectify unusable for large documents. Fortunately, Malamas provided an alternative method for parsing an XML document based on the Python EXPAT bindings (EXPAT is a high-performance XML library in C). While there are still a few wrinkles to iron out in the ExpatFactory class (failure for some documents with processing instructions), in most cases, the new technique provides speedy handling of even huge XML documents. The EXPAT technique also imposes a couple of limitations by design: You obviously lose the _dom attribute of your xml_obj (if you kept xml_obj in the first place); and you do not have an _XML attribute to play with anymore. Resolution of the latter limitation may occur in the future, however. Choosing which parsing technique to use is straightforward:
>>> xml_obj = XML_Objectify('buffet.xml',EXPAT)
>>> xml_obj = XML_Objectify('buffet.xml',parser=DOM)
|
Absent a specified option, the default is the legacy DOM technique, but future code should specify explicitly in case the default changes. EXPAT and DOM are constants within xml_objectify that simply contain matching string values.
Enhancements to xml_pickle
In a manner similar to xml_objectify, you will need to populate the xml_pickle namespace when you want to retain the instance methods of "unpickled" objects. That sounds confusing, but some code makes it simple:
>>> import xml_pickle
>>> class
MyClass:
.. def
DoIt(self):
.. print
"Done!"
..
>>> o1 = MyClass()
>>> o1.attr1 = 'spam'
>>> xml_str = xml_pickle.XML_Pickler(o1).dumps()
>>> o2 = xml_pickle.XML_Pickler().loads(xml_str)
>>> o2.DoIt()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'MyClass' instance has no attribute 'DoIt'
>>> xml_pickle.MyClass = MyClass
>>> o2 = xml_pickle.XML_Pickler().loads(xml_str)
>>> o2.DoIt()
Done!
|
Basically, if you put the classes you want to pickle into the xml_pickle namespace before you start all the pickling/unpickling, you can restore all of your object behavior. However, notice that as with pickle and cPickle, the methods are not themselves pickled (just the attributes are). You use the class that is present at runtime for the methods (which might be more current since last pickling).
Cyclical references and deep copy
Joshua Macy (with some help from Joe Kraska) has lifted a limitation of xml_pickle that I pointed out in the original article. In early versions, xml_pickle made no effort to check for cyclical references in pickled objects. Furthermore (and for the same reason), earlier versions pickled every attribute as a deep copy of its actual Python object. If you have a Python object with many substructures containing references to the same objects, the pickled size can get big quickly. Moreover, unpickled objects will contain multiple objects that, while possibly equal (a == a), are not identical (a is a) as were the prepickled originals. Despite the gains in Macy's approach, however, it is desirable to introduce a DEEPCOPY option back into the module. The main issue with the (quite elegant) refid/id scheme is that it is likely to be much more difficult for a generic tool to use. Maybe users of languages other than Python want to easily use xml_pickle'd objects (maybe more as hierarchical data stores than as full dynamic objects, but that's fine). Or perhaps XSLT transformations of pickled objects would be useful for certain purposes. A pickled excerpt shows the difficulty:
<?xml version="1.0"?>
<!DOCTYPE PyObject SYSTEM "PyObjects.dtd">
<PyObject class="XML_Pickler" id="1383532">
<attr name="lst" type="list" id="1391340">
<item type="numeric" value="1" />
<item type="numeric" value="3.5" />
<item type="numeric" value="2" />
<item type="numeric" value="(4+7j)" />
</attr>
<attr name="lst2" type="ref" refid="1391340" />
<attr name="num" type="numeric" value="37" />
..
</PyObject>
|
You can see that the attribute lst2 would be a bit of work to figure out in a generic way (such as with developer eyeballs). One has to pull off the refid, then search back for the corresponding id. Actually, the use of the type="ref" XML attribute may have been a bad choice. Given that it has a refid XML attribute, things might become more understandable by simply still recording type="list" as with the lst2 referent lst. But of course, once something is done, it is difficult to improve it without breaking backward compatibility. A small caveat on references might appeal to subtle-minded hackers. id/refid values develop out of the Python id() of the relevant objects. The values do not mean anything inherently, but they have the nice property of being unique at any given moment of runtime. xml_pickle gives no assurance that pickling the "same" object in different runs will produce entirely identical XML files (the id values will almost certainly change). In general, the ad hoc id values will not matter to a program, but with the use of things like cryptographic hashes or CRCs as part of a process, this could be a "gotcha." The enhancement doesn't require too much description, but in response to user requests, there is the addition of Numeric arrays to the set of "picklable" types. For scientific and mathematical Python users, these types may make up important attributes of their objects. xml_pickle makes an intelligent effort to ensure that Numeric is present when supporting it. If not, it falls back to the array module.
A Python truism
One lesson I have learned in developing, or maybe just shepherding the development of, these modules is the value of a Python truism: First get it right, then make it fast!
Collectively, we have reached the latter fairly well. Some optimizations to xml_pickle have brought its behavior from O(N^2) to a manageable O(N), relative to pickled object size. The trick here is that str = str + "more stuff" can be shockingly inefficient if you perform it often enough. With the EXPAT techniques, xml_objectify is similarly swift. I do not think I would have gotten something to the world quickly, nor received the amount of valuable contributions, if I had worried too much about optimization early on. I look forward to learning more about the practical social dynamics of open-source software development as I am able to create more tools and libraries such as the ones I've discussed in this column. It has been an interesting path, and I wonder where it will lead.
Resources
About the author  | 
|  |
David Mertz became disenchanted with the academy and became a technical
journalist: post hoc ergo prompter hoc. You can contact David at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.
Suggestions and recommendations on this, past, or future, columns are welcomed. |
Rate this page
|  |