XML Matters #1 introduced my project for creating more seamless and natural integration between XML and Python. The Resources section provides links to other developerWorks articles in which I discuss general Python programming techniques and other XML/Python topics.
Because of the asymmetries between XML and Python, the project -- at least initially -- contains two separate modules: xml_pickle for representing arbitrary Python objects in XML, and xml_objectify for native representation of XML documents as Python objects. This article addresses xml_objectify .
In Python, modules and packages such as xmllib , xml.sax , pyxie , and xml.dom provide ways of handling XML documents that are common in the XML community. You may be familiar with similar modules and libraries available for other programming languages. In fact, many of these modules are based on language-neutral XML standards, and they commonly implement an XML-centric way of handling documents and objects.
The Python implementations of general XML protocols give you the flexibility to program in different ways. For example, you might use portable standards such as DOM so that programmers using one language can easily work with DOM-oriented code written in another language. However, there are times when a Python programmer may prefer to code in ways that are much more like "normal" Python. In many cases, the XML conceptual framework seems like it's tacked on to Python, rather than being an integral part of it. Thus, I developed a set of "Pythonic" modules for working with XML documents.
Using xml_objectify is simple and well-documented in module docstring comments. Let's take a quick look at some sample code:
Creating a Python object from an XML document
<FONT color="#3333cc"><B>from</B></FONT> xml_objectify <FONT color="#3333cc"><B>import</B></FONT> XML_Objectify xml_obj = XML_Objectify(<FONT color="#115511">'address.xml'</FONT>) py_obj = xml_obj.make_instance()
As you can see, there are two steps in creating a native Python object from a generic XML document. First, you create an intermediate DOM-like factory object (that is, an object used to create other objects). Second, you generate one or more Python object instances from the XML_Objectify instance. Note that you should use xml_pickler to handle special PyObjects.dtd format documents. (See XML Matters 1 for information about xml_pickle .)
You could also do both steps on the same line. For example:
Creating an XML/Python object inline
py_obj = XML_Objectify(<FONT color="#115511">'address.xml'</FONT>).make_instance()
Of course, in the latter case, the factory object is not preserved to produce more native objects, and its ._dom data member, which contains a full DOM instance, is also cleared.
For comparison, the following example shows that creating a DOM object can be just as simple in Python:
Creating a DOM object from an XML document
from xml.dom.utils import FileReader dom_obj = FileReader().readXml(open('address.xml'))
FileReader().readXml() requires an actual file object, while XML_Objectify() accepts either a file object or a plain filename. In either case, creating the object is a two-line action.
The difference between using the xml_objectify module and the xml.dom package is in the type of object you wind up with. A Python DOM object is a genuine Python object, but its attributes and methods do not correspond to the data and structure of the original XML document as closely as those of the XML_Objectify object. The Python DOM object's attributes are generally nested .children lists, which are not too helpful semantically. To access the same XML attribute in the sample document, you have a choice between using the first line with xml_objectify or the next four lines with DOM. This is illustrated below:
Using [xml.dom] versus [xml_objectify] Python objects
<FONT color="#3333cc"><B>print</B></FONT> py_obj.person.address.city <FONT color="#3333cc"><B>print</B></FONT> dom_obj.get_childNodes().get_childNodes().\ get_childNodes().get_attributes()[<FONT color="#115511">'city'</FONT>].value <FONT color="#3333cc"><B>print</B></FONT> dom_obj._node.children.children.children.\ attributes[<FONT color="#115511">'city'</FONT>].children.value
A DOM tree is organized as a strictly ordered tree of nodes. It isn't hard to enumerate over these nodes, but it's quite cumbersome to refer to specific ones. What makes matters worse is that some nodes are whitespace text and processing instruction nodes (which you rarely care about), so finding the subtags in the node list is mostly trial and error. In the example above, access to the native attributes (for example, .children ) and the DOM-style methods (for example, .get_childNodes() ) are used in different print statements. Either way, it isn't easy to see what data in the XML document is being referenced.
In contrast, the first print statement in the example above pretty much documents itself. The only minor caveat is that you must use Python's zero-based list indexing. Beyond that, the line simply says: "Print the city of the address of the second person in the addressbook." ("New York" is what is printed by each statement.) To help you further, py_obj.__class__ is "addressbook," which corresponds to the XML document's root element. And every attribute that might contain more than simple text is an instance of a class named according to the XML tag defining it.
As you can see, xml.dom is generally hard to use and its syntax is obscure. Native Python objects are much easier to use. Note that xml_objectify does make wide use of DOM internally. In fact, every XML_Objectify instance contains a ._dom attribute that is a DOM tree for the XML document opened. However, the instance .make_instance that is created does not contain any DOM, and is the class type of the root tag.
With xml_objectify , you can take advantage of all your existing generic functions. pyobj_printer() is a sample generic function included with the xml_objectify module. This function produces a readable, recursive representation of any Python object. By representing your XML documents as native Python documents, you can reuse existing functions that handle Python objects in abstract ways. Of course, a DOM object is a Python object of sorts, but it's difficult to use generic functions with these objects in a useful way. For example, because a DOM object's attributes are nested .children lists, using a generic function like the pyobj_printer() will not produce very useful output.
xml_objectify offers a subtle trick in that it only dynamically defines a class for an attribute value if that class has not already been defined. This lets you define classes with complex behavior and attributes that you can place specific XML document contents into. Say for example that the class person is predefined with various methods (including an .__init__() method, if needed). Each "person" in the XML addressbook imported in the example above will have whatever behaviors it has been given, including methods that operate on the data placed in the instance. Of course, if you have not predefined a class before running XML_Objectify() on the document, the class is just a container for the attributes defined in the actual XML.
XML tags are normally block-level, but some are character-level. In my opinion, the natural Python representation is different for each case. A block-level subtag is easily represented by an attribute of the parent tag that is named after the subtag. The value of the subtag-attribute is a new Python object, which is also of a type named after the subtag. For example, a person might have, in a hierarchical sense, an address and misc-info. With Python, you can refer to these as person.address and person.misc_info.
With a character-level tag, where the contents of a tag are a mixture of text data and markup of that data (often typographic), the subtags are not really something the parent tag has in a hierarchy. For example, a misc_info object does not really have ital attributes. So, how should the following type of XML be represented?
<misc-info>One of the <ital>most</ital> talented actresses on TV.</misc-info>
xml_objectify adds a special attribute called ._XML to objects/tags that appear to contain marked-up character data. This attribute contains the literal XML inside a tag. For example, the pyobj_printer() function displays this literal XML instead of recursive attributes if the ._XML attribute exists for a given nested object. However, the standard recursive subtag-object creation is still performed, so you can look at whatever attributes and structures are most relevant.
Many XML documents contain processing instructions and/or comments along with their tags and character data contents. However, the native Python object created by the .make_instance() method of an XML_Objectify object contains only the contents of the document root tag. Furthermore, XML comments are ignored; only tag attributes and character data are represented.
In the Creating a Python object from an XML document example above, if you preserved the original XML_Objectify object ( xml_obj ), you could access its .processing_instruction attribute, or even its ._dom attribute, to see what was left out of the native Python object.
All XML attributes are converted to Python object attributes of string type. Currently Python does not represent XML enumerated or numeric types for attributes. Such capabilities might be added to later versions, but these would generally require a DTD, which xml_objectify does not assume.
XML subtags are represented by either Python attributes of object type or by lists of such objects, depending on whether there are one or several such subtags of the same type. This is determined by whether a particular tag contains multiple subtags of the same type. For example, in the first address.xml example above, one person's contact information may include one home phone, while another person's contact information may include zero or several. Correspondingly, some contact_info objects will have no .home_phone attribute, while some will have a .home_phone attribute containing a home_phone object, and some will have a .home_phone attribute containing a list of home_phone objects. Although it would be possible to impose more order if a DTD were used, in my opinion, Python applications require this kind dynamic ability.
Be aware that the Python namespace is smaller than the XML namespace. Therefore, sometimes the XML names of either tags or attributes are modified. xml_objectify transforms dashes, colons, and the pound/hash mark into underscores. The module does not handle any further namespace collision. For example, if your XML document has tags, <spam-eggs> , <spam_eggs> , <spam:eggs> and <spam#eggs> , xml_objectify will create Python objects that do not correctly represent your XML document. In most cases, this is not a problem, since people are unlikely to have XML documents with these kind of conflicting tags.
Currently, no capabilities exist for converting native Python objects back to XML documents with the same structure as those read in. The problem occurs because xml_objectify deliberately drops information about order in XML documents to produce friendlier Python objects. Python attributes do not have any predetermined order, but XML tags and attributes may be required in a specific sequence. Even where XML tags are not required to occur in specific order, the order may be semantically important. (Note that in the case of repeated common subtags, Python lists maintain order.) In order to convert back to XML, we would either need to choose arbitrary orders, or preserve order information within the native Python object, making it seem less like Python.
One option in reconstructing the dropped information in Python objects might be to enforce a DTD when converting back to XML. Even if I pursued this option, questions would still exist about how to handle attributes that are added, deleted, or modified at Python runtime. Modifying a Python object could produce something that does not conform to the original XML document's DTD. However, I will add capabilities to xml_objectify if users identify specific needs.
- An Introduction to XML Tools for Python: Charming Python #1
- A Closer Look at Python's
On the Pythonic treatment of XML documents as objects
Revisiting xml_pickle and xml_objectify
Most current version of
The Python Special Interest
Group on XML
The World Wide Web Consortium's DOM
Level 1 Recommendation
Files used and
mentioned in this article
Find other articles in David Mertz's XML Matters column.
David Mertz wanted to call this column "Ex nihilo XML fit", if only for the alliteration; but he thinks his publisher shudders at the summoned imagery of a chthonic golem. David Mertz can be reached at firstname.lastname@example.org; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out David's book "Text Processing in Python" at http//gnosis.cx/TPiP/.