David Mertz, Ph.D (mertz@gnosis.cx), Data Masseur, Gnosis Software, Inc.
01 Aug 2000 In the second installment of his new "XML Matters" column -- and as part of his ongoing quest to create a more seamless integration between XML and Python -- David Mertz presents the xml_objectify module. David describes how to use xml_objectify and the advantages of using this "Pythonic" module for working with XML documents as objects.
Introduction to the Project
XML
Matters #1 introduced my project for creating more seamless and natural
integration between XML and Python. The Resources
section provides links to other developerWorks articles in which I
discuss general Python programming techniques and other XML/Python topics.
 |
Obtaining compatible XML-SIG updates
The XML-SIG distribution changes fairly frequently in beta versions. These changes are likely to affect how
xml_objectify
functions. Therefore, you can
download an XML-SIG version known to be compatible with
xml_objectify
from Resources.
When the XML-SIG
distribution is officially released and/or when the XML package is a
part of an official Python release, the current
xml_objectify
will be updated to work with the
official release. See Resources for the most current
xml_objectify.
|
|
Because of the asymmetries between XML and Python, the project -- at least
initially -- contains two separate modules:
xml_pickle
for
representing arbitrary Python objects in XML, and
xml_objectify
for native representation of XML documents
as Python objects. This article addresses
xml_objectify
.
In Python, modules and packages such as
xmllib
,
xml.sax
,
pyxie
, and
xml.dom
provide ways of handling XML documents that are
common in the XML community. You may be familiar with similar modules and
libraries available for other programming languages. In fact, many of these modules are based on language-neutral
XML standards, and they commonly
implement an XML-centric way of handling documents and objects.
The Python implementations of general XML protocols give you the flexibility to
program in different ways. For example, you might use portable standards such as
DOM so that programmers using one language can easily work with DOM-oriented
code written in another language. However, there are times when a Python
programmer may prefer to code in ways that are much more like "normal" Python.
In many cases, the XML conceptual framework seems like it's tacked on to
Python, rather than being an integral part of it. Thus, I developed a set of
"Pythonic" modules for working with XML documents.
Jumping ahead: how to use xml_objectify
Using
xml_objectify
is simple and well-documented in module
docstring comments. Let's take a quick look at some sample code:
Creating a Python object from an XML document
<FONT color="#3333cc"><B>from</B></FONT> xml_objectify <FONT color="#3333cc"><B>import</B></FONT> XML_Objectify
xml_obj = XML_Objectify(<FONT color="#115511">'address.xml'</FONT>)
py_obj = xml_obj.make_instance()
|
As you can see, there are two steps in creating a native Python object from a
generic XML document. First, you create an intermediate DOM-like factory object
(that is, an object used to create other objects). Second, you generate one or more
Python object instances from the
XML_Objectify
instance. Note that
you should use
xml_pickler
to handle special
PyObjects.dtd
format documents. (See XML Matters 1 for information about
xml_pickle
.)
You could also do both steps on the same line. For example:
Creating an XML/Python object inline
py_obj = XML_Objectify(<FONT color="#115511">'address.xml'</FONT>).make_instance()
|
Of course, in the latter case, the factory object is not preserved to produce
more native objects, and its
._dom
data member, which contains a
full DOM instance, is also cleared.
For comparison, the following example shows that creating a DOM object can be
just as simple in Python:
Creating a DOM object from an XML document
from xml.dom.utils import FileReader
dom_obj = FileReader().readXml(open('address.xml'))
|
FileReader().readXml()
requires an actual file object, while
XML_Objectify()
accepts either a file object or a plain filename.
In either case, creating the object is a two-line action.
The difference between using the
xml_objectify
module
and the
xml.dom
package is in the type of object you wind
up with. A Python DOM object is a genuine Python object, but
its attributes and methods do not correspond to the data and structure of the
original XML document as closely as those of the
XML_Objectify
object. The Python DOM object's attributes are generally nested
.children
lists, which are not too helpful semantically. To access the same XML attribute in the sample
document, you have a choice between using the first line with
xml_objectify
or the next four lines with DOM. This is illustrated below:
Using [xml.dom] versus [xml_objectify] Python objects
<FONT color="#3333cc"><B>print</B></FONT> py_obj.person[1].address.city
<FONT color="#3333cc"><B>print</B></FONT> dom_obj.get_childNodes()[1].get_childNodes()[3].\
get_childNodes()[3].get_attributes()[<FONT color="#115511">'city'</FONT>].value
<FONT color="#3333cc"><B>print</B></FONT> dom_obj._node.children[1].children[3].children[3].\
attributes[<FONT color="#115511">'city'</FONT>].children[0].value
|
A DOM tree is organized as a strictly ordered tree of nodes. It isn't hard
to enumerate over these nodes, but it's quite cumbersome to refer to specific
ones. What makes matters worse is that some nodes are whitespace text and
processing instruction nodes (which you rarely care about), so finding the
subtags in the node list is mostly trial and error. In the example above, access
to the native attributes (for example,
.children
) and the DOM-style
methods (for example,
.get_childNodes()
) are used in different
print
statements. Either way, it isn't easy to see what data in
the XML document is being referenced.
In contrast, the first
print
statement in the example above
pretty much documents itself. The only minor caveat is that you must use
Python's zero-based list indexing. Beyond that, the line simply says: "Print the
city of the address of the second person in the addressbook." ("New York" is
what is printed by each statement.) To help you further,
py_obj.__class__
is "addressbook," which corresponds to the XML
document's root element. And every attribute that might contain more than simple
text is an instance of a class named according to the XML tag defining it.
As you can see,
xml.dom
is generally hard to use and
its syntax is obscure. Native Python objects are much easier to use. Note that
xml_objectify
does make wide use of DOM internally. In
fact, every
XML_Objectify
instance contains a
._dom
attribute that is a DOM tree for the XML document opened. However, the instance
.make_instance
that is created does not contain any DOM, and is the class
type of the root tag.
Design considerations, limitations, and caveats
Code introspection
With
xml_objectify
, you can take advantage of all your existing
generic functions.
pyobj_printer()
is a sample generic function included with the
xml_objectify
module. This
function produces a readable, recursive representation of any Python
object. By representing your XML documents as native Python documents, you can
reuse existing functions that handle Python objects in abstract ways. Of
course, a DOM object is a Python object of sorts, but it's difficult to use
generic functions with these objects in a useful way. For example, because a DOM
object's attributes are nested
.children
lists, using a generic
function like the
pyobj_printer()
will not produce very useful
output.
Tricks with class behavior
xml_objectify
offers a subtle trick in that it only dynamically defines a class for an attribute value if that class has not already been defined. This lets you define classes with complex behavior and attributes that you can place specific XML document contents into. Say for example that the class
person
is
predefined with various methods (including an
.__init__()
method,
if needed). Each "person" in the XML addressbook imported in the example above
will have whatever behaviors it has been given, including methods that operate
on the data placed in the instance. Of course, if you have not predefined a
class before running
XML_Objectify()
on the document, the
class is just a container for the attributes defined in the actual XML.
Character markup handling
XML tags are normally block-level, but
some are character-level. In my opinion, the natural Python representation is
different for each case. A block-level subtag is easily represented by an
attribute of the parent tag that is named after the subtag. The value of
the subtag-attribute is a new Python object, which is also of a type named after the
subtag. For example, a person might have, in a hierarchical sense, an address
and misc-info. With Python, you can refer to these as person.address and
person.misc_info.
With a character-level tag, where the contents of a tag are a
mixture of text data and markup of that data (often typographic), the subtags
are not really something the parent tag has in a hierarchy. For example, a misc_info
object does not really have ital attributes. So, how should the following type of XML be
represented?
<misc-info>One of the <ital>most</ital> talented actresses on TV.</misc-info> |
xml_objectify adds a special attribute called ._XML
to objects/tags that appear to contain marked-up character
data. This attribute contains the literal XML inside a tag. For example, the
pyobj_printer() function displays this
literal XML instead of recursive attributes if the
._XML
attribute
exists for a given nested object. However, the standard recursive subtag-object
creation is still performed, so you can look at whatever attributes and
structures are most relevant.
Native Python objects contain root document only
Many XML
documents contain processing instructions and/or comments along with their tags
and character data contents. However, the native Python object created by the
.make_instance()
method of an
XML_Objectify
object
contains only the contents of the document root tag. Furthermore, XML comments
are ignored; only tag attributes and character data are represented.
In the Creating a Python object from an XML document
example above, if you preserved the original
XML_Objectify
object
(
xml_obj
), you could access its
.processing_instruction
attribute, or even its
._dom
attribute, to see what was left out of the native Python object.
Attribute type simplification
All XML attributes are converted to
Python object attributes of string type. Currently Python does not represent XML
enumerated or numeric types for attributes. Such capabilities might be added to
later versions, but these would generally require a DTD, which
xml_objectify
does not assume.
Subtag attributes
XML subtags are represented by either
Python attributes of object type or by lists of such objects, depending
on whether there are one or several such subtags of the same type. This is
determined by whether a particular tag contains multiple subtags
of the same type. For example, in the first
address.xml
example
above, one person's contact information may include one home phone, while
another person's contact information may include zero or several. Correspondingly, some
contact_info
objects will have no
.home_phone
attribute, while some will have a
.home_phone
attribute containing a
home_phone
object,
and some will have a
.home_phone
attribute containing a list of
home_phone
objects. Although it would be possible to impose more order if a DTD were used, in my
opinion, Python applications require this kind dynamic ability.
Python namespace restrictions
Be aware that the Python namespace
is smaller than the XML namespace. Therefore, sometimes the XML names of either
tags or attributes are modified.
xml_objectify
transforms
dashes, colons, and the pound/hash mark into underscores. The module does not
handle any further namespace collision. For example, if your XML document has
tags,
<spam-eggs>
,
<spam_eggs>
,
<spam:eggs>
and
<spam#eggs>
,
xml_objectify
will create Python objects that do not
correctly represent your XML document. In most cases, this is not a problem,
since people are unlikely to have XML documents with these kind of conflicting
tags.
What is the future of xml_objectify?
Currently, no capabilities
exist for converting native Python objects back to XML documents with the same
structure as those read in. The problem occurs because xml_objectify
deliberately drops information about order
in XML documents to produce friendlier Python objects. Python attributes do not
have any predetermined order, but XML tags and attributes may be required in a
specific sequence. Even where XML tags are not required to occur in
specific order, the order may be semantically important. (Note that in the case
of repeated common subtags, Python lists maintain order.) In order to convert
back to XML, we would either need to choose arbitrary orders, or preserve order
information within the native Python object, making it seem less like
Python.
One option in reconstructing the dropped information in Python objects might
be to enforce a DTD when converting back to XML. Even if I pursued this option,
questions would still exist about how to handle attributes that are added,
deleted, or modified at Python runtime. Modifying a Python object could produce
something that does not conform to the original XML document's DTD. However, I
will add capabilities to xml_objectify if users identify specific needs.
Resources
About the author  | 
|  | David Mertz wanted to call this column "Ex nihilo XML fit", if only for the alliteration; but he thinks his publisher shudders at the summoned imagery of a chthonic golem. David Mertz can be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out David's book "Text Processing in Python" at http//gnosis.cx/TPiP/. |
Rate this page
|