Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

XML Matters: On the 'Pythonic' treatment of XML documents as objects(II)

David Mertz, Ph.D (mertz@gnosis.cx), Data Masseur, Gnosis Software, Inc.
Photo of David Mertz
David Mertz wanted to call this column "Ex nihilo XML fit", if only for the alliteration; but he thinks his publisher shudders at the summoned imagery of a chthonic golem. David Mertz can be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out David's book "Text Processing in Python" at http//gnosis.cx/TPiP/.

Summary:  In the second installment of his new "XML Matters" column -- and as part of his ongoing quest to create a more seamless integration between XML and Python -- David Mertz presents the xml_objectify module. David describes how to use xml_objectify and the advantages of using this "Pythonic" module for working with XML documents as objects.

View more content in this series

Date:  01 Aug 2000
Also available in:   Japanese

Activity:  11984 views
Comments:  



Introduction to the Project

XML Matters #1 introduced my project for creating more seamless and natural integration between XML and Python. The Resources section provides links to other developerWorks articles in which I discuss general Python programming techniques and other XML/Python topics.

Obtaining compatible XML-SIG updates

The XML-SIG distribution changes fairly frequently in beta versions. These changes are likely to affect how xml_objectify functions. Therefore, you can download an XML-SIG version known to be compatible with xml_objectify from Resources.

When the XML-SIG distribution is officially released and/or when the XML package is a part of an official Python release, the current xml_objectify will be updated to work with the official release. See Resources for the most current xml_objectify.

Because of the asymmetries between XML and Python, the project -- at least initially -- contains two separate modules: xml_pickle for representing arbitrary Python objects in XML, and xml_objectify for native representation of XML documents as Python objects. This article addresses xml_objectify .

In Python, modules and packages such as xmllib , xml.sax , pyxie , and xml.dom provide ways of handling XML documents that are common in the XML community. You may be familiar with similar modules and libraries available for other programming languages. In fact, many of these modules are based on language-neutral XML standards, and they commonly implement an XML-centric way of handling documents and objects.

The Python implementations of general XML protocols give you the flexibility to program in different ways. For example, you might use portable standards such as DOM so that programmers using one language can easily work with DOM-oriented code written in another language. However, there are times when a Python programmer may prefer to code in ways that are much more like "normal" Python. In many cases, the XML conceptual framework seems like it's tacked on to Python, rather than being an integral part of it. Thus, I developed a set of "Pythonic" modules for working with XML documents.


Jumping ahead: how to use xml_objectify

Using xml_objectify is simple and well-documented in module docstring comments. Let's take a quick look at some sample code:


Creating a Python object from an XML document
                

<FONT color="#3333cc"><B>from</B></FONT> xml_objectify <FONT color="#3333cc"><B>import</B></FONT> XML_Objectify
xml_obj = XML_Objectify(<FONT color="#115511">'address.xml'</FONT>)
py_obj = xml_obj.make_instance()


As you can see, there are two steps in creating a native Python object from a generic XML document. First, you create an intermediate DOM-like factory object (that is, an object used to create other objects). Second, you generate one or more Python object instances from the XML_Objectify instance. Note that you should use xml_pickler to handle special PyObjects.dtd format documents. (See XML Matters 1 for information about xml_pickle .)

You could also do both steps on the same line. For example:


Creating an XML/Python object inline
                

py_obj = XML_Objectify(<FONT color="#115511">'address.xml'</FONT>).make_instance()

Of course, in the latter case, the factory object is not preserved to produce more native objects, and its ._dom data member, which contains a full DOM instance, is also cleared.

For comparison, the following example shows that creating a DOM object can be just as simple in Python:


Creating a DOM object from an XML document
                

from xml.dom.utils import FileReader
dom_obj = FileReader().readXml(open('address.xml'))

FileReader().readXml() requires an actual file object, while XML_Objectify() accepts either a file object or a plain filename. In either case, creating the object is a two-line action.

The difference between using the xml_objectify module and the xml.dom package is in the type of object you wind up with. A Python DOM object is a genuine Python object, but its attributes and methods do not correspond to the data and structure of the original XML document as closely as those of the XML_Objectify object. The Python DOM object's attributes are generally nested .children lists, which are not too helpful semantically. To access the same XML attribute in the sample document, you have a choice between using the first line with xml_objectify or the next four lines with DOM. This is illustrated below:


Using [xml.dom] versus [xml_objectify] Python objects
                

<FONT color="#3333cc"><B>print</B></FONT> py_obj.person[1].address.city
<FONT color="#3333cc"><B>print</B></FONT> dom_obj.get_childNodes()[1].get_childNodes()[3].\
      get_childNodes()[3].get_attributes()[<FONT color="#115511">'city'</FONT>].value
<FONT color="#3333cc"><B>print</B></FONT> dom_obj._node.children[1].children[3].children[3].\
      attributes[<FONT color="#115511">'city'</FONT>].children[0].value

A DOM tree is organized as a strictly ordered tree of nodes. It isn't hard to enumerate over these nodes, but it's quite cumbersome to refer to specific ones. What makes matters worse is that some nodes are whitespace text and processing instruction nodes (which you rarely care about), so finding the subtags in the node list is mostly trial and error. In the example above, access to the native attributes (for example, .children ) and the DOM-style methods (for example, .get_childNodes() ) are used in different print statements. Either way, it isn't easy to see what data in the XML document is being referenced.

In contrast, the first print statement in the example above pretty much documents itself. The only minor caveat is that you must use Python's zero-based list indexing. Beyond that, the line simply says: "Print the city of the address of the second person in the addressbook." ("New York" is what is printed by each statement.) To help you further, py_obj.__class__ is "addressbook," which corresponds to the XML document's root element. And every attribute that might contain more than simple text is an instance of a class named according to the XML tag defining it.

As you can see, xml.dom is generally hard to use and its syntax is obscure. Native Python objects are much easier to use. Note that xml_objectify does make wide use of DOM internally. In fact, every XML_Objectify instance contains a ._dom attribute that is a DOM tree for the XML document opened. However, the instance .make_instance that is created does not contain any DOM, and is the class type of the root tag.


Design considerations, limitations, and caveats

Code introspection

With xml_objectify , you can take advantage of all your existing generic functions. pyobj_printer() is a sample generic function included with the xml_objectify module. This function produces a readable, recursive representation of any Python object. By representing your XML documents as native Python documents, you can reuse existing functions that handle Python objects in abstract ways. Of course, a DOM object is a Python object of sorts, but it's difficult to use generic functions with these objects in a useful way. For example, because a DOM object's attributes are nested .children lists, using a generic function like the pyobj_printer() will not produce very useful output.

Tricks with class behavior

xml_objectify offers a subtle trick in that it only dynamically defines a class for an attribute value if that class has not already been defined. This lets you define classes with complex behavior and attributes that you can place specific XML document contents into. Say for example that the class person is predefined with various methods (including an .__init__() method, if needed). Each "person" in the XML addressbook imported in the example above will have whatever behaviors it has been given, including methods that operate on the data placed in the instance. Of course, if you have not predefined a class before running XML_Objectify() on the document, the class is just a container for the attributes defined in the actual XML.

Character markup handling

XML tags are normally block-level, but some are character-level. In my opinion, the natural Python representation is different for each case. A block-level subtag is easily represented by an attribute of the parent tag that is named after the subtag. The value of the subtag-attribute is a new Python object, which is also of a type named after the subtag. For example, a person might have, in a hierarchical sense, an address and misc-info. With Python, you can refer to these as person.address and person.misc_info.

With a character-level tag, where the contents of a tag are a mixture of text data and markup of that data (often typographic), the subtags are not really something the parent tag has in a hierarchy. For example, a misc_info object does not really have ital attributes. So, how should the following type of XML be represented?

<misc-info>One of the <ital>most</ital> talented actresses on TV.</misc-info>

xml_objectify adds a special attribute called ._XML to objects/tags that appear to contain marked-up character data. This attribute contains the literal XML inside a tag. For example, the pyobj_printer() function displays this literal XML instead of recursive attributes if the ._XML attribute exists for a given nested object. However, the standard recursive subtag-object creation is still performed, so you can look at whatever attributes and structures are most relevant.

Native Python objects contain root document only

Many XML documents contain processing instructions and/or comments along with their tags and character data contents. However, the native Python object created by the .make_instance() method of an XML_Objectify object contains only the contents of the document root tag. Furthermore, XML comments are ignored; only tag attributes and character data are represented.

In the Creating a Python object from an XML document example above, if you preserved the original XML_Objectify object ( xml_obj ), you could access its .processing_instruction attribute, or even its ._dom attribute, to see what was left out of the native Python object.

Attribute type simplification

All XML attributes are converted to Python object attributes of string type. Currently Python does not represent XML enumerated or numeric types for attributes. Such capabilities might be added to later versions, but these would generally require a DTD, which xml_objectify does not assume.

Subtag attributes

XML subtags are represented by either Python attributes of object type or by lists of such objects, depending on whether there are one or several such subtags of the same type. This is determined by whether a particular tag contains multiple subtags of the same type. For example, in the first address.xml example above, one person's contact information may include one home phone, while another person's contact information may include zero or several. Correspondingly, some contact_info objects will have no .home_phone attribute, while some will have a .home_phone attribute containing a home_phone object, and some will have a .home_phone attribute containing a list of home_phone objects. Although it would be possible to impose more order if a DTD were used, in my opinion, Python applications require this kind dynamic ability.

Python namespace restrictions

Be aware that the Python namespace is smaller than the XML namespace. Therefore, sometimes the XML names of either tags or attributes are modified. xml_objectify transforms dashes, colons, and the pound/hash mark into underscores. The module does not handle any further namespace collision. For example, if your XML document has tags, <spam-eggs> , <spam_eggs> , <spam:eggs> and <spam#eggs> , xml_objectify will create Python objects that do not correctly represent your XML document. In most cases, this is not a problem, since people are unlikely to have XML documents with these kind of conflicting tags.


What is the future of xml_objectify?

Currently, no capabilities exist for converting native Python objects back to XML documents with the same structure as those read in. The problem occurs because xml_objectify deliberately drops information about order in XML documents to produce friendlier Python objects. Python attributes do not have any predetermined order, but XML tags and attributes may be required in a specific sequence. Even where XML tags are not required to occur in specific order, the order may be semantically important. (Note that in the case of repeated common subtags, Python lists maintain order.) In order to convert back to XML, we would either need to choose arbitrary orders, or preserve order information within the native Python object, making it seem less like Python.

One option in reconstructing the dropped information in Python objects might be to enforce a DTD when converting back to XML. Even if I pursued this option, questions would still exist about how to handle attributes that are added, deleted, or modified at Python runtime. Modifying a Python object could produce something that does not conform to the original XML document's DTD. However, I will add capabilities to xml_objectify if users identify specific needs.


Resources

About the author

Photo of David Mertz

David Mertz wanted to call this column "Ex nihilo XML fit", if only for the alliteration; but he thinks his publisher shudders at the summoned imagery of a chthonic golem. David Mertz can be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out David's book "Text Processing in Python" at http//gnosis.cx/TPiP/.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=237322
ArticleTitle=XML Matters: On the 'Pythonic' treatment of XML documents as objects(II)
publish-date=08012000
author1-email=mertz@gnosis.cx
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers