XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many of you are familiar with SGML via HTML. Both XML and HTML documents are composed of text interspersed with, and structured by, markup tags in angle brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes, including:
- Magazine articles and user documentation
- Files of structured data (like CSV or EDI files)
- Messages for interprocess communication between programs
- Architectural diagrams (like CAD formats)
A set of tags can be created to capture any sort of structured information you might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information.
Python is a freely available, very high-level, interpreted language developed by Guido van Rossum. It combines clear syntax with powerful (but optional) object-oriented semantics. Python is available for a range of computer platforms and offers strong portability between platforms.
There are a number of techniques and tools for dealing with XML documents in Python. (The Resources section provides links to two developerWorks articles in which I discuss general techniques. It also provides links to other documents on XML/Python topics.) However, one thing that most existing XML/Python tools have in common is that they are much more XML-centric than Python-centric. Certain constructs and coding techniques feel "natural" in a given programming language, and others feel much more like they are imported from other domains. But in an ideal environment all constructs fit intuitively into their domain, and domains merge seamlessly. When they do, programmers can wax poetic rather than merely make it work.
I've begun a research project of creating a more seamless and more natural integration between XML and Python. In this article, and subsequent articles in this column, I'll discuss some of the goals, decisions, and limitations of the project; and hopefully provide you with a set of useful modules and techniques that point to easier ways to meet programming goals. All tools created as part of the project will be released to the public domain.
Python is a language with a flexible object system and a rich
set of built-in types. The richness of Python is both an
advantage and a disadvantage for the project. On one hand,
having a wide range of native facilities in Python makes it
easier to represent a wide range of XML structures. On the other hand, the
range of native types and structures of Python makes for more cases to worry about in representing native Python objects in XML. As a result of
these asymmetries between XML and Python, the
project -- at least initially -- contains two separate modules: xml_pickle, for representing arbitrary Python objects in XML, and xml_objectify, for "native"
representation of XML documents as Python objects. We'll address xml_pickle in this article.
Python's standard pickle module already provides a simple and
convenient method of serializing Python objects that is useful for persistent storage or transmission over a
network. In some cases, however, it is desirable to perform
serialization to a format with several properties not possessed
by pickle. Namely, a format that:
- Is human readable
- May be parsed, manipulated, and its objects imported by languages other than Python
- Supports validation of stored serialized objects
xml_pickle provides each of these
features while maintaining interface compatibility with
pickle. However, xml_pickle is not a general purpose
replacement for pickle since pickle retains several
advantages of its own such as faster operation (especially via
cPickle) and a far more compact object representation.
Even though the interface of xml_pickle is mostly the same as that of pickle, it is worth illustrating the (quite simple) usage of xml_pickle for those who are not familiar with Python or pickle.
Python code to demonstrate [xml_pickle]
<FONT COLOR="#3333CC"><b>import</b></FONT> xml_pickle
<FONT COLOR="#1111CC"># import the module</FONT>
<FONT COLOR="#1111CC"># declare some classes to hold some attributes</FONT>
<FONT COLOR="#3333CC"><b>class</b></FONT><A NAME="MyClass1"><FONT COLOR="#CC0000"><b>
MyClass1</b></FONT></A>: <FONT COLOR="#3333CC"><b>pass</b></FONT>
<FONT COLOR="#3333CC"><b>class</b></FONT><A NAME="MyClass2"><FONT COLOR="#CC0000"><b>
MyClass2</b></FONT></A>: <FONT COLOR="#3333CC"><b>pass</b></FONT>
<FONT COLOR="#1111CC"># create a class instance, and add some basic data members to it</FONT>
o = MyClass1()
o.num = 37
o.str = <FONT COLOR="#115511">"Hello World"</FONT>
o.lst = [1, 3.5, 2, 4+7j]
<FONT COLOR="#1111CC"># create an instance of a different class, add some members</FONT>
o2 = MyClass2()
o2.tup = (<FONT COLOR="#115511">"x"</FONT>, <FONT COLOR="#115511">"y"</FONT>,
<FONT COLOR="#115511">"z"</FONT>)
o2.num = 2+2j
o2.dct = { <FONT COLOR="#115511">"this"</FONT>: <FONT COLOR="#115511">"that"</FONT>,
<FONT COLOR="#115511">"spam"</FONT>: <FONT COLOR="#115511">"eggs"</FONT>, 3.14:
<FONT COLOR="#115511">"about PI"</FONT> }
<FONT COLOR="#1111CC"># add the second instance to the first instance container</FONT>
o.obj = o2
<FONT COLOR="#1111CC"># print an XML representation of the container instance</FONT>
xml_string = xml_pickle.XML_Pickler(o).dumps()
<FONT COLOR="#3333CC"><b>print</b></FONT> xml_string
|
Everything except the first line and the next-to-last line is generic Python for working with object instances. It might be a little contrived and a little simple, but essentially everything you do with instance data members (including nesting instances as container data, which is how most complex structures are built in Python) is contained in the example above. Python programmers only need to make one method call to encode their objects as XML.
Of course, once you have "pickled" your objects, you'll want to restore them later (or use them elsewhere). Supposing the above few lines have already run, restoring the object representation is as simple as:
new_object = xml_pickle.XML_Pickler().loads(xml_string) |
Obviously, in real cases you would want to do something more
interesting with the created XML document than just hold it in
memory during runtime. For example, you might save the XML
document to disk (maybe using the XML_Pickler.dump() method),
or transmit it over a communication channel. Actually, the
example does print to paper, which might well be a good
durable storage format.
Running the sample code above will produce a pretty good
example of the features of an xml_pickle representation of a
Python object. But the following example is a hand-coded
test case I've developed that has the
advantage of containing every XML structure, tag and attribute
allowed in document type. The specific data is invented, but
it is not hard to imagine the application the data might belong
to.
<?xml version="1.0"?>
<!DOCTYPE PyObject SYSTEM "PyObjects.dtd">
<PyObject class="Automobile">
<attr name="doors" type="numeric" value="4" />
<attr name="make" type="string" value="Honda" />
<attr name="tow_hitch" type="None" />
<attr name="prev_owners" type="tuple">
<item type="string" value="Jane Smith" />
<item type="tuple">
<item type="string" value="John Doe" />
<item type="string" value="Betty Doe" />
</item>
<item type="string" value="Charles Ng" />
</attr>
<attr name="repairs" type="list">
<item type="string" value="June 1, 1999: Fixed radiator" />
<item type="PyObject" class="Swindle">
<attr name="date" type="string" value="July 1, 1999" />
<attr name="swindler" type="string" value="Ed's Auto" />
<attr name="purport" type="string" value="Fix A/C" />
</item>
</attr>
<attr name="options" type="dict">
<entry>
<key type="string" value="Cup Holders" />
<val type="numeric" value="4" />
</entry>
<entry>
<key type="string" value="Custom Wheels" />
<val type="string" value="Chrome Spoked" />
</entry>
</attr>
<attr name="engine" type="PyObject" class="Engine">
<attr name="cylinders" type="numeric" value="4" />
<attr name="manufacturer" type="string" value="Ford" />
</attr>
</PyObject> |
Informally, it is not difficult to see the
structure of a PyObjects.dtd XML document. (A formal document type definition (DTD) is available in Resources.) But the DTD will
disambiguate any issues that are not immediately evident.
Looking at the sample XML document, you can see that the three
stated design goals of xml_pickle have been met:
- The format is human readable
- The XML representations may be
manipulated by means other than
xml_pickle-- whether they are unrelated Python/XML modules, XML libraries in other programming languages, XML-enhanced editors and utilities, or just simply text-editors (as was used in creation of the sample) - XML representations of Python objects may be
validated using standard XML validators and
PyObjects.dtd
All documents that conform to the DTD and only documents that conform to the DTD will be representations of valid Python objects.
Design features, caveats and limitations
The content models of Python and XML are simply different in certain respects. One significant difference is that XML documents are inherently linear in form. Python object attributes -- and also Python dictionaries -- have no definitional order (although implementation details create arbitrary ordering, such as of hashed keys). In this respect, the Python object model is closer to the relational model; rows of a relational table have no "natural" sequence, and primary or secondary keys may or may not provide any meaningful ordering on a table. The keys are always orderable by comparison operators, but this order may be unrelated to the semantics of the keys.
An XML document always lists its tag elements in a particular
order. The order may not be significant to a particular
application, but the XML document order is always present. The effect of the differing
significance of key order in Python and XML is that the XML
documents produced by xml_pickle are not guaranteed to
maintain element order through "pickle"/"unpickle" cycles. For
example, a hand-prepared PyObjects.dtd XML document, such as the
one above, may be "unpickled" into a Python object. If the
resultant object is then "pickled," the <attr> tags will most
likely occur in a different order than in the original
document. This is a feature, not a bug, but the fact should be
understood.
Several known limitations occur in xml_pickle
as of the current version (0.2). One potentially serious flaw
is that no effort is made to trap cyclical references in
compound/container objects. If an object attribute refers back
to the container object (or some recursive version of this),
xml_pickle will exhaust the Python stack. Cyclical
references are likely to indicate a flaw in object design to
start with, but later versions of xml_pickle will certainly
attempt to deal with them more intelligently.
Another limitation is that the namespace of XML
attribute values (such as the "123" in <attr name="123">) is
larger than the namespace of valid Python variables and
instance members. Attributes created manually outside
the Python namespace will have the odd status of existing
in the .__dict__ magic attribute of an instance, but being
inaccessible by normal attribute syntax (e.g. "obj.123" is a
syntax error). This is only an issue where XML documents are
created or modified by means other than xml_pickle itself.
At this time, I simply haven't determined the best way of handling
this (somewhat obscure) issue.
A third limitation is that xml_pickle does not handle all attributes of Python objects. All the "usual" data members (strings,
numbers, dictionaries, etc.) are "pickled" well. But instance
methods, and class and function objects as attributes, are not
handled. As with pickle, methods are simply ignored in "pickling." If class or function objects exist as attributes,
an XMLPicklingError is raised. This is probably the correct
ultimate behavior, but a final decision has not been made.
One genuine ambiguity in XML document design
is the choice of when to use tag attributes and when to use
subelements. Opinions on this design issue differ, and XML
programmers often feel strongly about their conflicting views.
This was probably the biggest issue in deciding the
xml_pickle document structure.
The general principle decided was that a thing that is
naturally "plural" should be represented by subelements. For
example, a Python list can contain as many items as you like,
and is therefore represented by a sequence of <item>
subelements. On the other side, a number is a singular thing
(the value might be more than 1, but there is only one thing
in it). In that case, it seemed much more logical to use an XML
attribute called "value." The really difficult case was identified with
Python strings. In a basic way, they are sequence
objects -- just like lists. But representing each character in a
string using a hypothetical tag would destroy the goal
of human readability, and make for enormous XML
representations. The decision was made to put strings in the
XML "value" attribute, just as with numbers. However, from an
aesthetic point of view, this is probably less desirable than
within a tag container, especially for multiline strings.
But this decision seemed more consistent since there was no
other "naked" #PCDATA in the specification.
In part because strings are stored in XML "value" attributes -- but mostly to maintain the syntactical nature of the XML document -- Python strings needed to be stored in a "safe" form. There are a few unsafe things that could occur in Python strings. The first type is the basic markup characters like greater-than and less-than. A second type is the quote and apostrophe characters that set off attributes. The third type is questionable ASCII values, such as a null character. One possibility considered was to encode the whole Python strings in something like base64 encoding. This would make strings "safe," but also completely unreadable to humans. The decision was made to use a mixed approach. The basic XML characters are escaped in the style of "&", ">" or """. Questionable ASCII values are escaped in Python-style, such as "\000". The combination makes for human-readable XML representations, but requires a somewhat mixed approach to decoding stored strings.
There are a number of things that xml_pickle is likely to be
good for, and some user feedback has indicated that it has
entered preliminary usage. Below are a few ideas.
- XML representations of Python objects may be indexed and
cataloged using existing XML-centric tools (not necessarily
written in Python). This provides a ready means of
indexing Python object databases (such as ZODB, PAOS, or
simply
shelve). - XML representations of Python objects could be restored as
objects of other OOP languages, especially ones having
a similar range of basic types. This is something that has yet to
be done. Much "heavier" protocols like CORBA, XML-RPC, and
SOAP have an overlapping purpose, but
xml_pickleis pretty "lightweight" as an object transport specification. - Tools for printing and displaying XML documents can be used to provide convenient human-readable representations of Python objects via their XML intermediate form.
- Python objects can be manually "debugged" via their XML representation using XML-specific editors, or simply text editors. Once hand-modified objects are "unpickled," the effects of the edits on program operation can be examined. This provides an additional option to other existing Python debuggers and wrappers.
Please send me your feedback if you develop additional uses for xml_pickle or see
enhancements that would open the module to additional uses.
-
On the Pythonic treatment of XML documents as objects (II)
-
Revisiting xml_pickle and xml_objectify
-
Enforcing validity with the gnosis.xml.validity library
- An Introduction to XML Tools for Python:
Charming Python #1
- A Closer Look at Python's
xml.domModule: Charming Python #2 -
The Python Special Interest Group on XML
-
The World Wide Web Consortium's DOM page
-
The DOM Level 1 Recommendation
-
Files used and mentioned in this article
-
XML Processing with Python
by Sean McGrath is a friendly introduction to Python for programmers with an XML background. McGrath uses his book largely to
argue the virtues of his
pyxiemodule and associated tools and techniques as the best approach to XML processing. Whether or notpyxieis the best approach to your specific problem, McGrath's is a useful introduction to Python (but less so to XML). -
Find other articles in David Mertz's XML Matters column.

David Mertz wanted to call this column "Ex nihilo XML fit", if only for the alliteration; but he thinks his publisher shudders at the summoned imagery of a chthonic golem. David Mertz can be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed. Check out David's book "Text Processing in Python" at http//gnosis.cx/TPiP/.



