Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Charming Python: Tinkering with XML and Python

An introduction to XML tools for Python

David Mertz (mertz@gnosis.cx), President, Gnosis Software, Inc.
There must be some enthymetic necessity to David Mertz writing a column on Python. Like the Monty crew, whose phonorecordings he imbibed as a teenager, he wound up with graduate degrees in philosophy. Now that he writes computer programs for a living -- and writes about writing computer programs -- a certain symmetry is served by writing such in and about Python. David would welcome comments and suggestions for this column. You can contact David at mertz@gnosis.cx and find his life pored over at http://gnosis.cx/dW/.

Summary:  A major element of getting started on working with XML in Python is sorting out the comparative capabilities of all the available modules. In this first installment of his new Python column, "Charming Python," David Mertz briefly describes the most popular and useful XML-related Python modules, and points you to resources for downloading individual modules and reading more about them. This article will help you determine which modules are most appropriate for your specific task.

Date:  01 Jun 2000
Level:  Introductory
Also available in:   Japanese

Activity:  7823 views
Comments:  

Python is in many ways an ideal language for working with XML documents. Like Perl, REBOL, REXX, and TCL, it is a flexible scripting language with powerful text manipulation capabilities. Moreover, more than most types of text files (or streams), XML documents typically encode rich and complex data structures. The familiar "read some lines and compare them to some regular expressions" style of text processing is generally not well suited to adequately parsing and processing XML. Python, fortunately (and more so than most other languages), has both straightforward ways of dealing with complex data structures (usually with classes and attributes), and a range of XML-related modules to aid in parsing, processing, and generating XML.

One general concept to keep in mind about XML is that XML documents can be processed in either a validating or non-validating fashion. In the former type of processing, it is necessary to read a "Document Type Definition" (DTD) prior to reading an XML document it applies to. The processing in this case will evaluate not just the simple syntactic rules for XML documents in general, but also the specific grammatical constraints of the DTD. In many cases, non-validating processing is adequate (and generally both faster to run, and easier to program) -- we trust the document creator to follow the rules of the document domain. Most modules discussed below are non-validating; descriptions will indicate where validation options exist.

The Vaults of Parnassus (see Resources) has become the standard means of finding Python resources of late. All of the modules discussed below can be found at that site (via links to the respective module owner's sites). In particular, the PyXML distribution can be found as both a tarball and as a Win32 installer in the Vaults.

Python's XML special interest group (XML-SIG)

Much -- or most -- of the effort of maintaining a range of XML tools for Python is performed by members of the XML-SIG. As with other Python Special Interest Groups, the XML-SIG maintains a mailing list, list archive, helpful references, documentation, a standard packaging, and other resources. Probably the best place to start after reading the summaries in this article is with the XML-SIG Web pages.

Of specific interest for this article, the XML-SIG maintains the PyXML distribution. This package contains many of the modules discussed in this article, some "getting started" documentation, some demonstration code, and whatever else the XML-SIG might decide to throw into the distribution. A given package may not always contain the "bleeding edge" version of each individual module or tool, but downloading the PyXML distribution is a good place to start. You can always add any modules that are not included, or any new versions of included modules, later (and many of the modules that are not included themselves rely on services provided by the PyXML distribution).


Module: XMLLIB module (standard)

"Out of the box," Python 1.5.* comes with the module [xmllib]. Python 1.6 is likely to incorporate more of the XML-SIG's efforts, but that version is still in alpha. [xmllib] is a non-validating and low-level parser. The way [xmllib] works is by the application programmer overriding the class XMLParser, and providing methods to handle document elements, such as specific or generic tags or character entities.

As an example of [xmllib] in action, the PyXML distribution includes a DTD called 'quotations.dtd' and a document called 'sample.xml' of this DTD (see Resources for an archive of files mentioned in this article). The below code will display the first few lines of each quotation in 'sample.xml', and produce very simple ASCII indicators of unknown tags and entities. The parsed text is handled as a sequential stream, and any accumulators used are the programmer's responsibility (such as the string of characters (#PCDATA) within a tag, or a list/dictionary of tags encountered).


Code to try the xmllib

      #-------------------- try_xmllib.py --------------------#
      import xmllib, string

      class QuotationParser(xmllib.XMLParser):
          """Crude xmllib extractor for quotations.dtd document"""

          def __init__(self):
              xmllib.XMLParser.__init__(self)
              self.thisquote = ''             # quotation accumulator

          def handle_data(self, data):
              self.thisquote = self.thisquote + data

          def syntax_error(self, message): pass

          def start_quotations(self, attrs):  # top level tag
              print '--- Begin Document ---'

          def start_quotation(self, attrs):
              print 'QUOTATION:'

          def end_quotation(self):
              print string.join(string.split(self.thisquote[:230]))+'...',
              print '('+str(len(self.thisquote))+' bytes)\n'
              self.thisquote = ''

          def unknown_starttag(self, tag, attrs):
              self.thisquote = self.thisquote + '{'

          def unknown_endtag(self, tag):
              self.thisquote = self.thisquote + '}'

          def unknown_charref(self, ref):
              self.thisquote = self.thisquote + '?'

          def unknown_entityref(self, ref):
              self.thisquote = self.thisquote + '#'

      if __name__ == '__main__':
          parser = QuotationParser()
          for c in open("sample.xml").read():
              parser.feed(c)
          parser.close()      


Other parsing modules

Several additional parsing modules with varying capabilities are included in the PyXML distribution. These all aim to provide some improvement over the base [xmllib] module.

[pyexpat] is a wrapper for the GPL'd XML Parser Toolkit 'expat'. 'expat' in turn is a library written in C that is meant to be available from any language that wants to utilize it. 'expat' is non-validating, and should be much faster than a native Python parser. [sgmlop] is similar in purpose to [pyexpat]. It is also non-validating, and also written in C. [pyexpat] is available as a MacOS binary, and [sgmlop] is available as a Win32 binary; but if you need a different platform than these, you will need to use a C compiler to build the modules for your own platform.

[xmlproc] is a python native parser, which performs nearly complete validation. If you need a validating parser, [xmlproc] is currently your only choice in Python. As well, [xmlproc] provides a variety of high-level and experimental interfaces that other parsers do not.

If you decide to use the Simple API for XML (SAX) -- which you should for anything sophisticated, since most other tools are built on top of it -- much of the work of sorting through parsers can be done for you. In the PyXML distribution, [xml.sax.drivers] contains thin wrappers for a number of parsers, including all those discussed, with names of the form 'drv_*.py'. However, generally you will access the drivers by a higher level SAX facility that will automatically choose the "best" parser available on the system where run:


Selecting a parser

      #------------- selecting the best parser ---------------#
      from xml.sax.saxext import *
      parser = XMLParserFactory.make_parser()


Package: SAX

We have mentioned above that SAX can automatically choose a parser to use; but just what is SAX? A good answer is:

"SAX (Simple API for XML) is a common parser interface for XML parsers. It allows application writers to write applications that use XML parsers, but are independent of which parser is actually used. (Think of it as JDBC for XML.)" -- Lars Marius Garshol, SAX for Python (see Resources)

SAX -- like the parser modules it provides an API for -- is essentially a sequential processor of an XML document. You use it in a manner largely similar to the [xmllib] example, but with a somewhat higher level of abstraction. Instead of defining a parser class, an application programmer defines a 'handler' class that is registered with whatever parser is used. Four SAX interfaces must be defined (each with several methods): DocumentHandler, DTDHandler, EntityResolver, and ErrorHandler. Base classes of all of these are provided, but in most cases it is easiest to inherit from 'HandlerBase', which itself inherits from all four interfaces. You can override whatever you wish to. Some code will help illustrate this; the sample performs the same task as the [xmllib] example.


Sample code to try SAX

      #--------------------- try_sax.py ----------------------#
      import string
      from xml.sax import saxlib, saxexts

      class QuotationHandler(saxlib.HandlerBase):
          """Crude sax extractor for quotations.dtd document"""

          def __init__(self):
              self.in_quote = 0
              self.thisquote = ''

          def startDocument(self):
              print '--- Begin Document ---'

          def startElement(self, name, attrs):
              if name == 'quotation':
                  print 'QUOTATION:'
                  self.in_quote = 1
              else:
                  self.thisquote = self.thisquote + '{'

          def endElement(self, name):
              if name == 'quotation':
                  print string.join(string.split(self.thisquote[:230]))+'...',
                  print '('+str(len(self.thisquote))+' bytes)\n'
                  self.thisquote = ''
                  self.in_quote = 0
              else:
                  self.thisquote = self.thisquote + '}'

          def characters(self, ch, start, length):
              if self.in_quote:
                  self.thisquote = self.thisquote + ch[start:start+length]

      if __name__ == '__main__':
          parser  = saxexts.XMLParserFactory.make_parser()
          handler = QuotationHandler()
          parser.setDocumentHandler(handler)
          parser.parseFile(open("sample.xml"))
          parser.close()

Two small things to notice about the example in contrast to [xmllib] are: the 'parseFile()'/'parse()' methods handle a whole stream/string so there is no need to create a loop to feed the parser; and 'characters()' is fed chunks of data whose size and position with the passed string are indicated by arguments. Don't make any assumptions about what the 'ch' variable will as passed to 'characters()'.


Package: DOM

DOM is a very high-level tree-based representation of an XML document. The model is not specific to Python, but is a common XML model (see Resources for further information). Python's DOM package is built upon SAX, and is included in the PyXML distribution. Length constraints prevent code samples in this article, but an excellent general description is given in the XML-SIG's "Python/XML HOWTO".

The Document Object Model specifies a tree-based representation for an XML document. A top-level Document instance is the root of the tree, and has a single child, which is the top-level Element instance; this Element has children nodes representing the content and any sub-elements, which may have further children, and so forth. Functions are defined which let you traverse the resulting tree any way you like, access element and attribute values, insert and delete nodes, and convert the tree back into XML.

The DOM is useful for modifying XML documents, because you can create a DOM tree, modify it by adding new nodes and moving subtrees around, and then produce a new XML document as output. You can also construct a DOM tree yourself, and convert it to XML; this is often a more flexible way of producing XML output than simply writing <tag1>...</tag1> to a file.


Package: Pyxie

The [pyxie] module is built on top of the PyXML distribution from the XML-SIG, and provides additional high-level interfaces to an XML document. [pyxie] does two basic things: it transforms XML documents to a more easily parsed line-oriented format; and it provides methods to treat an XML document as a walkable tree. The line-oriented PYX format used by [pyxie] is language-independent, and tools are available for several languages. In general, a PYX representation of a document is much easier to process using familiar line-oriented text-processing tools like grep, sed, awk, bash, perl -- or standard python modules like [string] and [re] -- than is its XML representation. Depending on what is downstream, a transformation from XML to PYX might save a lot of work.

[pyxie]'s concept of treating an XML document like a tree is similar to the ideas in DOM. Since the DOM standard is gaining widespread support across a number of programming languages, it will probably make sense for most programmers to focus on that standard rather than on [pyxie] if tree-representation of XML documents is a requirement.


Module: XML Parser

The too generically -- and perhaps a bit wrongly -- named 'XML Parser' is a somewhat older tool to check the syntacticality and well-formedness of an XML document (but not to validate against a DTD). One extra utility class implements a bit of fuzziness in the checking to get HTML documents to pass (even without having all the closing tags XML requires). The range of applicability of this module is not as broad as those in the PyXML distribution. But it is easy to get up-and-running with XML Parser if your requirement is just to verify some XML documents. The module will check an XML document on STDIN if run from the command line without even bothering to import it into your program. You can't get much easier than that.


XML_OBJECTS 0.1

Like other high-level tools, xml_objects is built on top of SAX. The purpose of xml_objects is to transform an XML document into a two dimensional grid representation that can more easily be stored in a relational database.


What's next

In the next "Charming Python" column, we'll take a closer look at the xml.dom module, probably the most powerful tool available to a Python programmer for working with XML documents.


Resources

About the author

David Mertz

There must be some enthymetic necessity to David Mertz writing a column on Python. Like the Monty crew, whose phonorecordings he imbibed as a teenager, he wound up with graduate degrees in philosophy. Now that he writes computer programs for a living -- and writes about writing computer programs -- a certain symmetry is served by writing such in and about Python. David would welcome comments and suggestions for this column. You can contact David at mertz@gnosis.cx and find his life pored over at http://gnosis.cx/dW/.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux, XML, Open source
ArticleID=11014
ArticleTitle=Charming Python: Tinkering with XML and Python
publish-date=06012000
author1-email=mertz@gnosis.cx
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers