Simple API for XML (SAX) is a very efficient method of XML processing. In SAX processing, the parser passes to the application a stream of events that represents the XML content. An important aspect of SAX is the user's ability to create SAX filters, which accept a stream of SAX events and pass on a modified stream. For example, you might use a SAX filter to take as input XML-ized HTML that uses such deprecated HTML practices as
<center> and passes on proper XHTML forms such as
<div style="align: center">. Such a filter could then be reused in a broad array of applications very easily because it does a single, focused task, and by design it is separate from the systems upstream as well as downstream from the filter.
If you are unfamiliar with SAX, please see some of the introductory material mentioned in Resources.
A SAX filter that selects English language sections
XML 1.0 allows you to specify the language used in element content on an element-by-element basis using the
xml:lang attribute (see my earlier tip, "Localization within a document format," in Resources for more information on this). Here, I shall create a SAX filter in Python that strips all content that is known to be in a language other than English; in other words, the filter preserves all content that doesn't have an
xml:lang designation or has a designation starting with
The implementation of a SAX filter in most object-based SAX systems is as a specialized handler class. All SAX handlers accept SAX events from an upstream source, which might be the XML parser directly. A SAX filter is also a SAX handler class, but it is distinguished in that its actions are to generate further SAX events by calling the appropriate methods on a given instance, which is the downstream SAX handler in the filter chain. Listing 1 is a SAX filter instance.
Listing 1. A filter that removes non-English content (en-filter.py)
import xml.sax from xml.sax.saxutils import XMLFilterBase, XMLGenerator #Define constants for the two states we care about ALLOW_CONTENT = 1 SUPPRESS_CONTENT = 2 class EnglishOnlyFilter(XMLFilterBase): def __init__(self, upstream, downstream): XMLFilterBase.__init__(self, upstream) self._downstream = downstream return def startDocument(self): #Set the initial state, and set up the stack of states self._state = ALLOW_CONTENT self._state_stack = [ALLOW_CONTENT] return def startElement(self, name, attrs): #Check if there is any language attribute lang = attrs.get('xml:lang') if lang: #Set the state as appropriate if lang[:2] == 'en': self._state = ALLOW_CONTENT else: self._state = SUPPRESS_CONTENT #Always update the stack with the current state #Even if it has not changed self._state_stack.append(self._state) #Only forward the event if the state warrants it if self._state == ALLOW_CONTENT: self._downstream.startElement(name, attrs) return def endElement(self, name): self._state = self._state_stack.pop() #Only forward the event if the state warrants it if self._state == ALLOW_CONTENT: self._downstream.endElement(name) return def characters(self, content): #Only forward the event if the state warrants it if self._state == ALLOW_CONTENT: self._downstream.characters(content) return if __name__ == "__main__": parser = xml.sax.make_parser() #XMLGenerator is a special SAX handler that merely writes #SAX events back into an XML document downstream_handler = XMLGenerator() #upstream, the parser, downstream, the next handler in the chain filter_handler = EnglishOnlyFilter(parser, downstream_handler) import sys #The SAX filter base is designed so that the filter takes #on much of the interface of the parser itself, including the #"parse" method filter_handler.parse(sys.argv)
Python supplies a utility class from which SAX filters can be derived --
XMLFilterBase. I define
EnglishOnlyFilter as a filter that takes an upstream SAX event source (the parser or another filter) and a downstream SAX filter or other handler. Many SAX handlers operate as state machines, meaning they manage some variables based on the stream of events that come in, and change behavior based on these variables.
EnglishOnlyFilter is set up to be in one of two states: one in which content is passed on to the downstream handler, and one in which content is suppressed. This state is marked in the
self._state instance variable. The state is initially set to
ALLOW_CONTENT, and changed to
SUPPRESS_CONTENT if the filter encounters an
xml:lang attribute that represents a language other than English (which can be done by checking the first two characters of the value, according to the rules of standard language codes).
XML language specifications are scoped. See Listing 2 for an example of what this means.
Listing 2. A sample XML file with multiple languages (listing2.xml)
<?xml version="1.0" encoding="utf-8"?> <menu> <item id="A" xml:lang="en">Orange juice</item> <item id="A" xml:lang="es">Jugo de naranja</item> <item id="B" xml:lang="en">Toast</item> <item id="B" xml:lang="es">Pan tostada <note xml:lang="en">Wheat bread only, please</note> </item> </menu>
In this example, the string "Pan tostada" is within the scope of the element with the attribute
xml:lang="es", and so it is marked as being in Spanish. The entire
note element, however, is marked as being in English by an overriding
xml:lang="en" attribute. Such scoping requires that I maintain a stack of the state in the SAX filter, the
self._state_stack instance variable. To be precise, the
self._state_stack variable makes
self._state unnecessary -- I could have just read the current state from the top of the stack -- but I left it in for a bit of added clarity. Running the filter code against the sample XML gives the following output.
$ python en-filter.py listing2.xml <menu> <item xml:lang="en" id="A">Orange juice</item> <item xml:lang="en" id="B">Toast</item> <note xml:lang="en">Wheat bread only, please</note> </menu>
SAX is already fast, and SAX filters add some flexibility. As you use SAX more and more, you may find yourself with an impressive library of SAX filters for all sorts of processing tasks.
- To get your feet wet with SAX, read "SAX, the power API," by Benoît Marchal, which introduces SAX in Java (developerWorks, August 2001). David Mertz's article "Revisiting XML tools for Python" briefly introduces the Python SAX API (developerWorks, June 2001).
- My earlier tip, "Localization within a document format," introduces
xml:lang(developerWorks, September 2002).
- Learn the fundamentals of using SAX 2.0 -- including retrieving, manipulating, and outputting XML data -- in Nicholas Chase's tutorial "Understanding SAX" (developerWorks, September 2001).
- See the official Python documentation for a reference of xml.sax.saxutils, which includes the filter base class.
- Find information on Basic SAX processing in my Python/XML Akara site, where I maintain a lot of information about Python/SAX and other such facilities.
- Perl users -- consult "Transforming XML With SAX Filters," by Kip Hampton, to learn how to use SAX filters in Perl.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.