 | Level: Intermediate Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
01 Mar 2003 SAX filters allow you to construct complex XML processing behaviors from simple, independent modules. In this tip, Uche Ogbuji introduces this important XML processing technique.
Simple API for XML (SAX) is a very efficient method of XML processing. In SAX processing, the parser passes to the application a stream of events that represents the XML content. An important aspect of SAX is the user's ability to create SAX filters, which accept a stream of SAX events and pass on a modified stream. For example, you might use a SAX filter to take as input XML-ized HTML that uses such deprecated HTML practices as <center> and passes on proper XHTML forms such as <div style="align: center">. Such a filter could then be reused in a broad array of applications very easily because it does a single, focused task, and by design it is separate from the systems upstream as well as downstream from the filter.
If you are unfamiliar with SAX, please see some of the introductory material mentioned in Resources.
A SAX filter that selects English language sections
XML 1.0 allows you to specify the language used in element content on an element-by-element basis using the xml:lang attribute (see my earlier tip, "Localization within a document format," in Resources for more information on this). Here, I shall create a SAX filter in Python that strips all content that is known to be in a language other than English; in other words, the filter preserves all content that doesn't have an xml:lang designation or has a designation starting with en.
The implementation of a SAX filter in most object-based SAX systems is as a specialized handler class. All SAX handlers accept SAX events from an upstream source, which might be the XML parser directly. A SAX filter is also a SAX handler class, but it is distinguished in that its actions are to generate further SAX events by calling the appropriate methods on a given instance, which is the downstream SAX handler in the filter chain. Listing 1 is a SAX filter instance.
Listing 1. A filter that removes non-English content (en-filter.py)
import xml.sax
from xml.sax.saxutils import XMLFilterBase, XMLGenerator
#Define constants for the two states we care about
ALLOW_CONTENT = 1
SUPPRESS_CONTENT = 2
class EnglishOnlyFilter(XMLFilterBase):
def __init__(self, upstream, downstream):
XMLFilterBase.__init__(self, upstream)
self._downstream = downstream
return
def startDocument(self):
#Set the initial state, and set up the stack of states
self._state = ALLOW_CONTENT
self._state_stack = [ALLOW_CONTENT]
return
def startElement(self, name, attrs):
#Check if there is any language attribute
lang = attrs.get('xml:lang')
if lang:
#Set the state as appropriate
if lang[:2] == 'en':
self._state = ALLOW_CONTENT
else:
self._state = SUPPRESS_CONTENT
#Always update the stack with the current state
#Even if it has not changed
self._state_stack.append(self._state)
#Only forward the event if the state warrants it
if self._state == ALLOW_CONTENT:
self._downstream.startElement(name, attrs)
return
def endElement(self, name):
self._state = self._state_stack.pop()
#Only forward the event if the state warrants it
if self._state == ALLOW_CONTENT:
self._downstream.endElement(name)
return
def characters(self, content):
#Only forward the event if the state warrants it
if self._state == ALLOW_CONTENT:
self._downstream.characters(content)
return
if __name__ == "__main__":
parser = xml.sax.make_parser()
#XMLGenerator is a special SAX handler that merely writes
#SAX events back into an XML document
downstream_handler = XMLGenerator()
#upstream, the parser, downstream, the next handler in the chain
filter_handler = EnglishOnlyFilter(parser, downstream_handler)
import sys
#The SAX filter base is designed so that the filter takes
#on much of the interface of the parser itself, including the
#"parse" method
filter_handler.parse(sys.argv[1])
|
Python supplies a utility class from which SAX filters can be derived -- XMLFilterBase. I define EnglishOnlyFilter as a filter that takes an upstream SAX event source (the parser or another filter) and a downstream SAX filter or other handler. Many SAX handlers operate as state machines, meaning they manage some variables based on the stream of events that come in, and change behavior based on these variables. EnglishOnlyFilter is set up to be in one of two states: one in which content is passed on to the downstream handler, and one in which content is suppressed. This state is marked in the self._state instance variable. The state is initially set to ALLOW_CONTENT, and changed to SUPPRESS_CONTENT if the filter encounters an xml:lang attribute that represents a language other than English (which can be done by checking the first two characters of the value, according to the rules of standard language codes).
XML language specifications are scoped. See Listing 2 for an example of what this means.
Listing 2. A sample XML file with multiple languages (listing2.xml)
<?xml version="1.0" encoding="utf-8"?>
<menu>
<item id="A" xml:lang="en">Orange juice</item>
<item id="A" xml:lang="es">Jugo de naranja</item>
<item id="B" xml:lang="en">Toast</item>
<item id="B" xml:lang="es">Pan tostada
<note xml:lang="en">Wheat bread only, please</note>
</item>
</menu>
|
In this example, the string "Pan tostada" is within the scope of the element with the attribute xml:lang="es", and so it is marked as being in Spanish. The entire note element, however, is marked as being in English by an overriding xml:lang="en" attribute. Such scoping requires that I maintain a stack of the state in the SAX filter, the self._state_stack instance variable. To be precise, the self._state_stack variable makes self._state unnecessary -- I could have just read the current state from the top of the stack -- but I left it in for a bit of added clarity. Running the filter code against the sample XML gives the following output.
$ python en-filter.py listing2.xml
<menu>
<item xml:lang="en" id="A">Orange juice</item>
<item xml:lang="en" id="B">Toast</item>
<note xml:lang="en">Wheat bread only, please</note>
</menu>
|
Wrap up
SAX is already fast, and SAX filters add some flexibility. As you use SAX more and more, you may find yourself with an impressive library of SAX filters for all sorts of processing tasks.
Resources
- To get your feet wet with SAX, read "SAX, the power API," by Benoît Marchal, which introduces SAX in Java (developerWorks, August 2001). David Mertz's article "Revisiting XML tools for Python" briefly introduces the Python SAX API (developerWorks, June 2001).
- My earlier tip, "Localization within a document format," introduces
xml:lang (developerWorks, September 2002).
- Learn the fundamentals of using SAX 2.0 -- including retrieving, manipulating, and outputting XML data -- in Nicholas Chase's tutorial "Understanding SAX" (developerWorks, September 2001).
- See the official Python documentation for a reference of xml.sax.saxutils, which includes the filter base class.
- Find information on Basic SAX processing in my Python/XML Akara site, where I maintain a lot of information about Python/SAX and other such facilities.
- Perl users -- consult "Transforming XML With SAX Filters," by Kip Hampton, to learn how to use SAX filters in Perl.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
-
IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
About the author  | 
|  |
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at
uche@ogbuji.net.
|
Rate this page
|  |