Skip to main content

Tip: SAX filters for flexible processing

Create a chain of XML processes

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  SAX filters allow you to construct complex XML processing behaviors from simple, independent modules. In this tip, Uche Ogbuji introduces this important XML processing technique.

View more content in this series

Date:  01 Mar 2003
Level:  Intermediate
Activity:  1571 views

Simple API for XML (SAX) is a very efficient method of XML processing. In SAX processing, the parser passes to the application a stream of events that represents the XML content. An important aspect of SAX is the user's ability to create SAX filters, which accept a stream of SAX events and pass on a modified stream. For example, you might use a SAX filter to take as input XML-ized HTML that uses such deprecated HTML practices as <center> and passes on proper XHTML forms such as <div style="align: center">. Such a filter could then be reused in a broad array of applications very easily because it does a single, focused task, and by design it is separate from the systems upstream as well as downstream from the filter.

If you are unfamiliar with SAX, please see some of the introductory material mentioned in Resources.

A SAX filter that selects English language sections

XML 1.0 allows you to specify the language used in element content on an element-by-element basis using the xml:lang attribute (see my earlier tip, "Localization within a document format," in Resources for more information on this). Here, I shall create a SAX filter in Python that strips all content that is known to be in a language other than English; in other words, the filter preserves all content that doesn't have an xml:lang designation or has a designation starting with en.

The implementation of a SAX filter in most object-based SAX systems is as a specialized handler class. All SAX handlers accept SAX events from an upstream source, which might be the XML parser directly. A SAX filter is also a SAX handler class, but it is distinguished in that its actions are to generate further SAX events by calling the appropriate methods on a given instance, which is the downstream SAX handler in the filter chain. Listing 1 is a SAX filter instance.


Listing 1. A filter that removes non-English content (en-filter.py)
                
import xml.sax
from xml.sax.saxutils import XMLFilterBase, XMLGenerator

#Define constants for the two states we care about
ALLOW_CONTENT = 1
SUPPRESS_CONTENT = 2

class EnglishOnlyFilter(XMLFilterBase):
    def __init__(self, upstream, downstream):
        XMLFilterBase.__init__(self, upstream)
        self._downstream = downstream
        return

    def startDocument(self):
        #Set the initial state, and set up the stack of states
        self._state = ALLOW_CONTENT
        self._state_stack = [ALLOW_CONTENT]
        return

    def startElement(self, name, attrs):
        #Check if there is any language attribute
        lang = attrs.get('xml:lang')
        if lang:
            #Set the state as appropriate
            if lang[:2] == 'en':
                self._state = ALLOW_CONTENT
            else:
                self._state = SUPPRESS_CONTENT
        #Always update the stack with the current state
        #Even if it has not changed
        self._state_stack.append(self._state)
        #Only forward the event if the state warrants it
        if self._state == ALLOW_CONTENT:
            self._downstream.startElement(name, attrs)
        return

    def endElement(self, name):
        self._state = self._state_stack.pop()
        #Only forward the event if the state warrants it
        if self._state == ALLOW_CONTENT:
            self._downstream.endElement(name)
        return

    def characters(self, content):
        #Only forward the event if the state warrants it
        if self._state == ALLOW_CONTENT:
            self._downstream.characters(content)
        return


if __name__ == "__main__":
    parser = xml.sax.make_parser()
    #XMLGenerator is a special SAX handler that merely writes
    #SAX events back into an XML document
    downstream_handler = XMLGenerator()
    #upstream, the parser, downstream, the next handler in the chain
    filter_handler = EnglishOnlyFilter(parser, downstream_handler)
    import sys
    #The SAX filter base is designed so that the filter takes
    #on much of the interface of the parser itself, including the
    #"parse" method
    filter_handler.parse(sys.argv[1])

Python supplies a utility class from which SAX filters can be derived -- XMLFilterBase. I define EnglishOnlyFilter as a filter that takes an upstream SAX event source (the parser or another filter) and a downstream SAX filter or other handler. Many SAX handlers operate as state machines, meaning they manage some variables based on the stream of events that come in, and change behavior based on these variables. EnglishOnlyFilter is set up to be in one of two states: one in which content is passed on to the downstream handler, and one in which content is suppressed. This state is marked in the self._state instance variable. The state is initially set to ALLOW_CONTENT, and changed to SUPPRESS_CONTENT if the filter encounters an xml:lang attribute that represents a language other than English (which can be done by checking the first two characters of the value, according to the rules of standard language codes).

XML language specifications are scoped. See Listing 2 for an example of what this means.


Listing 2. A sample XML file with multiple languages (listing2.xml)
                
<?xml version="1.0" encoding="utf-8"?>
<menu>
  <item id="A" xml:lang="en">Orange juice</item>
  <item id="A" xml:lang="es">Jugo de naranja</item>
  <item id="B" xml:lang="en">Toast</item>
  <item id="B" xml:lang="es">Pan tostada
    <note xml:lang="en">Wheat bread only, please</note>
  </item>
</menu>

In this example, the string "Pan tostada" is within the scope of the element with the attribute xml:lang="es", and so it is marked as being in Spanish. The entire note element, however, is marked as being in English by an overriding xml:lang="en" attribute. Such scoping requires that I maintain a stack of the state in the SAX filter, the self._state_stack instance variable. To be precise, the self._state_stack variable makes self._state unnecessary -- I could have just read the current state from the top of the stack -- but I left it in for a bit of added clarity. Running the filter code against the sample XML gives the following output.

$ python en-filter.py listing2.xml
<menu>
  <item xml:lang="en" id="A">Orange juice</item>
  <item xml:lang="en" id="B">Toast</item>
  <note xml:lang="en">Wheat bread only, please</note>
</menu>


Wrap up

SAX is already fast, and SAX filters add some flexibility. As you use SAX more and more, you may find yourself with an impressive library of SAX filters for all sorts of processing tasks.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12223
ArticleTitle=Tip: SAX filters for flexible processing
publish-date=03012003
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers