Skip to main content

Tip: Asynchronous SAX

Use Simple API for XML as a long-running event processor

David Mertz (mertz@gnosis.cx), Daemon, Gnosis Software, Inc.
Photo of David Mertz
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. And buy his book: Text processing in Python .

Summary:  Even though every abstract description of SAX prominently mentions that it is an event-driven interface, very few SAX applications really use SAX for event-driven programming. Instead, SAX is mostly just a way to save memory while extracting data from an XML document. However, over asynchronous channels -- such as a socket that produces data over a long duration -- SAX is a wonderfully lightweight programming technique for parsing incoming messages.

View more content in this series

Date:  14 May 2003
Level:  Introductory
Activity:  583 views
Comments:  

Usually, you probably think of XML as a format for files. Parsing an XML file using SAX means opening the file, sequentially reading through it to find tags and contents, processing each occurrence, then closing the file when the parsing is done. But the XML specification applies just as well to asynchronous streams as to disk files. And since SAX is strictly unidirectional, it works great on streams.

In principle, a stream can be a lot of things, but most of the time in Internet programming, this term refers to BSD sockets -- an interface implemented on all modern operating systems. No reason prevents you from using the techniques in this tip for a serial-port instrument connection, to monitor GUI events, or for similar long-running and intermittent data streams.

The basic idea this tip promotes is that XML is often an excellent choice for a wire protocol, and SAX is the most natural technique for coding client applications that utilize this protocol. While XML's verboseness can be a problem for monitoring large volumes of data, when it comes to moderate streams of ongoing data, SAX (and XML itself) is a good choice for a communications API.

An implementation

For this tip, I wanted a test case that would incorporate a non-finite data stream provided by a remote host that is useful to a client. Since I maintain a Web site, an obvious example was a way of remotely monitoring hits to my site -- they occur continually, at an irregular rate, and the total data bandwidth is moderate. It could be useful, or at least interesting, to let a utility on my home system keep track of hits to my Web server.

On my particular Web server, log records are appended to a file, one per line, with mostly space-separated fields. But some quoted fields have internal spaces, so parsing a line is a little bit complicated. Granted, I could send these raw log lines directly to the client as they are written. But XML has several nice features that you are probably familiar with:

  • It is somewhat self-documenting
  • It allows variations in attribute order and whitespace
  • Within limits, a schema can be enhanced over time and maintain backward compatibility
  • And specifically for my application, I could arrange to monitor several Web servers this same way, as long as each one transmitted its log data in a common XML format

My XML log server is a pretty basic socket application, written in Python (but a different language would be fine for server and/or client). Here's an abridged version:


Listing 1. Server application (weblog-xml.py)
                
                from SocketServer import BaseRequestHandler, TCPServer
from time import sleep
import sys, socket
# ...Define hit_tag template and log_fields() function...
                
class WebLogHandler(BaseRequestHandler): def handle(self): print "Connected from", self.client_address self.request.sendall('<hits>') try: while True: for hit in LOG.readlines(): self.request.sendall(hit_tag % log_fields(hit)) sleep(5) except socket.error: self.request.close() print "Disconnected from", self.client_address if __name__=='__main__': global LOG LOG = open('../access-log') LOG.seek(0, 2) # Start at end of current access log srv = TCPServer(('',8888), WebLogHandler) srv.serve_forever()

When a socket is opened, the document root element <hits> is sent immediately, followed by new logged hits, as they occur (but batched in 5-second blocks), with elements similar to that in Listing 2:


Listing 2. Sample <hit> XML element
                
<hit
  ip="210.8.XX.XXX"
  timestamp="11/May/2003:01:47:53 -0500"
  request="GET /publish/programming/code_recognizer.gif HTTP/1.1"
  status="200"
  bytes="12718"
  referrer="http://gnosis.cx/dW/programming/neural_networks.htm"
  agent="Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"/>


The SAX client

In the server, I have not yet used SAX at all; I just use string formatting to compose XML elements. In the client application, SAX saves some work. Here is my entire current client application:


Listing 3. SAX-based log monitoring client
                
#!/usr/bin/env python
import socket
import xml.sax
from xml.sax.handler import ContentHandler
class AsyncWebLog(ContentHandler):
    def startDocument(self):
        print "Connected to gnosis.cx server"
    def startElement(self, name, attrs):
        if (name=='hit' and attrs['status']=='200'
                        and attrs['referrer']!='-'):
            print attrs['referrer'],"->",attrs['request'].split()[1]
parser = xml.sax.make_parser()
parser.setContentHandler(AsyncWebLog())
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('gnosis.cx', 8888))
try:
    while 1:
        xml_data = sock.recv(8192)
        parser.feed(xml_data)
finally:
    sock.close()

Since the log records are already nicely formatted as XML, parsing their elements is essentially effortless. All I need to do is define a content handler that has a .startElement() method, and do something desirable inside that method. Just to be a little friendly, I also have the client acknowledge the connection with a message, which is triggered by the root <hits> element that's sent by the sever.

My startElement() method makes a few decisions about what it wants to display. I decide only to process elements named <hit> -- perhaps an enhanced server will start sending other sorts of XML elements as messages as well; my client will happily ignore them without choking on the stream. Quite a few attributes are available, but I decided to focus just on the referrers to my pages. The boolean algebra of checking various attributes like these is demonstrated by my test for only successfully delivered pages with known referrers. After that, I print a description to my client screen. Obviously, a more elaborate client application could use a GUI to display this information, or otherwise manipulate and process the received data.


Quick finish

With my client and server running, every once in a while my local terminal updates with a list of a few surfers who have followed links to get to my Web site. As long as I leave these processes running, I can continue to receive updates forever -- the underlying XML has no size limit. Any application that is similar in the minimal respect of monitoring a long-lasting data stream can usefully -- and easily -- utilize the XML and SAX libraries already available with their favorite programming tools to achieve this purpose.


Resources

About the author

Photo of David Mertz

David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. And buy his book: Text processing in Python .

Comments



Trademarks

static.content.url=/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12270
ArticleTitle=Tip: Asynchronous SAX
publish-date=05142003
author1-email=mertz@gnosis.cx
author1-email-cc=