Usually, you probably think of XML as a format for files. Parsing an XML file using SAX means opening the file, sequentially reading through it to find tags and contents, processing each occurrence, then closing the file when the parsing is done. But the XML specification applies just as well to asynchronous streams as to disk files. And since SAX is strictly unidirectional, it works great on streams.
In principle, a stream can be a lot of things, but most of the time in Internet programming, this term refers to BSD sockets -- an interface implemented on all modern operating systems. No reason prevents you from using the techniques in this tip for a serial-port instrument connection, to monitor GUI events, or for similar long-running and intermittent data streams.
The basic idea this tip promotes is that XML is often an excellent choice for a wire protocol, and SAX is the most natural technique for coding client applications that utilize this protocol. While XML's verboseness can be a problem for monitoring large volumes of data, when it comes to moderate streams of ongoing data, SAX (and XML itself) is a good choice for a communications API.
For this tip, I wanted a test case that would incorporate a non-finite data stream provided by a remote host that is useful to a client. Since I maintain a Web site, an obvious example was a way of remotely monitoring hits to my site -- they occur continually, at an irregular rate, and the total data bandwidth is moderate. It could be useful, or at least interesting, to let a utility on my home system keep track of hits to my Web server.
On my particular Web server, log records are appended to a file, one per line, with mostly space-separated fields. But some quoted fields have internal spaces, so parsing a line is a little bit complicated. Granted, I could send these raw log lines directly to the client as they are written. But XML has several nice features that you are probably familiar with:
- It is somewhat self-documenting
- It allows variations in attribute order and whitespace
- Within limits, a schema can be enhanced over time and maintain backward compatibility
- And specifically for my application, I could arrange to monitor several Web servers this same way, as long as each one transmitted its log data in a common XML format
My XML log server is a pretty basic socket application, written in Python (but a different language would be fine for server and/or client). Here's an abridged version:
Listing 1. Server application (weblog-xml.py)
from SocketServer import BaseRequestHandler, TCPServer from time import sleep import sys, socket # ...Define hit_tag template and log_fields() function... |
When a socket is opened, the document root element <hits> is
sent immediately, followed by new logged hits, as they occur (but
batched in 5-second blocks), with elements similar to that in Listing 2:
Listing 2. Sample <hit> XML element
<hit ip="210.8.XX.XXX" timestamp="11/May/2003:01:47:53 -0500" request="GET /publish/programming/code_recognizer.gif HTTP/1.1" status="200" bytes="12718" referrer="http://gnosis.cx/dW/programming/neural_networks.htm" agent="Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"/> |
In the server, I have not yet used SAX at all; I just use string formatting to compose XML elements. In the client application, SAX saves some work. Here is my entire current client application:
Listing 3. SAX-based log monitoring client
#!/usr/bin/env python
import socket
import xml.sax
from xml.sax.handler import ContentHandler
class AsyncWebLog(ContentHandler):
def startDocument(self):
print "Connected to gnosis.cx server"
def startElement(self, name, attrs):
if (name=='hit' and attrs['status']=='200'
and attrs['referrer']!='-'):
print attrs['referrer'],"->",attrs['request'].split()[1]
parser = xml.sax.make_parser()
parser.setContentHandler(AsyncWebLog())
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect(('gnosis.cx', 8888))
try:
while 1:
xml_data = sock.recv(8192)
parser.feed(xml_data)
finally:
sock.close()
|
Since the log records are already nicely formatted as XML,
parsing their elements is essentially effortless. All I need
to do is define a content handler that has a .startElement()
method, and do something desirable inside that method. Just to
be a little friendly, I also have the client acknowledge the
connection with a message, which is triggered by the root
<hits> element that's sent by the sever.
My startElement() method makes a few decisions about what it
wants to display. I decide only to process elements named
<hit> -- perhaps an enhanced server will start sending
other sorts of XML elements as messages as well; my client
will happily ignore them without choking on the stream.
Quite a few attributes are available, but I decided to focus
just on the referrers to my pages. The boolean algebra of
checking various attributes like these is demonstrated by my test for
only successfully delivered pages with known referrers. After
that, I print a description to my client screen. Obviously, a
more elaborate client application could use a GUI to display
this information, or otherwise manipulate and process the
received data.
With my client and server running, every once in a while my local terminal updates with a list of a few surfers who have followed links to get to my Web site. As long as I leave these processes running, I can continue to receive updates forever -- the underlying XML has no size limit. Any application that is similar in the minimal respect of monitoring a long-lasting data stream can usefully -- and easily -- utilize the XML and SAX libraries already available with their favorite programming tools to achieve this purpose.
- Gain a solid understanding of the basics of the SAX interface with the SAX Project's
SAX Overview.
- Check out the official documentation of the xml.sax module for details on using SAX inside Python.
- For an introduction to BSD sockets, read this Quick and Dirty Primer.
- Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Find out how you can become an IBM Certified Developer in XML and related technologies.

David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. And buy his book: Text processing in Python.
