Last month I suspended work on the XM project, an XSLT-based Web-publishing management application, while I collect feedback on it over the next months. Already a few readers have written to share their experience with XM, report bugs, and suggest improvements. Special thanks to you for taking the time to share your thoughts.
So, on to a new project. For my second project for the Working XML column, I want to automate the creation of SAX ContentHandler. If you are not familiar with SAX, turn to SAX, the Power API, an excerpt from XML by Example, 2nd Edition.
SAX was designed as a syntax-oriented API. Through SAX events, your application listens to the flow of the document and, supposedly, generates the appropriate response. SAX is ideally suited to transform and otherwise manipulate XML documents. It is simple and efficient.
However, SAX was designed for power, not programmer's convenience, and that becomes painfully apparent when you write an event handler. SAX does very little preprocessing, so the programmer has to manage many low-level aspects. In particular, a significant portion of a SAX event handler is devoted to tracking how far the parser has gone in the document. The state tracking code is highly repetitive, error prone and, generally speaking, a chore to write and maintain.
HC, the Handler Compiler I introduce in this column, should simplify writing that content-tracking code. Specifically HC will automate writing a SAX ContentHandler.
To better understand what HC is all about, review Listing 1. The ContentHandler introduced in Listing 1 simply counts the number of occurrences of the ulink element within simpara elements. It ignores any ulink that appears elsewhere in the document, for example, in the articleinfo element.
You could hardly imagine a more trivial task: increment a counter when you find a special element. Yet SAXCountHandler is 39 lines long and, out of these 39 lines, only a dozen are directly related to the counter. The bulk of the code tracks the parser state.
Listing 1. SAXCountHandler.java
import java.util.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
public class SAXCountHandler
extends DefaultHandler
{
private int count;
private Stack stack;
public void startDocument()
{
count = 0;
stack = new Stack();
}
public void startElement(String uri,
String lName,
String qName,
Attributes atts)
{
if(!uri.equals("http://www.psol.com/2001/docbook"))
return;
if(lName.equals("ulink") &&
"simpara".equals(stack.peek()))
count++;
stack.push(lName);
}
public void endElement(String uri,
String lName,
String qName)
{
if(!uri.equals("http://www.psol.com/2001/docbook"))
return;
stack.pop();
}
public int getCount()
{
return count;
}
} |
Compare Listing 1 with Listing 2, which illustrates what I would prefer to write. The HCCountHandler class is half the size of SAXCounterHandler, and most of the code deals with the counter. In fact HCCountHandler has no code to track the state. Instead it uses special @xpath to indicate which method should be called when.
The HC compiler will preprocess Listing 2 and, using the @xpath comments, generate the state-tracking code automatically. The operative word here is automatically: HC does not dispense with state-tracking code (it could not), but it saves the programmer the effort of writing it.
Listing 2. HCCountHandler.java
/**
* @xmlns http://www.psol.com/2001/docbook
*/
public class HCCountHandler
implements org.ananas.hc.HCHandler
{
private int count;
/**
* @xpath /
*/
public void init()
{
count = 0;
}
/**
* @xpath simpara/ulink
*/
public void increment()
{
count++;
}
public int getCount()
{
return count;
}
} |
Now that you understand the goal, let's look at how to implement it.
HC is a compiler; it accepts a Java source code file adorned with special comments as input, and produces whatever is required to turn the class into a working SAX ContentHandler. There are two aspects
to analyze: What does HC need to produce, and how will it do so? The next section discusses the answer to the first question, code. The subsequent sections discuss the answer to the second question, the compiler itself.
For simplicity, I have decided to limit HC's scope, at least in the first release, to XPaths without conditions. For example, the following is acceptable in release 1:
/ /article/articleinfo/title simpara/ulink |
But this is not acceptable yet in release 1:
simpara[ulink/@href='index.xml'] |
I appreciate that conditions offer many useful applications, but the simplicity of doing without them allows me to fire an event as soon as the XPath is recognized. With conditions, I would have to store the event until the condition is met. As always, the concept of the Working XML column is to develop code alongside the columns so you can see the project unfold. I will lift the constraint against conditions in a later version of HC.
Figure 1 shows the class model for the code being generated. HC proposes a standard XPathHandler class that implements the ContentHandler interface. Amongst other things, XPathHandler tracks those pesky states for us. In the example in Listing 2, XPathHandler will call into the application handler, HCCountHandler.
Figure 1. Class model for the generation

This architecture is a standard proxy pattern. The XPathHandler class acts as a proxy between the low-level, generic events defined by SAX and the higher-level, application-specific events we define. In the
process, HC frees the programmer from the state management.
Ultimately I want to be able to write code such as:
HCCountHandler counter = new HCCountHandler(); XPathHandler handler = new XPathHandler(counter); reader.setContentHandler(handler); reader.parse(uri); |
Obviously, to perform its magic, XPathHandler needs a description of the application handler. The compiler will generate an extra class, the table class, with the information required by XPathHandler. The table class contains several tables and a few methods to perform application-specific calls.
The name of the table class is irrelevant because it is generated automatically and is essentially invisible to the programmer. The compiler just appends a suffix long and bizarre enough to avoid collision: __org_ananas_hc_tables.
The compiler is responsible for creating the table class automatically. The standard for compiler design is to break it in two modules: the front end that reads and decodes the input file, and the back end that produces the code or, in this case, the table class.
HC does not deviate from this standard, but it uses a trick to simplify writing the front end. Recall from Listing 2 that XPaths are written as Javadoc comments. I find this convenient for two reasons. First, it's a familiar syntax which is more convenient. Second, it provides me with a free front end through the doclet mechanism.
Doclet is a mechanism introduced in JDK 1.2 to extend Javadoc. In a nutshell, a doclet is a Java class to which Javadoc passes the result of parsing the Java class. Doclets were originally designed to give you more control over the presentation of documentation. You could, for example, write a doclet that produces PDF documentation instead of the standard HTML output.
Yet Javadoc returns so much information, including the class name, the package it's in, the list of methods, their parameters -- and, obviously, the famous comments -- that it is not only possible but also easy to use Javadoc as a front end. The only thing missing is the implementation of the methods ... but we don't need those for HC.
Doclets are introduced in the com.sun.javadoc package. The main classes are RootDocs that contain the parse tree. Classes themselves are represented as ClassDoc, methods as MethodDoc and Javadoc comments are instances of Tag class. So the @xpath comment will ultimately be returned as a Tag object attached to a MethodDoc object.
Note that Javadoc does not parse the XPaths so I will have to add an XPath parser in the front end, but parsing XPath is easy compared to parsing Java classes.
This gets into the most interesting aspect of this project: the back end. The back end collects the path information and creates the table class. Although at this time I have not yet worked out all the details for this algorithm, I plan to use transition diagrams, such as the one in Figure 2. A transition diagram essentially is a flowchart that controls how to recognize an XPath. The circles represent states. The states are connected by arrows, called edges, that specify which input element causes a move from one state to the next.
Figure 2. Transition diagram for simpara/ulink

So the transition diagram specifies how to recognize a given XPath. The states hold the necessary information for tracking where the parser is in reading the document; the edges record the progress from one state to the next.
The state with a thicker border, which marks the end of the XPath, is an accepting state. Typically when the process reaches an accepting state, you will want to call the application handler for the given XPath. Obviously in a real case, you would have more complex transition diagrams with several accepting states, one for each XPath you want to recognize.
Transition diagrams are a common implementation for lexers and parsers. They are also effective for searching through a text. They are particularly attractive because it is possible to represent this diagram efficiently as a set of tables.
Deterministic finite automaton
The most common algorithm to process transition diagrams is to convert them in a deterministic finite automaton (DFA). I first encountered DFA several years ago when I worked on regular expressions, but it has been at least three years since I last used them so I spent some time researching the subject. Amongst other things, I have re-read chapters 3 and 4 of "the dragon book," the classic book on compiler construction (see Resources).
Mathematically, a transition diagram is a set of states. One of the states is the start state, and one or more states are accepting states. The transition diagram also includes input symbols (in this case XML tags) and a transition function that controls how input symbols modify the state.
This model can be represented efficiently as a transition table, similar to Table 1. The table provides all the data needed to implement the transition function. For example, it states that state 1 transitions to state 2 when it encounters the simpara tag. Such a table is used to implement nondeterministic finite automaton (NFA). NFA is simply a fancy word for a state machine that implements the transition diagram.
Table 1. The transition table for simpara/ulink
| Â | Input symbols | |
| states | simpara | ulink |
| 0 | 1 | - |
| 1 | - | 2 |
| 2 | - | - |
Yet for increased efficiency, it is best to work with deterministic finite automaton. A DFA is a special case of NFA: it has no transition on the empty string, and it guarantees that the transition table has at most one edge leaving any given state for each symbol (in other words, from each state it is possible to compute the transition deterministically). As you will see, it is possible to automatically convert an NFA into a DFA.
The states in the transition table implement the equivalent of the stack in SAXCountHandler. Given the transition table, implementing the transition function is trivial. Once the transition function is in place, you can easily implement the NFA. Remember that the NFA is a state machine that implements the transition diagram. To summarize, if you can compile the transition table, you can easily process the XPaths.
Fortunately there are well-documented algorithms on how to construct the transition table for any transition diagram, so the DFA effectively solves the problem.
The only downside to transition tables is that, in real-world applications, the number of states grow very quickly. Writing the transition table by hand would be too much work. Fortunately you do not write it manually, you compile it.
The code is divided over three packages:
org.ananas.hcis the run time. It contains those classes that must always be included with applications. This package should be as small as possible.org.ananas.hc.compileris the compiler. It creates the table class that theXPathHandlerneeds. This package also contains the doclet and the XPath parser.- The application's own package, which is created by the table class.
I devoted this month's Working XML to analyzing the HC system, researching algorithms, and designing the main classes and packages. The real fun begins next month when I start coding the compiler. As always, the code will be posted in the CVS repository as soon as it is available.
- The so-called "Dragon Book" (Compilers: Principles, Techniques and Tools, by A. Aho, R. Sethi and J. Ullman) is the authoritative document on compiler design and construction.
- SAX, the Power API is an excerpt from XML by Example, 2nd Edition that introduces SAX parsing.
- JAXP is the standard Java API for XML processing. It consists of three components: SAX, DOM and TrAX.
- SIA Parser from Robert Berlinski, a reader, is another automation engine. It takes a different approach, as it does not use a compiler.

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . He is also the author of Applied XML Solutions and XML and the Enterprise. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.





