This column, the last in the current round of development around the Handler Compiler (HC), delivers the first working version of HC. As I did with XM, the first project for this column, I now intend to move to another project for the next few months while field testing HC.
I hope to use this time to gain more experience from the project and draw a list of requirements for future developments. Of course, in the meantime, I encourage you to download HC, test it in your environment, and share your thoughts on the ananas-discussion mailing list. I will post bug fixes and updates on the CVS server (see Resources).
HC evolved from my experience writing SAX code. While I appreciate the power and flexibility of SAX parsers, I have also found that they require lots of tedious coding to track where the parser is in the document. HC automatically generates the state tracking code from XPaths.
HC is broken down in two components. The first is a compiler that accepts the application handler and creates a table class. The application handler is a Java class that implements the HCHandler interface. Special Javadoc comments indicate which method to invoke when the parser matches an XPath. The compiler is used for development only.
The second component is the run-time. Its most important class is XPathHandler. XPathHandler acts as a proxy to translate SAX events (start element, end element, and the like) in calls to the application handler. The run-time ships with applications.
Figure 1 is the class model for the HC run-time. Here, the application handler is HCCountHandler.
Figure 1. Class model for the HC run-time

In the last column, we wrote the logic to compile a set of XPath in a so-called Deterministic Finite Automaton or DFA. Without repeating the discussion from that column, a DFA is a structure to which XPaths can be compiled efficiently.
What remains for this column is interfacing the DFA construction algorithm with Javadoc to pull XPaths from the application handler source. In this column, we also need to write the table class.
Unfortunately what was supposed to be a smooth ride to the end of the project turned into a more involved coding session when I found a problem with the DFA. More on this later.
To integrate HC smoothly into Java classes, I turned to an old friend: Javadoc comments. The new @xpath Javadoc tag indicates that the method should match the given XPath such as:
/**
* @xpath para
*/
public void startPara()
{
writer.print("<p>");
}
|
A word of caution: We have two different tags here. Javadoc tags appear in Java code and have the form @name value. XML tags appear in XML code and have the form <para/>. Unfortunately, Javadoc and XML have adopted the same vocabulary, so be careful not to confuse them.
I like this solution because it does not force me to switch from the Java editor to another tool or to learn a new language. Furthermore, since JDK 1.2, Javadoc has supported Doclet extensions. Doclets let you plug any code into the Javadoc parser.
Doclets were originally introduced to let you change the format of Javadoc documentation. The Javadoc parser reads the files, compiles information about the classes, methods, and packages, and passes it to the Doclet. The default Doclet writes HTML documentation for the classes. Sun also ships Doclet for MIF (Framemaker), PDF, and RTF.
Doclets have many other applications. Since they have access to the entire parse tree (minus the actual method body), Doclets provide a handy mechanism to compile utility classes or compile reports on the code. For example, Sun has a Doclet that checks the quality and consistency of comments.
Doclets are fashioned after the main() method. A Doclet has a static start() method and takes the parse tree as an argument.
The Doclet API defines many classes for storing the parse tree (note that this is the Java parse tree, not the XML one). The most important ones for HC are RootDoc, ClassDoc, and MethodDoc; they return information on the parse tree, classes, and methods, respectively.
HC compiler is CompilerDoclet. It collects the namespace declarations (from the @xmlns tags) and the XPaths attached to methods (from the @xpath tags) and uses the HCTablesGenerator (to be introduced shortly) to write the table class.
Complete listings for CompilerDoclet are available online (see Resources). Listing 1 is the start() method. It extracts command-line arguments and processes the parse tree. Note the use of an inner-class, DocletMessenger, to report HC errors properly.
Listing 1: CompilerDoclet.start()
public static boolean start(RootDoc root)
throws Exception
{
try
{
String[][] options = root.options();
File destdir = new File(".");
for(int i = 0;i < options.length;i++)
if(options[i][0].equals("-d"))
destdir = new File(options[i][1]);
CompilerDoclet compiler = new CompilerDoclet();
HandlerInfo[] handlers = compiler.compile(root);
Messenger messenger = new DocletMessenger(root);
HCTablesGenerator generator =
new HCTablesGenerator(getMessageStore(),messenger,destdir);
for(int i = 0;i < handlers.length;i++)
generator.generateHCTables(handlers[i]);
return true;
}
catch(CompilerException e)
{
// no need to display again, it has already been shown
// to the user
return false;
}
}
|
Listing 2 is the compile(ClassDoc) method, which extracts HC information for a class. The Doclet API is readable so you should have no problem following along. For example, ClassDoc.interfaces() returns the interfaces that the class implements. ClassDoc.tags() returns Javadoc tags for the class.
HC defines two classes, HandlerInfo and MethodInfo, to collect this information. I chose not to use the Javadoc-provided classes directly in order to buy myself some independence from Javadoc. Who knows -- I might want to switch to another Java parser in the future.
Listing 2: compile(ClassDoc)
protected HandlerInfo compile(ClassDoc clasz)
{
ClassDoc[] interfaces = clasz.interfaces();
if(interfaces == null)
return null;
boolean found = false;
for(int i = 0;i < interfaces.length;i++)
if(interfaces[i].qualifiedName().equals("org.ananas.hc.HCHandler"))
found = true;
if(!found)
return null;
Tag[] tags = clasz.tags("xmlns");
NamespaceSupport namespaceSupport = new NamespaceSupport();
if(tags != null)
for(int i = 0;i < tags.length;i++)
{
String content = tags[i].text();
int pos = content.indexOf(' ');
if(pos == -1)
namespaceSupport.declarePrefix("",content);
else
namespaceSupport.declarePrefix(content.substring(0,pos),
content.substring(pos + 1));
}
MethodDoc[] methods = clasz.methods();
List methodsList = new ArrayList();
if(methods != null)
for(int i = 0;i < methods.length;i++)
{
MethodInfo method = compile(methods[i]);
if(method != null)
methodsList.add(method);
}
MethodInfo[] methodsArray = new MethodInfo[methodsList.size()];
methodsList.toArray(methodsArray);
return new HandlerInfo(clasz.qualifiedName(),
namespaceSupport,
methodsArray);
} |
There are other compile() methods for RootDoc and MethodDoc. HandlerInfo and MethodInfo are also available online.
Writing the table class is the responsibility of HCTablesGenerator. It uses the XPathParser and DFAFactory introduced in the previous column to create DFAs from HandlerInfo. Listing 3 presents the relevant methods.
You might wonder why compileDFA() in Listing 3 creates a new DFA for each XPath. In the last column, they were combined through an OR. Well, that's a consequence of the problem I mentioned before. More details on this in the section "Problems with OR".
Listing 3: Compiling the DFA
public void generateHCTables(HandlerInfo handler)
throws CompilerException
{
messenger.info(message.getMessage("Compiling",handler.getName()));
try
{
DFATable[] tables = compileDFA(handler);
writeHCTables(handler,tables);
}
catch(IOException e)
{
error("IOException",e.getLocalizedMessage());
}
}
protected DFATable[] compileDFA(HandlerInfo handler)
throws CompilerException
{
XPathParser parser = new XPathParser(handler.getNamespaceSupport(),message);
MethodInfo[] methods = handler.getMethods();
ArrayList array = new ArrayList();
for(int i = 0;i < methods.length;i++)
{
String[] xpaths = methods[i].getXPaths();
for(int j = 0;j < xpaths.length;j++)
{
XPathNode node = parser.axpath(xpaths[j],i,methods[i]);
if(node != null)
array.add(factory.createDFA(node));
}
}
DFATable[] tables = new DFATable[array.size()];
return (DFATable[])array.toArray(tables);
}
|
The writeHCTables() method serializes the DFA tables as a Java class. Currently, it just writes Java code in a text file. In the future, I might want to compile directly to bytecodes. However compiling to Java code is easier to debug.
writeHCTables() is too long (see the excerpt in Listing 4). It's a prime candidate for refactoring and I will certainly break it into more manageable units in a future iteration of HC.
The table class implements the HCTables interface (see Listing 5).
Listing 5: HCTables
package org.ananas.hc;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
public interface HCTables
{
public static final String CLASS_SUFFIX =
"__org_ananas_hc_tables_1";
public void setHCHandler(HCHandler handler);
public int getCount();
public int move(int xpath,QName qname,int state);
public boolean isAcceptingState(int xpath,int state);
public void acceptStartEvent(int xpath,
int state,
QName qName,
Attributes atts)
throws SAXException;
public void acceptEndEvent(int xpath,int state,QName qName)
throws SAXException;
public void acceptCharactersEvent(int xpath,
int state,
char[] ch,
int start,
int length)
throws SAXException;
}
|
The interface specifies the contract between XPathHandler and the table class. The setHCHandler() is used to initialize the table class. getCount(), move(), and isAcceptingState() define the interface to the DFA itself.
Finally the acceptXXXEvent() method implements a call in the application handler. The problem here is that since the XPathHandler is a generic class, it does not know about the actual application handler. Therefore it does not know which method to call when the DFA matches. The compiler creates these methods that call the appropriate method in the application handler.
I intentionally chose not to use Java reflection because it is less efficient. If you step through the code you will see that the compiler is very flexible when it creates these methods. For example, it gives the user a lot of control over the parameters. Again, this would be prohibitive to implement with reflection but it is perfectly acceptable through this method.
The remaining class is XPathHandler, which acts as a proxy between the SAX events and the HC events. Listing 6 is the handler.
The constructor takes an HCHandler as a parameter. It attempts to load the corresponding table class dynamically. Since the name of the table class is derived from the name of the application handler, this is not too difficult. A version number in the name guarantees compatibility in the future.
XPathHandler implements selected SAX events and issues the proper calls on the table class. startDocument() and startElement() cause it to transition (move) to the next state. endElement() restores the state before reading the current element.
Listing 6: XPathHandler
package org.ananas.hc;
import java.util.Stack;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class XPathHandler
extends DefaultHandler
{
protected HCTables tables;
protected int[] states;
protected Stack stack;
public XPathHandler(HCHandler handler)
throws HCException
{
try
{
Class handlerClass = handler.getClass(),
tablesClass = handlerClass.forName(handlerClass.getName() +
HCTables.CLASS_SUFFIX);
tables = (HCTables)tablesClass.newInstance();
tables.setHCHandler(handler);
}
catch(ClassNotFoundException e)
{
throw new HCException(e);
}
catch(IllegalAccessException e)
{
throw new HCException(e);
}
catch(InstantiationException e)
{
throw new HCException(e);
}
}
public void startDocument()
throws SAXException
{
stack = new Stack();
QName qname = new QName(QName.ROOT);
states = new int[tables.getCount()];
for(int i = 0;i < tables.getCount();i++)
{
states[i] = tables.move(i,qname,-1);
tables.acceptStartEvent(i,states[i],qname,null);
}
}
public void startElement(String namespaceURI,
String localName,
String qualifiedName,
Attributes atts)
throws SAXException
{
stack.push(states);
QName qname = new QName(QName.ELEMENT,namespaceURI,localName);
int[] cstates = states;
states = new int[tables.getCount()];
for(int i = 0;i < tables.getCount();i++)
{
if(tables.isAcceptingState(i,cstates[i]))
states[i] = tables.move(i,qname,-1);
else
states[i] = tables.move(i,qname,cstates[i]);
tables.acceptStartEvent(i,states[i],qname,atts);
}
}
public void characters(char[] ch,
int start,
int length)
throws SAXException
{
for(int i = 0;i < tables.getCount();i++)
tables.acceptCharactersEvent(i,states[i],ch,start,length);
}
public void endElement(String namespaceURI,
String localName,
String qualifiedName)
throws SAXException
{
QName qname = new QName(QName.ELEMENT,namespaceURI,localName);
for(int i = 0;i < tables.getCount();i++)
tables.acceptEndEvent(i,states[i],qname);
states = (int[])stack.pop();
}
public void endDocument()
throws SAXException
{
QName qname = new QName(QName.ROOT);
for(int i = 0;i < tables.getCount();i++)
tables.acceptEndEvent(i,states[i],qname);
}
}
|
You might wonder why this application creates as many DFAs as XPaths. If anything, this is less efficient than maintaining only one DFA. That's the best trade-off I can think of, given an unexpected problem occurred.
The promise of this column is to show you work on the project as it unfolds. I promised that I would share with you how I attempted to solve the problems and what I learned in the process. If anything, I hope my false starts and problems will help you avoid the same situation.
This time the problem is that I misunderstood something. In the last column, I was linking XPaths with an OR operator to create a single DFA. In other words, I was treating the following two XPaths:
simpara/ulink sect1info/title |
as if they have been written:
simpara/ulink | sect1info/title |
This looks fine, and for the most part it is. However, it breaks in the following situation. If one XPath ends with the beginning of the next one, then this is incorrect. For example, the following two XPaths:
sect1/simpara simpara/ulink |
are not equivalent to
sect1/simpara | simpara/ulink |
Can you spot the difference? It took me a while. The problem is that if the DFA recognizes sect1/simpara, it will never explore the other branch (simpara/ulink). The above two XPaths really are equivalent to:
sect1/(simpara|simpara/ulink) | simpara/ulink |
Although this is not proper XPath syntax, the parentheses imply a different priority.
So what went wrong? When I set out looking for an algorithm, I looked at examples in a specific context (regular expression) and I attempted to match them to a completely different context. I saw some problems (e.g. the symbol space for XPath is unlimited) but I missed this one.
I had to choose between spending several columns revising the algorithm and releasing it with a less satisfactory (running multiple DFAs in parallel) but stable technical solution. Add to the mix a self-imposed deadline to deliver a working version in this column, pause the project, and go gain some practical experience, as I did with XM.
As a technician, I'm inclined to ignore deadlines in favour of a more elegant technical solution. Yet, as a consultant, I have learned that it's best to release a stable but slower product early. Nothing beats practical experience, and the second release is your best chance to optimize performance.
To conclude this series on HC, here's a short how-to guide. Listing 7 is an HC application handler that formats a small subset of Docbook in HTML. Docbook is a popular DTD for technical publishing. The class defines methods for selected Docbook elements. @xpath tags mark the XPath. It also implements the HCHandler interface (which is not a lot of work given HCHandler defines no method; it is essentially a flag for the compiler).
The HC compiler differentiates start, end, and character events by their names. It is quite flexible when it comes to parameters. For example, notice that a characters method accepts the SAX character array or a String.
Run the HC compiler to create the corresponding table class:
javadoc -docletpath hc.jar;xerces.jar -doclet org.ananas.hc.compiler.CompilerDoclet -classpath hc.jar;xerces.jar -sourcepath src -d autosrc org.ananas.hc.test.* |
If we review the parameters one by one, -docletpath is the classpath to the doclet, -doclet selects the Doclet, -classpath is the classpath for the application handler (do not confuse them), and -d is the output directory.
The last parameter (org.ananas.hc.test.*) selects the package to use.
Compile the application (including the class file) with javac or jikes and run. Congratulations, you're ready to roll.
As I mentioned in the introduction, I plan to stop developing HC in order to gain practical experience with it. It's far from complete, but as I have often stated in this column, I believe in pragmatically testing software to find out what needs to be improved.
I encourage you to give it a try. Download and install HC, test it, and report your findings on the ananas-discussion mailing list.
The next column will launch a third "Working XML" project. As always, the new project will be released as open source.
- You can download the code for this project from ananas.org. Follow the links
to the CVS repository on developerWorks as well as the ananas-discussion mailing list. I encourage to join the list and contribute your thoughts to the project.
- If you'd rather have a ZIP file, that's available too.
-
Jikes is an excellent Java compiler from IBM that greatly improves your build time.
- Contrast this project with the SIA Parser from
Robert Berlinski.
- You might also look at the JaxMe framework that compiles an XML schema in Java classes using SAX parsers.
-
IBM WebSphere Studio Application Developer is an easy-to-use, integrated development environment for building, testing, and deploying J2EE (TM) applications, including generating XML documents from DTDs and schemas.

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . He is also the author of Applied XML Solutions and XML and the Enterprise. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.




