Two columns ago I launched HC, the Handler Compiler, as a new project for this column. The goal of HC is to compile a proxy content handler that matches XPaths to specific methods (the application handler) in SAX parsing.
I have found that in my own SAX programming I spend too much time on low-level, repetitive tasks such as state tracking. HC is my attempt to break free from those technicalities.
If you have not read the last two columns, I encourage you to review them now as this month I implement the algorithms introduced in the previous columns (see Related content at right for links).
Most specifically, in the last column I reviewed algorithms to compile a so-called Deterministic Finite Automaton (DFA). The DFA is a popular algorithm to construct a state machine to recognize patterns. In the case of HC, the patterns are XPaths.
Most of the development work for this column has been in implementing the DFA construction algorithm introduced last month. Before the actual algorithm, however, I first had to implement a few utility classes for message display. Most of them will be familiar to you if you are a regular reader of this column:
- Since the very early days of Java, I have missed something like
NotImplementedExceptionin the standard Java library. This is an exception to signal unexpected conditions. I have found that being able to report these exceptions with a specific error code (instead of a generic exception or, even worse, no reported error) has saved me hours tracking bugs. I had introduced a similar class for XM in theorg.ananas.utilpackage. This version is in theorg.ananas.hcpackage to indicate that it is part of the HC run time. -
Messengeris very similar to XM's messenger classes. The compiler interacts with the user through an implementation of theMessengerinterface. The interface has methods to report errors and display information messages. Because the compiler works in batch mode, it has no methods to prompt the user for data. Note that this class is specific to the compiler (and therefore resides in theorg.ananas.hc.compilerpackage). The HC run time (including the proxy) throwsSAXExceptionor descendants ofSAXExceptionto report errors. -
DefaultMessengeris a default implementation ofMessengerthat outputs messages to the console. To offer a GUI front end on the compiler, I need to provide only an alternate implementation ofMessengerthat displays messages in a window. -
MessageStoreoffers internationalization facilities. It essentially combines aResourceBundlewith the most useful methods inMessageFormat.
Although these classes are important in the overall HC operation (imagine a compiler that cannot report errors), they are not directly related to compiling XPaths and should be mostly self explanatory. Consequently I won't go over them in this column in any detail. Instead, I encourage you to download them and review the code (see Resources).
Two related elements come into play when constructing the DFA. From the discussion in the last column, you may recall that the algorithm to construct a DFA expects a parse tree as input. For HC, the parse tree represents an XPath.
In the previous column, I defined an HCNode class for the nodes in the parse tree. When implementing the algorithm, I realized I had forgotten two important things:
- There are no fields in
HCNode, as introduced in the previous column, to store what to do upon successful recognition of an XPath. I have added a generic object,data, to store such information. I have also introduced apriorityinteger to indicate the relative priority of XPaths. - The
followpos()function, which was explained in the previous column, was not implemented. I have added it.
I chose a generic object to store information when matching an XPath because I do not want to tie the parser to a given application. Deciding what to do when matching an XPath and how to assign priorities is not the responsibility of the DFA compiler.
As for followpos(), I have added an accessor and new code in the class constructor:
case PARENT_OF:
// compute lastpos() and firstpos()
Iterator iterator = left.lastpos().iterator();
while(iterator.hasNext())
{
XPathNode n = (XPathNode)iterator.next();
n.followpos.addAll(right.firstpos());
}
break; |
As written in the previous column, an HCNode encapsulates a QName. The latter represents an XML element while HCNode represents an element in an XPath (so it might be an XML element or a child-of relationship).
When working with it, I found that the distinction between QName and HCNode was not so clean-cut. Logically an XML element in an XPath should be represented by a QName, not a different class.
For readability, I chose to replace HCNode with a class that inherits from QName. To better reflects its use, I changed its name into XPathNode. Otherwise an XPathNode is identical to an HCNode as introduced previously. More specifically, it has the firstpos(), lastpos(), and followpos() methods.
Unfortunately this caused one problem: XPathNode inherits equals() and hashCode() from QName. Eventually, this led to a hard-to-track bug in followpos().
The problem is that QName represents symbols in the input vocabulary. Two symbols are identical if they have the same namespace URI and local name. In other words, simpara equals simpara.
Yet that's not true for XPathNode! They represent a certain element in an XPath so that the simpara in simpara/ulink is different from the simpara in simpara/emphasis
More specifically their followpos() should be different: ulink in the first case, emphasis in the second case.
I had to rewrite equals() for XPathNode such that two different nodes are treated differently if they appear in different XPaths. For debugging purposes, I implemented an extra comparison method, isSynonymousWith(). The new method compares two XPathNodes.
It can perform either a deep comparison that recursively compares the trees below the current node, or a quick comparison that stops at the current node. This method is intended for debugging only, as it lets me compare the result of a parse with expected results.
While working with the XPath parser, I also decided that the END_MARK, a special type of XPathNode/HCNode that marks the end of an XPath would be better represented as a QName, so I moved that constant from XPathNode to QName.
While we're on the topic of XPathNode/HCNode, I have found that java.util.Set has an addAll() method that is identical to the merge() I had written. I now use addAll() instead of merge().
Last but not least, I have adapted the JUnit tests for these classes. You can download XPathNode and its test from the CVS repository (see Resources).
The actual XPath parsing is the responsibility of XPathParser, shown in Listing 1. As discussed in previous columns, this class recognizes only a subset of XPath, which is consistent with my earlier decision not to support conditions in XPaths.
The class constructor takes up to three parameters:
-
NamespaceSupportis a SAX-defined class to process namespaces. It can decompose an XML qualified name (db:simpara) in a namespace URI and local name. -
MessageStoreis essentially a resource bundle with the messages used to report parsing errors. -
Messengeris used by the parser to report errors to the end user.
The bulk of the parsing takes place in the xpath() method. The method is plain, vanilla parsing. It uses a standard StringTokenizer to decompose the string in elementary tokens. Next it loops over the tokens looking for XML elements, attributes and parent-of separators (/).
As it recognizes tokens, it creates XPathNode with the appropriate type: ELEMENT, ATTRIBUTE, and PARENT-OF. Care must be taken for the absolute XPath indicator.
In practice, users of XPathParser will use the axpath() method more often than xpath(). The axpath() method returns what is known as an augmented parse tree -- a parser tree with one extra node for the END_MARK. The extra node holds information about what to do when matching the XPath. As discussed above, it uses the data and priority fields for that purpose.
I have written a JUnit test for the parser as well.
The DFAFactory class compiles a DFA from the parse tree. It implements the algorithm discussed last month in the createDFA() method. The method accepts a parse tree (as an XPathNode object) as parameter and, for now, returns an instance of DFATable.
DFATable is a temporary class I introduced primarily for testing. Eventually it will be replaced by the table class, as explained in the first HC column. This class provides three useful methods:
-
move()returns the state the DFA transitions to upon hitting a given XML element. Its parameters are aQName, for the XML element, and the current state. -
isAcceptingState()returns true if the current state is an accepting state. An accepting state is when an XPath has been recognized. In the parse tree, it corresponds to hitting theEND_MARK. -
getAssociatedData()returns the data associated with an XPath (or null if the current state is not an accepting state). Typically the data will tell us which methods to call when a certain XPath has been recognized.
Listing 2 is DFATable. As you can see, the class is not difficult. It has three data fields for the list of QNames the DFA recognizes: (symbols), the transition table (dtran), and the accepting states (astate).
DFAFactory, in Listing 3, is the core class in the HC compiler because it is responsible for compiling the transition table. createDFA() is a direct implementation of the DFA construction algorithm discussed in the previous column. In fact the algorithm appears in the comments of the code.
I think that the most difficult aspect of reading this code is to realize that states are represented by a set of XPathNodes. Remember the discussion of states and stacks in the first HC column: A state represents a certain stack configuration (that is, a list of XML elements and attributes as they appear on the stack).
In this algorithm, a set of XPathNodes represents a stack configuration. In fact, if you were to trace this algorithm, you would realize that it builds a list of all the possible stack configurations and the transitions between them.
The top loop in the algorithm iterates over the list of XML elements and attributes that may appear in the XML document. To compute the list, DFAFactory walks the parse tree and compiles a list of unique QNames through the uniqueSymbols() method.
createDFA() also builds a list of accepting states by checking whether a state corresponds to an END_MARK node.
e_closure() is a helper method that computes the list of nodes in followpos().
Although HC is not complete yet, its back end is nearly written now. It's missing only the proxy and the front end.
To test the DFA construction, I had to simulate the proxy and the front end. Listing 4 is DFAHandler, a class that simulates the proxy. It does not call into the application handler though; it limits itself to printing recognized XPaths.
The handler takes an instance of DFATable and uses the move() function to transition to a new state given the XML element.
Further imagine that you're interested in the db:sect1/db:simpara/db:ulink and db:sect1/db:simpara XPaths. The following code excerpt shows how to create a DFA that recognizes them:
NamespaceSupport namespace =
new NamespaceSupport();
namespace.declarePrefix("db",
"http://www.ananas.org/2001/docbook");
MessageStore message =
MessageStore.getMessageStore(
"org.ananas.hc.compiler.Message");
XPathParser parser =
new XPathParser(namespace,message);
String xpath1 = "db:sect1/db:simpara/db:ulink",
xpath2 = "db:sect1/db:simpara";
XPathNode first = parser.axpath(xpath1,1,xpath1),
second = parser.axpath(xpath2,2,xpath2),
root = new XPathNode(XPathNode.OR_XPATH,
first,second);
DFAFactory factory = new DFAFactory();
DFATable table = factory.createDFA(root);
XMLReader xparser =
XMLReaderFactory.createXMLReader(
"org.apache.xerces.parsers.SAXParser");
xparser.setContentHandler(new DFAHandler(table));
xparser.parse("file:doc.xml"); |
DFAHandler illustrates how HC will work soon. Admittedly the previous example is hardcoded for two XPaths, but that's only because HC still needs a front end (coming up in the next column).
However, you can already appreciate the power of using a compiler: the DFA construction algorithm takes care of all the state management.
Here's a quick update on the status of XM. You may recall that XM, short for XSLT Make, was the first project introduced in the Working XML column. XM is an affordable solution for publishing Web sites with XML and XSLT. I have temporarily paused XM development in order to collect user feedback.
So far I have received some feedback on the ananas-discussion mailing list, but I would love to hear more from you so feel free to contribute. Also last month I built two new Web sites with XM. The first one was a partial rewriting of the corporate Web site for Pineapplesoft (my consulting company) in XML. The second one was in the context of a short consulting mission to help an organization implement XML and XSLT.
The mailing list and the other experiences have uncovered several bugs, and I have fixed most of them. They have also given me more insight into how a third party might deploy XM. The experiences have also prompted me to add two new features:
- XM now has an option to ignore the external DTD of a document. This proves handy with SoftQuad XMetaL (and possibly other XML editors). I discovered that XMetaL inserts paths relative to one of its directories, which has the unfortunate side effect of breaking other XML applications (including XM until these corrections).
- XM now fully supports the
xsl:outputelement. Therefore you can use XM to preprocess XML documents, for example, for bulk XML conversion.
I will discuss these changes in more detail in a future column. In the meantime, I have committed the new code to the CVS repository. I encourage you to download it, test it, and report your findings on the ananas-discussion mailing list (see Resources).
- You can download the code for this project from ananas.org. Follow the links
to the CVS repository on developerWorks as well as to the ananas-discussion mailing list. I encourage to join the list and contribute your thoughts to the project.
- If you'd rather have a ZIP file, it's available too.
- The so-called "Dragon
Book" (Compilers: Principles, Techniques and Tools by A. Aho, R. Sethi, and J. Ullman) is the authoritative document on compiler design and construction.
-
JUnit is an open-source framework for running automated tests.
-
Extreme programming (XP) is a set of recommendations for achieving better quality software.
-
Pragramatic Programmers is another recent contender in the field of methodologies.
-
Pattern Hatching is an interview with Kent Beck, one of the fathers of XP.

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . He is also the author of Applied XML Solutions and XML and the Enterprise. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.




