Skip to main content

Working XML: Compiling XPaths

HC kicks off with a first implementation of DFA construction

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . He is also the author of Applied XML Solutions and XML and the Enterprise. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Summary:  The Java-based Handler Compiler (HC) project for SAX parsing nears its alpha release. This month our columnist describes how he implements the DFA construction algorithm, giving the first concrete example of using the compiler to recognize XPath.

Each month in the Working XML column, Benoît Marchal discusses the progress of his open-source projects for XML developers, from design decisions to coding challenges. The current project, HC (short for Handler Compiler), will take some drudgery out of event-based XML parsing by automatically generating the SAX ContentHandler for a list of XPaths.

Date:  01 Jan 2002
Level:  Introductory
Activity:  1276 views

Two columns ago I launched HC, the Handler Compiler, as a new project for this column. The goal of HC is to compile a proxy content handler that matches XPaths to specific methods (the application handler) in SAX parsing.

I have found that in my own SAX programming I spend too much time on low-level, repetitive tasks such as state tracking. HC is my attempt to break free from those technicalities.

If you have not read the last two columns, I encourage you to review them now as this month I implement the algorithms introduced in the previous columns (see Related content at right for links).

Most specifically, in the last column I reviewed algorithms to compile a so-called Deterministic Finite Automaton (DFA). The DFA is a popular algorithm to construct a state machine to recognize patterns. In the case of HC, the patterns are XPaths.

Messages and display

Most of the development work for this column has been in implementing the DFA construction algorithm introduced last month. Before the actual algorithm, however, I first had to implement a few utility classes for message display. Most of them will be familiar to you if you are a regular reader of this column:

  • Since the very early days of Java, I have missed something like NotImplementedException in the standard Java library. This is an exception to signal unexpected conditions. I have found that being able to report these exceptions with a specific error code (instead of a generic exception or, even worse, no reported error) has saved me hours tracking bugs. I had introduced a similar class for XM in the org.ananas.util package. This version is in the org.ananas.hc package to indicate that it is part of the HC run time.
  • Messenger is very similar to XM's messenger classes. The compiler interacts with the user through an implementation of the Messenger interface. The interface has methods to report errors and display information messages. Because the compiler works in batch mode, it has no methods to prompt the user for data. Note that this class is specific to the compiler (and therefore resides in the org.ananas.hc.compiler package). The HC run time (including the proxy) throws SAXException or descendants of SAXException to report errors.
  • DefaultMessenger is a default implementation of Messenger that outputs messages to the console. To offer a GUI front end on the compiler, I need to provide only an alternate implementation of Messenger that displays messages in a window.
  • MessageStore offers internationalization facilities. It essentially combines a ResourceBundle with the most useful methods in MessageFormat.

Although these classes are important in the overall HC operation (imagine a compiler that cannot report errors), they are not directly related to compiling XPaths and should be mostly self explanatory. Consequently I won't go over them in this column in any detail. Instead, I encourage you to download them and review the code (see Resources).


XPath parsing

Two related elements come into play when constructing the DFA. From the discussion in the last column, you may recall that the algorithm to construct a DFA expects a parse tree as input. For HC, the parse tree represents an XPath.

XPathNode

In the previous column, I defined an HCNode class for the nodes in the parse tree. When implementing the algorithm, I realized I had forgotten two important things:

  • There are no fields in HCNode, as introduced in the previous column, to store what to do upon successful recognition of an XPath. I have added a generic object, data, to store such information. I have also introduced a priority integer to indicate the relative priority of XPaths.
  • The followpos() function, which was explained in the previous column, was not implemented. I have added it.

I chose a generic object to store information when matching an XPath because I do not want to tie the parser to a given application. Deciding what to do when matching an XPath and how to assign priorities is not the responsibility of the DFA compiler.

As for followpos(), I have added an accessor and new code in the class constructor:

case PARENT_OF:
   // compute lastpos() and firstpos()
   Iterator iterator = left.lastpos().iterator();
   while(iterator.hasNext())
   {
      XPathNode n = (XPathNode)iterator.next();
      n.followpos.addAll(right.firstpos());
   }
   break;

As written in the previous column, an HCNode encapsulates a QName. The latter represents an XML element while HCNode represents an element in an XPath (so it might be an XML element or a child-of relationship).

When working with it, I found that the distinction between QName and HCNode was not so clean-cut. Logically an XML element in an XPath should be represented by a QName, not a different class.

For readability, I chose to replace HCNode with a class that inherits from QName. To better reflects its use, I changed its name into XPathNode. Otherwise an XPathNode is identical to an HCNode as introduced previously. More specifically, it has the firstpos(), lastpos(), and followpos() methods.

Unfortunately this caused one problem: XPathNode inherits equals() and hashCode() from QName. Eventually, this led to a hard-to-track bug in followpos().

The problem is that QName represents symbols in the input vocabulary. Two symbols are identical if they have the same namespace URI and local name. In other words, simpara equals simpara.

Yet that's not true for XPathNode! They represent a certain element in an XPath so that the simpara in simpara/ulink is different from the simpara in simpara/emphasis

More specifically their followpos() should be different: ulink in the first case, emphasis in the second case.

I had to rewrite equals() for XPathNode such that two different nodes are treated differently if they appear in different XPaths. For debugging purposes, I implemented an extra comparison method, isSynonymousWith(). The new method compares two XPathNodes.

It can perform either a deep comparison that recursively compares the trees below the current node, or a quick comparison that stops at the current node. This method is intended for debugging only, as it lets me compare the result of a parse with expected results.

While working with the XPath parser, I also decided that the END_MARK, a special type of XPathNode/HCNode that marks the end of an XPath would be better represented as a QName, so I moved that constant from XPathNode to QName.

While we're on the topic of XPathNode/HCNode, I have found that java.util.Set has an addAll() method that is identical to the merge() I had written. I now use addAll() instead of merge().

Last but not least, I have adapted the JUnit tests for these classes. You can download XPathNode and its test from the CVS repository (see Resources).

XPathParser

The actual XPath parsing is the responsibility of XPathParser, shown in Listing 1. As discussed in previous columns, this class recognizes only a subset of XPath, which is consistent with my earlier decision not to support conditions in XPaths.

The class constructor takes up to three parameters:

  • NamespaceSupport is a SAX-defined class to process namespaces. It can decompose an XML qualified name (db:simpara) in a namespace URI and local name.
  • MessageStore is essentially a resource bundle with the messages used to report parsing errors.
  • Messenger is used by the parser to report errors to the end user.

The bulk of the parsing takes place in the xpath() method. The method is plain, vanilla parsing. It uses a standard StringTokenizer to decompose the string in elementary tokens. Next it loops over the tokens looking for XML elements, attributes and parent-of separators (/).

As it recognizes tokens, it creates XPathNode with the appropriate type: ELEMENT, ATTRIBUTE, and PARENT-OF. Care must be taken for the absolute XPath indicator.

In practice, users of XPathParser will use the axpath() method more often than xpath(). The axpath() method returns what is known as an augmented parse tree -- a parser tree with one extra node for the END_MARK. The extra node holds information about what to do when matching the XPath. As discussed above, it uses the data and priority fields for that purpose.

I have written a JUnit test for the parser as well.


DFA construction

The DFAFactory class compiles a DFA from the parse tree. It implements the algorithm discussed last month in the createDFA() method. The method accepts a parse tree (as an XPathNode object) as parameter and, for now, returns an instance of DFATable.

DFATable

DFATable is a temporary class I introduced primarily for testing. Eventually it will be replaced by the table class, as explained in the first HC column. This class provides three useful methods:

  • move() returns the state the DFA transitions to upon hitting a given XML element. Its parameters are a QName, for the XML element, and the current state.
  • isAcceptingState() returns true if the current state is an accepting state. An accepting state is when an XPath has been recognized. In the parse tree, it corresponds to hitting the END_MARK.
  • getAssociatedData() returns the data associated with an XPath (or null if the current state is not an accepting state). Typically the data will tell us which methods to call when a certain XPath has been recognized.

Listing 2 is DFATable. As you can see, the class is not difficult. It has three data fields for the list of QNames the DFA recognizes: (symbols), the transition table (dtran), and the accepting states (astate).

DFAFactory

DFAFactory, in Listing 3, is the core class in the HC compiler because it is responsible for compiling the transition table. createDFA() is a direct implementation of the DFA construction algorithm discussed in the previous column. In fact the algorithm appears in the comments of the code.

I think that the most difficult aspect of reading this code is to realize that states are represented by a set of XPathNodes. Remember the discussion of states and stacks in the first HC column: A state represents a certain stack configuration (that is, a list of XML elements and attributes as they appear on the stack).

In this algorithm, a set of XPathNodes represents a stack configuration. In fact, if you were to trace this algorithm, you would realize that it builds a list of all the possible stack configurations and the transitions between them.

The top loop in the algorithm iterates over the list of XML elements and attributes that may appear in the XML document. To compute the list, DFAFactory walks the parse tree and compiles a list of unique QNames through the uniqueSymbols() method.

createDFA() also builds a list of accepting states by checking whether a state corresponds to an END_MARK node.

e_closure() is a helper method that computes the list of nodes in followpos().


Testing the compiler

Although HC is not complete yet, its back end is nearly written now. It's missing only the proxy and the front end.

To test the DFA construction, I had to simulate the proxy and the front end. Listing 4 is DFAHandler, a class that simulates the proxy. It does not call into the application handler though; it limits itself to printing recognized XPaths.

The handler takes an instance of DFATable and uses the move() function to transition to a new state given the XML element.

Further imagine that you're interested in the db:sect1/db:simpara/db:ulink and db:sect1/db:simpara XPaths. The following code excerpt shows how to create a DFA that recognizes them:

NamespaceSupport namespace =
   new NamespaceSupport();
namespace.declarePrefix("db",
               "http://www.ananas.org/2001/docbook");
MessageStore message =
   MessageStore.getMessageStore(
	   "org.ananas.hc.compiler.Message");
XPathParser parser =
   new XPathParser(namespace,message);
String xpath1 = "db:sect1/db:simpara/db:ulink",
       xpath2 = "db:sect1/db:simpara";
XPathNode first = parser.axpath(xpath1,1,xpath1),
          second = parser.axpath(xpath2,2,xpath2),
          root = new XPathNode(XPathNode.OR_XPATH,
			                      first,second);
DFAFactory factory = new DFAFactory();
DFATable table = factory.createDFA(root);
XMLReader xparser =
   XMLReaderFactory.createXMLReader(
	   "org.apache.xerces.parsers.SAXParser");
xparser.setContentHandler(new DFAHandler(table));
xparser.parse("file:doc.xml");


What's next with HC

DFAHandler illustrates how HC will work soon. Admittedly the previous example is hardcoded for two XPaths, but that's only because HC still needs a front end (coming up in the next column).

However, you can already appreciate the power of using a compiler: the DFA construction algorithm takes care of all the state management.


Interesting feedback on XM

Here's a quick update on the status of XM. You may recall that XM, short for XSLT Make, was the first project introduced in the Working XML column. XM is an affordable solution for publishing Web sites with XML and XSLT. I have temporarily paused XM development in order to collect user feedback.

So far I have received some feedback on the ananas-discussion mailing list, but I would love to hear more from you so feel free to contribute. Also last month I built two new Web sites with XM. The first one was a partial rewriting of the corporate Web site for Pineapplesoft (my consulting company) in XML. The second one was in the context of a short consulting mission to help an organization implement XML and XSLT.

The mailing list and the other experiences have uncovered several bugs, and I have fixed most of them. They have also given me more insight into how a third party might deploy XM. The experiences have also prompted me to add two new features:

  • XM now has an option to ignore the external DTD of a document. This proves handy with SoftQuad XMetaL (and possibly other XML editors). I discovered that XMetaL inserts paths relative to one of its directories, which has the unfortunate side effect of breaking other XML applications (including XM until these corrections).
  • XM now fully supports the xsl:output element. Therefore you can use XM to preprocess XML documents, for example, for bulk XML conversion.

I will discuss these changes in more detail in a future column. In the meantime, I have committed the new code to the CVS repository. I encourage you to download it, test it, and report your findings on the ananas-discussion mailing list (see Resources).


Resources

  • You can download the code for this project from ananas.org. Follow the links to the CVS repository on developerWorks as well as to the ananas-discussion mailing list. I encourage to join the list and contribute your thoughts to the project.

  • If you'd rather have a ZIP file, it's available too.

  • The so-called "Dragon Book" (Compilers: Principles, Techniques and Tools by A. Aho, R. Sethi, and J. Ullman) is the authoritative document on compiler design and construction.

  • JUnit is an open-source framework for running automated tests.

  • Extreme programming (XP) is a set of recommendations for achieving better quality software.

  • Pragramatic Programmers is another recent contender in the field of methodologies.

  • Pattern Hatching is an interview with Kent Beck, one of the fathers of XP.

About the author

Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . He is also the author of Applied XML Solutions and XML and the Enterprise. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology, Web development
ArticleID=11634
ArticleTitle=Working XML: Compiling XPaths
publish-date=01012002
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers