Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Working XML: Compiling the paths and automating tests

A closer look at algorithms and JUnit

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . He is also the author of Applied XML Solutions and XML and the Enterprise. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Summary:  Work on HC, the SAX ContentHandler compiler, continues. This month, our columnist discusses the compilation algorithm. He also invests a bit of time automating tests with JUnit.

Each month in the Working XML column, Benoît Marchal discusses the progress of his open-source projects for XML developers, from design decisions to coding challenges. The new project called HC (short for Handler Compiler) will take some drudgery out of event-based XML parsing by automatically generating the SAX ContentHandler for a list of XPaths.

View more content in this series

Date:  01 Jan 2002
Level:  Introductory
Also available in:   Korean  Japanese

Activity:  9898 views
Comments:  

Work on HC, the Handler Compiler continues. Introduced last month in this column, HC aims to eliminate the need to track states in SAX ContentHandler. Tracking states is tedious and error prone. HC automates the process by compiling a proxy ContentHandler that takes care of state management and calls into an application handler where the application logic fits.

Although at first sight it may appear that HC demands more work from the programmer, it is important to understand that the proxy is compiled automatically from the application handler. No programming required.

Compiling the proxy

As discussed last month, I plan to use Deterministic Finite Automaton (DFA) to compile the proxy. DFAs are attractive because they are very efficient. Also, the algorithms to construct them are well documented. The algorithm I will use was originally designed to compile regular expressions, but I'm confident it can be easily adapted to XPaths.

I'll draw most of the algorithm material from Compilers: Principles, Techniques and Tools (see Resources), one of the definitive references when it comes to compiler construction. If you have a copy of the book handy, the DFA construction is algorithm 3.5.

Algorithm basics

Recall from our discussion last month that a DFA is a transition diagram. Figure 1 shows the transition diagram for the simpara/ulink XPath. The circles represent the states that the proxy will go through; the arrows are labelled with the elements that cause the DFA to transition. The thick circle marks a successful recognition of the XPath.


Figure 1. Transition diagram for simpara/ulink
Transition diagram

To better understand the algorithm, you might to compare this transition diagram with a stack. Essentially when the proxy encounters the simpara element, it pushes it on the stack. Next when it hits ulink, it also pushes that on the stack. The stack now has two elements: simpara and ulink, which is the configuration for simpara/ulink.

The various states in the diagram each represent one configuration of the stack. State 0 represents the empty stack, state 1 represents the stack with only one element: simpara. State 2 is a stack with two elements: simpara and ulink.

In other words, to build the DFA, you need to allocate as many states as there are possible stack configurations and compute the transition function between all these states.

As you can imagine, Figure 1 is too simplistic. A real proxy will process several transition diagrams in parallel because it will attempt to recognize not one but several XPaths. For example, a proxy might look for any of the following three XPaths:

simpara/ulink
/
/article/articleinfo/title

The number of states grows when the DFA attempts to recognize more XPaths. Indeed, thinking in terms of a stack, there would be more stack configurations to recognize those three XPaths than to recognize simpara/ulink.

QName

XML elements are identified by a combination of a namespace URI and a local name. To manipulate the pair more efficiently, I've created the QName class. To identify XML nodes completely, QName also records the nature of the node: element, attribute, or the root (/). The code for QName is in Listing 1.

QName implements equals() and hashCode() in order to be compatible with hashtables.

HCNode

The algorithm processes the set of XPaths and compiles three things: a set of states that the transition diagram might go through, the transition function to move from one state to another, and an indication of which states mark successful recognition of an XPath.

It expects the front end (reread previous column for more on the front end and back end) to return a parse tree that contains all the XPaths to recognize. Listing 2 is HCNode, the elements that appear in the tree. Note however that this listing is incomplete at this stage in HC development; I'll show a more complete version in the next column.

HCNode is in the org.ananas.hc.compiler package because it is specific to the compiler. QName was in the org.ananas.hc because I expect the proxy will use it.

A node can be one of the following (depending on its type property):

  • An XML node that is represented by a QName
  • The / separator in an XPath to indicate the "parent of" relationship between XML nodes
  • A combination of XPaths when the proxy recognizes two or more XPaths
  • A special node to mark the end of the parse tree

XML nodes hold a reference to a QName. The other types of node hold references to left and right HCNodes so as to construct the tree.

The DFA construction algorithm converts this parse tree into a set of states, as the previous column explains. For that, it computes how many XML nodes may appear after a given node. It's easier to understand this portion of the algorithm if you remember that each state represent a certain stack configuration. The algorithm essentially computes all the possible stack configurations that could lead to a given node in the parse tree.

To achieve this, HCNode provides the following methods:

  • n.firstpos(): the set of XML nodes that match the first element of an XPath rooted at node n
  • n.lastpost(): the set of XML nodes that match the last element of an XPath rooted at node n
  • n.nullable(): true if the node n is the root of an XPath that can be empty. False otherwise

Let me give a couple of examples. Let's start with the XPath simpara. This XPath would be parsed into a single node, n of type XML_NODE. Its n.firstpos() and n.lastpos() would be simpara because simpara is the only XML node that matches the XPath simpara.

The XPath simpara/ulink is parsed in a node n of type PARENT_OF with left and right nodes pointing to simpara and ulink respectively. The method n.firstpos() is simpara, while n.lastpost() is ulink because the XML node that matches the beginning of the XPath is simpara, and the one matching the end of the XPath is ulink.

Table 1 spells out the rules to compute firstpos() and nullable(). lastpos() is similar to firstpos(), but the rules for left and right are reversed. Incidentally you might wonder what nullable is all about; that will become clearer when I discuss processing relative XPaths in a later column.


Table 1. Computing firstpos() and nullable()
 firstpos()nullable()
n marks a relative XPathempty settrue
n is an XML node{ qname }false
OR_XPATHleft.firstpos() U right.firstpos()left.nullable() or right.nullable()
PARENT_OFif left.nullable() then left.firstpos() U right.firstpos() else left.firstpos()left.nullable() and right.nullable()

If you're feeling confused, just remember that states represent a stack configuration. firstpos() and lastpos() are the set of XML nodes in a given stack configuration. Eventually I'll assign each of these sets a unique number to turn it into a state.

Computing the DFA

Given the parse tree, you apply the algorithm in Listing 3 to compute the transition function. You can represent the transition function as a two dimensional matrix, dtran, that returns the next state given an XML node. Note that the algorithm in Listing 3 is pseudo-code, not actual Java code.


Listing 3. Pseudo-code for the algorithm
dstates &t;- root.firstpos()
for-each s as a state in dstates
   for-each a as an XML node in the input vocabulary
	   U <- s.followpos(a)
		if U is not empty and is not in dstates
		   dstates.append(U)
	   dtran[s,a] <- U

followpos(a) tells which XML nodes may follow a given XML node in the XPath. You compute it by applying the following rules:

  • If n is a PARENT_OF node and a is an XML node in left.lastpos(), then all nodes in right.firstpos() are in followpos(a)
  • If n indicates a relative path, and a is in n.lastpos(), then all the nodes n.firstpos() are in followpos(a)

JUnit

I was hoping to implement this algorithm completely in this column, but I have had to invest some time in learning JUnit. JUnit is an open-source framework for automated testing (see Resources). I have found out the hard way that automated testing is absolutely required when writing compilers.

Automated testing

A plain fact in any software project is that programmers introduce bugs in their software. I have read a statistic that says a typical programmer writes a bug for every 10 lines of code. Tests are designed to find and fix the bugs. Testing is a typical example of a highly repetitive, boring task. There's very little creativity involved when testing software: you feed it certain values and observe the results. If they differ from what you expect, you've found a bug.

Because they are so boring and repetitive, tests (or at least some tests) should be automated. After all, computers are more patient than programmers. Computers are therefore ideal candidates for boring and repetitive tasks.

Automated testing means that you write software to test other software. The beauty of this approach is that you can run the testing software as often as needed. All too often, we test only what has changed in the application. The problem, of course, is to decide what has changed. It's not uncommon that a change in one section of the code causes bugs in other sections. Unless you remember the two sections are related, you might not test for this condition.

Automated tests, on the other hand, take a brute-force approach. They have the potential to test the entire application on every run, which increases the chances of finding bugs in unrelated sections of the code.

Another benefit of automated tests is that bugs tend to reappear in the application. Whenever you have fixed a bug, it's a good idea to write a test to exercise the bug. Chances are that the test will fail several times in the project life as the bug resurfaces.

The extreme programing movement (see Resources) has popularized automated tests. I confess that I do not use automated tests systematically because my experience has been that they are most effective with classes whose public interface does not change often. They are less effective with those classes whose public interface changes a lot, and that includes most of the user-interface code.

The operative word in automated testing is automated, and it is useless to automate anything unless you intend to use it often. By the same token, it's more effective to write tests that you know you'll run often. Automated tests are also ideal for rather arcane code such as a tree manipulation and, in particular, compilers.

JUnit

Don't be fooled into thinking that automated testing is labor free and painless. You have to invest the time in writing the test cases and, of course, you need to run the test application regularly. In fact, I could argue that I would have finished writing a buggy, yet not-well-tested version of the DFA compiler this month if I hadn't spent so much time on automated tests. In all fairness, I spent more time on automated tests than I could have because I chose to learn JUnit.

Still, automated testing is an investment that pays off over the life of the project. It is faster to run a test than to try to automate it. For example, it might take five times more work to write test cases for a given class than to test the class once manually. Of course, writing the test case will pay off if you run the test more than six times. If you run the test, say, 50 times (which is not much for even a medium-size project), the payoff is tremendous.

Practically speaking, you can follow one of two strategies to write automated tests. Either you write the test when you write the code being tested, or you have another team of developers write the tests. For the time being, I'll be writing tests as I go along, but if you would like to volunteer to write test suites, by all means post on the ananas-discussion mailing list.

In the past, I wrote my own tests as plain Java classes. However, a friend suggested that I test JUnit. HC looked like a good project to test JUnit, so I downloaded it. JUnit is a simple and small framework for writing automated tests. JUnit does not do much (it's just a handful of classes), but it does help formalize the testing process.

Listing 4 is a test procedure I wrote to exercise QName. As you can see, the test class inherits from TestCase, a JUnit-defined class. I wrote my tests in testXXX() methods. I have three methods to exercise various aspects of the software: testing the constructor, testing the getters, and testing the equals() and hashCode() methods. The setUp() method, which is declared by JUnit, initializes the test objects.

I chose to place my tests in a separate package (org.ananas.hc.test) which goes against the recommendation from JUnit authors. I might live to regret that choice, but I prefer to separate tests from the actual code as it's easier to remove the tests from the distribution.

The beauty of using JUnit is that the framework takes care of loading the test class, running the test, and reporting the results. The framework also supports the notion of test suite, which lets me organize the tests in logical units. Finally, the framework provides graphical and console runners. The graphical console is good for interactively testing one or two classes, while the console runs all the tests in batch mode, for example, as part of a full rebuild.


What's next

I have not yet posted the code for HC at the ananas.org repository because it is not ready for prime time. I need to finalize the DFA compilation before I have anything new worth posting. However, I'm sure that the time invested in building automated tests (and learning about JUnit) was well spent; it will pay off over the life of this project. By next month, I expect to have a running version (albeit not optimized) version of the compiler and I expect to be working on the proxy.


Resources

  • The so-called "Dragon Book" (Compilers: Principles, Techniques and Tools by A. Aho, R. Sethi and J. Ullman) is the authoritative document on compiler design and construction.

  • JUnit is an open-source framework for running automated tests.

  • Extreme programming (XP) is a set of recommendations for achieving better quality software.

  • Pragmatic Programmers is another recent contender in the field of methodologies.

About the author

Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . He is also the author of Applied XML Solutions and XML and the Enterprise. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=10620
ArticleTitle=Working XML: Compiling the paths and automating tests
publish-date=01012002
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com