In the last two columns I have been working on XI (short for XML Import), a project to convert legacy files to XML (see Resources). The motivation for XI came from a need to publish an address book as part of an XML site. Because the address book is maintained in the proprietary format of an e-mail client, I needed a tool to convert the text to XML.
I took the opportunity to try the new regular-expression library built into JDK 1.4. Regular expressions make for a flexible conversion solution: Instead of hard-coding the conversion routine, I can describe how to parse the legacy document as a set of regular expressions. I will use one set of rules for the address book, but I could write different rules for other calendars or for chemical analysis data, Web server logs, or other formats. XI is a more generic tool that you or I would be able to use and reuse in many projects.
In the previous column, "Wrestling with Java NIO" (see Resources), I spent a fair amount of time studying the regular-expression library. Some of my assumptions turned out to be completely off the mark, but I still managed to parse the address book into elements using regular expressions.
Because I'm aiming at a generic solution, I created a small data structure to hold the set of rules. It essentially associates XML tag names with regular expressions. Although I had to limit myself to a fixed data structure for testing, I organized the code so that it would be a simple matter to populate the data structure from a file, a feature I have now implemented.
This column is mostly about cleaning the code and making sure that it produces a valid XML document. I also worked on packaging the existing algorithm as an XML parser. As you will see, the XML parser interface proves handy when dealing with XSLT processors.
The best way to write XML documents
The easiest solution to finish XI would have been to revisit the code and adapt the various print statements to write XML tags. Indeed, the logic to parse the document and associate XML elements to the node is already there. For example, when it matches a regular expression, the algorithm prints the element associated with it, such as:
System.out.print(ruleset.getMatchAt(i).getQualifiedName()); |
It's not difficult to adapt this to produce proper XML:
System.out.print("<"+ruleset.getMatchAt(i).getQualifiedName()+">"); |
Of course, the above statement prints the start tag only, so I would need more print statements for the end tag and the content, but that's not difficult to do.
If I were only interested in writing an XML document, that's probably what I would do as it is the least demanding solution. Special care is needed to escape the angle brackets, ampersand, and other reserved characters, but that's trivial. I might also want to save the XML document in a file instead of printing to the console -- but again, that's trivial.
Yet I'm not happy with writing an XML document to a file. As you may recall from the previous columns, I don't plan to use the output from XI directly. Experience shows that one often needs to reorganize legacy documents. For example, with the address book, I will have to combine alias and note lines. I could add logic in XI to handle this and other similar cases, but I have found it advantageous to break the import process into two steps:
- A syntax conversion
- A data structure reorganization
The syntax conversion takes the textual information and wraps it in the simplest XML structure. Typically the resulting XML document is very close to the original document. In most cases, it's as simple as replacing delimiters with XML tags. That's what XI does.
The second step uses a transformation to turn this crude XML document into the target vocabulary. I have found XSLT is particularly suitable for that purpose because it's a powerful transformation language. And because XSLT is a standard, there's no shortage of support tools such as editors.
In a nutshell, I don't necessarily want XI to write the XML document in a file; I'd rather optimize it to interface with an XSLT processor. JDK 1.4 ships with a version of Apache Xalan that accepts input from files (streams), SAX events, and DOM trees. Of the three interfaces, my personal favorite is SAX.
SAX is attractive because it's simple to program and has a reasonably efficient interface when processing XML documents. Compared to files, it saves writing to a temporary file; compared to DOM, it requires less memory.
In the remainder of this article, I will assume that you are familiar with SAX programming. If not, you might want to turn to "SAX the Power API," also on developerWorks (see Resources).
The two most important interfaces in SAX are XMLReader and ContentHandler. XMLReader describes how to initialize and start an XML parser, while ContentHandler lists the events that the XMLReader fires as it parses the XML document.
You've probably already used both interfaces when reading XML documents. Even if you're familiar with them, however, this application requires that you look at SAX from a slightly different angle. In this case, instead of being a user of SAX, I am writing my own parser. Strictly speaking, XI is not an XML parser; it does not read XML documents. However, it offers an XML view over text documents, therefore it can conform to the XMLReader interface.
The SAX implementation of XI is in the class XIReader. The class is too large to reproduce here in its entirety. Before going any further, I encourage you to grab a copy from the Open Source section of developerWorks (see Resources).
XIReader deals with two issues: implementing the SAX interface and the actual text parsing and XML document generation. Listing 1 illustrates the implementation of the interface.
Listing 1: XIReader's SAX Implementation
public class XIReader
implements XMLReader, Locator
{
protected ContentHandler contentHandler = null;
public ContentHandler getContentHandler()
{
return contentHandler;
}
public void setContentHandler(ContentHandler value)
throws NullPointerException
{
if(value == null)
throw new NullPointerException("ContentHandler");
else
contentHandler = value;
}
// ...
} |
To support XMLReader, XIReader offers methods to register and access various SAX handlers: ContentHandler, ErrorHandler, DTDHandler, and EntityResolver.
Strictly speaking, DTDHandler and EntityResolver are useless: A legacy text has no DTDs, so XIReader will never fire DTD-related events.
Likewise, there's no need for EntityResolver; if you recall, the parser is not supposed to use it for the top-level document entity. The interface is only useful for external entities such as DTDs! Again, no use for legacy text documents. Still, SAX mandates methods to set and get the two handlers and XIReader obliges.
XIReader also implements limited support for SAX features and properties. Features and properties control various aspects of the parsing; they are identified by URLs such as http://xml.org/sax/features/namespaces. Be warned that the URLs act as identifiers only, and so do not try to resolve them. (Do not visit the Web site -- there's no Web site to visit.)
The specification states that an XMLReader must support setting the http://xml.org/sax/features/namespaces feature to true (supporting false is optional), and setting http://xml.org/sax/features/namespace-prefixes to false (true is optional).
The first feature controls whether the parser decodes XML namespaces (true) or not. XIReader always use namespaces. The second feature controls whether to report the namespace declaration in the list of attributes (true) or not. XIReader supports both values.
As you can see, XIReader offers minimal conformance. Still, I have found that I had to support setting http://xml.org/sax/features/namespace-prefixes to true (and not only false, as required by the specification) because Apache Xalan needs to set this property to true to process namespaces properly.
The specification defines other features and their URLs, but a parser is not required to support them. Because most of these features deal with validation and XML schemas, I chose to ignore them.
I have also defined a new property, http://ananas.org/xi/features/rulesets, to give the parser its rules file. The property accepts an InputSource value that points to the rules file.
In the code discussed in the previous column, "Wrestling with Java NIO" (see Resources), the bulk of the processing took place in a method called read(). I renamed it match() for increased readability, and adapted it to call ContentHandler as it decodes the input document. Listing 2 illustrates this. If you compare this code with with that in "Wrestling with Java NIO," you will find very similar structures. The only important difference is that print() statements have been replaced by various calls to ContentHandler.
Listing 2: match() and ContentHandler
public void match(Ruleset ruleset,String st,boolean firstMatch)
throws SAXException
{
attributes.clear();
int i = 0;
while(i < ruleset.getMatchCount())
{
if(ruleset.getMatchAt(i).matches(st))
{
Match match = ruleset.getMatchAt(i);
if(firstMatch && contentHandler != null)
contentHandler.startElement(match.getNamespaceURI(),
match.getLocalName(),
match.getQualifiedName(),
attributes);
for(int j = 1;j <= match.getGroupCount();j++)
{
QName qname = match.getGroupNameAt(j);
Ruleset nextRuleset = (Ruleset)rulesetsMap.get(qname);
if(nextRuleset != null)
match(nextRuleset,match.getGroupValueAt(j),true);
else
{
Group group = match.getGroupNameAt(j);
if(contentHandler != null)
{
contentHandler.startElement(group.getNamespaceURI(),
group.getLocalName(),
group.getQualifiedName(),
attributes);
String value = match.getGroupValueAt(j);
int begin = 0,
end = 0;
while(begin < value.length())
{
if(value.length() - begin < chars.length)
end = value.length();
else
end = begin + chars.length;
value.getChars(begin,end,chars,0);
contentHandler.characters(chars,0,end - begin);
begin = end;
}
contentHandler.endElement(group.getNamespaceURI(),
group.getLocalName(),
group.getQualifiedName());
}
}
}
String rest = match.rest();
if(rest != null)
match(ruleset,rest,false);
if(firstMatch && contentHandler != null)
contentHandler.endElement(match.getNamespaceURI(),
match.getLocalName(),
match.getQualifiedName());
break;
}
else
i++;
}
if(i < ruleset.getMatchCount()
&& ruleset.getError() != null
&& errorHandler != null)
errorHandler.error(new SAXParseException(ruleset.getError(),
this));
} |
If you remember XM, the publishing project introduced in the very first Working XML column, you will be familiar with firing ContentHandler events. XM did so to fix dangling hyperlinks. XIReader builds on the same logic, but is more ambitious. Instead of firing one event for links, it fires enough events to describe a complete document.
I confess that I was originally anxious about writing a full implementation of XMLReader. But, as this column shows, it was surprisingly easy....which just proves the point that SAX truly lives up to its name as a Simple API for XML.
ContentHandler is particularly simple to use. Consider that you would normally have methods to print start and end tags as well as content. Those methods would take care of character escaping, indenting, and other syntax-related issues. ContentHandler essentially defines those methods for you. Use the startElement() method to print a start tag, the endElement() method to print an end tag, and the characters() method to print the content.
Having established an XMLReader, I wanted XI to be able to read a rules file. I already have applications of XI that go beyond the address book, and I want to break free from hard-coded regular expressions.
I have mostly preserved the vocabulary introduced two columns ago. A rules file would resemble Listing 3. The root element is rules; it contains one or more ruleset elements.
Each ruleset contains a list of matches that represent regular expressions. The error element details what to do if XI cannot match any of the regular expressions. Finally, group elements represent the groups in the regular expression. Attached to every element is an element name, which is the name that XI uses.
Listing 3: rules.xml
<?xml version="1.0"?>
<xi:rules version="1.0"
xmlns:xi="http://ananas.org/2002/xi/rules"
defaultPrefix="an"
targetNamespace="http://ananas.org/2002/sample">
<xi:ruleset name="address-book">
<xi:match name="alias"
pattern="^alias (.*):(.*)$">
<xi:group name="id"/>
<xi:group name="email"/>
</xi:match>
<xi:match name="note"
pattern="^note .*:(.*)$">
<xi:group name="fields"/>
</xi:match>
<xi:error message="unknown line type"/>
</xi:ruleset>
<xi:ruleset name="fields">
<xi:match name="field"
pattern="[\s]*<([^<]*)>">
<xi:group name="field"/>
</xi:match>
</xi:ruleset>
</xi:rules> |
I made one change between the vocabulary in Listing 3 and the original one: The document now supports one global namespace that applies for the whole rules file. My original idea was to let the user specify multiple namespaces in the rules files, but it makes XIReader needlessly complex.
As I studied the issue more, I realized that a global namespace fits 99% of all needs. But what if you really need multiple namespaces? You can still work around it because the document is post-processed in XSLT anyway. It's a simple matter to add the new namespaces in the style sheet.
One of the pleasures of writing this column is that I can reuse projects as it progresses. In this case, I use HC, the Handler Compiler introduced a few months ago, to simplify parsing the rules file.
If you missed the corresponding columns, HC is a precompiler that takes a Java class annotated with XPaths and turns it into a SAX ContentHandler. Each method in the class matches one or more XPaths. In practice, it saves writing a lot of tedious state-management code.
Listing 4 is the handler for the rules file. You can see those XPaths in Javadoc comments. The handler defines one method for each element in the rules vocabulary. As it reads through the rules file, it populates the data structure with the regular expressions.
Listing 4: RulesHandler.java
package org.ananas.xi;
import java.util.*;
import org.xml.sax.*;
/**
* @xmlns xi http://ananas.org/2002/xi/rules
*/
public class RulesHandler
implements org.ananas.hc.HCHandler
{
private String namespaceURI = null;
private String prefix = null;
private List rulesets = null;
private Ruleset getLastRuleset()
{
return (Ruleset)rulesets.get(rulesets.size() - 1);
}
/**
* @xpath xi:rules
*/
public void init(Attributes attributes)
{
rulesets = new ArrayList();
namespaceURI = attributes.getValue("targetNamespace");
prefix = attributes.getValue("defaultPrefix");
if(namespaceURI != null)
{
namespaceURI = namespaceURI.trim();
if(namespaceURI.equals(""))
namespaceURI = null;
}
if(prefix != null)
{
prefix = prefix.trim();
if(prefix.equals(""))
prefix = null;
}
}
/**
* @xpath xi:rules/xi:ruleset
*/
public void doRuleset(Attributes attributes)
throws SAXException
{
String name = attributes.getValue("name");
if(name != null)
rulesets.add(new Ruleset(namespaceURI,
name,
prefix));
else
throw new SAXException("name attribute required for xi:ruleset");
}
/**
* @xpath xi:rules/xi:ruleset/xi:match
*/
public void doMatch(Attributes attributes)
throws SAXException
{
String name = attributes.getValue("name"),
pattern = attributes.getValue("pattern");
if(name != null && pattern != null)
{
Ruleset ruleset = getLastRuleset();
ruleset.addMatch(new Match(namespaceURI,
name,
prefix,
pattern));
}
else
throw new SAXException("name and pattern attributes" +
"required for xi:match");
}
/**
* @xpath xi:rules/xi:ruleset/xi:error
*/
public void doError(Attributes attributes)
throws SAXException
{
String message = attributes.getValue("message");
if(message != null)
{
Ruleset ruleset = getLastRuleset();
if(ruleset.getError() == null)
ruleset.setError(message);
else
throw new SAXException("no more than one error per xi:ruleset");
}
else
throw new SAXException("message attribute required for xi:error");
}
/**
* @xpath xi:rules/xi:ruleset/xi:match/xi:group
*/
public void doGroup(Attributes attributes)
throws SAXException
{
String name = attributes.getValue("name");
if(name != null)
{
Ruleset ruleset = getLastRuleset();
Match match = ruleset.getLastMatch();
match.addGroup(new Group(namespaceURI,
name,
prefix));
}
else
throw new SAXException("name attribute required for xi:group");
}
public Ruleset[] getRulesets()
{
Ruleset[] array = new Ruleset[rulesets.size()];
return (Ruleset[])rulesets.toArray(array);
}
public String getNamespaceURI()
{
return namespaceURI;
}
public String getPrefix()
{
return prefix;
}
} |
Work on XI is nearing completion. Now, you have a working processor, and it's a simple matter to interface it with an XSLT processor, as Listing 5 illustrates. In the next column, I'll wrap a simple user interface around the existing core to make XI even more useful.
Listing 5: Sample Main
public static void main(String[] params)
throws TransformerException, TransformerConfigurationException,
SAXException, IOException
{
InputSource inputSource = new InputSource(new FileInputStream(params[0]));
inputSource.setSystemId(params[0]);
XMLReader xmlReader =
XMLReaderFactory.createXMLReader("org.ananas.xi.XIReader");
xmlReader.setProperty(XIReader.RULESETS_URI,new InputSource("rules.xml"));
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
transformer.transform(new SAXSource(xmlReader,inputSource),new
StreamResult("result.xml"));
}
|
- Review the earlier Working XML columns about XI, "Importing text as XML with XI," part 1 (developerWorks, April 2002) and
"Wrestling with Java NIO," part 2 (developerWorks, June 2002).
- Read an introduction to SAX parsing, "SAX, the Power API," an excerpt from XML by Example, 2nd Edition (developerWorks, August 2001).
- Download the companion code for this column from the online source repository in the Open Source Projects section of developerWorks.
- Check out the reference on regular expressions,
Mastering Regular Expressions
by Jeffrey E. F. Friedl, O'Reilly, 1997 (ISBN number: 1-56592-257-3).
- Read these Working XML columns that describe the development of the author's open-source HC, Handler Compiler for automatically generating the SAX ContentHandler for a list of XPaths:
- Building a compiler for the SAX ContentHandler introduces the HC project (developerWorks, November 2001).
- Compiling the paths and automating tests discusses the compilation algorithm and shares the author's experiences with Junit (developerWorks, January 2002).
- Compiling XPaths describes the implementation of the DFA algorithm (developerWorks, January 2002).
- Compiling the proxy describes the tweaks to the first release of HC, including a fix for an unexpected problem with the DFA algorithm, and outlines how to use HC (developerWorks, March 2002).
- Read all of Benoît Marchal's
Working XML
articles.
- Take the developerWorks tutorial Understanding SAX for an introduction to the topic (developerWorks, September 2001).
- Find more XML resources on the
developerWorks XML technology zone.
- Get Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
- Find out how you can become an IBM Certified Developer in XML and related technologies.

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example. More details on this topic are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.
Comments (Undergoing maintenance)





