XI (which stands for XML Import) is a tool for converting text files to XML. Many applications produce text files and consequently much useful data can be imported into an XML workflow with such a tool. I originally designed XI to retrieve an address book in XML. Yet XI uses regular expressions (as introduced in JDK 1.4) to parse the input file, so it works equally well with server logs, comma-separated values (CSVs), Excel files, and many other documents. Think of XI as a generic toolkit for importing your legacy data into a modern XML flow.
Combining regular expressions with XSLT
You might remember from my previous column, "Wrapping up XI," I chose to implement the XMLReader interface. Instead of writing to XML directly from the XI parser, the software generates XML documents. In many respects, XI behaves like an XML parser -- but it parses any text document.
Listing 1 (which also appeared in the first column in the XI series) is an address book. It is maintained as a text document by an e-mail application.
Listing 1. Original address book
alias "jdoe" jdoe@xmli.com
note "jdoe"
<country:US><zip:45202>
<state:OH><city:Cincinnati>
<address:34 Fountain Square Plaza>
<name:John Doe>
alias "jsmith" jsmith@worth-it.com
note "jsmith"
<first:Jack><last:Smith>
<name:Jack Smith>
alias "pdupont" pdupont@pineapples.net
note "pdupont" <name:Pierre Dupont>
|
The premise behind XI is that converting text documents to XML is best done in two steps:
- The XI parser turns the document into a rough XML version. The rough XML is syntactically correct, but may not use the proper vocabulary.
- This rough XML document is post-processed with one or more XSLT style sheets. Experience shows that you often need a full-fledged scripting language to produce a clean XML document, and XSLT is one of the best languages available for that purpose.
To convert the address book to XML, you need to reorganize some of the data. In the text document, for instance, the e-mail and postal addresses are on different lines; in XML, I want everything under one tag. Also, not every entry has the same type of data (such as a postal address). Finally, there are fields in the text document (first name, last name) that I do not need in XML. All this for a simple case. Other cases, such as having to compute the value of fields, can be much trickier. In a recent customer project (using a tool that is conceptually similar to XI), data had to be summed and averaged across several lines. A scripting language is definitely handy.
At the end of my previous column, I included some code that takes the XIReader and passes it to a Java Transformer to save the result. The sample used an uninitialized Transformer to save the result of the parsing. In effect, this implements only the first step of the conversion. However, it's a simple matter to initialize the Transformer with an XSLT style sheet to perform the second step as well. shows the XSLTLink class, which encapsulates the complete conversion.
Listing 2. The XSLTLink class
package org.ananas.xi;
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.transform.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;
public class XSLTLink
{
private String name;
private File stylesheet,
output;
private String suffix;
private XMLReader xiReader;
private Transformer transformer;
public XSLTLink(File stylesheet,File output)
throws IOException, SAXException,
TransformerConfigurationException
{
this.stylesheet = stylesheet;
this.output = output;
name = stylesheet.getName();
int pos = name.lastIndexOf('.');
if(pos != -1)
name = name.substring(0,pos);
reload();
}
public void reload()
throws IOException, SAXException,
TransformerConfigurationException
{
InputSource input =
new InputSource(stylesheet.toURI().toString());
xiReader =
XMLReaderFactory.createXMLReader("org.ananas.xi.XIReader");
xiReader.setProperty(XIReader.RULESETS_URI,input);
TransformerFactory factory = TransformerFactory.newInstance();
transformer =
factory.newTransformer(new StreamSource(stylesheet));
suffix = transformer.getOutputProperty("method");
}
public File applyTo(File source,boolean overwrite)
throws IOException, SAXException, TransformerException
{
if(!source.isFile())
throw new IOException(source.getPath() + " is not a file");
InputSource input = new InputSource(source.toURI().toString());
String name = source.getName();
int pos = name.lastIndexOf('.');
String base = pos != -1 ? name.substring(0,pos + 1)
: name + '.';
File result = new File(output,base + suffix);
int index = 1;
while(result.exists() && !overwrite)
{
result = new File(output,base + index + "." + suffix);
index++;
}
transformer.transform(new SAXSource(xiReader,input),
new StreamResult(result));
return result;
}
public String getDisplayName()
{
return name;
}
}
|
XSLTLink offers additional services. For one thing, it groups the XSLT style sheet and the regular expressions into one file. This is easy thanks to a forward-looking provision in the XSLT standard, which states that the processor must ignore elements that appear under xsl:stylesheet, but are defined in a different namespace.
Since the XI rules (regular expressions) have their own namespace, it's possible to copy them into the style sheet. The XI processor already looks for the rules anywhere in the document (it does not require them to appear at the root of the document). Listing 3 is a sample style sheet. Notice that it contains the XI rules as well.
Listing 3. The Import address book style sheet
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xi="http://ananas.org/2002/xi/rules"
xmlns:an="http://ananas.org/2002/sample">
<xi:rules version="1.0"
defaultPrefix="an"
targetNamespace="http://ananas.org/2002/sample">
<xi:ruleset name="address-book">
<xi:match name="alias"
pattern="^alias ([^\s]*) (.*)$">
<xi:group name="id"/>
<xi:group name="email"/>
</xi:match>
<xi:match name="note"
pattern='^note ([^\s]*) (.*)$'>
<xi:group name="id"/>
<xi:group name="fields"/>
</xi:match>
<xi:error message="unknown line type"/>
</xi:ruleset>
<xi:ruleset name="fields">
<xi:match name="fields"
pattern="[\s]*<([^<]*)>">
<xi:group name="field"/>
</xi:match>
</xi:ruleset>
<xi:ruleset name="field">
<xi:match name="field"
pattern="([^:]*):(.*)">
<xi:group name="key"/>
<xi:group name="value"/>
</xi:match>
</xi:ruleset>
</xi:rules>
<xsl:output method="xml"/>
<xsl:template match="an:address-book">
<sect1><xsl:apply-templates/></sect1>
</xsl:template>
<xsl:template match="an:alias">
<address>
<xsl:variable name="id" select="an:id"/>
<xsl:for-each select="/an:address-book/an:note[an:id = $id]">
<personname><xsl:value-of
select="an:fields/an:field[an:key='name']/an:value"/>
</personname>
<xsl:if test="an:fields/an:field[an:key='country']/an:value">
<street><xsl:value-of
select="an:fields/an:field[an:key='address']/an:value"/>
</street>
<postcode><xsl:value-of
select="an:fields/an:field[an:key='zip']/an:value"/>
</postcode>
<city><xsl:value-of
select="an:fields/an:field[an:key='city']/an:value"/>
</city>
<state><xsl:value-of
select="an:fields/an:field[an:key='state']/an:value"/>
</state>
<country><xsl:value-of
select="an:fields/an:field[an:key='country']/an:value"/>
</country>
</xsl:if>
</xsl:for-each>
<email><xsl:value-of select="an:email"/></email>
</address>
</xsl:template>
<xsl:template match="an:note"/>
</xsl:stylesheet>
|
XSLTLink offers one additional service: It computes the output name from the input. Depending on the option, it may create unique names so that a new file never overwrites an existing one. The reason for this will become clearer in What the UI should do.
I see one major advantage in this two-step approach, namely that XSLT is a standard with many choices. Several XSLT processors are on the market; if one is only so-so, you can pick another. Books, editors, debuggers, and more are available to support the developer. However, there's one drawback: Since everything goes through a style sheet, you never see what XI generates. If something is not quite right in the output, is it the regular expressions, the style sheet, or a combination of the two? Hard to tell.
Fortunately, there's a work-around. The trick is to use a style sheet that copies its input verbatim. This only requires one template, as shown with in Listing 4
Listing 4. Using a style sheet that copies its input verbatim
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xi="http://ananas.org/2002/xi/rules">
<xi:rules version="1.0"
defaultPrefix="an"
targetNamespace="http://ananas.org/2002/sample">
<xi:ruleset name="address-book">
<xi:match name="alias"
pattern="^alias ([^\s]*) (.*)$">
<xi:group name="id"/>
<xi:group name="email"/>
</xi:match>
<xi:match name="note"
pattern='^note ([^\s]*) (.*)$'>
<xi:group name="id"/>
<xi:group name="fields"/>
</xi:match>
<xi:error message="unknown line type"/>
</xi:ruleset>
<xi:ruleset name="fields">
<xi:match name="fields"
pattern="[\s]*<([^<]*)>">
<xi:group name="field"/>
</xi:match>
</xi:ruleset>
<xi:ruleset name="field">
<xi:match name="field"
pattern="([^:]*):(.*)">
<xi:group name="key"/>
<xi:group name="value"/>
</xi:match>
</xi:ruleset>
</xi:rules>
<xsl:output method="xml"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
|
When you start a new project, you should always copy this template and debug your regular expressions. When you are happy with the regular expressions, start writing the style sheet itself. Debugging is a two-step process.
Now, just the all-important user interface is left. When I released the first project in this column, XM (XSLT Make), I failed to include a proper user interface (it only has a command-line interface). The result? I still get e-mails from people who would like to use XM but are not comfortable with the command line. I'll fix this aspect of XM in a future column, but until then I might as well avoid falling into the same trap with this project.
The bulk of the work for this column was to provide a user interface for XI. This requires a lot of code so I will not reproduce the entire listing in this column. Besides, as you will see, it's straightforward AWT (Abstract Window Toolkit) code. Of course, you can download the entire project from the developerWorks open source section (see Resources).
First, I had to decide between a couple of options, such as Do I want a desktop or Web-based interface? A desktop interface has a richer set of controls and it can often be made more user-friendly, but a Web-based interface can be made accessible from anywhere. Eventually, user-friendliness won and I went with the desktop interface.
The next decision was whether to use AWT or Swing. I know that for most Java technology developers, Swing is a no-brainer because it has a richer set of controls, yet I personally do not like Swing. I was enthusiastic at the first announcement of Swing, but I was very disappointed with the result. Simply put, I don't think Swing does a very good job emulating Windows. It just does not feel right.
I am intrigued by the IBM-sponsored Eclipse project and its SWT toolkit (see Resources). SWT provides a native version of the more advanced components, such as trees, toolbars, and tables. Still, it seemed like overkill to use SWT in this project, so I decided for regular AWT. In practice, I'm only missing the ability to make a button be the default in a box, and I can live with that.
More important is the question of how to organize the user interface. I'm not much of a user interface designer so I like to start with an existing project and tweak it. In this instance, my starting point is a simple wrapper around Xalan that I wrote for my book, XML by Example (see Resources).
At the time, I reasoned that since I had to collect three parameters (input XML document, XSLT style sheet, and output document), I only needed three input fields, one "Transform" button, and a big display area for error messages, as shown in Figure 1.
Figure 1. Not-so-good user interface

Now that I've used this interface for several months, I have a pretty good idea of what works and what doesn't. First, the good things: It's functional and it's very easy to explain. Unfortunately, the interface also has a number of problems:
- It's clumsy to use. If you've used a similar interface, you know what I mean. The interface works well if you have to process only one XML document with a single style sheet. In practice, one generally works with several documents and several style sheets. Selecting and reselecting the same files is an annoyance.
- It's very easy to confuse the fields and place the XML document in the XSL fields or vice versa.
- It's even easier to overwrite a file by mistake.
Since I knew I would work on the user interface, I studied how we use XSLT processors here at Pineapplesoft. I found three distinct cases:
- Debugging style sheets
- Processing files manually
- Running batches
When debugging, we repetitively apply a style sheet to a fixed set of files. We need clear error messages and a convenient way to select the files. It's easier if the output files are in a directory of their own. In roughly half the cases, we want to preserve all our test cases. For the other half, the last test can erase the previous one.
When converting files manually, we often work with two or three well-tested style sheets. Selecting the right style sheet must be particularly convenient.
When working in a batch, we set all the parameters from the command line. We often use a special version of the software that works at the API level. In this case, however, my priority was not the batch interface.
Putting all the pieces together, I decided that the best solution was to extend the existing interface so I could drag and drop files. I reasoned that if I could drop files into the various fields, it would be easier to cycle through style sheets or select files.
Drag and drop was introduced in JDK 1.2 and is surprisingly easy to use. It is sufficient to implement the DropTargetListener interface to turn a regular component in a drop target.
I did a quick test and found that I never drop the files at the right place. Figure 2 illustrates this: I have dropped the style sheet into two fields by mistake.
Figure 2. Does drag and drop help?

Clearly, drag and drop is a good idea because it simplifies file selection, but three fields is two fields too many. Also drag and drop does not work so well for the output file, which leads to the final interface (for now).
I'm a great believer in simplifying user interfaces down to the bone, leaving fewer options and fewer controls to reduce confusion. As Figure 3 illustrates, I removed all the fields. Indeed, I found that a drop-down list box is a good replacement for the style sheet field. The interface lists all the files from the rules directory, which is particularly convenient for cycling through a few style sheets. The application generates the names for the output files automatically. With no text fields, I can drop input files anywhere on the window -- much easier than having to hit a small text field. I have also provided an Open button for cases where drag and drop is inconvenient.
Figure 3. A better user interface

I could not test this interface as extensively as I would have liked, so please share your comments on the mailing list (see Resources). I have found that a few options are required, including:
- The option to overwrite or preserve the files in the output directory, which is useful for debugging
- The option to reload the style sheet for each invocation (again, this is useful for debugging when the style sheet is being updated)
- The option to close the software after the last transformation (primarily for batch mode)
I have also provided command-line arguments to set the input document, choose a style sheet, and change the options.
This article shows you how to build a tool to convert text documents into XML. I know from past experience that such a tool is very handy, if only for converting legacy documents to XML. As usual, you can find the source code for the application in the online source respository of developerWorks. You can either integrate XI as a library into your project (it uses the familiar XMLReader API) or turn to the user interface for occasional data conversion.
- Download the companion code from the online source code repository. The project mailing list is at the same address.
- Check out IBM-sponsored Eclipse, which develops a generic IDE. Particularly interesting is the SWT, a set of native UI components for Java.
- In the XSLT Library, find sample style sheets to parse fixed-length input. Note the clever trick to input a text document in XML.
- Read XML by Example by Benoît Marchal. This introductory book on XML covers the latest features of the ever-evolving XML standard, including coverage of the final XML Schema recommendation and the latest developments in XSL.
- Read the reference on regular expressions, Mastering Regular Expressions by Jeffrey E.F. Friedl.
- Read all of Benoît Marchal's Working XML articles.
- Find more XML resources on the developerWorks
XML technology zone.
- Get Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
- Find out how you can become an IBM Certified Developer in XML and related technologies.

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example, which covers the latest features of the ever-evolving XML standard, including coverage of the final XML Schemas recommendation and the latest developments of XSL. More details on this topic are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.




