Skip to main content

Working XML: Putting XI to good use

User interface challenges

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example, which covers the latest features of the ever-evolving XML standard, including coverage of the final XML Schemas recommendation and the latest developments of XSL. More details on this topic are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Summary:  When it comes to user interfaces, simplification is the key. Fewer options and fewer controls mean less confusion and less chance for error. In his latest column on XI, Benoît uses this concept of less is more to create a user interface that makes XI, the text-to-XML conversion tool, easier to use and more palatable.

View more content in this series

Date:  01 Aug 2002
Level:  Intermediate
Activity:  793 views


XI (which stands for XML Import) is a tool for converting text files to XML. Many applications produce text files and consequently much useful data can be imported into an XML workflow with such a tool. I originally designed XI to retrieve an address book in XML. Yet XI uses regular expressions (as introduced in JDK 1.4) to parse the input file, so it works equally well with server logs, comma-separated values (CSVs), Excel files, and many other documents. Think of XI as a generic toolkit for importing your legacy data into a modern XML flow.

Combining regular expressions with XSLT

You might remember from my previous column, "Wrapping up XI," I chose to implement the XMLReader interface. Instead of writing to XML directly from the XI parser, the software generates XML documents. In many respects, XI behaves like an XML parser -- but it parses any text document.

Listing 1 (which also appeared in the first column in the XI series) is an address book. It is maintained as a text document by an e-mail application.


Listing 1. Original address book
alias "jdoe" jdoe@xmli.com
note "jdoe" 
    <country:US><zip:45202>
    <state:OH><city:Cincinnati>
    <address:34 Fountain Square Plaza>
    <name:John Doe>
alias "jsmith" jsmith@worth-it.com
note "jsmith" 
    <first:Jack><last:Smith>
    <name:Jack Smith>
alias "pdupont" pdupont@pineapples.net
note "pdupont" <name:Pierre Dupont>

Two-step conversion

The premise behind XI is that converting text documents to XML is best done in two steps:

  1. The XI parser turns the document into a rough XML version. The rough XML is syntactically correct, but may not use the proper vocabulary.
  2. This rough XML document is post-processed with one or more XSLT style sheets. Experience shows that you often need a full-fledged scripting language to produce a clean XML document, and XSLT is one of the best languages available for that purpose.

To convert the address book to XML, you need to reorganize some of the data. In the text document, for instance, the e-mail and postal addresses are on different lines; in XML, I want everything under one tag. Also, not every entry has the same type of data (such as a postal address). Finally, there are fields in the text document (first name, last name) that I do not need in XML. All this for a simple case. Other cases, such as having to compute the value of fields, can be much trickier. In a recent customer project (using a tool that is conceptually similar to XI), data had to be summed and averaged across several lines. A scripting language is definitely handy.

XSLT linking

At the end of my previous column, I included some code that takes the XIReader and passes it to a Java Transformer to save the result. The sample used an uninitialized Transformer to save the result of the parsing. In effect, this implements only the first step of the conversion. However, it's a simple matter to initialize the Transformer with an XSLT style sheet to perform the second step as well. shows the XSLTLink class, which encapsulates the complete conversion.


Listing 2. The XSLTLink class
package org.ananas.xi;

import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.transform.*;
import javax.xml.transform.sax.*;
import javax.xml.transform.stream.*;

public class XSLTLink
{
   private String name;
   private File stylesheet,
                output;
   private String suffix;
   private XMLReader xiReader;
   private Transformer transformer;

   public XSLTLink(File stylesheet,File output)
      throws IOException, SAXException,
             TransformerConfigurationException
   {
      this.stylesheet = stylesheet;
      this.output = output;
      name = stylesheet.getName();
      int pos = name.lastIndexOf('.');
      if(pos != -1)
         name = name.substring(0,pos);
      reload();
   }

   public void reload()
      throws IOException, SAXException,
             TransformerConfigurationException
   {
      InputSource input =
         new InputSource(stylesheet.toURI().toString());
      xiReader =
         XMLReaderFactory.createXMLReader("org.ananas.xi.XIReader");
	     xiReader.setProperty(XIReader.RULESETS_URI,input);
      TransformerFactory factory = TransformerFactory.newInstance();
      transformer =
         factory.newTransformer(new StreamSource(stylesheet));
      suffix = transformer.getOutputProperty("method");
   }

   public File applyTo(File source,boolean overwrite)
      throws IOException, SAXException, TransformerException
   {
      if(!source.isFile())
         throw new IOException(source.getPath() + " is not a file");
      InputSource input = new InputSource(source.toURI().toString());
      String name = source.getName();
      int pos = name.lastIndexOf('.');
      String base = pos != -1 ? name.substring(0,pos + 1)
                              : name + '.';
      File result = new File(output,base + suffix);
      int index = 1;
      while(result.exists() && !overwrite)
      {
         result = new File(output,base + index + "." + suffix);
         index++;
      }
      transformer.transform(new SAXSource(xiReader,input),
                            new StreamResult(result));
      return result;
   }

   public String getDisplayName()
   {
      return name;
   }
}

XSLTLink offers additional services. For one thing, it groups the XSLT style sheet and the regular expressions into one file. This is easy thanks to a forward-looking provision in the XSLT standard, which states that the processor must ignore elements that appear under xsl:stylesheet, but are defined in a different namespace.

Since the XI rules (regular expressions) have their own namespace, it's possible to copy them into the style sheet. The XI processor already looks for the rules anywhere in the document (it does not require them to appear at the root of the document). Listing 3 is a sample style sheet. Notice that it contains the XI rules as well.


Listing 3. The Import address book style sheet
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xi="http://ananas.org/2002/xi/rules"
                xmlns:an="http://ananas.org/2002/sample">

<xi:rules version="1.0"
          defaultPrefix="an"
          targetNamespace="http://ananas.org/2002/sample">

<xi:ruleset name="address-book">
   <xi:match name="alias"
             pattern="^alias ([^\s]*) (.*)$">
      <xi:group name="id"/>
      <xi:group name="email"/>
   </xi:match>
   <xi:match name="note"
             pattern='^note ([^\s]*) (.*)$'>
      <xi:group name="id"/>
      <xi:group name="fields"/>
   </xi:match>
   <xi:error message="unknown line type"/>
</xi:ruleset>

<xi:ruleset name="fields">
   <xi:match name="fields"
             pattern="[\s]*<([^<]*)>">
      <xi:group name="field"/>
   </xi:match>
</xi:ruleset>

<xi:ruleset name="field">
   <xi:match name="field"
             pattern="([^:]*):(.*)">
      <xi:group name="key"/>
      <xi:group name="value"/>
   </xi:match>
</xi:ruleset>

</xi:rules>

<xsl:output method="xml"/>

<xsl:template match="an:address-book">
   <sect1><xsl:apply-templates/></sect1>
</xsl:template>

<xsl:template match="an:alias">
<address>
   <xsl:variable name="id" select="an:id"/>
   <xsl:for-each select="/an:address-book/an:note[an:id = $id]">
      <personname><xsl:value-of
         select="an:fields/an:field[an:key='name']/an:value"/>
      </personname>
      <xsl:if test="an:fields/an:field[an:key='country']/an:value">
         <street><xsl:value-of
            select="an:fields/an:field[an:key='address']/an:value"/>
         </street>
         <postcode><xsl:value-of
            select="an:fields/an:field[an:key='zip']/an:value"/>
         </postcode>
         <city><xsl:value-of
            select="an:fields/an:field[an:key='city']/an:value"/>
         </city>
         <state><xsl:value-of
            select="an:fields/an:field[an:key='state']/an:value"/>
         </state>
         <country><xsl:value-of
            select="an:fields/an:field[an:key='country']/an:value"/>
         </country>
      </xsl:if>
   </xsl:for-each>
   <email><xsl:value-of select="an:email"/></email>
</address>
</xsl:template>

<xsl:template match="an:note"/>

</xsl:stylesheet>

XSLTLink offers one additional service: It computes the output name from the input. Depending on the option, it may create unique names so that a new file never overwrites an existing one. The reason for this will become clearer in What the UI should do.

Debugging tip

I see one major advantage in this two-step approach, namely that XSLT is a standard with many choices. Several XSLT processors are on the market; if one is only so-so, you can pick another. Books, editors, debuggers, and more are available to support the developer. However, there's one drawback: Since everything goes through a style sheet, you never see what XI generates. If something is not quite right in the output, is it the regular expressions, the style sheet, or a combination of the two? Hard to tell.

Fortunately, there's a work-around. The trick is to use a style sheet that copies its input verbatim. This only requires one template, as shown with in Listing 4


Listing 4. Using a style sheet that copies its input verbatim
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:xi="http://ananas.org/2002/xi/rules">

<xi:rules version="1.0"
          defaultPrefix="an"
          targetNamespace="http://ananas.org/2002/sample">

<xi:ruleset name="address-book">
   <xi:match name="alias"
             pattern="^alias ([^\s]*) (.*)$">
      <xi:group name="id"/>
      <xi:group name="email"/>
   </xi:match>
   <xi:match name="note"
             pattern='^note ([^\s]*) (.*)$'>
      <xi:group name="id"/>
      <xi:group name="fields"/>
   </xi:match>
   <xi:error message="unknown line type"/>
</xi:ruleset>

<xi:ruleset name="fields">
   <xi:match name="fields"
             pattern="[\s]*<([^<]*)>">
      <xi:group name="field"/>
   </xi:match>
</xi:ruleset>

<xi:ruleset name="field">
   <xi:match name="field"
             pattern="([^:]*):(.*)">
      <xi:group name="key"/>
      <xi:group name="value"/>
   </xi:match>
</xi:ruleset>

</xi:rules>

<xsl:output method="xml"/>

<xsl:template match="@*|node()">
   <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
   </xsl:copy>
</xsl:template>

</xsl:stylesheet>

When you start a new project, you should always copy this template and debug your regular expressions. When you are happy with the regular expressions, start writing the style sheet itself. Debugging is a two-step process.


User interface

Now, just the all-important user interface is left. When I released the first project in this column, XM (XSLT Make), I failed to include a proper user interface (it only has a command-line interface). The result? I still get e-mails from people who would like to use XM but are not comfortable with the command line. I'll fix this aspect of XM in a future column, but until then I might as well avoid falling into the same trap with this project.

The bulk of the work for this column was to provide a user interface for XI. This requires a lot of code so I will not reproduce the entire listing in this column. Besides, as you will see, it's straightforward AWT (Abstract Window Toolkit) code. Of course, you can download the entire project from the developerWorks open source section (see Resources).

UI decisions

First, I had to decide between a couple of options, such as Do I want a desktop or Web-based interface? A desktop interface has a richer set of controls and it can often be made more user-friendly, but a Web-based interface can be made accessible from anywhere. Eventually, user-friendliness won and I went with the desktop interface.

The next decision was whether to use AWT or Swing. I know that for most Java technology developers, Swing is a no-brainer because it has a richer set of controls, yet I personally do not like Swing. I was enthusiastic at the first announcement of Swing, but I was very disappointed with the result. Simply put, I don't think Swing does a very good job emulating Windows. It just does not feel right.

I am intrigued by the IBM-sponsored Eclipse project and its SWT toolkit (see Resources). SWT provides a native version of the more advanced components, such as trees, toolbars, and tables. Still, it seemed like overkill to use SWT in this project, so I decided for regular AWT. In practice, I'm only missing the ability to make a button be the default in a box, and I can live with that.

What the UI should do

More important is the question of how to organize the user interface. I'm not much of a user interface designer so I like to start with an existing project and tweak it. In this instance, my starting point is a simple wrapper around Xalan that I wrote for my book, XML by Example (see Resources).

At the time, I reasoned that since I had to collect three parameters (input XML document, XSLT style sheet, and output document), I only needed three input fields, one "Transform" button, and a big display area for error messages, as shown in Figure 1.


Figure 1. Not-so-good user interface
Not so good interface

Now that I've used this interface for several months, I have a pretty good idea of what works and what doesn't. First, the good things: It's functional and it's very easy to explain. Unfortunately, the interface also has a number of problems:

  • It's clumsy to use. If you've used a similar interface, you know what I mean. The interface works well if you have to process only one XML document with a single style sheet. In practice, one generally works with several documents and several style sheets. Selecting and reselecting the same files is an annoyance.

  • It's very easy to confuse the fields and place the XML document in the XSL fields or vice versa.

  • It's even easier to overwrite a file by mistake.

Since I knew I would work on the user interface, I studied how we use XSLT processors here at Pineapplesoft. I found three distinct cases:

  • Debugging style sheets

  • Processing files manually

  • Running batches

When debugging, we repetitively apply a style sheet to a fixed set of files. We need clear error messages and a convenient way to select the files. It's easier if the output files are in a directory of their own. In roughly half the cases, we want to preserve all our test cases. For the other half, the last test can erase the previous one.

When converting files manually, we often work with two or three well-tested style sheets. Selecting the right style sheet must be particularly convenient.

When working in a batch, we set all the parameters from the command line. We often use a special version of the software that works at the API level. In this case, however, my priority was not the batch interface.

Simplify, simplify, simplify

Putting all the pieces together, I decided that the best solution was to extend the existing interface so I could drag and drop files. I reasoned that if I could drop files into the various fields, it would be easier to cycle through style sheets or select files.

Drag and drop was introduced in JDK 1.2 and is surprisingly easy to use. It is sufficient to implement the DropTargetListener interface to turn a regular component in a drop target.

I did a quick test and found that I never drop the files at the right place. Figure 2 illustrates this: I have dropped the style sheet into two fields by mistake.


Figure 2. Does drag and drop help?
Does it help?

Clearly, drag and drop is a good idea because it simplifies file selection, but three fields is two fields too many. Also drag and drop does not work so well for the output file, which leads to the final interface (for now).

I'm a great believer in simplifying user interfaces down to the bone, leaving fewer options and fewer controls to reduce confusion. As Figure 3 illustrates, I removed all the fields. Indeed, I found that a drop-down list box is a good replacement for the style sheet field. The interface lists all the files from the rules directory, which is particularly convenient for cycling through a few style sheets. The application generates the names for the output files automatically. With no text fields, I can drop input files anywhere on the window -- much easier than having to hit a small text field. I have also provided an Open button for cases where drag and drop is inconvenient.


Figure 3. A better user interface
Better user interface

I could not test this interface as extensively as I would have liked, so please share your comments on the mailing list (see Resources). I have found that a few options are required, including:

  • The option to overwrite or preserve the files in the output directory, which is useful for debugging

  • The option to reload the style sheet for each invocation (again, this is useful for debugging when the style sheet is being updated)

  • The option to close the software after the last transformation (primarily for batch mode)

I have also provided command-line arguments to set the input document, choose a style sheet, and change the options.


Conclusion

This article shows you how to build a tool to convert text documents into XML. I know from past experience that such a tool is very handy, if only for converting legacy documents to XML. As usual, you can find the source code for the application in the online source respository of developerWorks. You can either integrate XI as a library into your project (it uses the familiar XMLReader API) or turn to the user interface for occasional data conversion.


Resources

About the author

Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example, which covers the latest features of the ever-evolving XML standard, including coverage of the final XML Schemas recommendation and the latest developments of XSL. More details on this topic are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12151
ArticleTitle=Working XML: Putting XI to good use
publish-date=08012002
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers