Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Working XML: Wrapping up XI

Implementing the XMLReader interface

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example. More details on this topic are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Summary:  Columnist Benoit Marchal continues to shape XI, an open-source project that converts legacy text to XML. For increased efficiency, XI now implements the SAX XMLReader interface, which proves handy in linking XI to an XSLT processor. Code samples demonstrate the techniques, and the complete source code is available as well. Each month the column reports on the author's open source projects designed to assist fellow XML developers, especially those working with Java technology.

View more content in this series

Date:  01 Jul 2002
Level:  Intermediate

Comments:  

In the last two columns I have been working on XI (short for XML Import), a project to convert legacy files to XML (see Resources). The motivation for XI came from a need to publish an address book as part of an XML site. Because the address book is maintained in the proprietary format of an e-mail client, I needed a tool to convert the text to XML.

I took the opportunity to try the new regular-expression library built into JDK 1.4. Regular expressions make for a flexible conversion solution: Instead of hard-coding the conversion routine, I can describe how to parse the legacy document as a set of regular expressions. I will use one set of rules for the address book, but I could write different rules for other calendars or for chemical analysis data, Web server logs, or other formats. XI is a more generic tool that you or I would be able to use and reuse in many projects.

And now in XML

In the previous column, "Wrestling with Java NIO" (see Resources), I spent a fair amount of time studying the regular-expression library. Some of my assumptions turned out to be completely off the mark, but I still managed to parse the address book into elements using regular expressions.

Because I'm aiming at a generic solution, I created a small data structure to hold the set of rules. It essentially associates XML tag names with regular expressions. Although I had to limit myself to a fixed data structure for testing, I organized the code so that it would be a simple matter to populate the data structure from a file, a feature I have now implemented.

This column is mostly about cleaning the code and making sure that it produces a valid XML document. I also worked on packaging the existing algorithm as an XML parser. As you will see, the XML parser interface proves handy when dealing with XSLT processors.


The best way to write XML documents

The easiest solution to finish XI would have been to revisit the code and adapt the various print statements to write XML tags. Indeed, the logic to parse the document and associate XML elements to the node is already there. For example, when it matches a regular expression, the algorithm prints the element associated with it, such as:

System.out.print(ruleset.getMatchAt(i).getQualifiedName());

It's not difficult to adapt this to produce proper XML:

System.out.print("<"+ruleset.getMatchAt(i).getQualifiedName()+">");

Of course, the above statement prints the start tag only, so I would need more print statements for the end tag and the content, but that's not difficult to do.

If I were only interested in writing an XML document, that's probably what I would do as it is the least demanding solution. Special care is needed to escape the angle brackets, ampersand, and other reserved characters, but that's trivial. I might also want to save the XML document in a file instead of printing to the console -- but again, that's trivial.

Yet I'm not happy with writing an XML document to a file. As you may recall from the previous columns, I don't plan to use the output from XI directly. Experience shows that one often needs to reorganize legacy documents. For example, with the address book, I will have to combine alias and note lines. I could add logic in XI to handle this and other similar cases, but I have found it advantageous to break the import process into two steps:

  • A syntax conversion
  • A data structure reorganization

The syntax conversion takes the textual information and wraps it in the simplest XML structure. Typically the resulting XML document is very close to the original document. In most cases, it's as simple as replacing delimiters with XML tags. That's what XI does.

The second step uses a transformation to turn this crude XML document into the target vocabulary. I have found XSLT is particularly suitable for that purpose because it's a powerful transformation language. And because XSLT is a standard, there's no shortage of support tools such as editors.

In a nutshell, I don't necessarily want XI to write the XML document in a file; I'd rather optimize it to interface with an XSLT processor. JDK 1.4 ships with a version of Apache Xalan that accepts input from files (streams), SAX events, and DOM trees. Of the three interfaces, my personal favorite is SAX.

SAX is attractive because it's simple to program and has a reasonably efficient interface when processing XML documents. Compared to files, it saves writing to a temporary file; compared to DOM, it requires less memory.


Programming the SAX interface

In the remainder of this article, I will assume that you are familiar with SAX programming. If not, you might want to turn to "SAX the Power API," also on developerWorks (see Resources).

The two most important interfaces in SAX are XMLReader and ContentHandler. XMLReader describes how to initialize and start an XML parser, while ContentHandler lists the events that the XMLReader fires as it parses the XML document.

You've probably already used both interfaces when reading XML documents. Even if you're familiar with them, however, this application requires that you look at SAX from a slightly different angle. In this case, instead of being a user of SAX, I am writing my own parser. Strictly speaking, XI is not an XML parser; it does not read XML documents. However, it offers an XML view over text documents, therefore it can conform to the XMLReader interface.

The SAX implementation of XI is in the class XIReader. The class is too large to reproduce here in its entirety. Before going any further, I encourage you to grab a copy from the Open Source section of developerWorks (see Resources).

XIReader deals with two issues: implementing the SAX interface and the actual text parsing and XML document generation. Listing 1 illustrates the implementation of the interface.


Listing 1: XIReader's SAX Implementation

public class XIReader
   implements XMLReader, Locator
{
   protected ContentHandler contentHandler = null;
 public ContentHandler getContentHandler()
   {
      return contentHandler;
   }

   public void setContentHandler(ContentHandler value)
      throws NullPointerException
   {
      if(value == null)
         throw new NullPointerException("ContentHandler");
      else
         contentHandler = value;
   }

   // ...

}

To support XMLReader, XIReader offers methods to register and access various SAX handlers: ContentHandler, ErrorHandler, DTDHandler, and EntityResolver.

Strictly speaking, DTDHandler and EntityResolver are useless: A legacy text has no DTDs, so XIReader will never fire DTD-related events.

Likewise, there's no need for EntityResolver; if you recall, the parser is not supposed to use it for the top-level document entity. The interface is only useful for external entities such as DTDs! Again, no use for legacy text documents. Still, SAX mandates methods to set and get the two handlers and XIReader obliges.

XIReader also implements limited support for SAX features and properties. Features and properties control various aspects of the parsing; they are identified by URLs such as http://xml.org/sax/features/namespaces. Be warned that the URLs act as identifiers only, and so do not try to resolve them. (Do not visit the Web site -- there's no Web site to visit.)

The specification states that an XMLReader must support setting the http://xml.org/sax/features/namespaces feature to true (supporting false is optional), and setting http://xml.org/sax/features/namespace-prefixes to false (true is optional).

The first feature controls whether the parser decodes XML namespaces (true) or not. XIReader always use namespaces. The second feature controls whether to report the namespace declaration in the list of attributes (true) or not. XIReader supports both values.

As you can see, XIReader offers minimal conformance. Still, I have found that I had to support setting http://xml.org/sax/features/namespace-prefixes to true (and not only false, as required by the specification) because Apache Xalan needs to set this property to true to process namespaces properly.

The specification defines other features and their URLs, but a parser is not required to support them. Because most of these features deal with validation and XML schemas, I chose to ignore them.

I have also defined a new property, http://ananas.org/xi/features/rulesets, to give the parser its rules file. The property accepts an InputSource value that points to the rules file.


ContentHandler and parsing

In the code discussed in the previous column, "Wrestling with Java NIO" (see Resources), the bulk of the processing took place in a method called read(). I renamed it match() for increased readability, and adapted it to call ContentHandler as it decodes the input document. Listing 2 illustrates this. If you compare this code with with that in "Wrestling with Java NIO," you will find very similar structures. The only important difference is that print() statements have been replaced by various calls to ContentHandler.


Listing 2: match() and ContentHandler

public void match(Ruleset ruleset,String st,boolean firstMatch)
   throws SAXException
{
   attributes.clear();
   int i = 0;
   while(i < ruleset.getMatchCount())
   {
      if(ruleset.getMatchAt(i).matches(st))
      {
         Match match = ruleset.getMatchAt(i);
         if(firstMatch && contentHandler != null)
            contentHandler.startElement(match.getNamespaceURI(),
                                        match.getLocalName(),
                                        match.getQualifiedName(),
                                        attributes);
         for(int j = 1;j <= match.getGroupCount();j++)
         {
            QName qname = match.getGroupNameAt(j);
            Ruleset nextRuleset = (Ruleset)rulesetsMap.get(qname);
            if(nextRuleset != null)
               match(nextRuleset,match.getGroupValueAt(j),true);
            else
            {
               Group group = match.getGroupNameAt(j);
               if(contentHandler != null)
               {
                  contentHandler.startElement(group.getNamespaceURI(),
                                              group.getLocalName(),
                                              group.getQualifiedName(),
                                              attributes);
                  String value = match.getGroupValueAt(j);
                  int begin = 0,
                      end = 0;
                  while(begin < value.length())
                  {
                     if(value.length() - begin < chars.length)
                        end = value.length();
                     else
                        end = begin + chars.length;
                     value.getChars(begin,end,chars,0);
                     contentHandler.characters(chars,0,end - begin);
                     begin = end;
                  }
                  contentHandler.endElement(group.getNamespaceURI(),
                                            group.getLocalName(),
                                            group.getQualifiedName());
               }
            }
         }
         String rest = match.rest();
         if(rest != null)
            match(ruleset,rest,false);
         if(firstMatch && contentHandler != null)
            contentHandler.endElement(match.getNamespaceURI(),
                                      match.getLocalName(),
                                      match.getQualifiedName());
         break;
      }
      else
         i++;
   }
   if(i < ruleset.getMatchCount()
      && ruleset.getError() != null
      && errorHandler != null)
      errorHandler.error(new SAXParseException(ruleset.getError(),
                                               this));
}

If you remember XM, the publishing project introduced in the very first Working XML column, you will be familiar with firing ContentHandler events. XM did so to fix dangling hyperlinks. XIReader builds on the same logic, but is more ambitious. Instead of firing one event for links, it fires enough events to describe a complete document.

I confess that I was originally anxious about writing a full implementation of XMLReader. But, as this column shows, it was surprisingly easy....which just proves the point that SAX truly lives up to its name as a Simple API for XML.

ContentHandler is particularly simple to use. Consider that you would normally have methods to print start and end tags as well as content. Those methods would take care of character escaping, indenting, and other syntax-related issues. ContentHandler essentially defines those methods for you. Use the startElement() method to print a start tag, the endElement() method to print an end tag, and the characters() method to print the content.


Reading the rules file

Having established an XMLReader, I wanted XI to be able to read a rules file. I already have applications of XI that go beyond the address book, and I want to break free from hard-coded regular expressions.

I have mostly preserved the vocabulary introduced two columns ago. A rules file would resemble Listing 3. The root element is rules; it contains one or more ruleset elements.

Each ruleset contains a list of matches that represent regular expressions. The error element details what to do if XI cannot match any of the regular expressions. Finally, group elements represent the groups in the regular expression. Attached to every element is an element name, which is the name that XI uses.


Listing 3: rules.xml

<?xml version="1.0"?>
<xi:rules version="1.0"
          xmlns:xi="http://ananas.org/2002/xi/rules"
          defaultPrefix="an"
          targetNamespace="http://ananas.org/2002/sample">

<xi:ruleset name="address-book">
   <xi:match name="alias"
             pattern="^alias (.*):(.*)$">
      <xi:group name="id"/>
      <xi:group name="email"/>
   </xi:match>
   <xi:match name="note"
             pattern="^note .*:(.*)$">
      <xi:group name="fields"/>
   </xi:match>
   <xi:error message="unknown line type"/>
</xi:ruleset>

<xi:ruleset name="fields">
   <xi:match name="field"
             pattern="[\s]*<([^<]*)>">
      <xi:group name="field"/>
   </xi:match>
</xi:ruleset>

</xi:rules>

I made one change between the vocabulary in Listing 3 and the original one: The document now supports one global namespace that applies for the whole rules file. My original idea was to let the user specify multiple namespaces in the rules files, but it makes XIReader needlessly complex.

As I studied the issue more, I realized that a global namespace fits 99% of all needs. But what if you really need multiple namespaces? You can still work around it because the document is post-processed in XSLT anyway. It's a simple matter to add the new namespaces in the style sheet.


HC to the rescue

One of the pleasures of writing this column is that I can reuse projects as it progresses. In this case, I use HC, the Handler Compiler introduced a few months ago, to simplify parsing the rules file.

If you missed the corresponding columns, HC is a precompiler that takes a Java class annotated with XPaths and turns it into a SAX ContentHandler. Each method in the class matches one or more XPaths. In practice, it saves writing a lot of tedious state-management code.

Listing 4 is the handler for the rules file. You can see those XPaths in Javadoc comments. The handler defines one method for each element in the rules vocabulary. As it reads through the rules file, it populates the data structure with the regular expressions.


Listing 4: RulesHandler.java

package org.ananas.xi;

import java.util.*;
import org.xml.sax.*;

/**
 * @xmlns xi http://ananas.org/2002/xi/rules
 */

public class RulesHandler
   implements org.ananas.hc.HCHandler
{
   private String namespaceURI = null;
   private String prefix = null;
   private List rulesets = null;

   private Ruleset getLastRuleset()
   {
      return (Ruleset)rulesets.get(rulesets.size() - 1);
   }

   /**
    * @xpath xi:rules
    */
   public void init(Attributes attributes)
   {
      rulesets = new ArrayList();
      namespaceURI = attributes.getValue("targetNamespace");
      prefix = attributes.getValue("defaultPrefix");
      if(namespaceURI != null)
      {
         namespaceURI = namespaceURI.trim();
         if(namespaceURI.equals(""))
            namespaceURI = null;
      }
      if(prefix != null)
      {
         prefix = prefix.trim();
         if(prefix.equals(""))
            prefix = null;
      }
   }

   /**
    * @xpath xi:rules/xi:ruleset
    */
   public void doRuleset(Attributes attributes)
      throws SAXException
   {
      String name = attributes.getValue("name");
      if(name != null)
         rulesets.add(new Ruleset(namespaceURI,
                                  name,
                                  prefix));
      else
         throw new SAXException("name attribute required for xi:ruleset");
   }

   /**
    * @xpath xi:rules/xi:ruleset/xi:match
    */
   public void doMatch(Attributes attributes)
      throws SAXException
   {
      String name = attributes.getValue("name"),
             pattern = attributes.getValue("pattern");
      if(name != null && pattern != null)
      {
         Ruleset ruleset = getLastRuleset();
         ruleset.addMatch(new Match(namespaceURI,
                                    name,
                                    prefix,
                                    pattern));
      }
      else
         throw new SAXException("name and pattern attributes" + 
                 "required for xi:match");
   }

   /**
    * @xpath xi:rules/xi:ruleset/xi:error
    */
   public void doError(Attributes attributes)
      throws SAXException
   {
      String message = attributes.getValue("message");
      if(message != null)
      {
         Ruleset ruleset = getLastRuleset();
         if(ruleset.getError() == null)
            ruleset.setError(message);
         else
            throw new SAXException("no more than one error per xi:ruleset");
      }
      else
         throw new SAXException("message attribute required for xi:error");
   }

   /**
    * @xpath xi:rules/xi:ruleset/xi:match/xi:group
    */
   public void doGroup(Attributes attributes)
      throws SAXException
   {
      String name = attributes.getValue("name");
      if(name != null)
      {
         Ruleset ruleset = getLastRuleset();
         Match match = ruleset.getLastMatch();
         match.addGroup(new Group(namespaceURI,
                                  name,
                                  prefix));
      }
      else
         throw new SAXException("name attribute required for xi:group");
   }

   public Ruleset[] getRulesets()
   {
      Ruleset[] array = new Ruleset[rulesets.size()];
      return (Ruleset[])rulesets.toArray(array);
   }

   public String getNamespaceURI()
   {
      return namespaceURI;
   }

   public String getPrefix()
   {
      return prefix;
   }
}


Till next time

Work on XI is nearing completion. Now, you have a working processor, and it's a simple matter to interface it with an XSLT processor, as Listing 5 illustrates. In the next column, I'll wrap a simple user interface around the existing core to make XI even more useful.


Listing 5: Sample Main

public static void main(String[] params)
   throws TransformerException, TransformerConfigurationException,
          SAXException, IOException
{
   InputSource inputSource = new InputSource(new FileInputStream(params[0]));
   inputSource.setSystemId(params[0]);
   XMLReader xmlReader = 
           XMLReaderFactory.createXMLReader("org.ananas.xi.XIReader");
   xmlReader.setProperty(XIReader.RULESETS_URI,new InputSource("rules.xml"));
   TransformerFactory factory = TransformerFactory.newInstance();
   Transformer transformer = factory.newTransformer();
   transformer.transform(new SAXSource(xmlReader,inputSource),new 
           StreamResult("result.xml"));
}


Resources

About the author

Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example. More details on this topic are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12137
ArticleTitle=Working XML: Wrapping up XI
publish-date=07012002
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).