Skip to main content

Working XML: Wrestling with Java NIO

How to get sidetracked into buffers and channels

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . More details on this topic are available at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Summary:  This column takes the XI project to the next step. Here, Benoît reports his findings with the new Java technology APIs -- in particular, the regular expression engine and the New I/O (also known as NIO). Although the XI is not yet operational, you get a glimpse of what it will look like soon.

View more content in this series

Date:  01 Jun 2002
Level:  Intermediate
Activity:  1695 views
Comments:  

My last installment introduced XI, a new tool project for this column. The challenge is this: My company maintains the Web site of a working group with XML and XM, a publishing solution based on XML. (XM was the first project of the "Working XML" column. See Resources.)

One of the documents for that site is the list of participants. For reasons that are outside the scope of this column, the list is maintained as an address book with an e-mail program. The file format, which has been imposed on me, is not XML. So, how do I feed it to an XML publishing solution? I need to convert this list into XML.

Of course it would not be difficult to write an ad hoc conversion routine, but this seems like a waste of time: There are other cases where one needs to feed non-XML documents into an XML solution. In addition to the address book, I may need to process agendas, spreadsheets, cataloging information, and other legacy data.

XI offers a fairly generic solution to this problem. It uses regular expressions (now built into JDK 1.4) to parse the input file and create an XML counterpart. The details of these conversions, as well as a short analysis of XI, were introduced in my last column (see Resources).

New features in JDK 1.4

One of the reasons I decided to develop XI now was to experiment with the new regular expression (regex) engine that's built into JDK 1.4. As you will see, I got more than I bargained for when I launched into an exploration of the New I/O package (commonly referred to as NIO, it's in package java.nio).

Package java.regex

The regular expression engine has a simple and clean interface, which basically requires you to learn two new classes and one new interface. The new classes are Pattern and Matcher, and the new interface is CharSequence. (The latter that can lead to problems, which I'll address later.) Listing 1 illustrates how to use Pattern and Matcher.


Listing 1. Using the regular expression engine

import java.util.regex.*;

public class SampleRegex
{
   public static void main(String[] params)
   {
      Pattern pattern = Pattern.compile("(.*):(.*)");
      Matcher matcher = pattern.matcher(params[0]);
      if(matcher.matches())
      {
         System.out.print("Key:");
         System.out.println(matcher.group(1));
         System.out.print("Value:");
         System.out.println(matcher.group(2));
      }
      else
         System.out.print("No match");
   }
}

Pattern is a regular expression compiler. It accepts a regular expression and compiles it into a Matcher. Matcher is used to apply regular expressions to strings or, more accurately, to CharSequences.

Pattern has no public constructor. To create a Pattern, you must call its compile(), passing a regular expression as an argument.

Regular Expressions 101

A regular expression describes the format of a string: With the simplest form of regular expression, you input the text that you want to match. For example, the regular expression ABC will match the string ABC but not DEF.

Of course, regular expressions are limited to strict string comparisons. For example, you can use a wildcard character called a joker -- a dot (.) that matches any character but one at the end of a line. So, the regular expression A.C will match the strings ABC, AAC, AKC, and many others, but it still won't match DEF.

The star (*) after a character or a joker indicates that it can repeat indefinitely. Therefore the regular expression A*B will match the strings AB, AAB, AAAB, or any other string starting with an A and ending with a single B.

Since the dot is a joker, .* will match any string including ABC, IBM developerWorks, and DEF.

Parentheses are used as a grouping operator. It is particularly handy because, as you will see, it is possible to extract the contents of a group from a string. For example, the regular expression (*):(.*) used in Listing 1 matches two strings separated by a colon.

There's more to regular expressions and I encourage you to turn to a reference book such as Mastering Regular Expressions (see Resources).

Pattern and Matcher

In Listing 1, once a regular expression is compiled into a Pattern, it creates a Matcher with the matcher() method. The Matcher accepts a CharSequence (more on this in the next section, NIO) and reports whether the regular expression matches or not. The Matcher offers several methods to test regular expression: matches(), lookingAt(), and find(). Each applies the regular expression differently.

Matcher also offers a group() that retrieves the string matching a given group. Groups are numbered from 1 to n, whereas group(0) is the complete regular expression.

Listing 1 applies the regular expression to a command-line parameter. It prints the groups it has found, if any. For example, when calling the application with:

java SampleRegex "domain:ananas.org"

it will print:

Key: domain
Value: ananas.org

since the input (domain:ananas.org) matches the regular expression. However if it is called with:

java SampleRegex "ananas.org"

it will print:

No match

since the input does not match the regular expression.

NIO

I mentioned CharSequence in the previous section, Pattern and Matcher. This is a new interface defined in the java.lang package for an array of characters. String has been updated to implement CharSequence.

More importantly, by using the NIO package it is possible to access a file as a CharSequence. Therefore, since Matcher accepts CharSequence, it is possible to apply regular expressions to whole files. That's how I ended up looking at the java.nio package (see Resources).

Ultimately, I won't be using java.nio in this project, but I will discuss it nonetheless because I spent a lot of time looking for a solution. (Chasing dead ends is a favorite pastime in software development and the column would not be true to itself if I failed to report on those. Besides, I hope my experience will save you from doing the same research.)

Listing 2 shows you how to turn a file into a CharSequence. In practice, you end up with a new class called CharBuffer, which implements CharSequence over a text file.


Listing 2. Using CharBuffer

FileInputStream input = new FileInputStream(params[0]);
FileChannel channel = input.getChannel();
int fileLength = (int)channel.size();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY,0,fileLength);

Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharBuffer charBuffer = decoder.decode(buffer);

Matcher matcher = pattern.matcher(charBuffer);
// ...

Before going any further, I must confess that I'm not sure I have fully grasped the logic behind the NIO. I only recently started using this API and here's what I have found.

As you can see, many objects are required to turn a file into a CharBuffer. This is the pattern when using the new API. From what I understand, it appears the goal of the NIO is to give you more control and more flexibility over regular I/O.

NIO offers a less abstract API. For example, with Java IO, you need not worry about buffer management but you have no control over it either. NIO gives you more control over buffer management -- by letting you run it! Arguably, it is more efficient but it is also more complex.

As I'm writing, I am under the impression that NIO will be particularly relevant for developers of high-performance applications such as database engines, servers, and high-performance clients. I do not see any reason to use it (and incur the additional effort) for regular programming.

Furthermore, my ultimate goal with XI is to keep it compatible with XMLReader as it is defined by SAX. XMLReader currently works with InputStream or Reader, but not with NIO. I was unable to find a completely generic solution to convert any InputStream into a CharBuffer. (I did find partial solutions.) If you have a better idea, please e-mail me and I will report about it in the next column.

Eventually, I decided to work with the regular I/O, read the file line-by-line into strings, and apply regular expressions to those strings.


How it works

Enough time on what I could not do; let's have a look at what I was able to accomplish! During the analysis, I defined a simple data structure (and corresponding XML vocabulary) for a file description. To process a new file with XI, I wrote another description.

I started by implementing the data structure with the following files:

  • Ruleset (see Listing 3) represents a set of regular expressions.
  • Match (see Listing 4) represents a single regular expression. The class encapsulates java.regex.
  • Group (see Listing 5) is a group (a parenthesized expression) from a regular expression.

Listing 3. Ruleset.java

package org.ananas.xi;

import java.util.*;

public class Ruleset
   extends QName
{
   private List matches = new ArrayList();
   private String error = null;

   public Ruleset(String namespaceURI,
                  String localName,
                  String qualifiedName)
   {
      super(namespaceURI,localName,qualifiedName);
   }

   public void setError(String error)
   {
      this.error = error;
   }

   public String getError()
   {
      return error;
   }

   public synchronized void addMatch(Match match)
   {
      matches.add(match);
   }

   public synchronized Match getMatchAt(int index)
   {
      return (Match)matches.get(index);
   }

   public synchronized int getMatchCount()
   {
      return matches.size();
   }
}

The Ruleset is essentially a container for a list of Match objects.


Listing 4. Match.java

package org.ananas.xi;

import java.util.*;
import java.util.regex.*;

public class Match
   extends QName
{
   private Pattern pattern;
   private Matcher matcher = null;
   private String input = null;
   private List groups = new ArrayList();

   public Match(String namespaceURI,
                String localName,
                String qualifiedName,
                String pattern)
   {
      super(namespaceURI,localName,qualifiedName);
      this.pattern = Pattern.compile(pattern);
   }

   public synchronized void addGroup(Group group)
   {
      groups.add(group);
   }

   public synchronized Group getGroupNameAt(int index)
   {
      if(index < 1 || index > groups.size())
         throw new IndexOutOfBoundsException("index out of bounds");
      return (Group)groups.get(index - 1);
   }

   public synchronized String getGroupValueAt(int index)
      throws IllegalStateException, IllegalArgumentException
   {
      if(matcher == null)
         throw new IllegalStateException("Call matches() first");
      return getGroupNameAt(index).isText() ?
             matcher.group(0) : matcher.group(index);
   }

   public synchronized int getGroupCount()
   {
      return groups.size();
   }

   public boolean matches(String st)
   {
      input = st;
      if(matcher == null)
         matcher = pattern.matcher(st);
      else
         matcher.reset(st);
      return matcher.lookingAt();
   }

   public String rest()
   {
      if(matcher == null)
         throw new IllegalStateException("Call matches() first");
      int end = matcher.end(),
          length = input.length();
      if(end < length)
         return input.substring(end,length);
      else
         return null;
   }
}

Match is the most important class in this data structure. It represents a regular expression and provides the logic to match the regular expression against strings. Note that it uses lookingAt() to apply the regular expression. Because lookingAt() can match a partial string, it is possible to decompose a string into substrings.


Listing 5. Group.java

package org.ananas.xi;

public class Group
   extends QName
{
   public Group(String namespaceURI,
                String localName,
                String qualifiedName)
   {
      super(namespaceURI,localName,qualifiedName);
   }
}

I derived all the classes from QName (see Listing 6), which represents the name of an XML element as the combination of the namespace URI and the local name.


Listing 6. QName.java

package org.ananas.xi;

public class Group
   extends QName
{
   public Group(String namespaceURI,
                String localName,
                String qualifiedName)
   {
      super(namespaceURI,localName,qualifiedName);
   }
}


Using it

Although I didn't have a first version of XIReader completed in time for this column (we software developers are always full of hope and confident in our abilities, but often end up running into issues, particularly when learning new libraries), I can write a simple test class that lets me experiment with the regular expression API -- this class is shown in Listing 7. Although it does not write an XML document, it already contains the logic to break a text file into its constituents through regular expressions.

The recursive algorithm can be found in the read() methods. A recursive algorithm works well with XML documents because their hierarchical structure is inherently recursive. The algorithm is as follows:

  • Given a string, it loops through the Match to try to find the appropriate regular expression.
  • For each Group attached to the Match, it prints the content.
  • If the Group name matches another Ruleset, a recursive call attempts to further decompose the string (this was true for the an:fields element that appeared in the examples in the last column).
  • If the string has not been entirely consumed by the regular expression, a recursive call processes the remainder.

Listing 7. Test.java

package org.ananas.xi;

import java.io.*;
import java.util.regex.*;

public class Test
{
   public static void main(String[] params)
      throws IOException
   {
      Ruleset[] rulesets = getRulesets();
      BufferedReader reader = new BufferedReader(new FileReader(params[0]));
      String st = reader.readLine();
      while(st != null)
      {
         read(rulesets,st);
         st = reader.readLine();
      }
   }

   public static Ruleset[] getRulesets()
   {
      Ruleset[] rulesets = new Ruleset[2];
      rulesets[0] = new Ruleset("http://ananas.org/2002/sample",
                                "address-book",
                                "an:address-book");
      rulesets[1] = new Ruleset("http://ananas.org/2002/sample",
                                "fields",
                                "an:fields");
      Match match = new Match("http://ananas.org/2002/sample",
                              "alias",
                              "an:alias",
                              "^alias (.*):(.*)$");
      Group group = new Group("http://ananas.org/2002/sample",
                              "id",
                              "an:id");
      match.addGroup(group);
      group = new Group("http://ananas.org/2002/sample",
                        "email",
                        "an:email");
      match.addGroup(group);
      rulesets[0].addMatch(match);
      match = new Match("http://ananas.org/2002/sample",
                        "note",
                        "an:note",
                        "^note .*:(.*)$");
      group = new Group("http://ananas.org/2002/sample",
                        "fields",
                        "an:fields");
      match.addGroup(group);
      rulesets[0].addMatch(match);
      match = new Match("http://ananas.org/2002/sample",
                        "fields",
                        "an:fields",
                        "[\\s]*<([^<]*)>");
      group = new Group("http://ananas.org/2002/sample",
                        "field",
                        "an:field");
      match.addGroup(group);
      rulesets[1].addMatch(match);
      return rulesets;
   }

   public static void read(Ruleset[] rulesets,String st)
   {
      read(rulesets,rulesets[0],st,false);
   }

   public static void 
       read(Ruleset[] rulesets,Ruleset ruleset,String st,boolean next)
   {
      boolean found = false;
      for(int i = 0;i < ruleset.getMatchCount() && !found;i++)
      {
         if(ruleset.getMatchAt(i).matches(st))
         {
            found = true;
            Match match = ruleset.getMatchAt(i);
            if(!next)
            {
               System.out.print(ruleset.getMatchAt(i).getQualifiedName());
               System.out.print(' ');
            }
            for(int j = 1;j <= match.getGroupCount();j++)
            {
               String qname = match.getGroupNameAt(j).getQualifiedName();
               boolean deep = false;
               for(int k = 0;k < rulesets.length && !deep;k++)
                  if(rulesets[k].getQualifiedName().equals(qname))
                  {
                     System.out.print("\n >> \"");
                     System.out.print(match.getGroupValueAt(j));
                     System.out.print("\" >> ");
                     read(rulesets,rulesets[k],match.getGroupValueAt(j),false);
                     deep = true;
                  }
               if(!deep)
               {
                  System.out.print(match.getGroupNameAt(j).getQualifiedName());
                  System.out.print(' ');
                  System.out.print(match.getGroupValueAt(j));
                  System.out.print(' ');
               }
            }
            String rest = match.rest();
            if(rest != null)
               read(rulesets,ruleset,rest,true);
         }
      }
      System.out.println();
   }
}

Do not be put off by the getRulesets() method. For the time being, it creates a file description in memory. In the next iteration, it will read the file description from an XML file.


Towards XIReader

I will soon have a working version of XIReader. All that's missing is to replace the System.out.println() in Listing 7 with the appropriate calls to ContentHandler. I also need to start fully implementing the XMLReader interface, but this isn't particularly difficult.

As this column clearly demonstrates, one can spend a lot of time learning new libraries.


Resources

About the author

Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . More details on this topic are available at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12116
ArticleTitle=Working XML: Wrestling with Java NIO
publish-date=06012002
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers