My last installment introduced XI, a new tool project for this column. The challenge is this: My company maintains the Web site of a working group with XML and XM, a publishing solution based on XML. (XM was the first project of the "Working XML" column. See Resources.)
One of the documents for that site is the list of participants. For reasons that are outside the scope of this column, the list is maintained as an address book with an e-mail program. The file format, which has been imposed on me, is not XML. So, how do I feed it to an XML publishing solution? I need to convert this list into XML.
Of course it would not be difficult to write an ad hoc conversion routine, but this seems like a waste of time: There are other cases where one needs to feed non-XML documents into an XML solution. In addition to the address book, I may need to process agendas, spreadsheets, cataloging information, and other legacy data.
XI offers a fairly generic solution to this problem. It uses regular expressions (now built into JDK 1.4) to parse the input file and create an XML counterpart. The details of these conversions, as well as a short analysis of XI, were introduced in my last column (see Resources).
One of the reasons I decided to develop XI now was to experiment with the new regular expression (regex) engine that's built into JDK 1.4. As you will see, I got more than I bargained for when I launched into an exploration of the New I/O package (commonly referred to as NIO, it's in package java.nio).
The regular expression engine has a simple and clean interface, which basically requires you to learn two new classes and one new interface. The new classes are Pattern and Matcher, and the new interface is CharSequence. (The latter that can lead to problems, which I'll address later.) Listing 1 illustrates how to use Pattern and Matcher.
Listing 1. Using the regular expression engine
import java.util.regex.*;
public class SampleRegex
{
public static void main(String[] params)
{
Pattern pattern = Pattern.compile("(.*):(.*)");
Matcher matcher = pattern.matcher(params[0]);
if(matcher.matches())
{
System.out.print("Key:");
System.out.println(matcher.group(1));
System.out.print("Value:");
System.out.println(matcher.group(2));
}
else
System.out.print("No match");
}
} |
Pattern is a regular expression compiler. It accepts a regular expression and compiles it into a Matcher. Matcher is used to apply regular expressions to strings or, more accurately, to CharSequences.
Pattern has no public constructor. To create a Pattern, you must call its compile(), passing a regular expression as an argument.
A regular expression describes the format of a string: With the simplest form of regular expression, you input the text that you want to match. For example, the regular expression ABC will match the string ABC but not DEF.
Of course, regular expressions are limited to strict string comparisons. For example, you can use a wildcard character called a joker -- a dot (.) that matches any character but one at the end of a line. So, the regular expression A.C will match the strings ABC, AAC, AKC, and many others, but it still won't match DEF.
The star (*) after a character or a joker indicates that it can repeat indefinitely. Therefore the regular expression A*B will match the strings AB, AAB, AAAB, or any other string starting with an A and ending with a single B.
Since the dot is a joker, .* will match any string including ABC, IBM developerWorks, and DEF.
Parentheses are used as a grouping operator. It is particularly handy because, as you will see, it is possible to extract the contents of a group from a string. For example, the regular expression (*):(.*) used in Listing 1 matches two strings separated by a colon.
There's more to regular expressions and I encourage you to turn to a reference book such as Mastering Regular Expressions (see Resources).
In Listing 1, once a regular expression is compiled into a Pattern, it creates a Matcher with the matcher() method. The Matcher accepts a CharSequence (more on this in the next section, NIO) and reports whether the regular expression matches or not. The Matcher offers several methods to test regular expression: matches(), lookingAt(), and find(). Each applies the regular expression differently.
Matcher also offers a group() that retrieves the string matching a given group. Groups are numbered from 1 to n, whereas group(0) is the complete regular expression.
Listing 1 applies the regular expression to a command-line parameter. It prints the groups it has found, if any. For example, when calling the application with:
java SampleRegex "domain:ananas.org" |
it will print:
Key: domain Value: ananas.org |
since the input (domain:ananas.org) matches the regular expression. However if it is called with:
java SampleRegex "ananas.org" |
it will print:
No match |
since the input does not match the regular expression.
I mentioned CharSequence in the previous section, Pattern and Matcher. This is a new interface defined in the java.lang package for an array of characters. String has been updated to implement CharSequence.
More importantly, by using the NIO package it is possible to access a file as a CharSequence. Therefore, since Matcher accepts CharSequence, it is possible to apply regular expressions to whole files. That's how I ended up looking at the java.nio package (see Resources).
Ultimately, I won't be using java.nio in this project, but I will discuss it nonetheless because I spent a lot of time looking for a solution. (Chasing dead ends is a favorite pastime in software development and the column would not be true to itself if I failed to report on those. Besides, I hope my experience will save you from doing the same research.)
Listing 2 shows you how to turn a file into a CharSequence. In practice, you end up with a new class called CharBuffer, which implements CharSequence over a text file.
Listing 2. Using CharBuffer
FileInputStream input = new FileInputStream(params[0]);
FileChannel channel = input.getChannel();
int fileLength = (int)channel.size();
MappedByteBuffer buffer = channel.map(FileChannel.MapMode.READ_ONLY,0,fileLength);
Charset charset = Charset.forName("ISO-8859-1");
CharsetDecoder decoder = charset.newDecoder();
CharBuffer charBuffer = decoder.decode(buffer);
Matcher matcher = pattern.matcher(charBuffer);
// ... |
Before going any further, I must confess that I'm not sure I have fully grasped the logic behind the NIO. I only recently started using this API and here's what I have found.
As you can see, many objects are required to turn a file into a CharBuffer. This is the pattern when using the new API. From what I understand, it appears the goal of the NIO is to give you more control and more flexibility over regular I/O.
NIO offers a less abstract API. For example, with Java IO, you need not worry about buffer management but you have no control over it either. NIO gives you more control over buffer management -- by letting you run it! Arguably, it is more efficient but it is also more complex.
As I'm writing, I am under the impression that NIO will be particularly relevant for developers of high-performance applications such as database engines, servers, and high-performance clients. I do not see any reason to use it (and incur the additional effort) for regular programming.
Furthermore, my ultimate goal with XI is to keep it compatible with XMLReader as it is defined by SAX. XMLReader currently works with InputStream or Reader, but not with NIO. I was unable to find a completely generic solution to convert any InputStream into a CharBuffer. (I did find partial solutions.) If you have a better idea, please e-mail me and I will report about it in the next column.
Eventually, I decided to work with the regular I/O, read the file line-by-line into strings, and apply regular expressions to those strings.
Enough time on what I could not do; let's have a look at what I was able to accomplish! During the analysis, I defined a simple data structure (and corresponding XML vocabulary) for a file description. To process a new file with XI, I wrote another description.
I started by implementing the data structure with the following files:
-
Ruleset(see Listing 3) represents a set of regular expressions. -
Match(see Listing 4) represents a single regular expression. The class encapsulatesjava.regex. -
Group(see Listing 5) is a group (a parenthesized expression) from a regular expression.
Listing 3. Ruleset.java
package org.ananas.xi;
import java.util.*;
public class Ruleset
extends QName
{
private List matches = new ArrayList();
private String error = null;
public Ruleset(String namespaceURI,
String localName,
String qualifiedName)
{
super(namespaceURI,localName,qualifiedName);
}
public void setError(String error)
{
this.error = error;
}
public String getError()
{
return error;
}
public synchronized void addMatch(Match match)
{
matches.add(match);
}
public synchronized Match getMatchAt(int index)
{
return (Match)matches.get(index);
}
public synchronized int getMatchCount()
{
return matches.size();
}
} |
The Ruleset is essentially a container for a list of Match objects.
Listing 4. Match.java
package org.ananas.xi;
import java.util.*;
import java.util.regex.*;
public class Match
extends QName
{
private Pattern pattern;
private Matcher matcher = null;
private String input = null;
private List groups = new ArrayList();
public Match(String namespaceURI,
String localName,
String qualifiedName,
String pattern)
{
super(namespaceURI,localName,qualifiedName);
this.pattern = Pattern.compile(pattern);
}
public synchronized void addGroup(Group group)
{
groups.add(group);
}
public synchronized Group getGroupNameAt(int index)
{
if(index < 1 || index > groups.size())
throw new IndexOutOfBoundsException("index out of bounds");
return (Group)groups.get(index - 1);
}
public synchronized String getGroupValueAt(int index)
throws IllegalStateException, IllegalArgumentException
{
if(matcher == null)
throw new IllegalStateException("Call matches() first");
return getGroupNameAt(index).isText() ?
matcher.group(0) : matcher.group(index);
}
public synchronized int getGroupCount()
{
return groups.size();
}
public boolean matches(String st)
{
input = st;
if(matcher == null)
matcher = pattern.matcher(st);
else
matcher.reset(st);
return matcher.lookingAt();
}
public String rest()
{
if(matcher == null)
throw new IllegalStateException("Call matches() first");
int end = matcher.end(),
length = input.length();
if(end < length)
return input.substring(end,length);
else
return null;
}
} |
Match is the most important class in this data structure. It represents a regular expression and provides the logic to match the regular expression against strings. Note that it uses lookingAt() to apply the regular expression. Because lookingAt() can match a partial string, it is possible to decompose a string into substrings.
Listing 5. Group.java
package org.ananas.xi;
public class Group
extends QName
{
public Group(String namespaceURI,
String localName,
String qualifiedName)
{
super(namespaceURI,localName,qualifiedName);
}
} |
I derived all the classes from QName (see Listing 6), which represents the name of an XML element as the combination of the namespace URI and the local name.
Listing 6. QName.java
package org.ananas.xi;
public class Group
extends QName
{
public Group(String namespaceURI,
String localName,
String qualifiedName)
{
super(namespaceURI,localName,qualifiedName);
}
} |
Although I didn't have a first version of XIReader completed in time for this column (we software developers are always full of hope and confident in our abilities, but often end up running into issues, particularly when learning new libraries), I can write a simple test class that lets me experiment with the regular expression API -- this class is shown in Listing 7. Although it does not write an XML document, it already contains the logic to break a text file into its constituents through regular expressions.
The recursive algorithm can be found in the read() methods. A recursive algorithm works well with XML documents because their hierarchical structure is inherently recursive. The algorithm is as follows:
- Given a string, it loops through the
Matchto try to find the appropriate regular expression. - For each
Groupattached to theMatch, it prints the content. - If the
Groupname matches anotherRuleset, a recursive call attempts to further decompose the string (this was true for thean:fieldselement that appeared in the examples in the last column). - If the string has not been entirely consumed by the regular expression, a recursive call processes the remainder.
Listing 7. Test.java
package org.ananas.xi;
import java.io.*;
import java.util.regex.*;
public class Test
{
public static void main(String[] params)
throws IOException
{
Ruleset[] rulesets = getRulesets();
BufferedReader reader = new BufferedReader(new FileReader(params[0]));
String st = reader.readLine();
while(st != null)
{
read(rulesets,st);
st = reader.readLine();
}
}
public static Ruleset[] getRulesets()
{
Ruleset[] rulesets = new Ruleset[2];
rulesets[0] = new Ruleset("http://ananas.org/2002/sample",
"address-book",
"an:address-book");
rulesets[1] = new Ruleset("http://ananas.org/2002/sample",
"fields",
"an:fields");
Match match = new Match("http://ananas.org/2002/sample",
"alias",
"an:alias",
"^alias (.*):(.*)$");
Group group = new Group("http://ananas.org/2002/sample",
"id",
"an:id");
match.addGroup(group);
group = new Group("http://ananas.org/2002/sample",
"email",
"an:email");
match.addGroup(group);
rulesets[0].addMatch(match);
match = new Match("http://ananas.org/2002/sample",
"note",
"an:note",
"^note .*:(.*)$");
group = new Group("http://ananas.org/2002/sample",
"fields",
"an:fields");
match.addGroup(group);
rulesets[0].addMatch(match);
match = new Match("http://ananas.org/2002/sample",
"fields",
"an:fields",
"[\\s]*<([^<]*)>");
group = new Group("http://ananas.org/2002/sample",
"field",
"an:field");
match.addGroup(group);
rulesets[1].addMatch(match);
return rulesets;
}
public static void read(Ruleset[] rulesets,String st)
{
read(rulesets,rulesets[0],st,false);
}
public static void
read(Ruleset[] rulesets,Ruleset ruleset,String st,boolean next)
{
boolean found = false;
for(int i = 0;i < ruleset.getMatchCount() && !found;i++)
{
if(ruleset.getMatchAt(i).matches(st))
{
found = true;
Match match = ruleset.getMatchAt(i);
if(!next)
{
System.out.print(ruleset.getMatchAt(i).getQualifiedName());
System.out.print(' ');
}
for(int j = 1;j <= match.getGroupCount();j++)
{
String qname = match.getGroupNameAt(j).getQualifiedName();
boolean deep = false;
for(int k = 0;k < rulesets.length && !deep;k++)
if(rulesets[k].getQualifiedName().equals(qname))
{
System.out.print("\n >> \"");
System.out.print(match.getGroupValueAt(j));
System.out.print("\" >> ");
read(rulesets,rulesets[k],match.getGroupValueAt(j),false);
deep = true;
}
if(!deep)
{
System.out.print(match.getGroupNameAt(j).getQualifiedName());
System.out.print(' ');
System.out.print(match.getGroupValueAt(j));
System.out.print(' ');
}
}
String rest = match.rest();
if(rest != null)
read(rulesets,ruleset,rest,true);
}
}
System.out.println();
}
} |
Do not be put off by the getRulesets() method. For the time being, it creates a file description in memory. In the next iteration, it will read the file description from an XML file.
I will soon have a working version of XIReader. All that's missing is to replace the System.out.println() in Listing 7 with the appropriate calls to ContentHandler. I also need to start fully implementing the XMLReader interface, but this isn't particularly difficult.
As this column clearly demonstrates, one can spend a lot of time learning new libraries.
- Check out MEC-Eagle, yet another tool for e-commerce applications that can import legacy files in XML.
- And if you have Word or other word processor documents, you'll want to turn to upCast.
- Try
Mastering Regular Expressions
(Jeffrey E. F. Friedl, ed. O'Reilly, 1997) for a useful reference on regular expressions.
- Find out more about the java.nio at http://java.sun.com/j2se/1.4/docs/api/java/nio/package-summary.html.
- Want more information about the conversions offered with XI, read Benoît Marchal's last column,
Working XML: Importing text as XML with XI (developerWorks, April 2002).
- Review the author's first project, XM, in earlier issues of the "Working XML" column:
- Working XML: Using XSLT for content management (developerWorks, July 2001)
- Working XML: Link management and preparing the future (developerWorks, August 2001)
- Working XML: Processing instructions and parameters (developerWorks, September 2001)
- Working XML: Wrapping up XM version 1 (developerWorks, October 2001)
- Read all of Benoît Marchal's
Working XML
articles.
- Take a look at Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
- You'll find lots more XML resources on the developerWorks XML zone.

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example . More details on this topic are available at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.





