In keeping with the tradition of two-letter names, the new project is called XI, short for "XML Import." XI provides a simple solution for importing text documents as XML documents. This is particularly handy in combination with the XSLT Maker (XM) introduced in my previous columns, as it allows me to process text documents as part of the publishing chain. More specifically, I'm looking to import an address list in a Web site maintained with XM.
Obviously there are other applications for XI. It can be useful any time you need to turn a text document into XML. That might include e-commerce and similar projects. I will touch on those cases in the column as well.
In many respects, XML has proven incredibly successful in a very short timeframe. Only four years after the W3C released XML 1.0, it's hard to find a developer who is not at least vaguely familiar with XML. Likewise, thousands of vocabularies (or XML Schemas) and even more applications have been developed.
Still, in many projects, you'll find that some or all of the data was not originally available in XML. Recent applications have adopted XML as their primary file format but older applications rely on other formats, so you often end up with a mixture of data in new and old formats.
Since I originally published XM in this column, my company has used it to publish several sites we maintain. One recent project provides the perfect illustration of the coexistence of XML and other files. The site supports a project that has been ongoing for several years. Over its lifetime, the project has accumulated many documents, most of them in Word format. Since they are part of the archive, we simply link to them with XM directory reading.
Yet there are a small number of files that we need to import into the Web site. One of them is the list of project participants that is currently maintained as a text document similar to Listing 1 (names and addresses are made up here).
For many reasons that are irrelevant to this discussion, the list will remain and be maintained in that format for the foreseeable future. Still, we have to publish an updated list of participants, preferably as an HTML document with hyperlinks to e-mail addresses. One solution would be to duplicate the list in XML, but we're reluctant to do so because the list changes frequently. This can create maintenance headaches.
Since we could not feed the original document straight to XM, we settled with converting it to XML automatically. As long as the process is automated, we are guaranteed to always offer the most up-to-date version online. As a side benefit, it gave me the inspiration for this project.
Fortunately, the problem is not novel to me. Most of my work as a consultant has been with e-commerce. I started working with EDI (Electronic Data Interchange), one of the oldest forms of e-commerce, and graduated to online shops and XML Web services.
One of the first things you learn with e-commerce is to deal with myriad files. A typical e-commerce application has to interface with just about every application in the company as it:
- Collects product descriptions and pricing information from the catalog
- May collect product availability from the warehouse
- Interfaces with the business management to process orders and prepare invoices
- Communicates with accounting for payments and other information
Working with many files is even more common with business-to-business interactions where, by definition, you try to integrate not only with your company back-office but also with your partners' back-offices.
In an ideal world, all the relevant information would be kept in DB2 databases and every application would have a clear, well-defined API. Unfortunately, the reality is that the data is almost never available in the format you need most.
For example, you might find that the catalog data is locked in a proprietary database, and that the customer list has been maintained with a simple PIM that offers few opportunities to export its database. In other cases, the back-office may have been developed in-house and has largely outgrown its original design.
You might find useful applications for XI in that area too. For the purpose of this column, however, I'll stick to importing the participant list.
Over the last couple of years, I have used different tools to process those documents from high-end solutions to more affordable ones. I have also written my share of applications and I have repeatedly found that XSLT, when augmented with a syntax converter, offers a very attractive solution.
I have found it advantageous to break the import process in two steps: syntax conversion and reorganization of data structure. The syntax conversion takes the textual information and wraps it in the simplest XML structure. Typically the resulting XML document is very close to the original document. In most cases, it's as simple as replacing delimiters with XML tags.
The second step is to use XSLT to turn this crude XML document in the target vocabulary. This division works particularly well and offers the following benefits:
- The code is organized in two modules so it's more maintainable
- Each module is written with the best tool for the task: A system programming language, such as Java, is ideal for the syntax conversion and a scripting language, like XSLT, is best for the reorganizing
- Writing style sheets is comparatively simpler than coding in Java so you might assign that work to a more junior team member
- You can often use and reuse both the syntax converter and the style sheet for different documents
For this project, though, it seems that the text document has a very specific format so it does not look like much reuse will be possible. Yet the new JDK 1.4 has added support for regular expressions (regex), and I thought that would be a good way to build a slightly more generic solution.
If you're not familiar with regular expressions, turn to Mastering Regular Expressions from O'Reilly (see Resources). In a nutshell, a regular expression describes a string pattern. Using a regular expression engine, it is easy to test if a string matches the pattern. This is useful for validating input or for splitting a document where the pattern matches.
To better understand what XI is all about, review Listing 2 which describes the regular expressions that interpret Listing 1. Ultimately, XI uses the expressions in this description to convert Listing 1 into Listing 3; the latter is an XML document.
Listing 2. XI description rules
<?xml version="1.0">
<xi:rules version="1.0"
xmlns:xi="http://ananas.org/2002/xi/rules"
xmlns:an="http://ananas.org/2002/sample">
<xi:ruleset name="an:address-book">
<xi:match name="an:alias"
pattern="^alias "([^"]*)" (\S*)$">
<xi:group name="an:id"/>
<xi:group name="an:email"/>
</xi:match>
<xi:match name="an:note"
pattern="^note "([^"]*)" (\S*)$">
<xi:group name="an:id"/>
<xi:group name="an:fields"/>
</xi:match>
<xi:error msg="unknown line type"/>
</xi:ruleset>
<xi:ruleset name="an:fields">
<xi:match name="an:field"
pattern="<(\s):(\s)>">
<xi:group name="an:key"/>
<xi:group name="an:value"/>
</xi:match>
</xi:ruleset>
</xi:rules>
|
Note that I created Listings 1 and 2 to help me understand how XI will work, so take them as illustrations, not necessarily as the final documents. As I develop XI, I may find that I need to make small amendments to these documents.
Listing 3. Sample document that XI will generate
<?xml version="1.0"?>
<an:address-book xmlns:an="http://ananas.org/2002/sample">
<an:alias>
<an:id>jdoe</an:id>
<an:email>jdoe@xmli.com</an:email>
</an:alias>
<an:note>
<an:id>jdoe</an:id>
<an:fields>
<an:field>
<an:key>country</an:key>
<an:value>US</an:value>
</an:field>
<an:field>
<an:key>zip</an:key>
<an:value>45202</an:value>
</an:field>
<an:field>
<an:key>state</an:key>
<an:value>OH</an:value>
</an:field>
<an:field>
<an:key>city</an:key>
<an:value>Cincinnati</an:value>
</an:field>
<an:field>
<an:key>address</an:key>
<an:value>34 Fountain Square Plaza</an:value>
</an:field>
<an:field>
<an:key>name</an:key>
<an:value>John Doe</an:value>
</an:field>
</an:fields>
</an:note>
<an:alias>
<an:id>jsmith</an:id>
<an:email>jsmith@worth-it.com</an:email>
</an:alias>
<an:note>
<an:id>jsmith</an:id>
<an:fields>
<an:field>
<an:key>first</an:key>
<an:value>Jack</an:value>
</an:field>
<an:field>
<an:key>last</an:key>
<an:value>Smith</an:value>
</an:field>
<an:field>
<an:key>name</an:key>
<an:value>Jack Smith</an:value>
</an:field>
</an:fields>
</an:note>
<an:alias>
<an:id>pdupont</an:id>
<an:email>pdupont@pineapples.net</an:email>
</an:alias>
<an:note>
<an:id>pdupont</an:id>
<an:fields>
<an:field>
<an:key>name</an:key>
<an:value>Pierre Dupont</an:value>
</an:field>
</an:fields>
</an:note>
</an:address-book>
|
I have tried to keep the format in Listing 3 as simple as possible. The description consists of one or more rulesets. Each ruleset specifies a list of patterns that the input must match. XI will test the input against each of these patterns and apply the description associated with the first one that matches it.
Within a pattern, parenthesis indicate a group (as shown in Listing 2). Each group may match to a different XML element.
When a pattern matches, XI either generates one or more XML elements or it switches to another ruleset. The idea here is that some patterns must be further broken down. The notes line, which begins notes "jdoe", in Listing 1 is a perfect example. A notes line contains a variable number of fields and, once we've identified such a line with one regular expression, we need to decompose it in fields.
The attributes are consistent throughout Listing 2: The name attribute represents an XML element in the output and it applies to rulesets, matches, and groups; the call attribute indicates that a group must be further broken down using another ruleset.
This format is intentionally limited. For example, XI only generates elements -- never attributes. It forces you to associate an element with every ruleset, match, and group. My goal is to keep this format (and XI) as simple as possible. It is just easier to reorganize the document, change elements into attributes, and skip and insert tags in XSLT than it is to try to come up with a smarter description language.
In this case, turning out XI output in Docbook would be the responsibility of a style sheet similar to that in Listing 4. Note that this style sheet has not been optimized; it's intended for illustration purposes only.
XI is different from the proposed addition of regular expression to XPath 2.0 (and therefore XSLT). It is also different from Simon St Laurent's Regular Fragmentation library (see Resources). As far as I understand the preliminary drafts, it appears that XPath 2.0 will use patterns to decompose elements. Yet the assumption remains that the input document was already an XML document.
XI, on the other hand, uses regular expressions to transform text documents into XML. If anything, the two solutions are complementary: It might make sense to prepare an XML document with XI and pass it to a style sheet that will use XPath 2.0 to select specific bits of information.
This section complements the above discussion with a brief analysis of XM.
My earliest attempts at XML conversions created temporary files that I would pass to an XSLT converter. Unfortunately, this caused the document to be written to disk and re-read. Later, I learned to skip the intermediate parsing and generate SAX events directly from my converter. I introduced the technique with DirectoryReader, during the development of XM.
More recently, I've discovered that it's even more beneficial to implement the XMLReader interface. This is particularly true when post-processing the document with XSLT because javax.xml.transform.sax.SAXSource accepts an XMLReader as a parameter.
Figure 1, below, is the class diagram for XI. The main class is XIReader and it implements XMLReader. This class is responsible for parsing a text document according to a definition, applying the regular expressions, and generating the appropriate SAX events.
Since the JDK includes a regular expression library, I don't anticipate any difficulties with this class. You must give the XIReader a document description (similar to Listing 2) using setProperty(). It uses RulesHandler to decode the document description. Since RulesHandler is a SAX content handler, I think HC will be handy. You'll notice the class inherits from HCHandler.
Finally you'll find several classes that hold the document description: the Error, Group, Ruleset, and Match tags from Listing 2 all have their corresponding classes in Listing 2. I'm still debating whether I need a Rules element. If I choose to introduce one, it won't affect Figure 1 in any major way.
Figure 1. Class diagram for the XI

All the classes are defined in the org.ananas.xi package.
This column has included the analysis and requirements for XI. We'll start implementing the project in the next column with a first version of XIReader. As usual, the code will be released under an open source license.
- Learn how e-commerce developers have long used conversion tools such as IBM WebSphere Data Interchange for Multiplatforms to process documents in a variety of formats. It does not require XSLT. http://www.ibm.com/software/integration/wdi/multiplatform32/
- If you're a iSeries (AS/400) developer, check out this interesting discussion about converting XML documents to transactions.
- Get the GoXML Transform tool that lets you import documents in XML. It does not require XSLT.
- Download the XML Convert application that you can use to convert text documents to XML. You would typically complement it with XSLT.
- Read all of Benoît Marchal's
Working XML
articles.
- Read about XPath 2.0 that should support regular expressions.
- Learn about Simon St Laurent's proposed Regular Fragmentation to split XML elements.
- Check out this reference book on regular expressions, Mastering Regular Expressions (Jeffrey E. F. Friedl, ed. O'Reilly, 1997).

Benoît Marchal is a consultant and writer based in Namur, Belgium. He has just released the second edition of XML by Example. More details on this topic are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.
Comments (Undergoing maintenance)





