Skip to main content

Working XML: Link management and preparing the future

XM matures, with features enough to publish simple Web sites

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapple Software
Benoît Marchal is a consultant and writer based in Namur, Belgium. He is the author of XML by Example , Applied XML Solutions and XML and the Enterprise. He is a columnist for Gamelan. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Summary:  In this installment of Working XML, Benoît Marchal uses XML filters to add new functionality to XM, his open-source Web publishing application. Thanks to two new features, XM is now powerful enough to handle simple Web sites. Code samples demonstrate the use of filters and other techniques, as well as updates to XM code. There's also a link to download the application source code.

Date:  01 Aug 2001
Level:  Introductory

Activity:  2263 views
Comments:  

The Working XML column, introduced last month, follows the evolution of Benoît Marchal's open-source applications based on XML technologies. As he develops the applications, the first called XM (XSLT Make), a content-management and Web publishing solution, Benoît shares his design decisions, techniques, code, and the lessons he learns along the way.

Last month in the first Working XML column, I introduced XM, an approachable solution for content management and Web site publishing based on XML and XSLT. The operative word here is approachable. Excellent commercial content-management solutions exist for high-end sites, but they are priced too high for many webmasters. Conversely, homegrown solutions are cheap, but while they are suitable for developers, they're not convenient for relatively nontechnical end users.

The first version of XM (introduced last month) is limited: It recursively walks through a set of directories and applies an XSLT style sheet as it goes along. Many important features are missing. This month, I address two of the most urgent limitations. The result is a usable solution for simple Web sites, but there's still room for improvement in upcoming months. (For the record, I publish the ananas.org Web site, referenced in Resources, with XM.)

Link management

The two new features this month are link management and the ability to process non-XML files. The two features are somewhat related. Let's start with link management.

Fixing links

In most cases, XM breaks links because it renames some files while applying the style sheet. Figure 1 illustrates this. Consider that the webmaster has two files (index.xml and details.xml) with one hyperlink. XM renames the files as index.html and details.html respectively ... thereby breaking the link.

Theoretically, the style sheet could fix the link, indeed it suffices to replace .xml with .html. However, as I plan to add new features to XM, fixing links in the style sheet will become more difficult. It just seems more logical to have XM fix the links automatically.


Figure 1. XM breaks links because the resulting files have different extensions.
XM breaks the links

Preventing broken links

Link management should also prevent and detect broken links. Users hate the "404 - File Not Found" error, and XM must reduce those. High end content-management solutions work with a database and assign unique identifiers to each document. With the identifiers, they can monitor insertions and deletions to prevent invalid links. XM does not have that luxury, so it limits itself to testing whether a link points to files. If a link does not point to a file, XML reports an error.

As a convenience, I have also devised a simple mechanism, which I called "relative absolute path," that combines the benefits of relative and absolute paths. Relative paths are particularly convenient for testing a Web site from the file system. On the other hand, absolute links are more likely to be accurate.

The absolute paths are particularly useful for links created by the style sheet. As an illustration, consider the image for developerWorks at ananas.org. The style sheet loads the image from /images/buttons/dw.gif. Imagine if the style sheet had to use relative links. It would insert a link to images/buttons/dw.gif for XML documents in the root and a link to ../images/buttons/dw.gif for those documents one level down the root. But how does the style sheet find out whether a document is in the root? Again it is easier to let XM manage this for the style sheet.

Relative absolute paths are the answer: Replace the first character with !, and XM link management will turn the link into a relative path. This is useful for testing the Web site from the file system only. For example, with last month's version of XM, the style sheet had to use absolute paths and the images would not display correctly when I tested the site locally. Of course, if you always access your site through a Web server, you won't use this feature.

LinkFilter

Link management is implemented in LinkFilter. As the name implies, this class is an XML filter that extends SAX XMLFilterImpl. If you are not yet familiar with XML filters, read on. I believe that filtering is one of the most useful features of SAX.

Simply put, a filter is a SAX event handler that passes the events it receives to another event handler. However, in so doing, it modifies (or filters) some of the events. Filters are particularly handy when linked together to from a so-called pipeline, see Figure 2. In this configuration, the document flows through the pipe (see the sidebar Logging SAX events). At each step, a filter modifies it. Gradually, the pipe turns the input document into the output.

Filters offer a clean pattern to break down XML transformations into several distinct steps. They are particularly handy because they can be recombined in new configurations easily. Filters are also efficient because they process events as soon as they are received.


Figure 2. A pipeline of filters
Image depicting a pipeline of filters

Logging SAX events

When building pipes, it is useful to see which events pass in the pipe. I wrote a class to test the pipe. It dumps as much information as possible in a file. Later I can analyze the file to better understand which events are fired when. The code for this class is in the source repository (see Resources), in the org.ananas.util package.

One of the challenges with link management is to recognize the URIs in the XML document. There are so many XML vocabularies that it's hard to decide which to support. I originally thought I would preprocess the input XML documents: Then the links would be correct by the time they reach the XSLT processor. I had thought I would support XLink, XML standard linking vocabulary; DocBook, the vocabulary I use for ananas.org; as well as NewsML, and others.

I eventually decided that it would make more sense to postprocess the document. In other words, fix the links after running it through the style sheet. This alternative is particularly attractive because there are relatively few vocabularies to publish with: HTML, RSS, and XSL FO (for PDF publishing, coming soon). The current version is limited to HTML, which is reasonable given that XM targets webmasters.

Listing 1, LinkFilter.java implements the link management. XMLFilterImpl already implements most of the ContentHandler, blindly forwarding events to the next filter in the pipe. I only need to intercept startElement() to fix links. The logic is as follows:

  1. Extract the attribute that contains the URI.
  2. Process so-called absolute relative paths, replacing ! with the appropriate relative paths.
  3. If the file is local, fix the URI by replacing the file name with whatever name XM will use. We'll review MoversSupervisor in a moment.
  4. If the file is not local and it does not look like an external URI, report an error. What does "look like an external URI" mean? Essentially, the filename does not start with http:, ftp:, news: or mailto:. In theory it would be a simple matter to connect to the remote server and test for a 4xx or 5xx response code. In practice, it may slow down XM dramatically, so I decided not to do it.

Obviously LinkFilter has to know a thing or two about the file system. The setDirectory() method provides the information.

You may wonder about comment(), endCDATA(), and similar methods. They implement the LexicalHandler interface. LexicalHandler is an optional event handler in SAX2 that copes with special lexical constructs such as comments, CDATA sections, entities, and the like. The XSLT processor uses LexicalHandler to return them.

Because LexicalHandler is an optional interface, there is no setLexicalHandler() method on the XMLFilterImpl. Instead, one uses the generic setProperty() method. The lexical handler doesn't do much: It simply forwards the events to the next filter in the pipe, but the XSLT processor requires them.


Movers and shakers

The second improvement to XM this month is the ability to handle non-XML files. There's a comment in the DirectoryWalker from last month that reads // else copy file. This month it's time to write the else.

My plan for the next few months includes a special processor for directories. With that processor, XM will create XML files dynamically at publishing time. What I'm getting at is that soon XM will recognize several categories of files: XML, of course, and also directory processor files and other types. Consequently I decided to abstract the file processing now rather than later; it will be handy soon.

Mover

I followed a classic pattern to abstract an operation and introduced a Mover interface. As the name implies, a Mover relocates a file from the source to the publishing directory. In the process it may apply a style sheet and rename the file (change the extension from .xml to .html). Because I follow a classic pattern, the coding is almost brainless. The most difficult part was to decide on a good name for the interface.

The interface is defined in Listing 2, Mover.java. It proposes two methods:

  • move(), which, as you would have guessed, copies a file from the source to the target directory
  • getTargetName(), which returns the name this mover would use if it were to move the file. This method is used by LinkFilter to fix links (more specifically, file suffixes), as we saw previously

Listing 2. Mover.java

package org.ananas.xm;

import java.io.*;

public interface Mover
{
   public File move(File sourceFile,File targetDir,int depth)
      throws IOException, XMException;
   public String getTargetName(File file)
      throws XMException;
}

To simplify coding, I have provided an abstract class DefaultMoverImpl, available through the source repository (see Resources). It offers a few useful methods for Mover implementors. Offering both an interface and an abstract class is redundant, but I find it convenient: In most cases, I only need the abstract class, but the interface enables multiple inheritance.

StylingMover

The first concrete implementation of Mover is Listing 3, StylingMover.java. It applies an XSLT style sheet as it copies the file to the target directory. The code is taken almost verbatim from last month's DirectoryWalker. The differences relate to the use of the LinkFilter.

How to save the result in a file? The pipe flows SAX events, but you don't want events; you want a file! The last element in the pipe must write the events to an XML or HTML file. In practice, you need a special ContentHandler to terminate the pipe, after the LinkFilter. The easiest solution is to turn to the Xalan-provided serializer. The serializer accepts SAX events and writes the corresponding XML document in a file. It does not cost anything to use the Xalan serializer because that's the class Xalan uses internally to save HTML document. StylingMover builds the pipe, including the serializer.

CopyingMover

The second Mover is in Listing 4, CopyingMover.java. If the file has changed, it copies it to the target directory. For greater efficiency, it uses a small buffer. Unlike StylingMover, CopyingMover never modifies the filename.


Listing 4. CopyingMover

package org.ananas.xm;

import java.io.*;

public class CopyingMover
   extends DefaultMoverImpl
{
   protected byte[] bytes = new byte[1024];
   public String getTargetName(File file)
   {
      return file.getName();
   }
   public CopyingMover(Messenger messenger)
   {
      super(messenger);
   }
   public File move(File sourceFile,File targetDir,int depth)
      throws XMException
   {
      try
      {
         File targetFile = new File(targetDir,getTargetName(sourceFile));
         if(targetFile.exists())
         {
            if(targetFile.lastModified() >= sourceFile.lastModified())
               return null;
         }
         synchronized(bytes)
         {
            FileInputStream in = new FileInputStream(sourceFile);
            FileOutputStream out = new FileOutputStream(targetFile);
            for(int len = in.read(bytes);len != -1;len = in.read(bytes))
               out.write(bytes,0,len);
         }
         return targetFile;
      }
      catch(IOException e)
      {
         messenger.error(new XMException(e));
      }
      return null;
   }
}

MoversSupervisor

Listing 5, MoversSupervisor.java selects the proper mover for a file. It looks only at the file suffix: .xml files are sent to a StylingMover and all the other files go through a CopyingMover.

MoversSupervisor also creates and retains a copy of all the movers. Obviously this class (and the whole Mover series) is overkill for the current relatively limited version of XM, but it will simplify things next month.

DirectoryWalker

Last but not least, for the new version of XM I had to update DirectoryWalker to use the movers. This cleanly separates the logic for walking through directories from the logic for moving files from source to target directory, as shown in Listing 6, DirectoryWalker redux.


Miscellaneous updates

There's one last change to report: I have renamed Messager (and its companion DefaultMessager) into Messenger (and DefaultMessenger). Sorry for the typo.

I know that some readers had problems downloading the source code last month. This is due to unfortunate complications as developerWorks migrated its Open Source zone to a new platform shortly after the first column was published.

To avoid further problems as developerWorks brings the new system online, I will make zip files available (see Resources). This is a temporary measure, and I encourage you to access the CVS repository if it is available. However, should the repository be temporarily unavailable, you may want to go for the zip file.

Incidentally, if you experience problems accessing the files, please report them on the ananas-discussion mailing list (see Resources).


Your turn

XM is taking shape. While the code I released last month was not very useful (I know because I had to work hard to publish ananas.org with XM), this month's version is more practical. While it still lacks important features, it should work well for simple Web sites.

Now it's your turn. Download XM and report your findings on the ananas-discussion mailing list. I need your help to identify areas that need more work. The plans for the near future are to support several XSLT files and generate a directory.


Resources

  • You can download the code for this project from ananas.org. Follow the links to the CVS repository on developerWorks, as well as the ananas-discussion mailing list. I encourage you to join the list and contribute your thoughts to the project.

  • If you experience problems with the CVS repository, please try accessing this zip file instead.

  • XM uses Xalan and Xerces-J as XSLT processor and XML parser, respectively. Xalan was originally developed by IBM subsidiary, Lotus, and Xerces was originally developed by IBM. IBM donated the code to the Apache Foundation.

About the author

Benoit Marchal

Benoît Marchal is a consultant and writer based in Namur, Belgium. He is the author of XML by Example , Applied XML Solutions and XML and the Enterprise. He is a columnist for Gamelan. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12028
ArticleTitle=Working XML: Link management and preparing the future
publish-date=08012001
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers