Work on XM continues. This month I have added support for multiple style sheets, thereby addressing the most common suggestion from readers. I have also added an option to pass parameters to style sheets, which about rounds off the basic publishing functions of XM. Starting to work on more advanced features, I've included a directory reader. (You can download all the relevant code for this and previous columns; see the sidebar, Getting the code.)
The first two versions of XM followed a one-size-fits-all strategy, which originally seemed a good idea but proved ineffective. More specifically, so far XM recognizes only one style sheet, rules.xsl. As explained in the first article in this series, I originally thought I would use different XSLT templates to select changes in styles:
<xsl:template match="db:article"> <!-- rules for an article here --> </xsl:template> <xsl:template match="xm:Directory"> <!-- rules for a directory here --> </xsl:template> |
Yet my own experience with ananas.org, a site I maintain exclusively with XM, showed that my original plan just doesn't work well. I also heard from readers who suggested I address what they perceived as a limitation. Finally, as I started working on content generation (introduced later in this article), it became evident that it was time to add support for multiple style sheets.
You may recall that ease of use is one of my priorities with XM and, specifically, I do not want to use configuration files or "build scripts" to select which style sheet applies where (review Working XML: Using XSLT for content management for a complete discussion of this requirement).
Readers suggested clever naming conventions that would select a style sheet, but the best suggestion for a solution came from a
colleague who reminded me of the xml-stylesheet processing instruction.
If you are not familiar with xml-stylesheet, it was introduced in July 1999 in a small W3C recommendation and popularized by Internet Explorer 5.0. The processing instruction associates a style sheet (either XSL or CSS) with XML documents. For example:
<?xml version="1.0"?> <?xml-stylesheet href="classic.xsl" type="text/xml"?> <?xml-stylesheet href="funky.xsl" type="text/xml" alternate="yes"?> <article> <articleinfo> <title>ananas.org</title> <!-- rest of the document goes here --> |
Generally speaking, processing instructions encode application-specific data. You are already familiar with processing instructions because most XML documents start with the XML declaration, which is itself a special processing instruction. A processing instruction contains a target (xml-stylesheet in the above example), followed by data. It is enclosed in <? and ?> delimiters. The target identifies the application, and an application
should ignore processing instructions for targets it does not recognize.
The format of the data is totally free. XML does not specify what goes in there (except, of course, for the XML declaration). In fact, historically, processing instructions would contain PostScript images or scripts ... anything but tags.
Like the declaration, xml-stylesheet has a special status because it is defined by the W3C. It must appear in the prologue of the document (before the first element, that is), and it contains several so-called pseudo-attributes. The data is called pseudo-attributes because the syntax is similar to XML attributes.
The most important pseudo-attribute is href, which contains a URI to the style sheet. Other useful
pseudo-attributes are type and alternate. type is the MIME type for the style sheet, and it serves to distinguish between CSS and XSL. If there is more than one xml-stylesheet instruction, alternate indicates which are replacements for the principal style sheet. A processor should prompt the user with the list of alternate style sheets. However, because XM works in batch mode, it uses a different strategy and simply ignores alternate style sheets.
Although xml-stylesheet is a W3C standard, TrAX processors ignore it unless told otherwise. The
application must explicitly call getAssociatedStylesheet() to retrieve the processing instructions, as follows:
Source document = new StreamSource(file),
stylesheet = factory.getAssociatedStylesheet(document,null,null,null);
if(null != stylesheet)
transformer.transform(document,new StreamResult(System.out));
else
throw new XMException("Cannot find the style sheet"); |
Yet getAssociatedStylesheet() brings along two problems that XM must avoid. First, getAssociatedStylesheet() makes it difficult to cache frequently used style sheets. Second, it assumes that the style sheets are stored in the same directory as the documents. I prefer to store style sheets in a different directory because I
find is easier to maintain and share style sheets if they are all grouped in one directory.
Passing style sheet parameters, too
Selecting a style sheet is only half of the solution. Often I want to make small variations that do not warrant writing a new style sheet. In that case, my favorite solution is to use parameters, as shown in Listing 1:
Listing 1: Sample parameter
<xsl:stylesheet ...>
<xsl:param name="sponsor" select="'none'"/>
<xsl:template match="articleinfo">
<xsl:if test="$sponsor='dw'">
<center>
<a href="http://www.ibm.com/developerWorks">
<img align="middle" width="136" height="24" border="0"
alt="developerWorks" src="!images/buttons/dw.gif"/>
</a>
</center>
</xsl:if>
<xsl:apply-templates/>
</xsl:template>
|
Again, how do you pass the parameters? The W3C has not proposed a mechanism, but it seems logical to define new processing instructions. XM recognizes xm-xsl-param, as well as xml-stylesheet. The syntax for xm-xsl-param is similar to the other processing instruction and uses two
pseudo-attributes, name and value:
<?xml version="1.0"?>
<?xm-xsl-param name="sponsor" value="dw"?>
<article>
<articleinfo>
<title>XM</title>
|
Obviously, TrAX does not support xm-xsl-param, but since I have already established that XM needs a replacement for getAssociatedStylesheet(), parsing xm-xsl-param is not much more work.
But, doesn't that mean parsing the document twice? Once for the processing instructions, and once more with the XSLT processor? In practice, parsing twice does not cause much trouble because the processing instructions must appear in the prologue of the document, so XM reparses only a small subset of the document.
ProcessingInstructionHandler and PseudoAttributeTokenizer
ProcessingInstructionHandler is a SAX ContentHandler that extracts xml-stylesheet and xm-xsl-param.
The handler intercepts four events. setDocumentLocator() and startDocument() are used for initialization. The bulk of the work happens in processingInstruction(). As for startElement(), it stops the parsing because it marks the end of the prologue. To stop parsing, startElement() throws an exception. Arguably this borders on hacking; exceptions are normally used to report errors and there are no errors in startElement(), but SAX offers no cleaner solution to stop parsing.
Although the syntax of pseudo-attributes is similar to XML attributes, the SAX parser does not decode them. XM uses its own
parser, PseudoAttributeTokenizer, to decode the pseudo-attributes.
PseudoAttributeTokenizer scans the buffer one character at a time, looking for pseudo-attributes. It uses a classic algorithm found in every compiler book. If you are unfamiliar with the topic, I recommend Compiler Construction from Niklaus Wirth of Pascal fame (see Resources).
To simplify the code, the getc() method returns the next character from the buffer while putc() replaces the character in the buffer where it is available for the next call to getc().
The public interface of PseudoAttributeTokenizer consists of three methods: hasMoreTokens() tests if there are more pseudo-attributes in the buffer, nextName() returns the next name, and nextValue() returns the next value.
Let's examine nextName(). It removes leading spaces with a call to eatSpaces(). Next it loops for as long as it finds digits or letters and accumulates the character in a variable (token). Because a name contains only digits and letters, any other character signals the end. nextName() takes special care to return the last character read to the buffer, where it will be available for the nextValue().
Listing 2: nextName() example
public String nextName()
throws SAXParseException
{
token.setLength(0);
int c = eatSpaces();
for(;;)
if(c == -1)
throw new SAXParseException(UNEXPECTED_EOS,locator);
// strictly speaking a name cannot start with a digit...
else if(!Character.isLetterOrDigit((char)c) && c != '-')
{
putc(); // put it back for the next call
return token.length() == 0 ? null : token.toString();
}
else
{
token.append((char)c);
c = getc();
}
} |
nextValue() is similar, but it first recognizes the equal character (which was left in the buffer by nextName()) and the quotes. nextValue() also decodes predefined entities (<, >, and the like).
With the tokenizer, it is easy to decode the processing instructions. The following code, excerpted from ProcessingInstructionHandler, recognizes xml-stylesheet. The code for xm-xsl-param is similar:
Listing 3: ProcessingInstructionHandler excerpt
if(target.equals("xml-stylesheet"))
{
String href = null,
type = null;
boolean alternate = false;
PseudoAttributeTokenizer tokenizer =
new PseudoAttributeTokenizer(data,locator);
while(tokenizer.hasMoreTokens())
{
String name = tokenizer.nextName(),
value = tokenizer.nextValue();
if(name.equals("href"))
href = value;
else if(name.equals("alternate"))
alternate = value.equals("yes");
else if(name.equals("type"))
type = value.trim();
// ignore the media attribute...
}
if(type != null && href != null && !alternate &&
(type.equals("text/xsl") || type.equals("text/xml") ||
type.equals("application/xml+xslt")))
{
this.href = href;
params.clear();
readParams = true;
}
else
readParams = false;
} |
Remember that XM ignores alternate style sheets. The W3C recommendation allows for HTTP to provide a default style sheet that takes precedence over alternate style sheets. XM uses the same rules and applies its own default style sheet instead of considering alternate style sheets.
Since XM uses more style sheets, the caching logic has been improved. This is the responsibility of TemplatesManager. When the StylingMover requests a Templates object, it is retrieved from the cache, if it's available. If it's not
available, TemplatesManager loads the style sheet and caches it. TemplatesManager is essentially a wrapper around java.util.Map with additional methods to return Transformer objects.
As explained previously, XM does not mix documents and style sheets. Rather it uses two directories: the document directory and the rules directory. TrAX offers the URIResolver interface to control how the XSLT processor loads files. URIResolver for XSLT processor is similar to EntityResolver for SAX parser; the processor calls its resolve() method when it loads imported style sheets (through xsl:import or xsl:include elements) or documents (through the document() function).
TemplatesManager uses an inner class, ReferenceResolver, that loads style sheets from the rules directory:
Listing 4: ReferenceResolver example
protected class ReferenceResolver
implements URIResolver
{
protected File rulesDir;
public ReferenceResolver(File rulesDir)
{
this.rulesDir = rulesDir;
}
public Source resolve(String href,String base)
{
if(href.endsWith(".xsl"))
{
File file = new File(rulesDir,href);
if(file.exists())
return new StreamSource(file);
}
return null;
}
} |
Of course, I adapted StylingMover for the new classes. It now parses the document with a ProcessingInstructionHandler handler. It uses the result to select a style sheet and assign parameters, as illustrated in Listing 5. Pay special attention to the try/catch statement; because startElement() uses a special exception to stop parsing, the code must recognizes that it is not an error.
So far the work on XM has involved basic publishing features. Although they are important, I believe the real value of XM is in the automatic content generation I've had in mind from the beginning. In a nutshell, the idea is to let XM generate XML documents on your behalf.
For example, many Web sites include a download section. If the list of files changes frequently, it is difficult to maintain an XML document with a list that is always up to date. It's best to use a software to generate the list automatically. The document could look like Listing 6. Likewise a document could be generated from a SQL database, a mailbox, or even a remote Web site!
Listing 6: A directory read by XM
<?xml version="1.0" encoding="UTF-8"?>
<xm:Directory xmlns:xm="http://www.ananas.org/2001/XM/Walk/Directory">
<xm:File isDirectory="false" isFile="true" isHidden="false" canRead="true"
isMarked="false" lastModified="2001-07-07T18:21:10" canWrite="true"
length="749">NotImplementedException.java</xm:File>
<xm:File isDirectory="false" isFile="true" isHidden="false" canRead="true"
isMarked="false" lastModified="2001-07-20T11:49:42" canWrite="true"
length="6229">ContentHandlerExtractor.java</xm:File>
<xm:File isDirectory="false" isFile="true" isHidden="false" canRead="true"
isMarked="false" lastModified="2001-09-05T07:10:10" canWrite="true"
length="2351">JAXPHelper.java</xm:File>
</xm:Directory> |
Last month's column introduced Mover to simplify the process of adding automatic content generation. I have included a preliminary version of the directory generation in the code this month, and I plan to revisit it next month. In the meantime, if you are curious, review DirectoryReader, WalkHandler and WalkMover.
I am currently maintaining two Web sites with XM: ananas.org and an internal one. Actual experience from using XM is very helpful in deciding how to change the software. I welcome your input, too, so download a copy of XM and try it as you build your own site. Make sure to report your findings on the ananas-discussion mailing list (see Resources).
I have added the code for the ananas.org Web site (the .xml documents and .xsl style sheets) in the CVS repository to give you a starting point in designing your own Web site.
If you installed an earlier version of XM, you need to update your software to take advantage of this month's improvements: Rename the rules.xsl file as default.xsl and move it into a rules directory. This matches the new criteria for selecting style sheets.
- Participate in the discussion forum.
- You can download the code for this project from ananas.org. Follow the links there to
the CVS repository on developerWorks and to the ananas-discussion mailing list. I hope you will join the list and contribute your thoughts to the project.
- If you'd rather have a ZIP file, it's available too.
- XM uses Xalan and Xerces-J, respectively, as XSLT processor and XML parser. IBM (and Lotus) originally developed both and donated the code to the Apache Foundation.
- IBM's DB2 database provides relational database storage, plus pureXML to quickly serve data and reduce your work in the management of XML data.
-
Compiler Construction
from Niklaus Wirth (ISBN 0-2014-0353-6) is one of the best introductions to parsing. At 180 pages, it's quick to read too.
- Find more XML resources in the developerWorks XML zone.

Benoît Marchal is a consultant and writer based in Namur, Belgium. He is the author of XML by Example , Applied XML Solutions , and XML and the Enterprise. He is a columnist for Gamelan. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.
Comments (Undergoing maintenance)





