Working XML: Processing instructions and parameters

Adding support for multiple style sheets

This month our hardworking columnist adds support for multiple style sheets to the XM content-management project. In so doing, he taps into TrAX URIResolver and writes his own parser for pseudo-attributes. As usual, the complete source code is available in the developerWorks Open source zone.

In the Working XML column, each month Benoît Marchal reports on his progress on one or more open-source XML development projects. You can follow his design decisions and coding choices as he goes along, and you can make suggestions and reuse the open-source code in your own projects.

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapple Software

Benoit Marchal Benoît Marchal is a consultant and writer based in Namur, Belgium. He is the author of , , and XML and the Enterprise. He is a columnist for Gamelan. Details on his latest projects are at marchal.com. You can contact Benoît at bmarchal@pineapplesoft.com.



01 September 2001

Also available in Japanese

Work on XM continues. This month I have added support for multiple style sheets, thereby addressing the most common suggestion from readers. I have also added an option to pass parameters to style sheets, which about rounds off the basic publishing functions of XM. Starting to work on more advanced features, I've included a directory reader. (You can download all the relevant code for this and previous columns; see the sidebar, Getting the code.)

Multiple style sheets

The first two versions of XM followed a one-size-fits-all strategy, which originally seemed a good idea but proved ineffective. More specifically, so far XM recognizes only one style sheet, rules.xsl. As explained in the first article in this series, I originally thought I would use different XSLT templates to select changes in styles:

<xsl:template match="db:article">
   <!-- rules for an article here -->
</xsl:template>
<xsl:template match="xm:Directory">
   <!-- rules for a directory here -->
</xsl:template>

Yet my own experience with ananas.org, a site I maintain exclusively with XM, showed that my original plan just doesn't work well. I also heard from readers who suggested I address what they perceived as a limitation. Finally, as I started working on content generation (introduced later in this article), it became evident that it was time to add support for multiple style sheets.

Getting the code

As usual, you can download the code for the column from the CVS repository (see Resources). You can download a ZIP file from the same place. This month, the download includes sample .xml and .xsl files.

Processing instructions

You may recall that ease of use is one of my priorities with XM and, specifically, I do not want to use configuration files or "build scripts" to select which style sheet applies where (review Working XML: Using XSLT for content management for a complete discussion of this requirement).

Readers suggested clever naming conventions that would select a style sheet, but the best suggestion for a solution came from a colleague who reminded me of the xml-stylesheet processing instruction.

If you are not familiar with xml-stylesheet, it was introduced in July 1999 in a small W3C recommendation and popularized by Internet Explorer 5.0. The processing instruction associates a style sheet (either XSL or CSS) with XML documents. For example:

<?xml version="1.0"?>
<?xml-stylesheet href="classic.xsl" type="text/xml"?>
<?xml-stylesheet href="funky.xsl" type="text/xml" alternate="yes"?>
<article>
<articleinfo>
 <title>ananas.org</title>
<!-- rest of the document goes here -->

Generally speaking, processing instructions encode application-specific data. You are already familiar with processing instructions because most XML documents start with the XML declaration, which is itself a special processing instruction. A processing instruction contains a target (xml-stylesheet in the above example), followed by data. It is enclosed in <? and ?> delimiters. The target identifies the application, and an application should ignore processing instructions for targets it does not recognize.

The format of the data is totally free. XML does not specify what goes in there (except, of course, for the XML declaration). In fact, historically, processing instructions would contain PostScript images or scripts ... anything but tags.

Like the declaration, xml-stylesheet has a special status because it is defined by the W3C. It must appear in the prologue of the document (before the first element, that is), and it contains several so-called pseudo-attributes. The data is called pseudo-attributes because the syntax is similar to XML attributes.

The most important pseudo-attribute is href, which contains a URI to the style sheet. Other useful pseudo-attributes are type and alternate. type is the MIME type for the style sheet, and it serves to distinguish between CSS and XSL. If there is more than one xml-stylesheet instruction, alternate indicates which are replacements for the principal style sheet. A processor should prompt the user with the list of alternate style sheets. However, because XM works in batch mode, it uses a different strategy and simply ignores alternate style sheets.

Although xml-stylesheet is a W3C standard, TrAX processors ignore it unless told otherwise. The application must explicitly call getAssociatedStylesheet() to retrieve the processing instructions, as follows:

Source document = new StreamSource(file),
       stylesheet = factory.getAssociatedStylesheet(document,null,null,null);

if(null != stylesheet)
   transformer.transform(document,new StreamResult(System.out));
else
   throw new XMException("Cannot find the style sheet");

Yet getAssociatedStylesheet() brings along two problems that XM must avoid. First, getAssociatedStylesheet() makes it difficult to cache frequently used style sheets. Second, it assumes that the style sheets are stored in the same directory as the documents. I prefer to store style sheets in a different directory because I find is easier to maintain and share style sheets if they are all grouped in one directory.

Passing style sheet parameters, too

Selecting a style sheet is only half of the solution. Often I want to make small variations that do not warrant writing a new style sheet. In that case, my favorite solution is to use parameters, as shown in Listing 1:

Listing 1: Sample parameter
<xsl:stylesheet ...>

<xsl:param name="sponsor" select="'none'"/>

<xsl:template match="articleinfo">
   <xsl:if test="$sponsor='dw'">
      <center>
         <a href="http://www.ibm.com/developerWorks">
            <img align="middle" width="136" height="24" border="0"
                alt="developerWorks" src="!images/buttons/dw.gif"/>
         </a>
      </center>
   </xsl:if>
   <xsl:apply-templates/>
</xsl:template>

Again, how do you pass the parameters? The W3C has not proposed a mechanism, but it seems logical to define new processing instructions. XM recognizes xm-xsl-param, as well as xml-stylesheet. The syntax for xm-xsl-param is similar to the other processing instruction and uses two pseudo-attributes, name and value:

<?xml version="1.0"?>
<?xm-xsl-param name="sponsor" value="dw"?>
<article>
   <articleinfo>
      <title>XM</title>

Obviously, TrAX does not support xm-xsl-param, but since I have already established that XM needs a replacement for getAssociatedStylesheet(), parsing xm-xsl-param is not much more work.

But, doesn't that mean parsing the document twice? Once for the processing instructions, and once more with the XSLT processor? In practice, parsing twice does not cause much trouble because the processing instructions must appear in the prologue of the document, so XM reparses only a small subset of the document.

ProcessingInstructionHandler and PseudoAttributeTokenizer

ProcessingInstructionHandler is a SAX ContentHandler that extracts xml-stylesheet and xm-xsl-param.

The handler intercepts four events. setDocumentLocator() and startDocument() are used for initialization. The bulk of the work happens in processingInstruction(). As for startElement(), it stops the parsing because it marks the end of the prologue. To stop parsing, startElement() throws an exception. Arguably this borders on hacking; exceptions are normally used to report errors and there are no errors in startElement(), but SAX offers no cleaner solution to stop parsing.

Although the syntax of pseudo-attributes is similar to XML attributes, the SAX parser does not decode them. XM uses its own parser, PseudoAttributeTokenizer, to decode the pseudo-attributes.

PseudoAttributeTokenizer scans the buffer one character at a time, looking for pseudo-attributes. It uses a classic algorithm found in every compiler book. If you are unfamiliar with the topic, I recommend Compiler Construction from Niklaus Wirth of Pascal fame (see Resources).

To simplify the code, the getc() method returns the next character from the buffer while putc() replaces the character in the buffer where it is available for the next call to getc().

The public interface of PseudoAttributeTokenizer consists of three methods: hasMoreTokens() tests if there are more pseudo-attributes in the buffer, nextName() returns the next name, and nextValue() returns the next value.

Let's examine nextName(). It removes leading spaces with a call to eatSpaces(). Next it loops for as long as it finds digits or letters and accumulates the character in a variable (token). Because a name contains only digits and letters, any other character signals the end. nextName() takes special care to return the last character read to the buffer, where it will be available for the nextValue().

Listing 2: nextName() example
public String nextName()
   throws SAXParseException
{
   token.setLength(0);
   int c = eatSpaces();
   for(;;)
      if(c == -1)
         throw new SAXParseException(UNEXPECTED_EOS,locator);
      // strictly speaking a name cannot start with a digit...
      else if(!Character.isLetterOrDigit((char)c) && c != '-')
      {
         putc();   // put it back for the next call
         return token.length() == 0 ? null : token.toString();
      }
      else
      {
         token.append((char)c);
         c = getc();
      }
}

nextValue() is similar, but it first recognizes the equal character (which was left in the buffer by nextName()) and the quotes. nextValue() also decodes predefined entities (<, >, and the like).

With the tokenizer, it is easy to decode the processing instructions. The following code, excerpted from ProcessingInstructionHandler, recognizes xml-stylesheet. The code for xm-xsl-param is similar:

Listing 3: ProcessingInstructionHandler excerpt
if(target.equals("xml-stylesheet"))
{
   String href = null,
          type = null;
   boolean alternate = false;
   PseudoAttributeTokenizer tokenizer =
      new PseudoAttributeTokenizer(data,locator);
   while(tokenizer.hasMoreTokens())
   {
      String name = tokenizer.nextName(),
             value = tokenizer.nextValue();
      if(name.equals("href"))
         href = value;
      else if(name.equals("alternate"))
         alternate = value.equals("yes");
      else if(name.equals("type"))
         type = value.trim();
      // ignore the media attribute...
   }
   if(type != null && href != null && !alternate &&
      (type.equals("text/xsl") || type.equals("text/xml") ||
       type.equals("application/xml+xslt")))
   {
      this.href = href;
      params.clear();
      readParams = true;
   }
   else
      readParams = false;
}

Remember that XM ignores alternate style sheets. The W3C recommendation allows for HTTP to provide a default style sheet that takes precedence over alternate style sheets. XM uses the same rules and applies its own default style sheet instead of considering alternate style sheets.

TemplatesManager

Since XM uses more style sheets, the caching logic has been improved. This is the responsibility of TemplatesManager. When the StylingMover requests a Templates object, it is retrieved from the cache, if it's available. If it's not available, TemplatesManager loads the style sheet and caches it. TemplatesManager is essentially a wrapper around java.util.Map with additional methods to return Transformer objects.

As explained previously, XM does not mix documents and style sheets. Rather it uses two directories: the document directory and the rules directory. TrAX offers the URIResolver interface to control how the XSLT processor loads files. URIResolver for XSLT processor is similar to EntityResolver for SAX parser; the processor calls its resolve() method when it loads imported style sheets (through xsl:import or xsl:include elements) or documents (through the document() function).

TemplatesManager uses an inner class, ReferenceResolver, that loads style sheets from the rules directory:

Listing 4: ReferenceResolver example
protected class ReferenceResolver
   implements URIResolver
{
   protected File rulesDir;

   public ReferenceResolver(File rulesDir)
   {
      this.rulesDir = rulesDir;
   }

   public Source resolve(String href,String base)
   {
      if(href.endsWith(".xsl"))
      {
         File file = new File(rulesDir,href);
         if(file.exists())
            return new StreamSource(file);
      }
      return null;
   }
}

StylingMover

Of course, I adapted StylingMover for the new classes. It now parses the document with a ProcessingInstructionHandler handler. It uses the result to select a style sheet and assign parameters, as illustrated in Listing 5. Pay special attention to the try/catch statement; because startElement() uses a special exception to stop parsing, the code must recognizes that it is not an error.


Automatic content generation

So far the work on XM has involved basic publishing features. Although they are important, I believe the real value of XM is in the automatic content generation I've had in mind from the beginning. In a nutshell, the idea is to let XM generate XML documents on your behalf.

For example, many Web sites include a download section. If the list of files changes frequently, it is difficult to maintain an XML document with a list that is always up to date. It's best to use a software to generate the list automatically. The document could look like Listing 6. Likewise a document could be generated from a SQL database, a mailbox, or even a remote Web site!

Listing 6: A directory read by XM
<?xml version="1.0" encoding="UTF-8"?>
<xm:Directory xmlns:xm="http://www.ananas.org/2001/XM/Walk/Directory">
    <xm:File isDirectory="false" isFile="true" isHidden="false" canRead="true" 
             isMarked="false"  lastModified="2001-07-07T18:21:10" canWrite="true"
             length="749">NotImplementedException.java</xm:File>
    <xm:File isDirectory="false" isFile="true" isHidden="false" canRead="true" 
             isMarked="false" lastModified="2001-07-20T11:49:42" canWrite="true"
             length="6229">ContentHandlerExtractor.java</xm:File>
    <xm:File isDirectory="false" isFile="true" isHidden="false" canRead="true" 
             isMarked="false" lastModified="2001-09-05T07:10:10" canWrite="true"
             length="2351">JAXPHelper.java</xm:File>
</xm:Directory>

Last month's column introduced Mover to simplify the process of adding automatic content generation. I have included a preliminary version of the directory generation in the code this month, and I plan to revisit it next month. In the meantime, if you are curious, review DirectoryReader, WalkHandler and WalkMover.


Your turn

I am currently maintaining two Web sites with XM: ananas.org and an internal one. Actual experience from using XM is very helpful in deciding how to change the software. I welcome your input, too, so download a copy of XM and try it as you build your own site. Make sure to report your findings on the ananas-discussion mailing list (see Resources).

I have added the code for the ananas.org Web site (the .xml documents and .xsl style sheets) in the CVS repository to give you a starting point in designing your own Web site.

If you installed an earlier version of XM, you need to update your software to take advantage of this month's improvements: Rename the rules.xsl file as default.xsl and move it into a rules directory. This matches the new criteria for selecting style sheets.

Resources

  • You can download the code for this project from ananas.org. Follow the links there to the CVS repository on developerWorks and to the ananas-discussion mailing list. I hope you will join the list and contribute your thoughts to the project.
  • If you'd rather have a ZIP file, it's available too.
  • XM uses Xalan and Xerces-J, respectively, as XSLT processor and XML parser. IBM (and Lotus) originally developed both and donated the code to the Apache Foundation.
  • IBM's DB2 database provides relational database storage, plus pureXML to quickly serve data and reduce your work in the management of XML data.
  • Compiler Construction from Niklaus Wirth (ISBN 0-2014-0353-6) is one of the best introductions to parsing. At 180 pages, it's quick to read too.
  • Find more XML resources in the developerWorks XML zone.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12036
ArticleTitle=Working XML: Processing instructions and parameters
publish-date=09012001