Build portable XSLT utilities

A practical guide to creating lightweight XML authoring utilities

XML documents created for authoring projects such as help systems, maintenance documentation, and wikis tend to be complex and heavily dependent on inter- and intra-document linking. In this practical guide, create lightweight utilities that help you automate repetitive XML document-creation tasks.

Share:

Lewis Marshall (lew@wurmwood.co.uk), Technical Lead, Inmedius Inc.

Lewis Marshall specializes in publishing and editing systems based on XML and SGML specifications. His principal technology interests include the XSL family, xschema, and proprietary editor technologies such as ACL and FOSI. Lewis is a technical lead at Inmedius Inc. You can reach him at lew@wurmwood.co.uk.



25 January 2011

Also available in Chinese Japanese

This article focuses on some of the challenges of working with authored XML documents, a term that in this context refers to data sets originated by content creators and typically guided by a DTD or schema. Many environments have adopted guided authoring with XML for various reasons, including consistency, efficiency, and cost savings. With the DTD in place, the expectation of 100% consistency is not unreasonable. If authors are guided (and restricted) by a DTD, then surely the result will be predictable. Right?

Frequently used acronyms

  • DOM: Document Object Model
  • DTD: Document type definition
  • HTML: Hypertext Markup Language
  • URI: Uniform Resource Indicator
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language
  • XSLT: Extensible Stylesheet Language Transformation

A close examination of most DTDs, however, reveals a fair degree of flexibility—particularly if the DTD is from a generic source or derived from a standard or specification. This is not a shortcoming of the DTD paradigm, but it can create headaches if the infrastructure surrounding the XML data set—expensive publishing pipelines, document management systems, translation routines—anticipates documents in a particular way.

The practical solution is often to enforce a style guide to limit the variances. The expensive way to implement the style guide is to cross-check it to the data set manually, but a more effective way might be to automate the checking.

What is appropriate for automation?

Content creators have an XML data set and a series of tasks that they need to perform. Can this task be automated?

Specialized rules

The aerospace and defense sector has developed a couple of prominent initiatives that formalize this process:

  • Business Rules Exchange. This DTD from the S1000D specification describes a set of business rules for other S1000D documents; typically, the business rules would analyze the technical data in the S1000D data to ensure consistency across a large project.
  • Simplified English. This set of writing rules is based on an approved dictionary of words to minimize ambiguity in technical documentation.

The straightforward answer to this is usually a question: "Is the task predictable, repeatable, and definable?" The easier checks are those that do not involve parsing textual content but concentrate on the document structure:

  • Does the new section have a cross-reference target? Is it highly likely that someone else will link to this topic?
  • Does the list have more than one item? If not, it's not really a list.
  • Does the all-important safety information appear before the task list? It's best to warn the reader about the potential electric shock in advance.

You can either identify a known document model or aim to be generic. After all, this is an issue of scalability: If the utility serves one purpose and saves time through automation, then maybe a simple, self-contained script with the business logic tied directly to the code is a reasonable approach. If there is interest in a multi-purpose utility that a user can customize, then maybe a more ambitious, configurable approach is needed. I take the latter option here.


An example utility

XSLT serves as an all-purpose XML processing language. It is not the only choice, however, as much XML processing is carried out using DOM techniques, and some of what is done in this article can be replicated in a DOM. But when you look at taking corrective action, XSLT shows itself to be the ideal tool.

The XSLT example included here is combined with an HTML wrapper to demonstrate how to easily deploy the utility as a stand-alone application. This combination implies that the XSLT should be version 1.0 (see Resources for more information) and that the embedded script is in the Microsoft® JScript® scripting language.

Process a document and return a set of error messages according to business logic

The first step is to capture the business logic. For the purpose of this exercise, you make the checks against the XML source behind this article. The rules are based on the style guide provided for the authors of these documents.

The checks are designed to enforce that written content is complete by checking the structure of the document rather than analyzing the textual content. XPath is the ideal candidate for capturing these types of checks.

Formalize and genericize the checks by encoding in an XML vocabulary

This approach means going through a design process that defines how best to capture document errors, how to categorize the errors, and then how to handle the errors. Why not just embed this process in the XSLT? The benefit of this technique is that after the error checking is encoded in an XML vocabulary, the utility becomes generic code to handle one or more profiles. A user can select different rules sets for different document data:

XML namespaces

The use of a namespace is not essential, but it is a design choice. If it is not followed, simply omit the err: namespace prefix in the queries shown. See Resources for a link to a discussion on working with namespaces.

  • Define the means by which to test the document. For the purpose of this exercise, the design decision is to use XPath.
  • Define the pass–fail criteria. Use XPath-based document checks to query for the existence of one or more nodes that comply with the XPath expression.
  • Define the severity of the fail. Individual checks can be categorized as:
    • Enforceable. The error-checking process fails for the first instance of this type of check.
    • Advisable. There is no process failure, but the process logs instances as errors.
    • Conditional. A variant of enforce, this check is more complex, as an additional context check is made based on the node returned from the XPath expression test.
  • Create and import a mapping file. The file should use these document checks:
    • Define a namespace for the document—for example:

      <err:document xmlns:err="http://error.com/mynamespace">
    • Create each error definition.
    • Note error checks made against the document at a high level:

      <err:element type="structure" name="dw-document" 
          context="/dw-document" enforce="yes">
    • Note error checks made at the element level:

      <err:element type="element" name="ol" context="./li" pass="&gt;=2"/>

For a full set of sample tests, see Resources.

When you have defined the error-checking syntax, you can define one or more rule sets to be applied to different data sets.

Create the XSLT to process the rule set file

The XSLT can potentially have two output streams: the log messages and the refined document source (if corrective action is taken).

XSLT extensions

The implementation of XSLT extensions is vendor specific: Any processor that supports extensions uses a different namespace to define the script element. See Resources for links to information about XSLT extensions.

The design is to use the XSLT output stream to create a new, refined document and an XSLT extension to write the log messages to a separate output stream. The stand-alone example adds the log messages to an HTML logging pane.

The error checks are categorized in two distinct ways: top-level, structural checks and element-level checks. The XSLT first processes the top-level checks; then, if applicable (in other words, if all these checks pass), it processes the document's content using conventional XSLT templates.

To create the XSLT, perform the following steps:

  1. Define a script element in the XSLT to define embedded scripting. First, create a logging environment, then create a function to store messages, as in Listing 1.
    Listing 1. Define embedded scripting in the XSLT
    <msxsl:script language="JScript" implements-prefix="xslext">
    <![CDATA[
    
      var messages = new Array();
      var msgct = 0;
    
      function addMsg( msg ){
        messages[msgct++] = msg;
        return "";
      }
    
    ]]>
    </msxsl:script>
  2. Add a template to handle the messages. Listing 2 shows the code.
    Listing 2. Add a template
    <xsl:template name="handlemsg">
    
      <xsl:param name="msg"/>
      <xsl:param name="terminate">no</xsl:param>
      <xsl:param name="lvl">1</xsl:param>
    
      <xsl:variable name="logmsg">
        <!-- Indent the log messages to help with readability -->
        <xsl:choose>
          <xsl:when test="$lvl=2">  &#x2022; </xsl:when>
          <xsl:when test="$lvl=3">    &#x2022; </xsl:when>
          <xsl:when test="$lvl=4">      &#x2022; </xsl:when>
          <xsl:when test="$lvl=4">        &#x2022; </xsl:when>
        </xsl:choose>
        <xsl:value-of select="$msg"/>
      </xsl:variable>
    
      <xsl:variable name="log" select="xslext:addMsg( string( $logmsg ) )"/>
    
      <xsl:if test="$terminate='yes'">
        <xsl:variable name="errormsg"
                      select="xslext:addMsg( 'ERROR: Error checking caused 
                        the process to stop' )"/> 
      <!-- If the error msg force termination, the process must first output 
           all existing log messages -->
        <xsl:variable name="output" select="xslext:outputMsgs( $logfileout )"/>
        <xsl:message terminate="yes"></xsl:message>
      </xsl:if>
    
     </xsl:template>

    The template is called from throughout the XSLT to handle the messages sent to the message extension functions.

  3. Use a global document variable against which XPath expressions are evaluated, and create a function to which you can pass an expression. Listing 3 shows the code.
    Listing 3. Create a global variable
    <msxsl:script language="JScript" implements-prefix="xslext">
    <![CDATA[
    
      var xpathdoc = null;
    
      function setUpXPath( ns, trialexpr ){
        var xml = ns.nextNode().xml;
        try{
          xpathdoc = new ActiveXObject( "Msxml2.DOMDocument.3.0" );
          xpathdoc.loadXML( xml );
          return trialexpr + ": " + xpathdoc.selectNodes( trialexpr ).length;
        } catch(e) {
          return "ERROR: " + e.description;
        }
      }
    
    ]]>
    </msxsl:script>

    Listing 3 shows a function that creates a DOM document to use as a context node for further XPath evaluations.

  4. Call this initialization function from within the main body of the XSLT, as in Listing 4.
    Listing 4. Add an initialization function
    <xsl:call-template name="handlemsg">
      <xsl:with-param name="msg">Setup '
        <xsl:value-of select="xslext:setUpXPath( $root, 
                                   concat( '//', name($root) ) )"/>
      '</xsl:with-param>
    </xsl:call-template>

    Note how the extension function is called using the namespace prefix (xslext in this example). This prefix distinguishes this custom function from the standard functions available through XSLT such as number(), string(), and contains().

  5. Process the top-level document tests:
    1. Define a parameter for the ruleset file:

      <xsl:param name="rulesetfile"></xsl:param>

      Supply this parameter as a file URI. The stand-alone example takes a user selection at run time.

    2. Create a template to process each test:

      xsl:template name="process-check"

      This template works in the following way. First, you create an extension function that uses the xpathdoc as a context node and evaluates the test expression set in the rule file:


      function evalXPath( exp ){
        try{
          return xpathdoc.selectNodes( exp ).length;
        } catch(e) {
          return "Exception: " + e.description;
        }
      }

      If successful, this code returns an integer; it should be at least 1. A zero indicates that the test ran successfully but no matches were found; an error description indicates either that the function threw an exception or that the XPath expression was badly formed.

    3. Call the function, and store the return value in a variable:

      <xsl:variable name="check"
                    select="xslext:evalXPath( string( $context ) )"/>

      where $context is the expression string set for the err:element (for example, /dw-document//meta-dcsubject).

      If the value of $check is at least 1 and the test is set to Enforce, then the test has passed.

      If the value of $check is 0 and the test is not set to Enforce, then the test has passed, but the user should see a warning.

      Otherwise, the test has failed and the process should halt. You can force the termination by an xsl:message, with terminate set to Yes (see Listing 2). The template is called with the log message and the terminate parameter set to Yes.

    4. Define a nodeset of all enforceable tests to process:

      document($rulesetfile)//err:element[@type='structure'][@enforce='yes']
    5. Process all other top-level tests that are not enforceable:

      document($rulesetfile)//err:element[@type='structure'][not(@enforce='yes')]
  6. Process the element-level tests.

    These tests are processed at the individual templates. To keep the process generic, the XSLT has a simple template to process elements:


    xsl:template match="node()"

    Within this generic template, you set a variable to determine whether the rule set contains an applicable test:


    <xsl:variable name="match"
                  select="document($rulesetfile)//err:element[@type='element']
                                                             [@name=$name]"/>

    where $name is defined as the name of the current element.

    If $match is found to be True, the context of this test is run using another extension function. This function, similar to the top-level XPath evaluation, passes in the current node from the XSLT and evaluates the expression against that, as in Listing 6.

    Listing 6. Function to evaluate an expression
    function evalXPathAgainstNode( node, exp ){
      try{
        return node.nextNode().selectNodes( exp ).length;
      } catch(e) {
        return "Exception: " + e.description;
      }
    }

    If this function returns a value that parses as a number (that is, the return value isn't 0 or an error message), the integer is passed to another function to test the number against the pass–fail criteria, defined in the pass attribute:


    <err:element type="element" name="ol" context="./li" 
            pass="&gt;=2" />
  7. Test that the ol element has a number of li children greater than or equal to 2, as in Listing 7.
    Listing 7. Test the number of li elements
    function evalExpr( str, pass ){
      return eval( str + pass );
    }
    ...
    <xsl:variable name="eval" 
                  select="xslext:evalExpr( $check, $pass )"/>
  8. The XSLT returns log results similar to Listing 8.
    Listing 8. XSLT log results
    Start
    Setup '//dw-document: 1'...
     · Check (Top-level document?) '1'
     · Conditional check '(Document ID missing?) '1' (1==1) == true'
     · Conditional check '(Article missing?) '1' (1==1) == true'
     · Conditional check '(Meta field (document type) missing?) '1' (1==1) == true'
     · Conditional check '(Meta field (subject) missing?) '1' (1==1) == true'
     · Conditional check '(Article title missing?) '1' (1==1) == true'
     · Conditional check '(Document author missing?) '1' (1==1) == true'
     · Conditional check '(Published date missing?) '1' (1==1) == true'
     · Check (Missing abstract?) '1'
     · Conditional check '(Dates out of sync?) '0' (00) == 0'
     · Conditional check '(Broken internal links?) '0' (0==0) == true'
     · Context checking 'heading' (./a[@name]) '(1==1) == true'...
     · Error context checking 'heading' (./a[@name]) '(0==1) == false'...
     · Context checking 'heading' (./a[@name]) '(1==1) == true'...
     · Context checking 'ol' (./li) '(3>=2) == true'...
    / End

Where next?

Having built a process that makes checks, identifies errors, and remakes the document, the next obvious step is to take corrective action on the element. This example includes basic code to add back to the document.

If the ruleset shows an err:onfail element as a child of err:element, the code can take any of the following:

  • <err:insertbefore></err:insertbefore>
  • <err:insertatstart></err:insertatstart>
  • <err:insertatend></err:insertatend>
  • <err:insertafter></err:insertafter>

The insert element contains XML tags to correct the document—for example:

<err:insertatstart>
        <a name="function:generate-id()" /></err:insertatstart>

The XSLT needs to process this.

Then, you can create a template to iterate over a nodeset:

<xsl:template name="copy-nodeset">

Pass the contents of the err:insertbeforeerr:insertatstart, err:insertatend, and err:insertafter elements to this template at the relevant points in the XSLT—for example:

<-- Add 'err:insertbefore' here -->
<xsl:element name="{name()}">

  <xsl:copy-of select="@*"/>

  <-- Add 'err:insertatstart' here -->

  <xsl:apply-templates/>

  <-- Add 'err:insertatend' here -->

</xsl:element>

<-- Add 'err:insertafter' here -->

The template has special treatment for the function:generate-id() method.

For completeness, add logging as the content is inserted into the document:

Click to see code listing

  ...
  · Error context checking 'heading' (./a[@name]) '(0==1) == false'...
  ·Adding content at start of 'heading'· Error context checking 'heading' (./a[@name]) '(0==1) == false'...
  ·Adding content at start of 'heading'
  ...

Summary

This article showed how to use XSLT to analyze document structure to determine whether a set of business rules is met. This process can perform an important function in two significant ways: first, as an aid to the content creator to enable him or her to meet authoring objectives—for example, users can work offline and run the tests multiple times to verify that they have completed certain tasks—and second, as a formal part of a documentation workflow—for example, the utility can be embedded in a document repository workflow, and the pass–fail criteria can control the movement of a managed document between edit, review, and acceptance.

Separating the business logic from the XSLT makes the utility more flexible. The code becomes generic, as multiple rule sets can be applied using a single code base. Using XSLT instead of DOM methods proves powerful, as doing so allows document refinement using the transform process to correct the document.


Download

DescriptionNameSize
Example XSLT codexslt_source.zip9KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=608178
ArticleTitle=Build portable XSLT utilities
publish-date=01252011