Skip to main content

Use custom collations in XSLT 2.0

Create custom sort functions with XSLT extensions and the open-source Saxon processor

Doug Tidwell (dtidwell@us.ibm.com), Technology Evangelist, IBM, Software Group
A fair likeness of Doug Tidwell
Doug Tidwell is a strategist for IBM Software Group. As a technology evangelist, his job focuses on emerging technologies such as SCA, SDO and XForms, helping people use tomorrow's technologies to solve today's problems. He is the author of O'Reilly's XSLT, a second edition of which should be available in bookstores in time for Valentine's Day 2008 (ISBN 0596527217). A speaker at the first XML conference in 1997, he has worked with XSLT for about a decade, including some of the earliest approaches to XML transformations. He is currently writing a book about the inventor of the fruit smoothie, Dutch citrus merchant Julius of Orange. Doug lives in Chapel Hill, North Carolina, with his wife, food writer Sheri Castle, their daughter Lily, and their dog Domino, The Supine Canine.

Summary:  One emphasis of XSLT 2.0 is better support for internationalization, especially sorting and comparing text. This seemingly simple task is quite complicated in some languages; for example, accented characters can be considered the same or different depending on context. Are Á, À and A the same letter? Sometimes the answer needs to be yes, despite the fact that they are three different code points. The simple string comparison functions found in most languages (including XSLT 1.0) aren't up to the task. This article demonstrates how to write a custom collation function and invoke it from an XSLT 2.0 stylesheet.

Date:  27 Nov 2007
Level:  Advanced
Activity:  6353 views
Comments:  

Custom collations

In this article you'll use some of the new features of XSLT 2.0 and XPath 2.0. XSLT 2.0 has a number of functions and elements that allow you to specify a collation. A collation is the heart of any sorting algorithm. A collation function compares two items and returns one of three values. If the first item appears before the second, the function returns a value less than zero. If the two items are equal, the function returns zero. Finally, as you might expect, if the first item appears after the second, the return value is greater than zero.

The examples in this article use the Java-based Saxon XSLT 2.0 processor. Saxon implements the XSLT 2.0 specification (its author, Michael Kay, was the editor of the XSLT 2.0 spec), including custom collations. To use a custom collation with Saxon, you specify the name of the Java class that implements the collation function.

We'll cover three examples that:

  • Sort a list of Spanish words
  • Compare German words
  • Sort a list of bands and musicians, and ignore the text "The " at the start of the band name

More on Spanish and German collations

Your author is in no way a speaker of Spanish or German, so please pardon any incorrect statements about the languages themselves. The point here is to illustrate how to create extensions that implement custom collations and then use those extensions to sort and compare text in your stylesheets.

The traditional Spanish collation, the one you'll implement here, treats ch, ll and ñ as separate letters that sort after c, l and n respectively. However, much of the Spanish speaking world now uses the modern Spanish collation, defined by the Association of Spanish Language Academies (La Asociación de Academias de Lingua Española). The modern Spanish collation doesn't treat ch or ll as special characters; they sort as they would in English. The letter ñ still sorts after the letter n.

To sort German, you can choose from three collations: DIN-1, DIN-2 and Austrian. (The DIN standards are defined by the standards body Deutsches Institut Für Normung.) The collation used varies from one country to the next. Typically DIN-1 is used to sort words, although in Switzerland it's also used to sort names. DIN-2, the collation algorithm you'll implement here, is used to sort names in Germany. Austria uses the Austrian collation, although it seems to be disappearing in favor of the DIN-2 rules. The main complication for the DIN-2 algorithm you'll implement here is that ä is equal to ae, ö is equal to oe, ß is equal to ss and ü is equal to ue. The code has to realize that two characters in one word are equivalent to one character in another word.


Sorting a list of Spanish words

The first custom collator is one to sort Spanish words. The Spanish alphabet contains 30 letters; in addition to the 26 basic letters of Western European languages, ch (che), ll (elle), ñ (eñe) and rr (erre) are considered separate letters as well. The traditional Spanish collation sorts words beginning with ch after anything starting with cz, words beginning with ll after anything starting with lz and words starting with ñ after any word starting with n. The letter rr doesn't sort in any special way. Our Spanish custom collation implements these rules.

Here is the list of Spanish words:


Listing 1. A list of Spanish words
<?xml version="1.0"?>
<!-- spanish-words.xml -->
<wordlist>
  <word>campo</word>
  <word>luna</word>
  <word>ciudad</word>
  <word>llaves</word>
  <word>chihuahua</word>
  <word>arroz</word>
  <word>limonada</word>
</wordlist>
      

Defining a custom collation algorithm seems like a daunting task. Fortunately, Java defines a class named RuleBasedCollator. You'll define a rule string that indicates the order in which letters should be sorted and let Java do the rest of the work. To create a Java class that extends RuleBasedCollator, the rule string and the constructor function are all you have to write. Here's what the code looks like, including the rule string:


Listing 2. Source code for the Spanish collator
package com.oreilly.xslt;

import java.text.ParseException;
import java.text.RuleBasedCollator;

public class SpanishCollation extends RuleBasedCollator
{
  public SpanishCollation() throws ParseException
  {
    super(traditionalSpanishRules);
  }
  
  private static String smallnTilde  = new String("\u00F1");
  private static String capitalNTilde = new String("\u00D1");

  private static String traditionalSpanishRules =
    ("< a,A < b,B < c,C < ch, cH, Ch, CH "  +
     "< d,D < e,E < f,F < g,G < h,H < i,I " + 
     "< j,J < k,K < l,L < ll, lL, Ll, LL "  +
     "< m,M < n,N " +
     "< " + smallnTilde + "," + capitalNTilde + " " +
     "< o,O < p,P < q,Q < r,R < s,S < t,T " + 
     "< u,U < v,V < w,W < x,X < y,Y < z,Z");
}
      

You use the rule string to define the order in which characters are sorted. In Listing 2, notice the many sets of lowercase and uppercase characters. The less-than signs between them indicate that a and A appear before b and B. The che and elle are defined along with the character groups, even though they are two characters instead of one. When the Java runtime sorts information, the rule here tells it to process ll as a separate letter between l and m.

The two other special rules define that all uppercase and lowercase combinations of ch appear between c and d and that ñ and Ñ appear between n and o.

Now that you've defined the rules to sort Spanish words, it's time to use that code in your XSLT 2.0 stylesheet. XSLT 2.0 defines a number of places you can ask for a custom collation in your stylesheet. For example, the <xsl:sort> element now has a collation attribute. What the XSLT 2.0 spec does not define is how to use the contents of that attribute. The Saxon processor requires the attribute to have this format:

collation="http://saxon.sf.net/collation?class=com.oreilly.xslt.SpanishCollation;"

Saxon requires that the name of a custom collation class have the format http://saxon.sf.net/collation? followed by the keyword class= and the fully-qualified name of the class. The class is loaded at runtime, so it must be accessible from the Java classpath. If another XSLT 2.0 processor supports custom collations (I'm not aware of any, as of October 2007), the format of the collation attribute will be different.

Here's the stylesheet that invokes the com.oreilly.xslt.SpanishCollation class:


Listing 3. Invoking a collation class with an XSLT 2.0 stylesheet
<?xml version="1.0"?>
<!-- custom-collation-spanish.xsl -->
<xsl:stylesheet version="2.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xsl:output method="html"/>

  <xsl:variable name="words" as="xs:string*" 
    select="wordlist/word"/>
  
  <xsl:variable name="normally_sorted_words" as="xs:string*">
    <xsl:perform-sort select="$words">
      <xsl:sort select="."/>
    </xsl:perform-sort>
  </xsl:variable>
  
  <xsl:variable name="usefully_sorted_words" as="xs:string*">
    <xsl:perform-sort select="$words">
      <xsl:sort select="."
collation="http://saxon.sf.net/collation?class=com.oreilly.xslt.SpanishCollation;"/>
    </xsl:perform-sort>
  </xsl:variable>
  
  <xsl:template match="/">
    <html>
      <head>
        <title>Sorting with a custom collation</title>
      </head>
      <body style="font-family: sans-serif; font-size: 12pt;">
        <h1>Sorting with a custom collation</h1>
        <p>Here is a table that uses a <i>custom collation</i>
        to sort words according to the traditional rules of 
        Spanish.</p>
        <table cellpadding="5" width="50%"
          style="font-weight: bold;">
          <tr style="font-size: 120%; font-style: italic; 
                     text-align: center;">
            <td>Original words</td>
            <td>Normally sorted words</td>
            <td>Spanish sorted words</td>
          </tr>
          <xsl:for-each select="1 to count($words)">
            <tr style="background: {if (. mod 2 = 1) 
                                    then 'gray' else 'white'};
                       color: {if (. mod 2 = 1) 
                               then 'white' else 'black'};">
              <td style="border: solid white 6px;">
                <xsl:value-of 
                  select="., '. ', subsequence($words, ., 1)" 
                  separator=""/>
              </td>
              <td style="border: solid white 6px;">
                <xsl:value-of 
                  select="index-of($words, 
                          subsequence($normally_sorted_words, ., 1))"/>
                <xsl:text>. </xsl:text>
                <xsl:value-of 
                  select="subsequence($normally_sorted_words, ., 1)"/>
              </td>
              <td style="border: solid white 6px;">
                <xsl:value-of 
                  select="index-of($words, 
                          subsequence($usefully_sorted_words, ., 1))"/>
                <xsl:text>. </xsl:text>
                <xsl:value-of 
                  select="subsequence($usefully_sorted_words, ., 1)"/>
              </td>
            </tr>
          </xsl:for-each>
        </table>
      </body>
    </html>
  </xsl:template>

</xsl:stylesheet>
      

You invoke the stylesheet engine with this command:

java net.sf.saxon.Transform spanish-words.xml custom-collation-spanish.xsl

This command writes the results to standard output. If you'd like to capture those results in a file, use the -o option. The command java net.sf.saxon.Transform -o results.html words.xml custom-collation-spanish.xsl writes the results to the file results.html.

The stylesheet creates three sequences of strings. The first is the list of words from the XML input file. The second sequence is the list of words sorted with the default sorting algorithm. The third uses the custom Spanish collation function. The <xsl:for-each> element writes the word lists to an HTML table. The HTML document looks like Figure 1:


Figure 1. The list of words, sorted in two different ways
The list of words, sorted in two different ways

Notice several XSLT 2.0 features here. First of all, you use the <xsl:perform-sort> element to sort the items in the sequence variables. That tells the XSLT 2.0 processor to process the <xsl:sort> elements against the values in the sequence. You use <xsl:sort> to invoke the custom collation class.

The <xsl:for-each> element iterates from 1 to the number of words in the word list. For the rows of the table, every other row has a gray background with white text. To generate this code, you use attribute value templates (the code inside the curly braces { }) with the current value of the context item. Each table row is defined like this:

<tr style="background: {if (. mod 2 = 1) 
                        then 'gray' else 'white'};
           color: {if (. mod 2 = 1) 
                   then 'white' else 'black'};">
      

If dividing the context item by 2 leaves a remainder of 1, the stylesheet generates the CSS code background: gray; color: white;. Otherwise, the cells in the current row have a white background and black text. The mod operator is extremely useful for cycling through a set of values.

Finally, notice that the first column numbers the words in the order they appear in the XML source document. For the other two columns, each word has the same number it had in the first column, regardless of where it appears. In the example, chihuahua is the fifth word in the original word list, so it always has the number 5 beside it. The word chihuahua appears in the third row of column 2 and the fourth row of column 3, but it is displayed as word number 5. Here's how to do that:

<td style="border: solid white 6px;">
  <xsl:value-of 
    select="index-of($words, 
            subsequence($normally_sorted_words, ., 1))"/>
  <xsl:text>. </xsl:text>
  <xsl:value-of 
    select="subsequence($normally_sorted_words, ., 1)"/>
</td>
      

Each cell contains three things; a number, a period and the value of the current word. Generating the number is the difficult part. For columns 2 and 3 (the code here handles column 2), you have to find the index of the word in the original word list (stored in the variable $words). To do that, use the index-of() and subsequence() functions. I use subsequence() to retrieve the current word from the sorted sequence. The second argument to subsequence() is the starting position; the dot represents the context item. The third argument is the number of items to return. Given that word, index-of() returns its position in the original sequence.


Comparing strings with a German collation

The next example uses a custom collator to compare German words. The complication here is that four German characters can have more than one representation (note the upper- and lowercase letters):

Single characterTwo-character equivalent
ä (a with an umlaut) or Ä (A with an umlaut) ae or AE
ö (o with an umlaut) or Ö (O with an umlaut)oe or OE
ß (sharp s)ss
ü (u with an umlaut) or Ü (U with an umlaut)ue or UE

Sometimes, you want to consider these characters as equal. In other words, the word Strasse (street) is identical to the word Straße. Obviously a character-by-character comparison using the standard collation says these are not equal, so you'll need to use a custom collation.

As with the Spanish collation, you simply define rules that say how to compare characters:


Listing 4. Source code for the German custom collation
package com.oreilly.xslt; 

import java.text.ParseException;
import java.text.RuleBasedCollator;

public class GermanCollation extends RuleBasedCollator
{
  public GermanCollation() throws ParseException
  {
    super(traditionalGermanRules);
  }
  
  private static String sharpS  = new String("\u00DF");
  private static String uppercaseUmlautA = new String("\u00C4");
  private static String lowercaseUmlautA = new String("\u00E4");
  private static String uppercaseUmlautO = new String("\u00D6");
  private static String lowercaseUmlautO = new String("\u00F6");
  private static String uppercaseUmlautU = new String("\u00DC");
  private static String lowercaseUmlautU = new String("\u00FC");

  private static String traditionalGermanRules =
    ("< a,A " + 
     "<" + lowercaseUmlautA + "=ae " +
     "<" + uppercaseUmlautA + "=AE " +
     "< b,B < c,C < d,D < e,E < f,F " +
     "< g,G < h,H < i,I < j,J < k,K " +
     "< l,L < m,M < n,N < o,O " + 
     "<" + lowercaseUmlautO + "=oe " +
     "<" + uppercaseUmlautO + "=OE " + 
     "< p,P < q,Q < r,R < s,S " + 
     "< ss=" + sharpS + 
     "< t,T < u,U " + 
     "<" + lowercaseUmlautU + "=ue " + 
     "<" + uppercaseUmlautU + "=UE " + 
     "< v,V < w,W < x,X < y,Y < z,Z");
}
      

The code in Listing 4 uses the equals sign to indicate that certain characters and character groups are equivalent. (The Spanish collator used the less-than sign.) The string of rules includes items such as ... s,S < ss=ß < t, T .... This tells the collator that the two characters ss are equal to the single character ß. (The strings that represent each of the special characters makes the code easier to read.)

The stylesheet looks like this:


Listing 5. A stylesheet that uses a German collation function
<?xml version="1.0"?>
<!-- custom-collation-german.xsl -->
<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xsl:output method="html"/>

  <xsl:variable name="wordgroup1" as="xs:string*" 
    select="wordlist/wordgroup/word[1]"/>
  
  <xsl:variable name="wordgroup2" as="xs:string*" 
    select="wordlist/wordgroup/word[2]"/>
  
  <xsl:template match="/">
    <html>
      <head>
        <title>Comparing words with a custom collation</title>
      </head>
      <body style="font-family: sans-serif; font-size: 12pt;">
        <h1>Comparing words with a custom collation</h1>
        <p>This table illustrates what happens when you use 
        a <i>custom collation</i> to compare German words:</p>

        <table cellpadding="5" width="50%"
          style="font-weight: bold; text-align: center;">
          <tr style="font-size: 120%; font-style: italic; 
                     vertical-align: bottom;">
            <td>First word</td>
            <td>Second word</td>
            <td>Compared normally</td>
            <td>Compared with <br/>German (DIN-2) rules</td>
          </tr>
          <xsl:for-each select="1 to count($wordgroup1)">
            <xsl:variable name="word1" 
              select="subsequence($wordgroup1, ., 1)"/>
            <xsl:variable name="word2" 
              select="subsequence($wordgroup2, ., 1)"/>

            <tr style="background: {if (. mod 2 = 1) 
                                    then 'gray' else 'white'};
                       color: {if (. mod 2 = 1) 
                               then 'white' else 'black'};">
              <td style="border: solid white 6px;">
                <xsl:value-of select="$word1"/>
              </td>
              <td style="border: solid white 6px;">
                <xsl:value-of select="$word2"/>
              </td>
              <td style="border: solid white 6px;">
                <xsl:value-of 
                  select="if (compare($word1, $word2) = 0)
                          then 'Equal'
                          else 'Not equal'"/>
              </td>
              <td style="border: solid white 6px;">
                <xsl:value-of 
                  select="if (compare($word1, $word2,
'http://saxon.sf.net/collation?class=com.oreilly.xslt.GermanCollation;')
                              = 0)
                          then 'Equal'
                          else 'Not equal'"/>
              </td>
            </tr>
          </xsl:for-each>
        </table>
      </body>
    </html>
  </xsl:template>

</xsl:stylesheet>
      

This stylesheet is very similar to the Spanish collation example. The main difference here is that you're using the compare() function directly. Here is the list of German words:


Listing 6. A list of German words to compare
<?xml version="1.0"?>
<!-- german-words.xml -->
<wordlist>
  <wordgroup>
    <word>Berlin</word>
    <word>Stuttgart</word>
  </wordgroup>
  <wordgroup>
    <word>Straße</word>
    <word>Strasse</word>
  </wordgroup>
  <wordgroup>
    <word>Böblingen</word>
    <word>Boeblingen</word>
  </wordgroup>
  <wordgroup>
    <word>München</word>
    <word>Muenchen</word>
  </wordgroup>
</wordlist>
      

Each <wordgroup> contains two words. The first two words are always different, while the other groups are all equal with the German collation. The stylesheet creates two sequences. The first sequence ($wordgroup1) contains all of the first <word> elements, while the second sequence ($wordgroup2) contains all of the second words. Within the <xsl:for-each> element, the stylesheet iterates through the sequences of words. For each row in the table, the two words are stored in the variables $word1 and $word2.

Here's the command to invoke the stylesheet:

java net.sf.saxon.Transform german-words.xml custom-collation-german.xsl

Figure 2 shows the results:


Figure 2. Comparing words with German rules
Comparing words with German rules

A custom string sorter

The last example will sort a list of bands and musicians. You need a custom collator here to sort items while ignoring the text "The ". In other words, you want The Beatles to sort as if it were the string Beatles. You can't do this with a set of rules because you can't include a space as part of the characters you want to sort in a special way. You can't just remove the characters The for a comparison because they might be the start of a band name such as Them or They Might Be Giants. Another complication is the band The The. You need to sort this item as if the first "The " weren't there.

To further complicate the requirements, you will ignore case when you make the comparisons between artist names. If a band was named the e.e. cummingses, you want to sort them with the artists who use a capital E in their name.

You must also make sure to ignore the characters only if they appear at the first of the string. In other words, you should not process Toad the Wet Sprocket by removing the characters "the ". Although this error case only happens if you mistakenly assign any artistic merit to Toad the Wet Sprocket, your code should be robust enough to ignore the characters "The " only if they occur at the first of the string.

The string you want to ignore is the four characters "The ". You can't use a RuleBasedCollator here because you want to ignore those characters, not define a place for them in a sorting order. The good news is that you can use the Java Comparator interface. Even better, you only have to implement one method, compare(). The custom collation extension looks like this:


Listing 7. Source code for the custom collator
package com.ibm.dw;

import java.util.Comparator;

public class TheTheCollation implements Comparator<String>
{
  public int compare(String stringOne, String stringTwo)
  {
    return stringOne
             .replaceFirst("^(T|t)(H|h)(E|e) ", "")
             .compareToIgnoreCase(stringTwo.replaceFirst("^(T|t)(H|h)(E|e) ", ""));
  }
}
      

The code is very simple; you simply override the compare() function. Given two strings, you remove "The " if needed, then call the existing Java comparison function. Because you want to ignore case, use compareToIgnoreCase() instead of the more basic compare().

The String.replaceFirst() function removes the characters "The " at the first of the string. (As you can see, the class is named in honor of The The.) The important thing is that the first argument to the replaceFirst() function is a regular expression. The regular expression "^(T|t)(H|h)(E|e) " only matches "The " if it occurs at the first of the string; the caret anchors the regular expression to the start of the string. The parenthetical groups specify any combination of uppercase and lowercase letters. This is an elegant way to get around using a combination of functions such as String.startsWith() and String.substring().

Here is the list of bands and musicians:


Listing 8. A list of bands and musicians
<?xml version="1.0"?>
<!-- artists.xml -->
<artistlist>
  <artist>The Clash</artist>
  <artist>They Might Be Giants</artist>
  <artist>Eminem</artist>
  <artist>The Whigs</artist>
  <artist>X</artist>
  <artist>Talking Heads</artist>
  <artist>The Rutles</artist>
  <artist>Them</artist>
  <artist>The Yardbirds</artist>
  <artist>the e.e. cummingses</artist>
  <artist>Romeo Void</artist>
  <artist>The B-52's</artist>
  <artist>B. B. King</artist>
  <artist>The The</artist>
  <artist>Beastie Boys</artist>
  <artist>The Beatles</artist>
</artistlist>
      

As you can see, eight of the artists are bands whose names begin with "The." When you look for something by The Beatles in a music store, you don't look under "T" to find their music. You need a custom collation function that says The Beatles should appear before Eminem.

This stylesheet is virtually identical to the first stylesheet; the main difference here is that you use a different custom collator. Here's the fragment of the stylesheet that invokes the Java class:

<xsl:variable name="usefully_sorted_artists" as="xs:string*">
  <xsl:perform-sort select="$artists">
    <xsl:sort select="."
collation="http://saxon.sf.net/collation?class=com.ibm.dw.TheTheCollation;"/>
  </xsl:perform-sort>
</xsl:variable>
      

You invoke the stylesheet engine with this command:

java net.sf.saxon.Transform artists.xml custom-collation-thethe.xsl

The results look like Figure 3:


Figure 3. The list of artists, sorted in two different ways
The list of artists, sorted in two different ways

A final enhancement: Adding mouse effects

Writing the original position of each term in each column is a useful way to illustrate the differences between the collations. As a final exercise, look at how to enhance the stylesheet so that moving the mouse over an artist's name in one column highlights that artist's name in the other two columns. The three steps to add this to the generated HTML page are:

  1. Give every table cell an ID based on a naming convention.
  2. Define the id, onmouseover and onmouseout attributes for each table cell.
  3. Define JavaScript functions that highlight the appropriate cells when the mouse moves into a table cell and unhighlight the same cells when the mouse moves out.

You need the IDs of the table cells to find the appropriate cells. Those IDs will be in the format col1-1, col2-1 and col3-1. The cell with the ID col1-1 is the first term in the first column. The cells with the IDs col2-1 and col3-1 are the cells in columns 2 and 3 that have the same text. That means the IDs for every cell in columns 2 and 3 end with the same number you generate for the cell itself. In this example, The Clash is the first artist in the XML file, so every occurrence of The Clash has the number 1 beside it. The IDs of the cells containing The Clash are col1-1, col2-1 and col3-1.

Now that you know how the IDs work, you'll code the onmouseover and onmouseout attributes to use the last part of the ID. You'll create two JavaScript functions, highlightCells() and unhighlightCells(). Given the number 3, highlightCells() will highlight the three elements with IDs of col1-3, col2-3 and col3-3. A table row in the generated HTML document looks like this:

<td id="col1-1" 
   onmouseover="highlightCells('1');" 
   onmouseout="unhighlightCells('1');" 
   style="border: solid white 6px;">1. The Clash</td>
      

Finally, you need the JavaScript functions. They look like this:


Listing 9. JavaScript functions for highlighting table cells
      <title>Sorting with a custom collation</title><script language="JavaScript">
         <!--
        function highlightCells(rowNum)
        {
          el = document.getElementById('col1-' + rowNum);
          el.style.border='solid #E15119 6px';
          el = document.getElementById('col2-' + rowNum);
          el.style.border='solid #E15119 6px';
          el = document.getElementById('col3-' + rowNum);
          el.style.border='solid #E15119 6px';
        }

        function unhighlightCells(rowNum)
        {
          el = document.getElementById('col1-' + rowNum);
          el.style.border='solid white 6px';
          el = document.getElementById('col2-' + rowNum);
          el.style.border='solid white 6px';
          el = document.getElementById('col3-' + rowNum);
          el.style.border='solid white 6px';
        }
        --></script></head>
      

The JavaScript code uses the getElementById() function to find the elements with the given ID. To highlight the cells, it sets the border color of the cells to the official developerWorks shade of orange. To unhighlight the cells, it resets the border color to white. Figure 4 shows how the HTML looks when you place the mouse over any cell that contains The Clash:


Figure 4. JavaScript effects to highlight artists across columns
JavaScript effects to highlight artists across columns

Here's the complete stylesheet:


Listing 10. An XSLT stylesheet that generates mouse effects
<?xml version="1.0"?>
<!-- custom-collation-advanced.xsl -->
<xsl:stylesheet version="2.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xsl:output method="html"/>

  <xsl:variable name="artists" as="xs:string*" 
    select="artistlist/artist"/>
  
  <xsl:variable name="normally_sorted_artists" as="xs:string*">
    <xsl:perform-sort select="$artists">
      <xsl:sort select="."/>
    </xsl:perform-sort>
  </xsl:variable>
  
  <xsl:variable name="usefully_sorted_artists" as="xs:string*">
    <xsl:perform-sort select="$artists">
      <xsl:sort select="."
        collation="http://saxon.sf.net/collation?class=com.ibm.dw.TheTheCollation;"/>
    </xsl:perform-sort>
  </xsl:variable>
  
  <xsl:template match="/">
    <html>
      <head>
        <title>Sorting with a custom collation</title>
        <script language="JavaScript"><xsl:comment>        function highlightCells(rowNum)        {          el = document.getElementById('col1-' + rowNum);          el.style.border='solid #E15119 6px';          el = document.getElementById('col2-' + rowNum);          el.style.border='solid #E15119 6px';         el = document.getElementById('col3-' + rowNum);          el.style.border='solid #E15119 6px';        }        function unhighlightCells(rowNum)        {          el = document.getElementById('col1-' + rowNum);          el.style.border='solid white 6px';          el = document.getElementById('col2-' + rowNum);          el.style.border='solid white 6px';          el = document.getElementById('col3-' + rowNum);          el.style.border='solid white 6px';        }        </xsl:comment></script>
      </head>
      <body style="font-family: sans-serif; font-size: 12pt;">
        <h1>Sorting with a custom collation</h1>
        <p>Here is a table that uses a <i>custom collation</i>
        to sort data ignoring the characters 
        <span style="font-family: monospace;">The </span>
        at the start of the data:</p>
        <table cellpadding="5" width="50%"
          style="font-weight: bold;">
          <tr style="font-size: 120%; font-style: italic; 
                     text-align: center;">
            <td>Original data</td>
            <td>Normally-sorted data</td>
            <td>Usefully-sorted data</td>
          </tr>
          <xsl:for-each select="1 to count($artists)">
            <tr style="background: {if (. mod 2 = 1) 
                                    then 'gray' else 'white'};
                       color: {if (. mod 2 = 1) 
                               then 'white' else 'black'};">
              <xsl:variable name="col2Index"
                select="index-of($artists, 
                        subsequence($normally_sorted_artists, ., 1))"/>
              <xsl:variable name="col3Index"
                select="index-of($artists, 
                        subsequence($usefully_sorted_artists, ., 1))"/>
              <td id="col1-{.}"
                onmouseover="highlightCells('{.}');"                onmouseout="unhighlightCells('{.}');"
                style="border: solid white 6px;">
                <xsl:value-of 
                  select="., '. ', subsequence($artists, ., 1)" 
                  separator=""/>
              </td>
              <td id="col2-{$col2Index}"                onmouseover="highlightCells('{$col2Index}');"                onmouseout="unhighlightCells('{$col2Index}');"
                style="border: solid white 6px;">
                <xsl:value-of select="$col2Index"/>
                <xsl:text>. </xsl:text>
                <xsl:value-of 
                  select="subsequence($normally_sorted_artists, ., 1)"/>
              </td>
              <td id="col3-{$col3Index}"                onmouseover="highlightCells('{$col3Index}');"                onmouseout="unhighlightCells('{$col3Index}');"
                style="border: solid white 6px;">
                <xsl:value-of select="$col3Index"/>
                <xsl:text>. </xsl:text>
                <xsl:value-of 
                  select="subsequence($usefully_sorted_artists, ., 1)"/>
              </td>
            </tr>
          </xsl:for-each>
        </table>
      </body>
    </html>
  </xsl:template>

</xsl:stylesheet>      
      

You invoke the stylesheet engine with this command:

java net.sf.saxon.Transform artists.xml custom-collation-advanced.xsl

Notice that the JavaScript code is generated with an <xsl:comment> element. The stylesheet outputs the <script> element, which is followed immediately by the start of the comment. The end of the comment is followed immediately by the end of the <script> element. In the past, some browsers had intermittent errors when the end comment marker (-->) was interpreted as the decrement operator. Structuring the stylesheet this way avoids this error.


Summary

Java has very powerful classes that make it easy to change the way sorting works. In this article you looked at three Java classes that support Spanish, German and a domain-specific type of sorting. The three classes are arguably one line of code each. The ability to invoke these classes from an XSLT 2.0 stylesheet makes it easy to add custom sorting functions for different languages or other requirements. This simple technique can be a great addition to your XML processing tool box.

Acknowledgements

The author would like to thank Simon St. Laurent of O'Reilly and Associates for allowing you to use the first two code examples and the explanatory text for the different sorting algorithms. They are taken from the second edition of XSLT (ISBN 0596527217). The second edition of the book includes several extensions to XSLT, including code written in Java, C#, Python, Ruby and JavaScript. You can preorder a copy of the book today at amazon.com.



Download

DescriptionNameSizeDownload method
XML, XSLT, HTML and Java samples from this articlex-xsltsort-samples.zip16KB HTTP

Information about download methods


Resources

Learn

Get products and technologies

  • Saxon XSLT 2.0 processor: Get Michael Kay's excellent tool, available on SourceForge.

  • The AltovaXML page on Altova's Web site: Find more information on the XSLT processor that Altova (the maker of XML Spy and other popular products), makes available free of charge. The processor supports XSLT 1.0, XSLT 2.0 and XQuery 1.0. Although it's not open source, the license does allow you to embed the XSLT engine in your products.

  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

About the author

A fair likeness of Doug Tidwell

Doug Tidwell is a strategist for IBM Software Group. As a technology evangelist, his job focuses on emerging technologies such as SCA, SDO and XForms, helping people use tomorrow's technologies to solve today's problems. He is the author of O'Reilly's XSLT, a second edition of which should be available in bookstores in time for Valentine's Day 2008 (ISBN 0596527217). A speaker at the first XML conference in 1997, he has worked with XSLT for about a decade, including some of the earliest approaches to XML transformations. He is currently writing a book about the inventor of the fruit smoothie, Dutch citrus merchant Julius of Orange. Doug lives in Chapel Hill, North Carolina, with his wife, food writer Sheri Castle, their daughter Lily, and their dog Domino, The Supine Canine.

Comments



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=264164
ArticleTitle=Use custom collations in XSLT 2.0
publish-date=11272007
author1-email=dtidwell@us.ibm.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers