Here at developerWorks, we're always trying to answer your questions and meet your needs. Recently I received the following letter from Tommy Jones of Des Moines, Iowa:
Is there any way to do case-insensitive enumerations in XML Schemas? If the valid values for an element are "red," "blue," and "green," we'd like to let our users use any combination of upper- and lowercase letters for those values. We can't find any way in the XML Schema spec that we can define an enumeration that is case-insensitive. Can you help us?
Tommy Jones of Des Moines, Iowa
Well, Tommy, I've got good news and bad news. The bad news is that you can't do what you want with XML Schema; the good news is that we have an automated solution that's standards-compliant, fairly simple, and shouldn't require any work on your part.
First of all, you can't do what you want directly. The way around this problem is to convert the enumerations into a regular expression. Let's say that your schema defines the following datatype:
<xsd:element name="favoriteColor"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:enumeration value="red"/> <xsd:enumeration value="blue"/> <xsd:enumeration value="green"/> </xsd:restriction> </xsd:simpleType> </xsd:element>
To do a case-insensitive comparison, you need to convert this into a regular expression that combines all of the valid values. For the value
"blue," for example, you'll create a regular expression that says, "This is an upper- or lowercase B, followed by an upper- or lowercase L, followed by an upper- or lowercase U, followed by an upper- or lowercase E." That means the enumerated datatype above should look like this:
<xsd:element name="favoriteColor"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:pattern value="((B|b)(L|l)(U|u)(E|e)) | ((G|g)(R|r)(E|e)(E|e)(N|n)) | ((R|r)(E|e)(D|d))"/> </xsd:restriction> </xsd:simpleType> </xsd:element>
This regular expression matches
"bLUe," and any other combination of upper- and lowercase letters that spell the word blue. (You could also solve this problem by generating a set of
<xsd:enumeration> elements that define all the combinations of upper- and lowercase letters, but that would be much larger than the regular expression, especially if the valid values were long strings.)
Even better news
Because an XML Schema is itself an XML document, you can write a style sheet that converts the enumeration markup into the regular expression you just looked at. To do this, you need to find all of the
<xsd:restriction> elements that are based on the
xsd:string datatype and contain
<xsd:enumeration> elements. What you want is a style sheet that copies all of the existing schema except the
<xsd:restriction> elements you're looking for. You'll then add a rule that defines how to transform the
Here's a style sheet that defines the basic rule for copying an XML document. This will be the default rule used for everything in the source document; in a minute, you'll add the rule for transforming the
<?xml version="1.0" ?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsl:template match="*|@*|text()|comment()|processing-instruction()"> <xsl:copy> <xsl:apply-templates select="*|@*|text()|comment()| processing-instruction()" /> </xsl:copy> </xsl:template> <!-- Add the stuff that handles the enumerations here. --> </xsl:stylesheet>
Now you just have to write the template that transforms the
<xsd:restriction> elements. Here's the XPath expression that selects the elements:
<xsl:template match="xsd:restriction[@base='xsd:string'] [count(xsd:enumeration) > 0]">
If you're not familiar with XPath syntax, this tells the style sheet processor to select all of the
<xsd:restriction> elements that have a
base='xsd:string' attribute and contain at least one
<xsd:enumeration> element. The algorithm you'll follow for each
<xsd:enumeration> element inside the
<xsd:restriction> element is:
- Write a left parenthesis.
- Write the upper- and lowercase values of each letter.
- Write a right parenthesis.
- If this isn't the last
<xsd:enumeration>, add a vertical bar.
Here's how that part of the style sheet looks:
<xsl:template match="xsd:restriction[@base='xsd:string'] [count(xsd:enumeration) > 0]"> <xsd:restriction base="xsd:string"> <xsd:pattern> <xsl:attribute name="value"> <xsl:for-each select="xsd:enumeration"> <!-- Step 1. Write a left parenthesis --> <xsl:text>(</xsl:text> <!-- Step 2. Write the upper- and lowercase letters --> <!-- Step 3. Write a right parenthesis --> <xsl:text>)</xsl:text> <!-- Step 4. If this isn't the last enumeration, write --> <!-- a vertical bar --> <xsl:if test="not(position()=last())"> <xsl:text>|</xsl:text> </xsl:if> </xsl:for-each> </xsl:attribute> </xsd:pattern> </xsd:restriction> <xsl:template>
You might have noticed that this step skips over the difficult step of writing out the upper- and lowercase values of each letter. You'll use tail recursion and the XSLT
translate() function to do this.
Tail recursion is a common technique in XSLT style sheets. You'll use a named template to handle this; the named template will invoke itself until all of the letters in the string have been processed. The template (named
case-insensitive-pattern in the example) receives two parameters: the string you're converting to a regular expression, and the position in the string where you should start. Here's how your named template begins:
<xsl:template name="case-insensitive-pattern"> <xsl:param name="string"/> <xsl:param name="index"/>
For any given string, the correct value is the concatenation of:
- The value of the current letter, written in the (
- The value of the remaining letters written in the (
A|a) format. (If there are no letters left, the value is empty; otherwise, you call the template recursively. To do that, you pass the original string and increment the starting position by one.)
You'll create two variables representing the two values above, then you'll use the
<xsl:value-of> element to output their combined values. For the current letter, you output a left parenthesis, the uppercase value of the letter, a vertical bar, the lowercase value of the letter, and a right parenthesis. Here's the markup that calculates the first variable:
<xsl:variable name="current-letter"> <!-- Write a left parenthesis --> <xsl:text>(</xsl:text> <!-- Convert the current letter to uppercase --> <xsl:value-of select="translate(substring($string, $index, 1), 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')"/> <!-- Write a vertical bar --> <xsl:text>|</xsl:text> <!-- Convert the current letter to lowercase --> <xsl:value-of select="translate(substring($string, $index, 1), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')"/> <!-- Write a right parenthesis --> <xsl:text>)</xsl:text> </xsl:variable>
A word about the XSLT translate() function
Before we go on, it's worth noting that the XSLT
translate() function takes three strings. For each character in the first string, any letter that appears in the second string (
'abcde...') is replaced by the corresponding letter from the third string (
'ABCDE...'). So if the first string is
bed, the function call
translate('bed', 'abcde...', 'ABCDE...') returns
BED. If a character in the first string doesn't appear in the second string at all, it isn't changed. That means
translate('bed7', 'abcde...', 'ABCDE...') returns
BED7. You could extend the strings in the function call to include accented characters used in Western European languages if you wanted. (The XSLT spec warns that
translate() isn't sufficient to do case conversion in all the world's languages, so be aware of that.)
Now you calculate the value of all the remaining letters, each of them converted to the (
A|a) format. If the index of the current letter is less than the length of the string, you invoke your named template again, passing the original string and incrementing the index by 1. If the index of the current letter is equal to the length of the string, this variable is an empty string.
<xsl:variable name="remaining-letters"> <!-- If $index is less than the length of the string, --> <!-- call the template again. --> <xsl:if test="$index < string-length($string)"> <xsl:call-template name="case-insensitive-pattern"> <!-- The string parameter doesn't change --> <xsl:with-param name="string" select="$string"/> <!-- Increment the index of the current letter by 1 --> <xsl:with-param name="index" select="$index + 1"/> </xsl:call-template> </xsl:if> </xsl:variable>
Finally, you output the value of the two variables with the
<xsl:value-of> element and the
concat() function. This is equivalent to a
return statement in other programming languages.
<xsl:value-of select="concat($current-letter, $remaining-letters)"/>
So, if the values
blue, red, and
green, are valid, you can transform your schema with our style sheet to generate a new schema. Using that new schema, the values
BLUE, Blue, bLuE, and
blUE are all valid.
Here's an example that illustrates how your style sheet works. You'll use a schema that defines enumerations for gender, marital status, and favorite color. Here's a sample instance document:
<?xml version="1.0"?> <f:friend xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/developerWorks friend.xsd" xmlns:f="http://www.ibm.com/developerWorks"> <f:name> <f:firstName>Jane</f:firstName> <f:lastName>Doe</f:lastName> </f:name> <f:gender>f</f:gender> <f:maritalStatus>married</f:maritalStatus> <f:favoriteColor>orange</f:favoriteColor> </f:friend>
As part of this example, you're including a short piece of Java code,
XMLValidator.java, that validates an XML document against an XML schema. If you enter
java XMLValidator friend.xml, you'll see something like this:
> java XMLValidator friend.xml Your document contains no errors!
In our sample document, the values
orange are all case-sensitive; entering
OrAnGE will cause errors. If you put those illegal values into
friend.xml, you'll get messages like this:.
Error in friend.xml at line 10, column 25: cvc-type.3.1.3: The value 'F' of element 'f:gender' is not valid. Error in friend.xml at line 11, column 45: cvc-type.3.1.3: The value 'Married' of element 'f:maritalStatus' is not valid. Error in friend.xml at line 12, column 44: cvc-type.3.1.3: The value 'OrAnGE' of element 'f:favoriteColor' is not valid.
You can use our XSLT style sheet to convert the original schema into a new schema document.
> java org.apache.xalan.xslt.Process -in friend.xsd -xsl convert-enumerations.xsl -out insensitive-friend.xsd
Next, change the root element of the XML document to refer to this new schema file:
<f:friend xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/developerWorks insensitive-friend.xsd" xmlns:f="http://www.ibm.com/developerWorks">
If you run your validation program against the XML document now, you'll once again get the message that your document contains no errors. The file
case-insensitive.zip has all the code and samples you need to try it yourself.
Well, Tommy, I hope this answers your question. Our solution is relatively simple, works automatically, and is based on XML standards.
Have questions of your own? Feel free to send 'em to us, and we'll try to answer them in our vast spare time.
- The file case-insensitive.zip contains all the code and samples you need to try this solution yourself.
- Find out more about what you can and cannot do with XML Schema at the W3C site.
- Doug Tidwell's tutorial "Introduction to XML" is a perennial favorite on developerWorks, and provides a solid foundation for understanding the complexities of XML.
- You'll find plenty more XML resources on the developerWorks XML zone.
- Take a look at IBM WebSphere Studio Application Developer, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
- Want us to send you useful XML tips like this every week? Sign up for the developerWorks XML Tips newsletter here.