The RELAX NG XML schema language has achieved huge success over the past three years; this is due in large part to its incredibly clean and straightforward syntax, especially compared to the W3C XML Schema language. Numerous groups, including OpenOffice, DocBook, and the Text Encoding Initiative, have adopted the RELAX NG schema language. RELAX NG has even begun to replace W3C schemas within the W3C, where both the SVG and XHTML working groups are writing their schemas in RELAX NG, then translating them to DTDs and W3C XML Schemas. While RELAX NG doesn't mandate support for XML schema datatypes, in practice, major implementations such as Jing and Sun's Multischema Validator do support them.
However, in all the excitement over how much better RELAX NG does the same things as the W3C XML Schema language, the fact that it can actually do quite a bit more has been overlooked. In particular, unlike the W3C XML Schema language, RELAX NG is not limited to one preordained collection of primitive data types with a limited set of facets for extension. RELAX NG enables developers to define custom type libraries that can assert any constraints a program can verify. For example, W3C schemas cannot validate these constraints:
- A number is prime.
- Every left parenthesis in a string is matched by a right parenthesis.
- The value of the
SKUattribute matches a record in the products database. - The content of an element is correctly spelled, as determined by consulting a dictionary file.
You cannot verify any of these constraints in pure RELAX NG code, either. However, unlike the W3C XML Schema language, you can extend RELAX NG with user-defined type libraries written in the Java, C#, Python, or other languages. Because these languages are Turing complete (see Resources), they can verify essentially any condition that you might impose on the strings. They can even go outside the bounds of the string itself to compare its consistency with external conditions, such as the inventory of a warehouse or the current stock price of a company. In essence, you can extend RELAX NG's built-in validation rules with arbitrarily complex validation conditions.
This article explores the Java interface for such extensions, which many RELAX NG processors — including Jing and Sun's Multischema Validator — support. The specific example that I employ verifies that a number is prime. The code to test primality is fairly independent of the supporting code for hooking up the custom library to RELAX NG, so it's easy to see where to plug in a more complicated algorithm or validation condition. This extension requires three classes:
- A class that implements the
org.relaxng.datatype.Datatypeinterface to represent the prime datatype: This class is responsible for verifying that a given string is, in fact, a prime number. - A class that implements the
org.relaxng.datatype.DatatypeLibraryinterface: This class is responsible for loading the right datatype class given the type's local name. - A factory class that implements the
org.relaxng.datatype.DatatypeLibraryFactoryinterface: This class is responsible for loading the right datatype library given the library's namespace URI.
Now I'll show you each of these classes.
The prime DatatypeLibraryFactory
The factory class is straightforward. It has a single method — createDatatypeLibrary — that takes as an argument the namespace URI of the type library and returns an instance of that library. If the namespace does not match this library's, the method simply returns null, and the RELAX NG validator looks elsewhere for the correct library, as the code in Listing 1 shows.
Listing 1. The implementation of DatatypeLibraryFactory
package com.elharo.xml.relaxng;
import org.relaxng.datatype.*;
public class PrimeDatatypeLibraryFactory
implements DatatypeLibraryFactory {
public final static String namespace
= "http://ns.cafeconleche.org/relaxng/primes";
public DatatypeLibrary createDatatypeLibrary(String namespace) {
if (PrimeDatatypeLibraryFactory.namespace.equals(namespace)) {
return new PrimeDatatypeLibrary();
}
return null;
}
}
|
RELAX NG uses the Java services API to find factories. Add a plain text file called org.relaxng.datatype.DatatypeLibraryFactory in the META-INF/services/ directory in some JAR file or directory in the classpath. This file contains the fully package-qualified names of the factory classes bundled in the JAR, one per line. In this case, that's just com.elharo.xml.relaxng.PrimeDatatypeLibraryFactory.
The DatatypeLibrary implementation class is only a little more complex (see Listing 2). There is one DatatypeLibrary per namespace, but it needs two methods. The createDatatype method takes a local name, such as "prime," and returns the appropriate datatype object. This simple example only provides one type, but most libraries provide multiple simple types in the same namespace. The createDatatypeBuilder method returns an org.relaxng.datatype.DatatypeBuilder object. More complex type libraries can use this method to make types dependent on context and various parameters, such as base URIs and namespace prefixes in scope. However, this example doesn't need any context, so createDatatypeBuilder simply returns an instance of org.relaxng.datatype.helpers.ParameterlessDatatypeBuilder that's configured with a prime datatype.
Listing 2. The implementation of DatatypeLibrary
package com.elharo.xml.relaxng;
import org.relaxng.datatype.*;
import org.relaxng.datatype.helpers.*;
public class PrimeDatatypeLibrary implements DatatypeLibrary {
public Datatype createDatatype(String typeLocalName)
throws DatatypeException {
if ("prime".equals(typeLocalName)) {
return new PrimeDatatype();
}
throw new DatatypeException("Unsupported type: " + typeLocalName);
}
public DatatypeBuilder createDatatypeBuilder(String baseTypeLocalName)
throws DatatypeException {
return new ParameterlessDatatypeBuilder(
createDatatype("prime")
);
}
}
|
The final part of the public API is the datatype itself — an instance of the org.relaxng.datatype.Datatype interface. This datatype is the most complicated class you have to write. It has several responsibilities:
- Determine whether a given string is a valid instance of the datatype.
- Create objects that represent instances of the datatype.
- Compare two objects for equality according to the semantics of the datatype.
- Calculate hash codes for these objects.
- Provide streaming validators for the type.
- Determine whether the type is an ID type.
- Determine whether the type is context dependent.
I'll take a look at each of these responsibilities in turn.
Two methods validate strings, as the code in Listing 3 shows. isValid returns true if the string is a valid instance of the datatype, false if it isn't. checkValid throws a DatatypeException if the string is not valid.
Listing 3. Methods to check the validity of a given string
public boolean isValid(String literal, ValidationContext context) {
return isPrime(literal);
}
public void checkValid(String literal, ValidationContext context)
throws DatatypeException {
if (!isValid(literal, context)) {
throw new DatatypeException(literal + " is not a prime number");
}
} |
These two methods depend on a private isPrime method that implements a simple (and inefficient) algorithm for testing primality, shown in Listing 4. Take the remainder of the input when divided by every integer between 2 (the smallest prime) and the square root of the input. If this number is ever zero, the number is not prime. Of course, much more efficient algorithms for testing primality exist, but this one is the easiest to understand.
Listing 4. Algorithm for testing whether a number is prime
private boolean isPrime(String literal) {
try {
int candidate = Integer.parseInt(literal);
if (candidate < 2) return false;
double max = Math.sqrt(candidate);
for (int i = 2; i <= max; i++) {
if (candidate % i == 0) return false;
}
return true;
}
catch (NumberFormatException ex) {
return false;
}
}
|
This method also returns false if the input string is not a number at all.
Some data types may contain more raw text than the numbers expected here. For example, imagine a Base-64-encoded MPEG that's tested against an embedded check sum. In an extreme case like this, the size of the data might even exceed the maximum size of a Java string. In this case, the validator can ask the Datatype for a DatatypeStreamingValidator that knows how to validate the input a piece at a time without having it all in memory at once. In this example, though, the strings aren't so large, so the Datatype just returns an instance of the org.relaxng.datatype.helpers.StreamingValidatorImpl class.
public DatatypeStreamingValidator createStreamingValidator(
ValidationContext context) {
return new StreamingValidatorImpl(this, context);
}
|
This class just stores every bit of data passed in a string, then validates the entire string. More efficient implementations that need to deal with larger data can implement the DatatypeStreamingValidator interface directly.
The Datatype class must provide some object representation of the type that the validator can use for equality comparison and hash code calculation. In this example, the java.lang.Integer class serves this purpose nicely. In other cases, you might want to use java.lang.String or a custom class written just for this purpose. Whatever type you choose, the createValue method in PrimeDatatype converts the literal string into this kind of object, as Listing 5 shows.
Listing 5. Code to convert a literal string into an object
public Object createValue(String literal, ValidationContext context) {
if (isPrime(literal)) {
return Integer.valueOf(literal);
}
return null;
} |
One optimization you might make in some circumstances is to use the flyweight design pattern to create only one different object for each different value of a type rather than for each different type.
The only way you can use objects of this type is by passing them to the
sameValue and valueHashCode methods in Datatype. Make these methods consistent with
each other in the same way that equals and
hashCode are normally consistent with each
other. For this example, I've simply made these methods depend on the
equals and hashCode methods of the Integer class, as Listing 6 shows.
Listing 6. The
sameValue and valueHashCode methods public boolean sameValue(Object value1, Object value2) {
if (value1 == null) return value2 == null;
else return value1.equals(value2);
}
public int valueHashCode(Object value) {
return value.hashCode();
}
|
The validator is not supposed to pass any objects to these methods that were not created by the same object's createValue method. If it does, the behavior of these methods is undefined in the general case, although this specific implementation attempts to do something reasonable.
The getIdType method determines whether the type manifests some sort of ID constraint. Four possibilities exist here, each identified by a named constant in the Datatype class:
Datatype.ID_TYPE_IDDatatype.ID_TYPE_IDREFDatatype.ID_TYPE_IDREFSDatatype.ID_TYPE_NULL
The prime datatype is not any kind of ID or ID reference, so its getIdType method returns Datatype.ID_TYPE_NULL:
public int getIdType() {
return ID_TYPE_NULL;
}
|
The final method is isContextDependent. The validity of prime numbers does not depend on context, so this method simply returns false:
public boolean isContextDependent() {
return false;
}
|
Package, install, and use the type library
After the type library is written, package it for use by the validator. Don't forget to include the META-INF/services/org.relaxng.datatype.DatatypeLibraryFactory file that contains the name of the DatatypeLibraryFactory class. Add this .jar file to the validator's classpath, then run the validator as you normally would. For example, suppose the XML document shown in Listing 7 is in the file integers.xml.
Listing 7. The integers.xml file
<?xml version="1.0"?> <numbers> <number>2</number> <number>3</number> <number>4</number> <number>5</number> <number>6</number> </numbers> |
Now, suppose the schema shown in Listing 8, which references the primes type library, is found in the file primes.rng.
Listing 8. A RELAX NG schema that requires all integers to be prime
<?xml version="1.0"?>
<element name="numbers" xmlns="http://relaxng.org/ns/structure/1.0">
<oneOrMore>
<element name="number">
<data type="prime"
datatypeLibrary="http://ns.cafeconleche.org/relaxng/primes"/>
</element>
</oneOrMore>
</element> |
You can run the schema using the java interpreter, as Listing 9 demonstrates.
Listing 9. Validating with the custom type library
$ java -cp primetype.jar:msv.jar com.sun.msv.driver.textui.Driver primes.rng integers.xml start parsing a grammar. validating integers.xml Error at line:5, column:21 of file:///Users/elharo/integers.xml 4 is not a prime number Error at line:7, column:21 of file:///Users/elharo/integers.xml 6 is not a prime number the document is NOT valid. |
Success! The validator correctly flags 4 and 6 as composite (that is, non-prime) numbers.
That's all there is to it. You can write libraries that contain more than one simple type, and you can define types that have more complex validation rules; but all any type library requires is a few classes to define the type and set up the factories to load it. Here I've demonstrated validation with a command-line user interface (UI), but you can also validate with a graphical user interface (GUI) tool or integrate validation into your own programs using the Java API for RELAX Verifiers (JARV) or the Java API for XML Processing (JAXP) 1.3 validation package. Because type libraries are loaded dynamically using the services API, you don't need to change your Java code at all. Simply place the type library JAR in the classpath, then reference the types in your schemas. You are no longer limited to the W3C simple data types. You can validate absolutely any string that conforms to any decidable set of rules. You can mold the type library to fit your business rules instead of trimming the business rules to fit the schema language.
| Description | Name | Size | Download method |
|---|---|---|---|
| Pluggable prime number datatype for RELAX NG | x-custyp_primedatatype.zip | 3 KB | HTTP |
Information about download methods
- Download the sample code used in this article.
- Visit the RELAX NG home page for RELAX NG software, libraries, tutorials, and specifications.
- Download Kohsuke Kawaguchi's Multischema validator.
- Read the RELAX NG Pluggable Datatype Libraries specification.
- James Clark's Jing includes a sample datatype implementation that verifies that the parentheses in a string are balanced.
- Integrate RELAX NG validation into your code using the validation interfaces in JARV and JAXP 1.3.
- Find out more about Turing completeness in this Wikipedia entry.
- Browse for books on these and other technical topics.
- Find more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Elliotte Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he currently resides in the Prospect Heights neighborhood of Brooklyn, New York, with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site is one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, is one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.
Comments (Undergoing maintenance)





