Skip to main content

RELAX NG with custom datatype libraries

Define new types with Java technology

Elliotte Rusty Harold (elharo@metalab.unc.edu), Adjunct Professor, Polytechnic University
Photo of Elliot Rusty Harold
Elliotte Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he currently resides in the Prospect Heights neighborhood of Brooklyn, New York, with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site is one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, is one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.

Summary:  RELAX NG can do almost everything the W3C XML Schema language can do, including verifying constraints on text content and attribute values specified with the W3C XML Schema simple types. However, some constraints still can't be stated in anything less than a Turing-complete language, and RELAX NG is not such a language. Fortunately, you can extend RELAX NG dynamically with custom validation code — written in the Java™ programming language — that checks constraints that RELAX NG itself can't specify. This requires implementing three interfaces: Datatype, DatatypeLibrary, and DatatypeFactory. This article demonstrates these interfaces by verifying that a number is prime.

Date:  23 Nov 2004
Level:  Advanced
Activity:  2732 views

The RELAX NG XML schema language has achieved huge success over the past three years; this is due in large part to its incredibly clean and straightforward syntax, especially compared to the W3C XML Schema language. Numerous groups, including OpenOffice, DocBook, and the Text Encoding Initiative, have adopted the RELAX NG schema language. RELAX NG has even begun to replace W3C schemas within the W3C, where both the SVG and XHTML working groups are writing their schemas in RELAX NG, then translating them to DTDs and W3C XML Schemas. While RELAX NG doesn't mandate support for XML schema datatypes, in practice, major implementations such as Jing and Sun's Multischema Validator do support them.

However, in all the excitement over how much better RELAX NG does the same things as the W3C XML Schema language, the fact that it can actually do quite a bit more has been overlooked. In particular, unlike the W3C XML Schema language, RELAX NG is not limited to one preordained collection of primitive data types with a limited set of facets for extension. RELAX NG enables developers to define custom type libraries that can assert any constraints a program can verify. For example, W3C schemas cannot validate these constraints:

  • A number is prime.
  • Every left parenthesis in a string is matched by a right parenthesis.
  • The value of the SKU attribute matches a record in the products database.
  • The content of an element is correctly spelled, as determined by consulting a dictionary file.

You cannot verify any of these constraints in pure RELAX NG code, either. However, unlike the W3C XML Schema language, you can extend RELAX NG with user-defined type libraries written in the Java, C#, Python, or other languages. Because these languages are Turing complete (see Resources), they can verify essentially any condition that you might impose on the strings. They can even go outside the bounds of the string itself to compare its consistency with external conditions, such as the inventory of a warehouse or the current stock price of a company. In essence, you can extend RELAX NG's built-in validation rules with arbitrarily complex validation conditions.

This article explores the Java interface for such extensions, which many RELAX NG processors — including Jing and Sun's Multischema Validator — support. The specific example that I employ verifies that a number is prime. The code to test primality is fairly independent of the supporting code for hooking up the custom library to RELAX NG, so it's easy to see where to plug in a more complicated algorithm or validation condition. This extension requires three classes:

  • A class that implements the org.relaxng.datatype.Datatype interface to represent the prime datatype: This class is responsible for verifying that a given string is, in fact, a prime number.
  • A class that implements the org.relaxng.datatype.DatatypeLibrary interface: This class is responsible for loading the right datatype class given the type's local name.
  • A factory class that implements the org.relaxng.datatype.DatatypeLibraryFactory interface: This class is responsible for loading the right datatype library given the library's namespace URI.

Now I'll show you each of these classes.

The prime DatatypeLibraryFactory

Debugging hint

If you have trouble getting the library to work, place a call to System.out.println at the beginning of this method to make sure the library is found and loaded.

The factory class is straightforward. It has a single method — createDatatypeLibrary — that takes as an argument the namespace URI of the type library and returns an instance of that library. If the namespace does not match this library's, the method simply returns null, and the RELAX NG validator looks elsewhere for the correct library, as the code in Listing 1 shows.


Listing 1. The implementation of DatatypeLibraryFactory
package com.elharo.xml.relaxng;

import org.relaxng.datatype.*;

public class PrimeDatatypeLibraryFactory 
 implements DatatypeLibraryFactory {

  public final static String namespace 
    = "http://ns.cafeconleche.org/relaxng/primes";

  public DatatypeLibrary createDatatypeLibrary(String namespace) {
    if (PrimeDatatypeLibraryFactory.namespace.equals(namespace)) {
      return new PrimeDatatypeLibrary();
    }
    return null;
  }

}

RELAX NG uses the Java services API to find factories. Add a plain text file called org.relaxng.datatype.DatatypeLibraryFactory in the META-INF/services/ directory in some JAR file or directory in the classpath. This file contains the fully package-qualified names of the factory classes bundled in the JAR, one per line. In this case, that's just com.elharo.xml.relaxng.PrimeDatatypeLibraryFactory.


The prime DatatypeLibrary

The DatatypeLibrary implementation class is only a little more complex (see Listing 2). There is one DatatypeLibrary per namespace, but it needs two methods. The createDatatype method takes a local name, such as "prime," and returns the appropriate datatype object. This simple example only provides one type, but most libraries provide multiple simple types in the same namespace. The createDatatypeBuilder method returns an org.relaxng.datatype.DatatypeBuilder object. More complex type libraries can use this method to make types dependent on context and various parameters, such as base URIs and namespace prefixes in scope. However, this example doesn't need any context, so createDatatypeBuilder simply returns an instance of org.relaxng.datatype.helpers.ParameterlessDatatypeBuilder that's configured with a prime datatype.


Listing 2. The implementation of DatatypeLibrary
package com.elharo.xml.relaxng;

import org.relaxng.datatype.*;
import org.relaxng.datatype.helpers.*;

public class PrimeDatatypeLibrary implements DatatypeLibrary {

  public Datatype createDatatype(String typeLocalName)
   throws DatatypeException {
    if ("prime".equals(typeLocalName)) {
      return new PrimeDatatype();
    }
    throw new DatatypeException("Unsupported type: " + typeLocalName);
  }

  public DatatypeBuilder createDatatypeBuilder(String baseTypeLocalName) 
   throws DatatypeException {
    return new ParameterlessDatatypeBuilder(
     createDatatype("prime")
    );
  }

}


The prime datatype

The final part of the public API is the datatype itself — an instance of the org.relaxng.datatype.Datatype interface. This datatype is the most complicated class you have to write. It has several responsibilities:

  • Determine whether a given string is a valid instance of the datatype.
  • Create objects that represent instances of the datatype.
  • Compare two objects for equality according to the semantics of the datatype.
  • Calculate hash codes for these objects.
  • Provide streaming validators for the type.
  • Determine whether the type is an ID type.
  • Determine whether the type is context dependent.

I'll take a look at each of these responsibilities in turn.

Checking validity

Two methods validate strings, as the code in Listing 3 shows. isValid returns true if the string is a valid instance of the datatype, false if it isn't. checkValid throws a DatatypeException if the string is not valid.


Listing 3. Methods to check the validity of a given string
  public boolean isValid(String literal, ValidationContext context) {
    return isPrime(literal);
  }

  public void checkValid(String literal, ValidationContext context) 
   throws DatatypeException {
    if (!isValid(literal, context)) {
      throw new DatatypeException(literal + " is not a prime number");
    }
  }

These two methods depend on a private isPrime method that implements a simple (and inefficient) algorithm for testing primality, shown in Listing 4. Take the remainder of the input when divided by every integer between 2 (the smallest prime) and the square root of the input. If this number is ever zero, the number is not prime. Of course, much more efficient algorithms for testing primality exist, but this one is the easiest to understand.


Listing 4. Algorithm for testing whether a number is prime
  private boolean isPrime(String literal) {
      
    try {
      int candidate = Integer.parseInt(literal); 
      if (candidate < 2) return false;
      double max = Math.sqrt(candidate);
      for (int i = 2; i <= max; i++) {
        if (candidate % i == 0) return false;
      }
      return true;
    }
    catch (NumberFormatException ex) {
      return false;      
    }
    
  }

This method also returns false if the input string is not a number at all.

Streaming validity

Some data types may contain more raw text than the numbers expected here. For example, imagine a Base-64-encoded MPEG that's tested against an embedded check sum. In an extreme case like this, the size of the data might even exceed the maximum size of a Java string. In this case, the validator can ask the Datatype for a DatatypeStreamingValidator that knows how to validate the input a piece at a time without having it all in memory at once. In this example, though, the strings aren't so large, so the Datatype just returns an instance of the org.relaxng.datatype.helpers.StreamingValidatorImpl class.

  public DatatypeStreamingValidator createStreamingValidator(
   ValidationContext context) {
    return new StreamingValidatorImpl(this, context);
  }

This class just stores every bit of data passed in a string, then validates the entire string. More efficient implementations that need to deal with larger data can implement the DatatypeStreamingValidator interface directly.

Object representation

The Datatype class must provide some object representation of the type that the validator can use for equality comparison and hash code calculation. In this example, the java.lang.Integer class serves this purpose nicely. In other cases, you might want to use java.lang.String or a custom class written just for this purpose. Whatever type you choose, the createValue method in PrimeDatatype converts the literal string into this kind of object, as Listing 5 shows.


Listing 5. Code to convert a literal string into an object
  public Object createValue(String literal, ValidationContext context) {

    if (isPrime(literal)) {
        return Integer.valueOf(literal);
    }
    return null;
    
  }

One optimization you might make in some circumstances is to use the flyweight design pattern to create only one different object for each different value of a type rather than for each different type.

The only way you can use objects of this type is by passing them to the sameValue and valueHashCode methods in Datatype. Make these methods consistent with each other in the same way that equals and hashCode are normally consistent with each other. For this example, I've simply made these methods depend on the equals and hashCode methods of the Integer class, as Listing 6 shows.


Listing 6. The sameValue and valueHashCode methods
  public boolean sameValue(Object value1, Object value2) {
    if (value1 == null) return value2 == null;
    else return value1.equals(value2);
  }

  public int valueHashCode(Object value) {
    return value.hashCode();
  }

The validator is not supposed to pass any objects to these methods that were not created by the same object's createValue method. If it does, the behavior of these methods is undefined in the general case, although this specific implementation attempts to do something reasonable.

ID

The getIdType method determines whether the type manifests some sort of ID constraint. Four possibilities exist here, each identified by a named constant in the Datatype class:

  • Datatype.ID_TYPE_ID
  • Datatype.ID_TYPE_IDREF
  • Datatype.ID_TYPE_IDREFS
  • Datatype.ID_TYPE_NULL

The prime datatype is not any kind of ID or ID reference, so its getIdType method returns Datatype.ID_TYPE_NULL:

  public int getIdType() {
    return ID_TYPE_NULL;
  }

Context

The final method is isContextDependent. The validity of prime numbers does not depend on context, so this method simply returns false:

  public boolean isContextDependent() {
    return false;
  }


Package, install, and use the type library

After the type library is written, package it for use by the validator. Don't forget to include the META-INF/services/org.relaxng.datatype.DatatypeLibraryFactory file that contains the name of the DatatypeLibraryFactory class. Add this .jar file to the validator's classpath, then run the validator as you normally would. For example, suppose the XML document shown in Listing 7 is in the file integers.xml.


Listing 7. The integers.xml file
<?xml version="1.0"?>
<numbers>
  <number>2</number>
  <number>3</number>
  <number>4</number>
  <number>5</number>
  <number>6</number>
</numbers>

Now, suppose the schema shown in Listing 8, which references the primes type library, is found in the file primes.rng.


Listing 8. A RELAX NG schema that requires all integers to be prime
<?xml version="1.0"?>
<element name="numbers" xmlns="http://relaxng.org/ns/structure/1.0">
  <oneOrMore>
    <element name="number">
      <data type="prime"
            datatypeLibrary="http://ns.cafeconleche.org/relaxng/primes"/>
    </element>
  </oneOrMore>
</element>

You can run the schema using the java interpreter, as Listing 9 demonstrates.


Listing 9. Validating with the custom type library
$ java -cp primetype.jar:msv.jar com.sun.msv.driver.textui.Driver
 primes.rng integers.xml
start parsing a grammar.
validating integers.xml
Error at line:5, column:21 of file:///Users/elharo/integers.xml
  4 is not a prime number

Error at line:7, column:21 of file:///Users/elharo/integers.xml
  6 is not a prime number

the document is NOT valid.

Success! The validator correctly flags 4 and 6 as composite (that is, non-prime) numbers.


Conclusion

Beware runnable JARs

You have to add the validator's JAR to the classpath directly, as well as the JAR for the custom type library. If you treat the validator as a runnable JAR using "java -jar," the validator won't find the custom type library, and you'll get an error message such as:

"http://ns.cafeconleche.org/relaxng/primes" is not a recognized data type vocabulary 6:75@file:///Users/elharo/primes.rng failed to load a grammar.

That's all there is to it. You can write libraries that contain more than one simple type, and you can define types that have more complex validation rules; but all any type library requires is a few classes to define the type and set up the factories to load it. Here I've demonstrated validation with a command-line user interface (UI), but you can also validate with a graphical user interface (GUI) tool or integrate validation into your own programs using the Java API for RELAX Verifiers (JARV) or the Java API for XML Processing (JAXP) 1.3 validation package. Because type libraries are loaded dynamically using the services API, you don't need to change your Java code at all. Simply place the type library JAR in the classpath, then reference the types in your schemas. You are no longer limited to the W3C simple data types. You can validate absolutely any string that conforms to any decidable set of rules. You can mold the type library to fit your business rules instead of trimming the business rules to fit the schema language.



Download

DescriptionNameSizeDownload method
Pluggable prime number datatype for RELAX NGx-custyp_primedatatype.zip3 KB HTTP

Information about download methods


Resources

About the author

Photo of Elliot Rusty Harold

Elliotte Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he currently resides in the Prospect Heights neighborhood of Brooklyn, New York, with his wife Beth and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). He's an adjunct professor of computer science at Polytechnic University, where he teaches Java technology and object-oriented programming. His Cafe au Lait Web site is one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche, is one of the most popular XML sites. His books include Effective XML, Processing XML with Java, Java Network Programming, and The XML 1.1 Bible. He's currently working on the XOM API for processing XML and the XQuisitor GUI query tool. You can contact him at elharo@metalab.unc.edu.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Java technology
ArticleID=31942
ArticleTitle=RELAX NG with custom datatype libraries
publish-date=11232004
author1-email=elharo@metalab.unc.edu
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers