Tip: Tell a parser where to find a schema

More useful document validation with JAXP 1.2

This tip shows you how to implement robust document validation with XML schema and JAXP 1.2. Examples are included for SAX and DOM parsers.

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft

Photo of Benoit MarchalBenoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Contact Benoit bmarchal@pineapplesoft.com for help with your XML projects.



22 May 2003

Also available in Japanese

Most discussions on schemas center on the best possible vocabularies or on how to organize a schema efficiently (Russian dolls, Venetian blinds, salami slices, and so on). Also, the dispute about the most appropriate schema language is ongoing -- is it DTDs, W3C's XML Schema, or OASIS' Relax NG?

These are important considerations. Yet when you design an XML application, it's even more important to know what to do with the schema. This tip discusses new features in Java API for XML Processing (JAXP) 1.2 that give you more flexibility in validating documents against schemas.

Validating documents

Typically, an application validates XML documents against a list of known schemas as part of its error handling. Schemas describe the vocabulary: the names of elements, the attributes, and their datatypes (such as integer, string, and date). If a document validates against a schema, it conforms to a vocabulary that the application recognizes. Validating is useful -- after all, what's the point in processing documents if the application does not recognize the elements?

Yet to be of any use, it is important that the application validate against a known schema, which, until JAXP 1.2, was easier said than done. For instance, how do you associate a document to its schema in a portable way? In most cases, through the xsi:schemaLocation attribute. The attribute takes pairs of namespace URIs and the associated schema file (there's an xsi:noNamespaceSchemaLocation attribute for documents with no namespaces). In Listing 1, the schemaLocation attribute associates the http://ananas.org/2003/tips/validate namespace to the file simple.xsd.

Listing 1. XML document with an xsi:schemaLocation attribute
<?xml version="1.0"?>
<simple:Root
   xmlns:simple="http://ananas.org/2003/tips/validate"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://ananas.org/2003/tips/validate simple.xsd">
      Document content comes here.
</simple:Root>

The attribute is a simple solution for managing XML schemas, but it has one big flaw: It assumes that your application can control the xsi:schemaLocation attribute. Depending on your application, this may or may not be the case. Consider the following scenarios:

  • An XML editor, such as Corel XMetaL or XMLmind XML Editor, lets you change the xsi:schemaLocation attribute so it points to the appropriate file.
  • A Web publishing framework such as XM or Cocoon may expect that writers set the xsi:schemaLocation correctly, which may not be the case if several authors have contributed to the site.
  • An electronic commerce server that processes incoming XML documents needs to validate the documents, but may not be able to trust that the other party has set the schema correctly.

As these scenarios illustrate, the xsi:schemaLocation attribute works well in small-scale applications, but it becomes increasingly difficult to manage in more distributed environments. Among other things, for example, the chances are small that the schema will be stored under the same name on different computers.

By definition, an application validates a document if there's a risk that the document is incorrect. The validation cannot be robust if it depends on the content of the document, such as xsi:schemaLocation. Why would the attribute be more correct than the rest of the document? Clearly another solution is needed, one that puts more control in the hands of the application.


Schema support in JAXP 1.2

URIs and properties

JAXP uses URIs as identifiers for properties and attributes, which is consistent with the use of URIs as namespace identifiers. Unfortunately, this use of URIs may be slightly confusing. Users have come to expect that anything that starts with is a Web site. Not in this case: Those URIs are identifiers, and if you try to visit them in a browser, you'll most likely be greeted by a "404 - File not found" error message.

The schema specification correctly recognizes the lack of robustness due to xsi:schemaLocation. According to the specification, xsi:schemaLocation is only a hint to the parser and the parser may use other means to decide which schema to apply. Unfortunately the specification does not say what those other means should be. JAXP 1.2, a maintenance release for JAXP, fills in the blanks by providing a standard mechanism on the Java platform.

Essentially, JAXP 1.2 defines two new properties (for SAX parsers) and two new attributes (for DOM parsers) that control schema validation. The first property (http://java.sun.com/xml/jaxp/properties/schemaLanguage) specifies the schema language to use. For the time being, the only acceptable value is http://www.w3.org/2001/XMLSchema (the W3C recommendation on XML schema). Future releases may support other values for Relax NG or other schema languages.

The second property (http://java.sun.com/xml/jaxp/properties/schemaSource) sets the location of the schema. This is the one of most interest. It accepts many values, such as:

  • A string with the URI of the schema.
  • An InputStream object with the content of the schema.
  • An InputSource object pointing to the schema.
  • A File object pointing to the schema file.
  • An array with one of these defined types. The array is useful if your application accepts documents that can conform to different schemas.

A SAX example

Listing 2 demonstrates how to use the new properties in JAXP 1.2 to validate a document through a SAX parser. To use the SAX parser to validate a document:

  1. Create a SAXParserFactory object.
  2. Set the namespace-aware and validating properties to true.
  3. Obtain a SAXParser object.
  4. Set the properties for the schema language and schema source (this is new to JAXP 1.2 and schema).
  5. Parse the document. The parser must have access to an ErrorHandler object.
Listing 2. ValidateSAX.java demonstrates JAXP 1.2
                package org.ananas.tips;

import java.io.*;
import org.xml.sax.*;
import javax.xml.parsers.*;
public class ValidateSAX
{
   public static String SCHEMA_LANGUAGE =
      "http://java.sun.com/xml/jaxp/properties/schemaLanguage",
                        XML_SCHEMA =
      "http://www.w3.org/2001/XMLSchema",
                        SCHEMA_SOURCE =
      "http://java.sun.com/xml/jaxp/properties/schemaSource";

   public final static void main(String[] args)
      throws IOException, SAXException, ParserConfigurationException
   {
      if(args.length < 2)
      {
         System.err.println("usage is:");
         System.err.println("   java -jar tips.jar -validatesax "
                            + "input.xml schema.xsd");
         return;
      }

      File input = new File(args[0]),
           schema = new File(args[1]);
      SAXParserFactory factory = SAXParserFactory.newInstance();
      factory.setNamespaceAware(true);
      factory.setValidating(true);
      SAXParser parser = factory.newSAXParser();
      try
      {
         parser.setProperty(SCHEMA_LANGUAGE,XML_SCHEMA);
         parser.setProperty(SCHEMA_SOURCE,schema);
      }
      catch(SAXNotRecognizedException x)
      {
         System.err.println("Your SAX parser is not JAXP 1.2 compliant.");
      }
      parser.parse(input,new ErrorPrinter());
   }
}

ErrorHandler and validation

The validating property tells the parser to report validation errors to its object. In practice, this means that if you don't register an ErrorHandler, you won't see the error messages. Some programmers expect the parser to throw an exception if it cannot validate the document, but that is not how a SAX parser behaves.

To test Listing 2, you need a JAXP 1.2-compliant parser. Check the documentation for your favorite parser or download the most recent version of Apache Xerces (I have used version 2.4.0 to prepare this tip). If your parser is not JAXP 1.2-compliant, it throws a SAXNotRecognizedException exception when you try to set the property. That's your cue to upgrade to the latest version of Xerces.

Listing 2 only registers a DefaultHandler object that simply prints validation errors on the console, as shown in Listing 3. Your application could register a more interesting handler, such as one that does something with the document content.

Listing 3. ErrorPrinter.java
package org.ananas.tips;

import java.text.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class ErrorPrinter
   extends DefaultHandler
{
   private MessageFormat message =
      new MessageFormat("({0}: {1}, {2}): {3}");

   private void print(SAXParseException x)
   {
      String msg = message.format(new Object[]
                                  {
                                     x.getSystemId(),
                                     new Integer(x.getLineNumber()),
                                     new Integer(x.getColumnNumber()),
                                     x.getMessage()
                                  });
      System.out.println(msg);
   }

   public void warning(SAXParseException x)
   {
      print(x);
   }

   public void error(SAXParseException x)
   {
      print(x);
   }

   public void fatalError(SAXParseException x)
      throws SAXParseException
   {
      print(x);
      throw x;
   }
}

What about DOM?

JAXP 1.2 also defines schema support for the DOM parser, as shown in Listing 4. The procedure is very similar to that of a SAX parser, the only difference being that you set attributes on the factory object instead of properties on the parser object. The detailed procedure is:

  1. Create a DOMBuilderFactory object.
  2. Set the namespace-aware and validating properties to true.
  3. Set the attributes for the schema language and schema source. If your parser is not JAXP 1.2-compliant, it will throw an IllegalArgumentException exception.
  4. Obtain a DocumentBuilder object (the parser).
  5. Register an ErrorHandler object with the parser.
  6. Parse the document.

This example only demonstrates validation. Your application could do more interesting things with the parse tree.

Listing 4. ValidateDOM.java
package org.ananas.tips;

import java.io.*;
import org.xml.sax.*;
import javax.xml.parsers.*;

public class ValidateDOM
{
   public static String SCHEMA_LANGUAGE =
      "http://java.sun.com/xml/jaxp/properties/schemaLanguage",
                        XML_SCHEMA =
      "http://www.w3.org/2001/XMLSchema",
                        SCHEMA_SOURCE =
      "http://java.sun.com/xml/jaxp/properties/schemaSource";
   public final static void main(String[] args)
      throws IOException, SAXException, ParserConfigurationException
   {
      if(args.length < 2)
      {
         System.err.println("usage is:");
         System.err.println("   java -jar tips.jar -validatedom "
                            + "input.xml schema.xsd");
         return;
      }

      File input = new File(args[0]),
           schema = new File(args[1]);
      DocumentBuilderFactory factory =
         DocumentBuilderFactory.newInstance();
      factory.setNamespaceAware(true);
      factory.setValidating(true);
      try
      {
         factory.setAttribute(SCHEMA_LANGUAGE,XML_SCHEMA);
         factory.setAttribute(SCHEMA_SOURCE,schema);
      }
      catch(IllegalArgumentException x)
      {
         System.err.println("Your DOM parser is not JAXP 1.2 compliant.");
      }
      DocumentBuilder parser = factory.newDocumentBuilder();
      parser.setErrorHandler(new ErrorPrinter());
      parser.parse(input);
   }
}

Towards more robust XML applications

When implementing robust validations with XML schemas, keep in mind that -- almost by definition -- when your application validates documents, it should not depend on those documents being correct. More specifically, it should not depend on documents having the appropriate xsi:schemaLocation attribute.


Download

DescriptionNameSize
Code sample for this articlex-tipvalschmcode.zip11KB

Resources

  • Download the source code used in this article. You also need the latest version of Apache Xerces or another JAXP 1.2-compliant parser.
  • XMLmind XML Editor and Corel XMetaL are representative of XML editors with schema support. XM and Cocoon are representative of publishing solutions. When there are not too many writers, it is possible to depend on xsi:schemaLocation with these applications.
  • RELAX NG is an alternative to XML Schema.
  • In his document, XML Schemas: Best Practices, Roger Costello discusses different design techniques with exotic names such as Venetian blinds or Russian doll.
  • Find more XML resources on the developerWorks XML zone. For a complete list of XML tips to date, check out the tips summary page.
  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12273
ArticleTitle=Tip: Tell a parser where to find a schema
publish-date=05222003