Skip to main content

Tip: Tell a parser where to find a schema

More useful document validation with JAXP 1.2

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Contact Benoit bmarchal@pineapplesoft.com for help with your XML projects.

Summary:  This tip shows you how to implement robust document validation with XML schema and JAXP 1.2. Examples are included for SAX and DOM parsers.

View more content in this series

Date:  22 May 2003
Level:  Introductory
Activity:  2651 views

Most discussions on schemas center on the best possible vocabularies or on how to organize a schema efficiently (Russian dolls, Venetian blinds, salami slices, and so on). Also, the dispute about the most appropriate schema language is ongoing -- is it DTDs, W3C's XML Schema, or OASIS' Relax NG?

These are important considerations. Yet when you design an XML application, it's even more important to know what to do with the schema. This tip discusses new features in Java API for XML Processing (JAXP) 1.2 that give you more flexibility in validating documents against schemas.

Validating documents

Typically, an application validates XML documents against a list of known schemas as part of its error handling. Schemas describe the vocabulary: the names of elements, the attributes, and their datatypes (such as integer, string, and date). If a document validates against a schema, it conforms to a vocabulary that the application recognizes. Validating is useful -- after all, what's the point in processing documents if the application does not recognize the elements?

Yet to be of any use, it is important that the application validate against a known schema, which, until JAXP 1.2, was easier said than done. For instance, how do you associate a document to its schema in a portable way? In most cases, through the xsi:schemaLocation attribute. The attribute takes pairs of namespace URIs and the associated schema file (there's an xsi:noNamespaceSchemaLocation attribute for documents with no namespaces). In Listing 1, the schemaLocation attribute associates the http://ananas.org/2003/tips/validate namespace to the file simple.xsd.


Listing 1. XML document with an xsi:schemaLocation attribute
                
<?xml version="1.0"?>
<simple:Root
   xmlns:simple="http://ananas.org/2003/tips/validate"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://ananas.org/2003/tips/validate simple.xsd">
      Document content comes here.
</simple:Root>

The attribute is a simple solution for managing XML schemas, but it has one big flaw: It assumes that your application can control the xsi:schemaLocation attribute. Depending on your application, this may or may not be the case. Consider the following scenarios:

  • An XML editor, such as Corel XMetaL or XMLmind XML Editor, lets you change the xsi:schemaLocation attribute so it points to the appropriate file.

  • A Web publishing framework such as XM or Cocoon may expect that writers set the xsi:schemaLocation correctly, which may not be the case if several authors have contributed to the site.

  • An electronic commerce server that processes incoming XML documents needs to validate the documents, but may not be able to trust that the other party has set the schema correctly.

As these scenarios illustrate, the xsi:schemaLocation attribute works well in small-scale applications, but it becomes increasingly difficult to manage in more distributed environments. Among other things, for example, the chances are small that the schema will be stored under the same name on different computers.

By definition, an application validates a document if there's a risk that the document is incorrect. The validation cannot be robust if it depends on the content of the document, such as xsi:schemaLocation. Why would the attribute be more correct than the rest of the document? Clearly another solution is needed, one that puts more control in the hands of the application.


Schema support in JAXP 1.2

URIs and properties

JAXP uses URIs as identifiers for properties and attributes, which is consistent with the use of URIs as namespace identifiers. Unfortunately, this use of URIs may be slightly confusing. Users have come to expect that anything that starts with is a Web site. Not in this case: Those URIs are identifiers, and if you try to visit them in a browser, you'll most likely be greeted by a "404 - File not found" error message.

The schema specification correctly recognizes the lack of robustness due to xsi:schemaLocation. According to the specification, xsi:schemaLocation is only a hint to the parser and the parser may use other means to decide which schema to apply. Unfortunately the specification does not say what those other means should be. JAXP 1.2, a maintenance release for JAXP, fills in the blanks by providing a standard mechanism on the Java platform.

Essentially, JAXP 1.2 defines two new properties (for SAX parsers) and two new attributes (for DOM parsers) that control schema validation. The first property (http://java.sun.com/xml/jaxp/properties/schemaLanguage) specifies the schema language to use. For the time being, the only acceptable value is http://www.w3.org/2001/XMLSchema (the W3C recommendation on XML schema). Future releases may support other values for Relax NG or other schema languages.

The second property (http://java.sun.com/xml/jaxp/properties/schemaSource) sets the location of the schema. This is the one of most interest. It accepts many values, such as:

  • A string with the URI of the schema.
  • An InputStream object with the content of the schema.
  • An InputSource object pointing to the schema.
  • A File object pointing to the schema file.
  • An array with one of these defined types. The array is useful if your application accepts documents that can conform to different schemas.

A SAX example

Listing 2 demonstrates how to use the new properties in JAXP 1.2 to validate a document through a SAX parser. To use the SAX parser to validate a document:

  1. Create a SAXParserFactory object.
  2. Set the namespace-aware and validating properties to true.
  3. Obtain a SAXParser object.
  4. Set the properties for the schema language and schema source (this is new to JAXP 1.2 and schema).
  5. Parse the document. The parser must have access to an ErrorHandler object.

Listing 2. ValidateSAX.java demonstrates JAXP 1.2
                package org.ananas.tips;

import java.io.*;
import org.xml.sax.*;
import javax.xml.parsers.*;
public class ValidateSAX
{
   public static String SCHEMA_LANGUAGE =
      "http://java.sun.com/xml/jaxp/properties/schemaLanguage",
                        XML_SCHEMA =
      "http://www.w3.org/2001/XMLSchema",
                        SCHEMA_SOURCE =
      "http://java.sun.com/xml/jaxp/properties/schemaSource";

   public final static void main(String[] args)
      throws IOException, SAXException, ParserConfigurationException
   {
      if(args.length < 2)
      {
         System.err.println("usage is:");
         System.err.println("   java -jar tips.jar -validatesax "
                            + "input.xml schema.xsd");
         return;
      }

      File input = new File(args[0]),
           schema = new File(args[1]);
      SAXParserFactory factory = SAXParserFactory.newInstance();
      factory.setNamespaceAware(true);
      factory.setValidating(true);
      SAXParser parser = factory.newSAXParser();
      try
      {
         parser.setProperty(SCHEMA_LANGUAGE,XML_SCHEMA);
         parser.setProperty(SCHEMA_SOURCE,schema);
      }
      catch(SAXNotRecognizedException x)
      {
         System.err.println("Your SAX parser is not JAXP 1.2 compliant.");
      }
      parser.parse(input,new ErrorPrinter());
   }
}

ErrorHandler and validation

The validating property tells the parser to report validation errors to its object. In practice, this means that if you don't register an ErrorHandler, you won't see the error messages. Some programmers expect the parser to throw an exception if it cannot validate the document, but that is not how a SAX parser behaves.

To test Listing 2, you need a JAXP 1.2-compliant parser. Check the documentation for your favorite parser or download the most recent version of Apache Xerces (I have used version 2.4.0 to prepare this tip). If your parser is not JAXP 1.2-compliant, it throws a SAXNotRecognizedException exception when you try to set the property. That's your cue to upgrade to the latest version of Xerces.

Listing 2 only registers a DefaultHandler object that simply prints validation errors on the console, as shown in Listing 3. Your application could register a more interesting handler, such as one that does something with the document content.


Listing 3. ErrorPrinter.java
                
package org.ananas.tips;

import java.text.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;

public class ErrorPrinter
   extends DefaultHandler
{
   private MessageFormat message =
      new MessageFormat("({0}: {1}, {2}): {3}");

   private void print(SAXParseException x)
   {
      String msg = message.format(new Object[]
                                  {
                                     x.getSystemId(),
                                     new Integer(x.getLineNumber()),
                                     new Integer(x.getColumnNumber()),
                                     x.getMessage()
                                  });
      System.out.println(msg);
   }

   public void warning(SAXParseException x)
   {
      print(x);
   }

   public void error(SAXParseException x)
   {
      print(x);
   }

   public void fatalError(SAXParseException x)
      throws SAXParseException
   {
      print(x);
      throw x;
   }
}


What about DOM?

JAXP 1.2 also defines schema support for the DOM parser, as shown in Listing 4. The procedure is very similar to that of a SAX parser, the only difference being that you set attributes on the factory object instead of properties on the parser object. The detailed procedure is:

  1. Create a DOMBuilderFactory object.
  2. Set the namespace-aware and validating properties to true.
  3. Set the attributes for the schema language and schema source. If your parser is not JAXP 1.2-compliant, it will throw an IllegalArgumentException exception.
  4. Obtain a DocumentBuilder object (the parser).
  5. Register an ErrorHandler object with the parser.
  6. Parse the document.

This example only demonstrates validation. Your application could do more interesting things with the parse tree.


Listing 4. ValidateDOM.java
                
package org.ananas.tips;

import java.io.*;
import org.xml.sax.*;
import javax.xml.parsers.*;

public class ValidateDOM
{
   public static String SCHEMA_LANGUAGE =
      "http://java.sun.com/xml/jaxp/properties/schemaLanguage",
                        XML_SCHEMA =
      "http://www.w3.org/2001/XMLSchema",
                        SCHEMA_SOURCE =
      "http://java.sun.com/xml/jaxp/properties/schemaSource";
   public final static void main(String[] args)
      throws IOException, SAXException, ParserConfigurationException
   {
      if(args.length < 2)
      {
         System.err.println("usage is:");
         System.err.println("   java -jar tips.jar -validatedom "
                            + "input.xml schema.xsd");
         return;
      }

      File input = new File(args[0]),
           schema = new File(args[1]);
      DocumentBuilderFactory factory =
         DocumentBuilderFactory.newInstance();
      factory.setNamespaceAware(true);
      factory.setValidating(true);
      try
      {
         factory.setAttribute(SCHEMA_LANGUAGE,XML_SCHEMA);
         factory.setAttribute(SCHEMA_SOURCE,schema);
      }
      catch(IllegalArgumentException x)
      {
         System.err.println("Your DOM parser is not JAXP 1.2 compliant.");
      }
      DocumentBuilder parser = factory.newDocumentBuilder();
      parser.setErrorHandler(new ErrorPrinter());
      parser.parse(input);
   }
}


Towards more robust XML applications

When implementing robust validations with XML schemas, keep in mind that -- almost by definition -- when your application validates documents, it should not depend on those documents being correct. More specifically, it should not depend on documents having the appropriate xsi:schemaLocation attribute.



Download

DescriptionNameSizeDownload method
Code sample for this articlex-tipvalschmcode.zip11KB HTTP

Information about download methods


Resources

About the author

Benoit Marchal

Benoit Marchal is a Belgian consultant. He is the author of XML by Example and other XML books. Contact Benoit bmarchal@pineapplesoft.com for help with your XML projects.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12273
ArticleTitle=Tip: Tell a parser where to find a schema
publish-date=05222003
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers