Skip to main content

Working XML: Safe coding practices, Part 3

Validation, error handling, and schemas

Benoit Marchal (bmarchal@pineapplesoft.com), Consultant, Pineapplesoft
Photo of Benoit Marchal
Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at www.marchal.com.

Summary:  Benoît continues his four-part series of columns reviewing common pitfalls with XML technology. He turns your attention now to validation of documents and error handling. Learn how to avoid common mistakes when you design and implement error handling in your XML applications.

View more content in this series

Date:  27 Sep 2005
Level:  Intermediate
Activity:  2027 views

This is the third article in a four-part series discussing common XML pitfalls and, more importantly, ways around them. As a consultant and trainer, I have noticed that many companies and developers make the same mistakes when they adopt XML technology. This series is an attempt to document some of those problems and spare you the annoyance of dealing with them.

Schema and interface

Part 1 looked at common misunderstandings with the XML standard itself (such as encodings and namespaces). Part 2 focused on design issues: how and where to introduce XML support in an application. One of the guiding principles of Part 2 is to treat XML files as an interface between applications and to apply the time-tested design techniques used with JavaBeans and other interfaces: separation of tasks, documentation, built-in evolution, and more.

This article looks at validation, error handling, and schemas. Schemas primarily address the design, documentation, and validation of an interface. I'll focus primarily on validation.

Of DTDs and schemas

First, some vocabulary definitions. The W3C XML Schema Recommendation is one schema language for XML. Others include Document Type Definitions (DTDs), RELAX NG, and Schematron (see Resources). In this article, Schema (capital S) indicates the W3C XML Schema Recommendation. Lowercased, schema refers to the general concept of a schema language.

In 1998, when XML emerged as a W3C recommendation, DTDs were something of a novelty. They'd been around since SGML was adopted by the International Standards Organization (ISO) in 1986, but few other file formats offered a validation mechanism, and none of those that did proved as popular as XML. Still, the underlying concept isn't new; it derived directly from database schemas. Essentially a DTD describes the vocabulary's structure (the tags and attributes), similar to the way a database schema describes a database's structure (the tables and columns). The W3C later released XML Schema as a more powerful schema language.

Experience with database design shows that schemas are most useful as a safeguard against programming errors. Storing incomplete or incorrect data is less likely when you use a schema because the data must conform to clear rules. The usefulness of schemas increases with the number of applications accessing the information. The more applications that access and modify the data, the greater the need for the structure and guidance that a schema provides.

However, database developers have long known that a schema does not replace error handling. Database validation is the last chance to catch errors, but a good application has already validated the data at that point (and error messages from the database engine are anything but readable).

By and large, this experience holds true with schemas for XML. The more applications that use a given vocabulary, the more you need a schema to define a common framework. And although schemas provide some error handling, it is useful to complement them in the application.

Which schema language?

The DTD was the original schema language for XML. A direct transplant from SGML, the DTD is too limited for many applications. Still, because it is the oldest schema language, it is the most widely available. It always pays to keep a DTD around, because some of the older products in your toolkit might not work with more recent alternatives.

XML Schemas are more powerful and more modern. Among the differences between DTDs and XML Schemas (see Resources for a couple of articles that compare the two), three important ones matter in my practice:

  • XML namespaces
  • DOCTYPE declarations
  • Rich data typing

Lack of support for namespaces is the single most glaring deficiency of DTDs. (You can emulate namespaces partially through parameter entities, but it is complex and not totally satisfactory.) XML Schemas make namespaces a first-class citizen for XML. I discuss DOCTYPE statements and data typing in the next section, Schema for validation.

XML Schema is a complex recommendation. This complexity has led to the development of alternate schema languages that emphasize simplicity. The best-known alternative is RELAX NG, which the ISO is standardizing. Although technically interesting, those efforts lack the W3C's support, which translates into less support from tool vendors. My customers have shown little interest in these alternatives, and vendors don't offer much support for them, so I don't use them often.


Schema for validation

As I mentioned earlier, schemas have essentially three applications: design, documentation, and validation. Simple vocabularies are often designed by carefully crafting a corresponding schema. With proper annotations, the schema serves as documentation for developers. Designing the schema directly works well for tiny vocabularies. Anything larger is best served by a real modeling language such as UML (see Resources for a previous Working XML series that discusses UML). In this article I'll concentrate on the use of schemas for validation.

I have noticed three common errors in the use of schemas:

  • Making them too stringent (a mistake I tend to make; I always have to double-check myself)
  • Failing to design proper error handling
  • Implementing validation unreliably

How strict are you?

How stringent should a schema be? This is the first question to ask when you develop a new schema. Designers tend to be strict in an attempt to prevent errors; catching errors early through proper validation of files helps build more stable systems. Yet experience (including experience with database designs) shows that you need to balance strictness with what, for lack of a better word, I call clarity.

The dilemma boils down to this: Up to a certain point, the schema must match the expectations and understanding of the developers and other users of the vocabulary. In other words, the vocabulary should be easy to work with. Yet to design a schema that prevents as many errors as possible, you need to organize the vocabulary around the error checking, sometimes using advanced features such as inheritance. The result can be a complex schema that's hard to read and harder to implement correctly. Many users will find this too strict a framework. It can also be difficult to maintain, because adding or removing validation can require that you change the vocabulary.

Look at the common example of designing a vocabulary for international purchase orders. You need tags to record addresses, and you want those addresses to be correct because goods will be shipped to them. But how far is too far when it comes to validating the address? Take the state element, for example. States are required for U.S. addresses but do not exist (and have no equivalent) in most other countries. So you'd like to make the state tag mandatory (minOccurs="1" in XML Schema), but you can't because it won't work for most countries. One option is to have strict validation by introducing specialized address elements by country: U.S. addresses include a state; no other addresses do. This may sound attractive at first, but when you realize that there are 193 countries in the world it becomes clear that the schema will become bloated.

This example, in which tags are introduced only to strengthen validation, is an example of lack of clarity. The purpose and intent of those tags with a fair amount of redundant information will not be obvious to schema readers. Ironically, this can lead to errors. So what do you validate for? Validate as much as makes sense, but refrain from bending the data structure to push for more validation.

A layered approach

Validation and error handling are not a monolith. Your application can validate at different levels:

  • Structure: The schema specifies a structure that controls which tags will appear where, how often they repeat, and more -- for example, a purchase order line consists of a product number, description, quantity, and price.
  • Data typing: You can use data typing to control tag content -- for example, the quantity field in a purchase order is a non-negative integer (because the customer can't order less than a whole unit of the product).
  • Assertion: Use assertions to check relationships between fields -- for example, the purchase order's total is the sum of the order lines.

XMLFilter and assertions

If you use a SAX parser, you'll find that the XMLFilter class is a good place to locate the assertion validation layer. As the filter progresses through the document, it simply forwards correct tags to the application and otherwise reports errors.

DTD is pretty good at structural validation. So is XML Schema. XML Schema offers extensions (such as typing, local elements, and inheritance) that have proven contentious with the developer community. I for one was perfectly happy with the DTD's feature set in this respect.

XML Schema also adds rich data typing support. You can validate against the data types found in modern databases and programming languages. Furthermore, you can derive your own types through facets. In a nutshell, facets further restrict a simple type by specifying the maximum length of a string or the upper and lower limits of a number.

The last level, assertion, is typically implemented through Java code or through a dedicated assertion language such as Schematron. Unfortunately, because XML Schema does not recognize assertions, some applications don't include them in their validation strategy. The result? Applications print out bizarre error messages, freeze, or crash when they receive an incorrect file.

To build strong error handling, you need a layered approach:

  • At the lowest level, the parser checks for syntax conformance.
  • On the next layer sits a schema that validates the structure and typing.
  • A layer of custom Java code (or a Schematron) performs the next level of validation.
  • Optionally, the Java object that you load the data into can perform a last level of validation.

Each layer is more specific than the previous one, making it easier to share validation across several applications. Validation is also more maintainable when you clearly specify the responsibility of each piece of code.

Implementation considerations

ErrorHandler

If you use SAX, it is not enough to turn validation on. Your application must register an ErrorHandler to receive error messages. Without an ErrorHandler, the parser can silently ignore the validation errors.

It seems obvious that validation should not depend on the XML file being correct. If the file were correct, why would you bother validating it? Yet many applications validate files only if they are at least partly correct. Applications that rely on the file itself to reference the schema (through a DOCTYPE statement or through a schemaLocation attribute) are at risk. If the file points to an incorrect schema, the parser will validate against the incorrect schema. It might not report errors even though the file is incorrect.

With DTDs, the document must include a DOCTYPE statement. So to be safe, the application must either insert its own DOCTYPE statement before parsing or, at a minimum, implement SAX's LexicalHandler interface and check that the referenced DTD is the correct one.

XML Schema offers a better solution: Tell the parser to load the schema from the document namespace. With a JAXP 1.2 parser, you configure the schema by using the http://java.sun.com/xml/jaxp/properties/schemaSource property in the Java code before reading the document (see Resources for an article with sample code). Thanks to this property, the parser will always load the correct schema.

A word of warning: JAXP properties work like XML namespaces. The URL is an identifier; it does not point to a Web page. Just make sure you copy the URL exactly as shown.

What about the schemaLocation attribute that might appear in the XML document? If you read the XML Schema specification, you'll see that this attribute was never intended as a general mechanism for loading schemas but only as a hint. In practice, it is useful when you test and debug (before you've had a chance to configure the parser fully), but not for production data.


Strong error handling

Strong error handling translates into more reliable applications. When you load an XML document, it is best to catch errors early, before they have a chance to pollute other data. Schema languages help in that respect, but for maximum reliability you want to adopt a layered approach and not limit yourself to structural validations. You should validate using data types and assertions as well.


Resources

Learn

Get products and technologies

Discuss

About the author

Photo of Benoit Marchal

Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at www.marchal.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=94182
ArticleTitle=Working XML: Safe coding practices, Part 3
publish-date=09272005
author1-email=bmarchal@pineapplesoft.com
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers