This is the third article in a four-part series discussing common XML pitfalls and, more importantly, ways around them. As a consultant and trainer, I have noticed that many companies and developers make the same mistakes when they adopt XML technology. This series is an attempt to document some of those problems and spare you the annoyance of dealing with them.
Part 1 looked at common misunderstandings with the XML standard itself (such as encodings and namespaces). Part 2 focused on design issues: how and where to introduce XML support in an application. One of the guiding principles of Part 2 is to treat XML files as an interface between applications and to apply the time-tested design techniques used with JavaBeans and other interfaces: separation of tasks, documentation, built-in evolution, and more.
This article looks at validation, error handling, and schemas. Schemas primarily address the design, documentation, and validation of an interface. I'll focus primarily on validation.
First, some vocabulary definitions. The W3C XML Schema Recommendation is one schema language for XML. Others include Document Type Definitions (DTDs), RELAX NG, and Schematron (see Resources). In this article, Schema (capital S) indicates the W3C XML Schema Recommendation. Lowercased, schema refers to the general concept of a schema language.
In 1998, when XML emerged as a W3C recommendation, DTDs were something of a novelty. They'd been around since SGML was adopted by the International Standards Organization (ISO) in 1986, but few other file formats offered a validation mechanism, and none of those that did proved as popular as XML. Still, the underlying concept isn't new; it derived directly from database schemas. Essentially a DTD describes the vocabulary's structure (the tags and attributes), similar to the way a database schema describes a database's structure (the tables and columns). The W3C later released XML Schema as a more powerful schema language.
Experience with database design shows that schemas are most useful as a safeguard against programming errors. Storing incomplete or incorrect data is less likely when you use a schema because the data must conform to clear rules. The usefulness of schemas increases with the number of applications accessing the information. The more applications that access and modify the data, the greater the need for the structure and guidance that a schema provides.
However, database developers have long known that a schema does not replace error handling. Database validation is the last chance to catch errors, but a good application has already validated the data at that point (and error messages from the database engine are anything but readable).
By and large, this experience holds true with schemas for XML. The more applications that use a given vocabulary, the more you need a schema to define a common framework. And although schemas provide some error handling, it is useful to complement them in the application.
The DTD was the original schema language for XML. A direct transplant from SGML, the DTD is too limited for many applications. Still, because it is the oldest schema language, it is the most widely available. It always pays to keep a DTD around, because some of the older products in your toolkit might not work with more recent alternatives.
XML Schemas are more powerful and more modern. Among the differences between DTDs and XML Schemas (see Resources for a couple of articles that compare the two), three important ones matter in my practice:
- XML namespaces
-
DOCTYPEdeclarations - Rich data typing
Lack of support for namespaces is the single most glaring deficiency of DTDs. (You can emulate namespaces partially through parameter entities, but it is complex and not totally satisfactory.) XML Schemas make namespaces a first-class citizen for XML. I discuss DOCTYPE statements and data typing in the next section, Schema for validation.
XML Schema is a complex recommendation. This complexity has led to the development of alternate schema languages that emphasize simplicity. The best-known alternative is RELAX NG, which the ISO is standardizing. Although technically interesting, those efforts lack the W3C's support, which translates into less support from tool vendors. My customers have shown little interest in these alternatives, and vendors don't offer much support for them, so I don't use them often.
As I mentioned earlier, schemas have essentially three applications: design, documentation, and validation. Simple vocabularies are often designed by carefully crafting a corresponding schema. With proper annotations, the schema serves as documentation for developers. Designing the schema directly works well for tiny vocabularies. Anything larger is best served by a real modeling language such as UML (see Resources for a previous Working XML series that discusses UML). In this article I'll concentrate on the use of schemas for validation.
I have noticed three common errors in the use of schemas:
- Making them too stringent (a mistake I tend to make; I always have to double-check myself)
- Failing to design proper error handling
- Implementing validation unreliably
How stringent should a schema be? This is the first question to ask when you develop a new schema. Designers tend to be strict in an attempt to prevent errors; catching errors early through proper validation of files helps build more stable systems. Yet experience (including experience with database designs) shows that you need to balance strictness with what, for lack of a better word, I call clarity.
The dilemma boils down to this: Up to a certain point, the schema must match the expectations and understanding of the developers and other users of the vocabulary. In other words, the vocabulary should be easy to work with. Yet to design a schema that prevents as many errors as possible, you need to organize the vocabulary around the error checking, sometimes using advanced features such as inheritance. The result can be a complex schema that's hard to read and harder to implement correctly. Many users will find this too strict a framework. It can also be difficult to maintain, because adding or removing validation can require that you change the vocabulary.
Look at the common example of designing a vocabulary for international purchase orders. You need tags to record addresses, and you want those addresses to be correct because goods will be shipped to them. But how far is too far when it comes to validating the address? Take the state element, for example. States are required for U.S. addresses but do not exist (and have no equivalent) in most other countries. So you'd like to make the state tag mandatory (minOccurs="1" in XML Schema), but you can't because it won't work for most countries. One option is to have strict validation by introducing specialized address elements by country: U.S. addresses include a state; no other addresses do. This may sound attractive at first, but when you realize that there are 193 countries in the world it becomes clear that the schema will become bloated.
This example, in which tags are introduced only to strengthen validation, is an example of lack of clarity. The purpose and intent of those tags with a fair amount of redundant information will not be obvious to schema readers. Ironically, this can lead to errors. So what do you validate for? Validate as much as makes sense, but refrain from bending the data structure to push for more validation.
Validation and error handling are not a monolith. Your application can validate at different levels:
- Structure: The schema specifies a structure that controls which tags will appear where, how often they repeat, and more -- for example, a purchase order line consists of a product number, description, quantity, and price.
- Data typing: You can use data typing to control tag content -- for example, the quantity field in a purchase order is a non-negative integer (because the customer can't order less than a whole unit of the product).
- Assertion: Use assertions to check relationships between fields -- for example, the purchase order's total is the sum of the order lines.
DTD is pretty good at structural validation. So is XML Schema. XML Schema offers extensions (such as typing, local elements, and inheritance) that have proven contentious with the developer community. I for one was perfectly happy with the DTD's feature set in this respect.
XML Schema also adds rich data typing support. You can validate against the data types found in modern databases and programming languages. Furthermore, you can derive your own types through facets. In a nutshell, facets further restrict a simple type by specifying the maximum length of a string or the upper and lower limits of a number.
The last level, assertion, is typically implemented through Java code or through a dedicated assertion language such as Schematron. Unfortunately, because XML Schema does not recognize assertions, some applications don't include them in their validation strategy. The result? Applications print out bizarre error messages, freeze, or crash when they receive an incorrect file.
To build strong error handling, you need a layered approach:
- At the lowest level, the parser checks for syntax conformance.
- On the next layer sits a schema that validates the structure and typing.
- A layer of custom Java code (or a Schematron) performs the next level of validation.
- Optionally, the Java object that you load the data into can perform a last level of validation.
Each layer is more specific than the previous one, making it easier to share validation across several applications. Validation is also more maintainable when you clearly specify the responsibility of each piece of code.
It seems obvious that validation should not depend on the XML file being correct. If the file were correct, why would you bother validating it? Yet many applications validate files only if they are at least partly correct. Applications that rely on the file itself to reference the schema (through a DOCTYPE statement or through a schemaLocation attribute) are at risk. If the file points to an incorrect schema, the parser will validate against the incorrect schema. It might not report errors even though the file is incorrect.
With DTDs, the document must include a DOCTYPE statement. So to be safe, the application must either insert its own DOCTYPE statement before parsing or, at a minimum, implement SAX's LexicalHandler interface and check that the referenced DTD is the correct one.
XML Schema offers a better solution: Tell the parser to load the schema from the document namespace. With a JAXP 1.2 parser, you configure the schema by using the http://java.sun.com/xml/jaxp/properties/schemaSource property in the Java code before reading the document (see Resources for an article with sample code). Thanks to this property, the parser will always load the correct schema.
A word of warning: JAXP properties work like XML namespaces. The URL is an identifier; it does not point to a Web page. Just make sure you copy the URL exactly as shown.
What about the schemaLocation attribute that might appear in the XML document? If you read the XML Schema specification, you'll see that this attribute was never intended as a general mechanism for loading schemas but only as a hint. In practice, it is useful when you test and debug (before you've had a chance to configure the parser fully), but not for production data.
Strong error handling translates into more reliable applications. When you load an XML document, it is best to catch errors early, before they have a chance to pollute other data. Schema languages help in that respect, but for maximum reliability you want to adopt a layered approach and not limit yourself to structural validations. You should validate using data types and assertions as well.
Learn
- "Working XML: Safe coding practices" (developerWorks, July and August 2005): Read the previous articles in this series.
- "UML, XMI, and code generation, Part 1" (developerWorks, March 2004): Learn why UML is better than Schema language for XML design and how to implement it. See also Part 2 (May 2004), Part 3 (June 2004), and Part 4 (August 2004).
- "Validating XML" (developerWorks, August 2003): Get started with DTDs and XML Schemas in no time.
- "A hands-on introduction to Schematron" (developerWorks, September 2004): Find out how to use this assertion validation language.
- "Understanding RELAX NG" (developerWorks, December 2003): Check out this tutorial if you want to get up to speed on RELAX NG.
- "Tell a parser where to find a schema" (developerWorks, May 2003): This tip includes sample listings for schema validation with JAXP 1.2.
- "Comparing W3C XML Schemas and Document Type Definitions (DTDs)" (developerWorks, March 2001): David Mertz is skeptical that Schemas will replace DTDs, though he believes that XML Schemas are an invaluable tool in a developer's arsenal.
- "Why XML Schema beats DTDs hands-down for data" (developerWorks, June 2001): Kevin Williams tells why he's sold on XML Schema for the structural definition of XML documents for data.
- "XML style guidelines for leveraging schema validators" (developerWorks, November 2003): This article discusses proper XML structure as well as best and worst practices for defining data validation rules in XML Schema.
-
developerWorks XML zone: Learn more about XML here. You'll find technical documentation, how-to articles, education, downloads, product information, and more.
-
IBM XML certification: Find out how you can become an IBM Certified Developer in XML and related technologies.
Get products and technologies
-
XML Schema Quality Checker: This XML Schema verification tool is available as a free trial download through IBM alphaWorks.
Discuss

Benoît Marchal is a Belgian consultant. He is the author of XML by Example, Second Edition and other XML books. You can contact him at bmarchal@pineapplesoft.com or through his personal site at www.marchal.com.
Comments (Undergoing maintenance)





