Skip to main content

XML style guidelines for leveraging schema validators

Cut development time by using XML Schema validation for basic data validation tasks

Erik Ostermueller, Senior Architect, Fidelity Information Services
Erik Ostermueller has been a lead software developer and consultant for more than 10 years, with experience in both the U.S. and Europe. Currently employed at Fidelity Information Services, Erik has spoken on XML at two different conferences. He actively contributes to a few different Java open-source projects and focuses on XML Schema, automated testing, Unicode, and usability issues. You can reach him at eostermueller@yahoo.com.

Summary:  Used correctly, XML Schema validation can dramatically reduce the effort necessary to perform basic data validation tasks. Additionally, validation rules that are centrally located in an XML schema can help users to better understand your system. It takes the right XML structure, however, to leverage a schema validator. This article discusses proper XML structure as well as best and worst practices for defining data validation rules in XML Schema.

Date:  11 Nov 2003
Level:  Intermediate
Activity:  1508 views

How do you keep invalid data from getting into your system? Should you hand-code validation routines that perform bounds checking? With the XML entry points into your system, XML Schema validators can save you an incredible amount of time in this area. This goes for DTD validators as well as those for XML Schema.

DTD validators provide basic structural validation. When appropriate, they will bark out error messages like "This required element is missing" and "That element cannot be reoccurring." However, DTDs lack support for describing data types, reusable structures, and namespaces. XML Schema validators are spun from a silkier thread and address these gaps.

Instead of hand-coding the data validation, consider defining XML Schema rules to achieve the same end. At runtime, a schema validator will read the rules and raise red flags when the incoming data violates the rules.

This approach offers a number of advantages. XML schema rules are programming language-independent. Their syntax is data-driven and governed by W3C standards. This means you can encode the rules a lot faster than writing the validation code by hand. Further, it is easier to standardize the look and feel of your system’s validation across different components and different developers.

Additionally, XML schema validation can vastly improve your client’s understanding of your XML API. Hand-coded data validation is not very accessible to your API’s consumers. The validation code is scattered across multiple source files and is often under-documented. If, on the other hand, you have a good XML schema that fully describes this data validation, your clients can better visualize the data validation rules and better understand your API. Each new rule in the schema is one more thing that your client will know about without having to dig through your source code.

Unfortunately, some ways of structuring your XML can prevent you from taking full advantage of schema validators. The following will help you understand and avoid these idioms, which are obstacles in both DTDs and XML Schema. After that, I will discuss best and worst practices for actually defining the validation rules -- these are for XML Schema only, and not DTDs. Finally, I will briefly address some of the limitations of currently available schema validators.

XML grammar or XML message set?

You should be wary of XML documents that look like this:

<ele>
 <DataElement variable="CustomerName" value="Smith"/>
 <DataElement variable="AccountBalance" value="100.00"/>
 <DataElement variable="TransDate" value="12/22/1996"/>
</ele>

The frustrating truth is that this XML is perfectly valid, but its style will unnecessarily require a lot of hand-coded data validation. An example better illustrates the problem -- consider this version with a bug in it:

<ele>
 <DataElement variable="CustomerName" value="Smith"/>
 <DataElement variable="AccountBalance" value="100.00"/>
 <DataElement variable="TansDate" value="12/22/1996"/> <!-- TYPO! --></ele>

How can you keep this bug from wriggling into your system? You would have to hand code the validation to root out this bug -- not to mention invalid dates and invalid numerics. The data entering almost any system requires a lot of validation. This XML, however, forces you to validate not only the data but also the names of the variables, so programmers need to remember precisely the right spelling of all those easy-to-remember variable names in your systems. Dealing with deprecated XML elements and your programmers’ typos is not terribly easy, either. Now, I'll create one of these from scratch:

<ele>
 <DataElement variable="" value=""/>
 <DataElement variable="" value=""/>
 <DataElement variable="" value=""/>
 <DataElement variable="" value=""/>
</ele>

This blank canvas can be very intimidating. What goes between all those quotes? Does this XML describe a financial transaction or scuba gear? What is the right format for a date? This XML code has lots of unknowns -- unless, of course, you don't mind combing the code that processes the data. How much time will your staff waste answering these questions?

On the brighter side, this style does have one thing going for it: If you want to add or change the variables used (perhaps adding a new variable like IsCustomerActive), you do not have to take the time to change the schema that governs the structure of the document; you can just add another tag to your XML with the new variable name and off you go.

This flexibility, however, comes at a big cost -- your system is more vulnerable to bad data. This refactored version of the XML addresses these issues.

<AccountInquiry>
 <CustomerName>Smith</CustomerName>
 <AccountBalance>100.00</AccountBalance>
 <TransDate>12/22/1996</TransDate>
</AccountInquiry>

Alas, this message includes no scuba gear. You can now designate a data type (with a well-defined format) for each tag. The schema validator can now take responsibility for throwing an exception for the bug mentioned earlier. Further, the validator will shoulder the burden of throwing errors for invalid dates and numerics. Lastly, your clients will thank you for providing a structure that fully describes itself using an industry standard.

With enough creativity, the <DataElement>, variable="", and value="" could define just about any data structure. That was part of the initial problem: The goal was to have a very specific business transaction, but I ended up with the generic <DataElement> construct. This distinction between generic XML structures and specific ones is important. I'd like to offer up a distinction between two different styles of XML structures:

  • Generic XML structures that suit a variety of applications are XML grammars.
  • Specific XML documents that are geared toward a single purpose are members of an XML message set.

XML grammars and XML message sets are very different beasts. Every schema describes more of one than the other. Schemas in general place restrictions on instance documents. Schemas for XML grammars, however, leave a lot of room for creativity: They are low-level, wide-open schemas for which you'll find an almost unlimited number of valid instance documents. Valid XML for an XML grammar can vary greatly in size and shape while still adhering to the schema. MathML and UIML are two good examples (see Resources) among hundreds of others. MathML lets you model mathematical equations -- imagine all the different equations you could create. UIML is similar -- think of all the possible UI applications you could model.

Schemas for message sets, on the other hand, leave little room for creativity. Imagine all the different account inquiries you could create with the four tags in the above transaction -- only a tax collector could find creativity in this. Message sets are designed for tight validation. Valid instance documents for the same message look strikingly similar to each other. Consider a valid message from the Interactive Financial Exchange (IFX) message set , which models financial transactions. Valid customer update messages largely contain the same set of tags. Another example of a message set is the Open Travel Alliance (OTA). (See Resources for more on IFX and OTA.)

The moral of the story is this: Do not use an XML grammar to describe something more restrictive; use an XML message set. If you use the grammar, you could get stuck having to hand code a lot of data validation.


An exception to the rule?

All of that said, ambiguous data structures can be valuable. For instance, the schema for SOAP messages would be nothing without a spot for application-defined messages. It uses the XML Schema any feature like this:

<complexType name="Body">
  <!-- This is a placeholder for your message payload.-->
  <!-- Use a WSDL file to define the structure.       -->
 <any minOccurs="0" maxOccurs="*" /> 
 <anyAttribute/> 
</complexType>

This structure serves as an instruction to the community. It trumpets a message: "Hear ye, hear ye. Place your application-specific stuff here." On the other hand, it does not say, "Shoot yourself in the foot by avoiding schema validation." After all, SOAP provides an alternate facility for specifying the structure and validation rules of the above Body element: The WSDL file. Some components, like SOAP servers, must deal with someone else’s XML structure. However, don’t mistake this as an excuse for shelving schema validation. Follow this SOAP/WSDL pattern to enable flexibility without compromising schema validation. For a more thorough treatment of this issue, read Dare Obasanjo’s article on XML Schema flexibility (see Resources).


Beware of attributes and element text

Consider this variant of the reworked XML:

<Message msgType="AccountInquiry">
 <CustomerName>Smith</CustomerName>
 <AccountBalance>100.00</AccountBalance>
 <TransDate>12/22/1996</TransDate>
</Message>

What changed? Instead of <AccountInquiry> at the root, you now have <Message msgType="AccountInquiry">. This is a mistake if you want a schema validator to do the validation for you, and I'll show you why. The predicament is clearer when you model the next message in the same message set, ProductInquiry:

<Message msgType="ProductInquiry">
 <ProductName>DEPOSIT</ProductName>
 <BankName>Fred’s Bank</BankName>
 <TransDate>12/22/1996</TransDate>
</Message>

The <Message> element has two different meanings -- one for account and one for product. How confusing. First, I declare that <Message> must have account data, and account data only. Then I change my mind and declare that the same element, <Message>, must now have product data, and product data only. The schema for <Message> would have all account and product elements optional for all messages. If everything is optional, the validator will sit idle when, for instance, the product message is missing vital product data. You want the validator to sound the alarm when this data is missing so you will not have to hand code the validation yourself. All these problems began when I encoded the message type in an attribute, like this: msgType="ProductInquiry". You can land yourself in the same predicament by encoding that same information in element text, like this:

<MessageType>AccountInquiry<MessageType>
<Message>
 <CustomerName>Smith</CustomerName>
 <AccountBalance>100.00</AccountBalance>
 <TransDate>12/22/1996</TransDate>
</Message>

Again, this overloads the <Message> element. Doing so is not always bad, but in this case it has unwittingly kept the schema validator from helping out. To solve this dilemma, replace <Message> with <AccountInquiry> or <ProductInquiry>. Then, in the schema, assign the appropriate children to each element.

The bottom line is to be careful when using attributes and element text. They are worthless when you are trying to specify what is and is not required in your message. This practice also leads you down the dark path of overloading the meaning of a single element like <Message>.

Attributes aren’t entirely evil. Unfortunately, though, it is easier to tell you when not to use them. Just remember this: The main role of the schema validator is structure validation; attributes do not play a big part in this. A validator rarely uses the value of an attribute to determine whether another part of the XML is valid (the obscure exceptions are XML Schema’s Identity and KeyRef features). This arrangement bumps attributes into a second-class role in schema validation. The same can be said for element text.

Although they are infrequently used, some techniques (that use other schema languages) work better with attributes and element text. (See Resources for relevant works by Jeni Tennison, Roger L. Costello, and Bob DuCharme.)

Keep in mind the differences between XML grammars and XML message sets. Knowing the difference will help you structure your XML to take advantage of schema validators. Further, remember that XML attributes have limited use in schema validation.

Thus far, I have discussed how to structure your XML to pave the way for using a schema validator. The rest of this article will focus on best practices for defining XML Schema validation rules. From here on in, I’ll be talking about XML Schema (note the capital "S" in "Schema".)


Do not limit yourself to UML

Software developers have been modeling business messages for many years. Prior to XML, groups that I worked with modeled their business messages in UML (Unified Modeling Language). Something was always missing with UML, though. Most UML notation is little more than documentation. If you wanted to validate the constraints you had modeled, you had to code the validation yourself. What a pain to have to encode these contraints twice -- once for the notation and once for the code itself!

XML schemas fix this. You get runtime validation simply by adding the notation. The following validation rules are very easy to model in XML Schema and can reveal a lot about your business message:

  • The required or optional nature of a piece of data
  • Data length: For example, an account number must have a particular string length, a minimum string length, or a maximum string length.
  • Wild card masks: For example, an account number must have two alphas followed by any number of numerics.
  • Enumerations and numeric ranges: For example, the value of a particular string attribute must be A, C, or D. Another example: The value of a particular numeric must be between 1 and 500.
  • Multiplicity: You can designate the multiplicity of a relationship between two entities -- for instance, each basket-weaving class must have between 20 and 30 students. If you add this type of constraint at development time, the schema validator will reject the invalid data at runtime. The schema validator will fire an exception when the 31st student attempts to enter the class.

Simply adding these rules to your schema automatically buys you both documentation and runtime validation. That is quite a deal. UML alone cannot accomplish these things. In fact, you must dust off UML’s obscure Object Constraint Language (OCL) and an OCL code generator to achieve the same functionality. See Resources for more information on OCL.


Best and worst practices for using XML Schema

Modeling required and optional elements

Use XML Schema to model the required and optional elements of your transaction.

Good: If an element in your business message is required, model it as such using XML Schema. Doing so enables a schema validator to flag when bad data enters the system (so you do not have to expend the effort). Also, it documents this fact for your client:

<xsd:element 
   name="FirstName" 
   type="xsd:string" 
   minOccurs="1"/>

Bad: Sometimes a schema indicates that a data item is optional when some other part of the system downstream fails without the data. Here is an optional element in XML Schema:

<xsd:element 
   name="FirstName" 
   type="xsd:string" 
   minOccurs="0"/>

At first, the logic behind this sounds reasonable: If you change this data element from optional to required, you essentially duplicate error processing that occurs elsewhere in the system. If you take this route, though, the consumer of your API will waste a good deal of time discovering the error. Consider this scenario:

  1. User prepares and submits a transaction without the data item (the schema doesn’t state that the item is required).
  2. The other part of your system returns an error.
  3. The user deciphers the error message and eventually discovers that a required element is missing.
  4. User adds the required data item and resubmits the transaction.
  5. User verifies that it runs OK.

You can avoid most of this by advertising in your schema which elements are required.

Modeling date/time data types

For dates and times that are in your business message, use xsd:dateTime.

Good: The format for the date is well defined in the XML Schema specification.

      <xsd:element name="BirthDate" type="xsd:dateTime"/> 

Bad: The following is bad practice because it forces the user to locate the correct format for a date.

      <xsd:element name="BirthDate" type="xsd:string"/>

Documenting the schema

Good: Use XML Schema annotation nodes to further describe the elements of a business message and the business message itself. For example:

<xsd:element name="Amount" type="xsd:integer">
   <xsd:annotation>
         <xsd:documentation>
               Use this element to specify 
               the amount that should be transferred
         </xsd:documentation>
   </xsd:annotation>
</xsd:element>

This type of documentation provides assistance while creating both schema and XML instance documents.

Bad: If you do not document your schemas, users have two other options -- they can rifle through your source code or refer to a separate document. Although highly accurate, little nuggets of wisdom embedded in the source code often take hours to pry loose. Documentation kept separate from the schema is more likely to get out of date than documentation inside the schema.

Modeling data items that are mutually exclusive

Good: Many business messages contain data items that are mutually exclusive of each other. For instance, only one element of a possible two (or more) can be present in a single instance of the message. Represent these items with an XML Schema element like this:

<xsd:choice>
   <xsd:element name="DestinationAccount" type="AccountKeyType"/>
   <xsd:element name="SourceAccount" type="AccountKeyType"/>
</xsd:choice>

Now, only the following XML fragments are allowed:

<MyMessage>
   <DestinationAccount>
</MyMessage>

or

<MyMessage>
   <SourceAccount>
</MyMessage>

The schema validator will reject the following:

<MyMessage> 
<!-- fails, because only DestinationAccount 
            or SourceAccount is allowed -->
   <DestinationAccount>
   <SourceAccount>
</MyMessage>

<MyMessage>
<!-- fails, because neither DestinationAccount 
            nor SourceAccount are present -->
</MyMessage>

Bad: It is easy to shy away from the XML Schema syntax when it is unfamiliar. If you choose this path, though, you will have to write the code to enforce the mutually exclusive arrangements between data items in your business messages. Choosing this path also hides an important facet of the data. Only those with access to the source code will know about the mutually exclusive arrangement.

Enumerations

Good: XML Schema enumerations help the user discover the valid values for a particular variable.

<xsd:element name="TransactionIndicator">
   <xsd:simpleType>
      <xsd:restriction base="xsd:string">
         <xsd:enumeration value="FORCE_POST"/>
         <xsd:enumeration value="BACKDATE"/>
      </xsd:restriction>
   </xsd:simpleType>
</xsd:element>

If, instead, the valid values for a particular variable are stored in a data source (like LDAP or RDBMS), you should specify in the XML Schema documentation precisely how to enumerate the valid values:

<xsd:element 
   name="TransactionIndicator" 
   type="xsd:string">
   <xsd:annotation>
      <xsd:documentation>
Use business message "TransactionIndicatorInquiry" 
to discover the valid values for this attribute.
      </xsd:documentation>
   </xsd:annotation>
</xsd:element>

Bad: It is a bad idea to abbreviate enumerations; the user will have no idea which value to use.

<xsd:element name="TransactionIndicator">
   <xsd:simpleType>
      <xsd:restriction base="xsd:string">
         <xsd:enumeration value="F"/>
         <xsd:enumeration value="B"/>
      </xsd:restriction>
   </xsd:simpleType>
</xsd:element>

Worse: If a variable has a fixed set of valid values, do not hide them from the user, like this:

<xsd:element 
   name="TransactionIndicator" 
   type="xsd:string"/>

This forces the user to ask a neighbor, experiment, or dig through other documentation or source code.


The downsides of schema validation

Most schema validators, not just those for XML Schema, can add a lot of value to your application. However, schema validators in general have their share of issues. For starters, a schema validator’s messages are descriptive, but not user-friendly. Here is an example:

 [Error] greetings.xml:1:12: Element type "greetings" must be declared.

This is not the kind of message you want to show your end users. Furthermore, few validators, if any, render error messages in multiple languages (English, Chinese, French, and so on.). In the same way, dates, times, and monetary amounts in the error messages are not localized. Also, most validators stop reporting errors after they discover the first error. It would be great to know about all errors at one time. Lastly, schema validation still has performance issues. Data binding facilities like Castor, XSD, and JAXB are helping to address these performance concerns. These facilities can also help you use schema validation rules even if you don’t utilize an XML interface. You can now even use Castor schema validation with Apache Axis (see Resources) for a great SOAP implementation.


Conclusion

If you use a schema to describe your data structures -- especially XML Schema -- a number of great third-party tools are available to help you out. However, it takes work to structure XML messages so that schema validators can do their job. If you follow the XML style guidelines presented here, you will kill at least two birds with one stone. First, you will shift responsibility for some data validation from your code to the schema validator -- this will save you time and money. Second, the people creating your XML will know a lot more about your XML interface. If you do a poor job of describing your interface, expect to see these people queued up at your desk waiting for a big chunk of your valuable time.

Acknowledgements
A number of people provided input for this article -- thank you. A special thanks to Colin Reeves, FNF Senior Technical Writer/Editor, for lending her editing expertise.


Resources

  • Get a more detailed look at XML Schema validation rules with this learn-by-example primer from the W3C.

  • With a little work, you can design an XML Schema with place-holders for structures yet to be defined. This buys you both flexibility and schema validation; on the other hand, using name-value pairs for flexibility leaves no room for schema validation. Read Dare Obasanjo’s article on XML Schema flexibility.

  • Implementing a SOAP solution? Add Castor XML Schema validation to Apache Axis using this excellent article (developerWorks, September 2003). Castor's code-generated schema validation executes faster than a parser's schema validation.

  • Bypass the attribute and element text limitations discussed in this article in a number of ways. These articles discuss how to extend XML schemas:

  • Read about Interactive Financial Exchange (IFX) and Open Travel Alliance (OTA), two schemas that define XML message sets. They are designed for tight validation and leave little room for creatively arranging different parts of the schema. The messages are comparable to 4-GL programming languages.

  • Check out MathML and UIML, two schemas that define XML grammars. Their wide-open style resembles that of 3-GL programming languages.

  • Use the guidelines in this article when you work with these Web Services standards:

  • If you add a validation rule to your schema, the schema validator flags any corresponding invalid data at runtime. UML doesn't work that way. You do not gain any runtime error flagging until you employ something like the obscure Object Constraint Language (OCL).

  • Download XML Schema Quality Checker from IBM alphaWorks. It takes as input an XML Schema written in the W3C XML Schema language and diagnoses improper uses of the schema language.

  • Find more XML resources on the developerWorks XML zone.

  • Learn how you can become an IBM Certified Developer in XML and related technologies.

About the author

Erik Ostermueller has been a lead software developer and consultant for more than 10 years, with experience in both the U.S. and Europe. Currently employed at Fidelity Information Services, Erik has spoken on XML at two different conferences. He actively contributes to a few different Java open-source projects and focuses on XML Schema, automated testing, Unicode, and usability issues. You can reach him at eostermueller@yahoo.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12340
ArticleTitle=XML style guidelines for leveraging schema validators
publish-date=11112003
author1-email=eostermueller@yahoo.com
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers