At first XML had the Document Type Definition (DTD). XML 1.0 came bundled with the schema technology inherited from SGML. However, numerous XML users complained about DTDs including the fact that they use a different syntax from XML itself. The W3C developed a successor technology to DTD, W3C XML Schema, but some complained that it was too complex, and that it showed every sign of design-by-committee. Separate groups developed schema technologies that became RELAX NG and Schematron. These technologies all have their strengths and weaknesses, and their attendant factions. But for the developer with deadlines to mind, crafting schemata is often too much of an additional burden.
Without a doubt , it is always a good idea to develop a schema. If for no other reason, it provides documentation of the format. But in the real world, the most common course for harassed developers is to develop a sample of the XML format to serve all purposes of a proper schema. But what if the example itself could provide the benefits of a formal schema? In particular, what if the example could be used to validate documents? Eric van der Vlist set out to develop a system that allows example documents to serve as formal schemata, and his invention is Examplotron.
In this article, I introduce Examplotron. This system is simple to use, so I encourage you to follow along by downloading Examplotron 0.7 (compile.xsl) and use your favorite XSLT and RELAX NG processors (see Resources for relevant links). The Examplotron implementation file you download is called compile.xsl: I thought this name too generic, so on my machine -- and in this article -- I have renamed it to eg-compile.xsl.
To use Examplotron, take most any XML instance and run it through a compiler that creates a compiled Examplotron script. The script can then be run against real instance documents to validate them. In earlier releases of Examplotron, the process was as illustrated in Figure 1:
Figure 1. Processing model of early versions of Examplotron
This is similar to the most common mechanism for Schematron validation. The schema, which in the case of Examplotron is a reference instance document, is compiled into an XSLT script which can then be run against other XML documents to check its validity against the schema. The most recent Examplotron versions (including 0.7, which I cover in this article) use a different process, illustrated in Figure 2.
Figure 2. Processing model of the most recent version of Examplotron
Suppose I come up with an XML format for mailing address labels. To think through the format, and describe it to others, I come up with a simple example, as in Listing 1 (eg1.xml).
Listing 1. A mailing label instance and valid Examplotron schema (eg1.xml)
<?xml version="1.0" encoding="utf-8"?> <labels> <label> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> </labels>
The brilliant thing is that my simple example is, without any further fuss or ado, a perfectly useful Examplotron schema. You can use any XSLT processor and the eg-compile.xsl script to compile it into a ready form for validation:
$ 4xslt -o eg1.rng eg1.xml eg-compile.xslt
The format of the 4xslt command line above is
4xslt -o [output file] [source file] [XSLT file]. The output file, eg1.rng, is a RELAX NG file. You can use any RELAX NG processor to check it. In this article, I use 4Suite's RELAX NG facilities (which are based on xvif, also by the productive Eric van der Vlist). Since eg1.xml is both a schema and a valid source document, I can apply the created schema against it:
$ 4xml --rng=eg1.rng eg1.xml
The format of the 4xml command line above is
4xml --rng=[RELAX NG schema file] [source file]. By default, the source document is echoed back to the screen as long as no RELAX NG validation errors have been found, which should be the case in the above invocation. For a more telling test, I can apply the created RELAX NG schema against a different document that conforms to eg1.xml -- such as the document in Listing 2 (test1.xml):
Listing 2. Sample document for validation against the Examplotron schema (test1.xml)
<?xml version="1.0" encoding="utf-8"?> <labels> <label> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
I apply the schema as follows:
$ 4xml --rng=eg1.rng test1.xml
And again it is valid. Listing 3 (test2.xml) is an example of an invalid document. When I validate it against eg1.rng as above, I get an error message -- "Qname quote not expected" -- which makes perfect sense because there is nothing in the Examplotron source document suggesting that a
quote element is legal.
Listing 3. Sample document that is invalid against the Examplotron schema (test2.xml)
<?xml version="1.0" encoding="utf-8"?> <labels> <label> <quote>What thou lovest well remains, the rest is dross</quote> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
Of course, the sample XML I've presented as an Examplotron schema probably doesn't convey all the information required for validation. For example, are there any optional
elements or attributes that were omitted from the sample document? This question is addressed if you make sure that you always include all possible elements or attributes in the Examplotron source schema, even the optional ones; Examplotron includes ways for you to indicate that some elements are optional. Another question that might arise is: Can there be more than one
label element? If you want to indicate that you can have more than one of an element, you can simply list it multiple times in the Examplotron source. Listing 4 (eg2.xml) is an Examplotron source file specifying that there can be one or more
Listing 4. Mailing label Examplotron source that allows multiple labels (eg2.xml)
<?xml version="1.0" encoding="utf-8"?> <labels> <label> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label/> </labels>
Notice that the second
label is empty. Examplotron figures out the content model for the element from the first one, and the second one is purely a marker to indicate that it can occur more than once, so you can leave it empty. To avoid confusing people who are truly looking at the Examplotron source as a human-readable example, you may want to fill out all such elements with the expected content. Listing 5 (test3.xml) is an example of a document that is valid against Listing 4, having more than one
Listing 5. Sample document that is valid against Listing 4 (test3.xml)
<?xml version="1.0" encoding="utf-8"?> <labels> <label> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> <label> <name>Ezra Pound</name> <address> <street>45 Usura Place</street> <city>Hailey</city> <state>ID</state> </address> </label> </labels>
This document is not valid against Listing 1. Since Listing 1 has only one
label element, Examplotron takes it at its word and generates a RELAX NG schema that only permits the one element. Also, all elements that appear in an Examplotron schema are required by default.
This is great so far, but not quite flexible enough for the real world. Usually in XML formats, one has to specify that a certain element is optional, or appears a certain number of times. In DTDs, one uses occurrence indicators to express this. Examplotron does some very good guessing based on sample documents as they are, but in most cases you'll have to help it out a bit to get more precise results. You can provide hints to Examplotron by adding special attributes to the source document, similar to DTD occurrence indicators. Listing 6 (eg3.xml) is an Examplotron schema that specifies zero, one, or more
label elements, and allows a single, optional
Listing 6. Mailing label Examplotron source that uses Examplotron hint attributes (eg3.xml)
<?xml version="1.0" encoding="utf-8"?> <labels xmlns:eg="http://examplotron.org/0/"> <label eg:occurs="*"> <quote eg:occurs="?">Midwinter Spring is its own season...</quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> </labels>
Notice the declaration of the special Examplotron namespace in this document, which is used for adding the hint attributes. The
eg:occurs attribute has values similar to occurrence indicators in DTDs. Hence "*" means "zero or more", "+" means "one or more", and "?" means "zero or one". (For more on these values, see the Examplotron home page in Resources.)
Mixed content -- the ability to mix child elements and plain text in XML elements -- has spotty support in some schema languages, which is unfortunate since it is one of the defining features of XML. Examplotron in its clever way makes this just as easy as any other feature. Listing 7 is an Examplotron schema that allows the optional
quote element to have mixed content, and specifically embedded
Listing 7. Mailing label Examplotron source that demonstrates mixed content support
<?xml version="1.0" encoding="utf-8"?> <labels xmlns:eg="http://examplotron.org/0/"> <label eg:occurs="*"> <quote eg:occurs="?"> <emph>Midwinter</emph> Spring is its own <strong>season</strong>... </quote> <name>Thomas Eliot</name> <address> <street>3 Prufrock Lane</street> <city>Stamford</city> <state>CT</state> </address> </label> </labels>
Again, the basic principle of Examplotron holds: You show it an example of a construct and it works the example into RELAX NG form for you. Examplotron supports namespaces in a similar way. Just use namespaces in the source document and Examplotron builds those namespaces into the RELAX NG. Other schema features, such as data typing, are supported through hint attributes. For more detail on these more advanced features of the language, see the full Examplotron specification (in Resources), which is very readable.
Schema systems are perhaps the area of XML enjoying the greatest technical advancement. And yet of all the work in XML schema systems, I think that Examplotron is the most brilliant-yet-simple idea. I think you'll find that it can do wonders for productivity. On a recent project, a client who had many XML formats hired me, through my company Fourthought, to develop schemata for documentation and validation for these XML formats. All they had to start with were sample XML documents for each format. Using Examplotron to generate the production RELAX NG schemata from these sample documents saved me perhaps over a hundred hours of effort, and thus saved them tens of thousands of dollars. I did have to augment Examplotron with document generation and other refinement code; I hope to cover the non-proprietary aspects of this refinement code in a future article.
Examplotron produces RELAX NG schemata, but if you must produce W3C XML Schema, all is still well: You can use James Clark's excellent Trang tool to convert RELAX NG to WXS. I know from my overall consulting experience that sample documents are the most common form of schema in the real world, so I expect that Examplotron will be of great help to a lot of folks right away.
- Visit the Examplotron home page which is also the (very readable) specification. The link to the XSLT script that implements Examplotron is a bit buried. I suggest renaming this script to "eg-compile.xsl" after download.
- Examplotron schemas are compiled into RELAX NG. If you wish to learn RELAX NG, read this tutorial. See also RELAX NG's Compact Syntax, by Michael Fitzgerald.
- Find out more about Schematron, a very powerful schema language based on rules and abstract patterns. It is often used in conjunction with other schema languages, including W3C XML Schema and RELAX NG, because it can offer some facilities not available with those languages alone. Uche Ogbuji has an Introduction to Schematron that is mostly targeted at XSLT users, and Chimezie Ogbuji has a more general introduction.
- For more information on W3C XML Schema, see the home page and the exhaustive Cover pages.
- Read what David Mertz has to say about RELAX NG in his "XML Matters" column here on developerWorks.
- Part 1 of this three-part series gives a fairly complete overview of both the syntax and semantics of RELAX NG schemas (February 2003).
- Part 2 addresses a few additional semantic issues and looks at tools for working with RELAX NG (March 2003).
- Part 3 looks at tools for working with the RELAX NG compact syntax and transforming between it and the RELAX NG XML syntax form (May 2003).
- See the W3C XSL Home page for links to XSLT processors you can use. See the RELAX NG home page for links to RELAX NG processors.
- Try James Clark's Trang tool to translate between a variety of XML schema languages.
- The author uses 4Suite for XSLT and RELAX NG processing in this article. The RELAX NG support in 4Suite is based on xvif, also by Eric van der Vlist.
- Find more XML resources on the developerWorks XML zone.
- Check out Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
- Find out how you can become an IBM Certified Developer in XML and related technologies.
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact him at firstname.lastname@example.org.