My book on XForms (see Resources) landed on the shelves and online in the fall of 2003. Shortly thereafter, I started getting lots of e-mail questions about XForms, usually including a page or three of buggy XML source. Normally I'm good about answering e-mail, but sifting through pages of someone else's XML looking for common typos is neither fun nor productive. I had to find a better way.
I'm a huge believer in constructive laziness, so I decided to write an online tool that would accept an XForms document as input, and produce a report on any markup constructs that were either wrong or suspicious. From my e-mail archive, I had a reasonable sample of the kinds of mistakes people were making. Combining the two, I would have a powerful tool to help form authors help themselves.
The XForms 1.0 specification (see Resources) is defined as a number of elements, attributes, and content models. One thing it doesn't define, however, is a root element -- that is left to a host language to address. The two most common host languages are XHTML and SVG, but in principle almost any XML vocabulary could be used. Thus, the first job of an XForms validator is to extricate the XForms portions out of a document. For these, I've coined the term XForms islands.
Because XForms separates purpose from presentation, all but the most minimal form documents have at least two XForms islands, one for the XForms Model (the definition of what the form does) and one for the XForms User Interface (the definition of what the form looks like).
Listing 1 shows a simple XForms+XHTML document -- which may be too simple, as it contains a common mistake.
Listing 1. A common, but erroneous, XForms+XHTML document.
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:x="http://www.w3.org/2002/xforms"
xmlns:ev="http://www.w3.org/2001/xml-events">
<head>
<title>Basic mixed document</title>
<x:model>
<x:instance>
<!-- @@@ forgot xmlns="" on <root> @@@ -->
<root>
<child/>
</root>
</x:instance>
<x:bind nodeset="child" required="1"/>
</x:model>
</head>
<body>
<x:input ref="child">
<x:label>Label</x:label>
</x:input>
</body>
</html>
|
Listing 1 clearly has two XForms islands, one for x:model and one for x:input. The question is how the code should locate
them. Actually, it's not that difficult, since an element can be
identified as an XForms island if it meets two simple conditions:
- It must be in the XForms namespace
- It must not have any ancestors in the XForms namespace
While this test could have been done with XPath, I wanted to explore different ways for Python to perform XML processing. So I wrote a filter function that, given a node, determines whether it starts an XML island. Listing 2 shows the code.
Listing 2. Picking out XForms islands
# extract islands
def island_filter(node, usedns):
"""
An 'island' element has
1) the XForms NS
2) no ancestors with the XForms NS
usedns is a string containing the XForms Namespace URI
"""
if node.type != "element": return False # skip non-elements
rc = True
if (get_element_ns(node) != usedns): rc = False
node = node.parent
while node is not None:
if (node.type=="element" and get_element_ns(node)==usedns):
rc = False
break
node = node.parent
return rc
def get_element_ns(elem):
ns = None
try:
ns = elem.ns().content
except libxml2.treeError, e:
pass
return ns
|
The validator stores the resulting list of XForms islands for later
processing. Now, the mistake from Listing 1 (as described in the comment) is
that the root element has inadvertently been
left in the default namespace -- namely that of XHTML. This mistake, and
several other similar problems, can be detected with XPath-based checks like the one shown in Listing 3, which returns any suspicious element
nodes.
Listing 3. Checking for namespace leakage with XPath
//xf:instance//*[namespace-uri(.)=namespace-uri(/*) or
namespace-uri(.)=$usedns]
|
Note that since Listing 3 can't make any assumptions about the
structure of the surrounding host language, it makes heavy use of
XPath's // abbreviation, which uses the descendant-or-self axis to thoroughly search the
document under consideration. Note too that the namespace prefix
mappings (xf:) in the XPath don't have to
match those used in the target document (x:).
The test checks descendants of the x:instance
element to see whether they have either the XForms namespace or the
namespace of the root node. This definitely qualifies as a heuristic,
since it is possible for perfectly valid XForms documents to trigger
this condition. On the other hand, the chances are pretty good that this
condition is an authoring error, like the one in Listing 1, and so the
validator spits out a warning.
Another area of frequent mistakes is matching up IDs and IDREFs.
This is partially a historical problem, since the mechanism for defining
an ID relies on the presence of a DTD. Some tools -- depending largely on
the author's philosophy towards XML Schema and the Infoset -- also allow
IDs to be defined through XML Schema datatypes. In practice, however,
you'll often find little to go on other than the presence of attributes that
happen to be named id.
This situation isn't pretty, but a real-world validator tool needs to
be aware of it. The validator looks at a list of all attributes known to
contain IDREFs in XForms. First it tries the
built-in id() function; if that doesn't
find a match, it resorts to an XPath test that checks for attributes named
either id or xml:id (based on an unfinished W3C draft -- see
Resources). Here's the code:
//*[@id='idstr' or @xml:id='idstr'] |
As a final step, the validator looks at each XForms island and
validates it using RELAX NG. This is more complicated than it sounds,
since several areas of XForms (such as label)
can contain markup from the host language, not to mention additional
attributes that are allowed everywhere.
To deal with this, the validator uses a highly modularized RELAX NG schema for XForms, which is integrated into a highly permissive host language. By "highly modularized" I mean that every element definition, set of attributes on an element, and content model for an element is assigned a unique name that can be separately extended. Listing 4 shows how this works for a single element definition, using the handy compact syntax of RELAX NG.
Listing 4. RELAX NG modularized element definition
Common.Attributes = empty
Single.Node.Binding.Attributes = attribute bind { xsd:NCName } |
(attribute model { xsd:NCName }?, attribute ref { xsd:string })
UI.Common.Attributes &=
#host language to add accesskey, navindex, etc. here
attribute appearance { xsd:QName { pattern = "[^:]+:[^:]+" } |
"minimal" | "compact" | "full" }?
Select = element select { Select.Attributes, Select.Content }
Select.Attributes &=
Common.Attributes,
Single.Node.Binding.Attributes,
UI.Common.Attributes,
attribute selection { "open" | "closed" }?,
attribute incremental { xsd:boolean }?
Select.Content = Label, List.UI.Common.Content, UI.Common.Content
|
Note that even attributes that contain an IDREF are labeled as xsd:NCName, which performs only the lexical
validation of the attribute. As I mentioned earlier, actual checking of the ID-to-IDREF connections
happens at a different level. The primary
advantage of defining everything separately is that it's simple to
extend, for example by adding a class
attribute to all form controls.
In fact, that's exactly what the host language schema does. This portion of the validator is still under development, but Listing 5 shows how the host language is defined.
Listing 5. XForms + host language definition
Common.Attributes &=
attribute id { xsd:NCName }?,
attribute xml:id { xsd:NCName }?,
attribute class { xsd:NMTOKENS }?
|
When included by the main schema for XForms, the code in Listing 5 extends the schema for XForms in a way that allows everyday constructs,
like class and id
attributes, to pass validation. Since the included bits of host language
can be almost anything, this is one area that will need ongoing
adjustments, based on actual usage as seen in the wild.
As the validator works, it keeps track of the running results as an in-memory XML file. At the conclusion, an XSL transformation converts the results into the final HTML that gets sent over the wire.
As always, namespaces are a tricky subject, especially for authors. Standards can make things easier, and two developing standards approach this problem in different ways.
One such suite of standards is Document Schema Definition Languages, or DSDL (see Resources); these languages are currently progressing variously towards becoming ISO Final Draft International Standards. Currently divided into 10 parts, DSDL is an implicit recognition of the complexity of the overall subject of validation. Individual parts include the definition for RELAX NG (part 2), Schematron (part 3), and a mechanism something like my XForms islands for selecting validation candidates out of a larger document (part 4). The remaining parts of DSDL cover other diverse areas like character repertoire validation and ways to combine various schema languages.
Another related standard and toolset is CAM, or "Content Assembly Mechanism" from OASIS. This technology allows business rules to define, validate, and compose documents; thus schema fragments can be brought together to define larger, compound documents.
All in all, mixed namespace validation in all its glory is a fertile area of XML development. The XForms validator is still a work in progress, as well as a great learning experience.
- Read the full text of Micah Dubinko's O'Reilly book XForms Essentials
online. You can also order the book from
the
developerWorks Developer Bookstore.
- Try out the XForms Validator discussed in this article.
- Get more details on the underlying specifications for XForms 1.0 and XPath 1.0.
- Explore the concept of constructive laziness and its rich legacy. The
most widely-known reference is perhaps from chapter 2 of Eric Raymond's
The Cathedral and the Bazaar.
- Learn a convenient way to specify
IDness in XML by reading the in-progressxml:idWorking Draft, which defines a reserved name ofxml:id. - Leverage the power of RELAX NG, starting with the specifications
and tutorials on the official
site. developerWorks also focuses on this technology in the tutorial
"Understanding
RELAX NG" (December 2003) by Nicholas Chase.
- Visit the DSDL site, where a
10-part ISO specification covering all manner of XML validation is
under way.
- Also visit the OASIS
Content Assembly Mechanism (CAM) site, which describes another ongoing standardization effort
related to validation.
- Want a more complete understanding of how all the major XML standards interrelate? Check out Uche Ogbuji's excellent four-part survey of XML standards here on developerWorks:
- Part 1 -- The core standards (January 2004)
- Part 2 -- XML processing standards (February 2004)
- Part 3 -- The most important vocabularies (February 2004)
- Part 4 -- Detailed cross-reference of the most important XML standards (March 2004)
- Find more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Micah Dubinko is a consultant and founder of Brain Attic, L.L.C., a software vendor and consultancy specializing in defeating information overload. He wrote XForms Essentials for O'Reilly Media and served on the Working Group that developed XForms 1.0. He lives and works in Phoenix, AZ. You can contact him at micah@dubinko.info.