Skip to main content

Inside the XForms validator

Challenges of mixed-namespace validation

Micah Dubinko (micah@dubinko.info), Principal Consultant, Brain Attic, L.L.C.
Photo of Micah Dubinko
Micah Dubinko is a consultant and founder of Brain Attic, L.L.C., a software vendor and consultancy specializing in defeating information overload. He wrote XForms Essentials for O'Reilly Media and served on the Working Group that developed XForms 1.0. He lives and works in Phoenix, AZ. You can contact him at micah@dubinko.info.

Summary:  Performing validation on mixed-namespace documents can be more art than science. XForms 1.0, which is used as a component inside arbitrary host languages, introduces some new questions about how a validator should process such documents. This article discusses some of the challenges that the author encountered while writing an online XForms validator tool, and techniques for overcoming these problems.

Date:  10 Sep 2004
Level:  Intermediate
Activity:  963 views

My book on XForms (see Resources) landed on the shelves and online in the fall of 2003. Shortly thereafter, I started getting lots of e-mail questions about XForms, usually including a page or three of buggy XML source. Normally I'm good about answering e-mail, but sifting through pages of someone else's XML looking for common typos is neither fun nor productive. I had to find a better way.

I'm a huge believer in constructive laziness, so I decided to write an online tool that would accept an XForms document as input, and produce a report on any markup constructs that were either wrong or suspicious. From my e-mail archive, I had a reasonable sample of the kinds of mistakes people were making. Combining the two, I would have a powerful tool to help form authors help themselves.

XForms islands

The XForms 1.0 specification (see Resources) is defined as a number of elements, attributes, and content models. One thing it doesn't define, however, is a root element -- that is left to a host language to address. The two most common host languages are XHTML and SVG, but in principle almost any XML vocabulary could be used. Thus, the first job of an XForms validator is to extricate the XForms portions out of a document. For these, I've coined the term XForms islands.

Because XForms separates purpose from presentation, all but the most minimal form documents have at least two XForms islands, one for the XForms Model (the definition of what the form does) and one for the XForms User Interface (the definition of what the form looks like).

Listing 1 shows a simple XForms+XHTML document -- which may be too simple, as it contains a common mistake.


Listing 1. A common, but erroneous, XForms+XHTML document.
<?xml version="1.0"?>
<html xmlns="http://www.w3.org/1999/xhtml"
      xmlns:x="http://www.w3.org/2002/xforms"
      xmlns:ev="http://www.w3.org/2001/xml-events">
  <head>
    <title>Basic mixed document</title>
    <x:model>
      <x:instance>
        <!-- @@@ forgot xmlns="" on <root> @@@ -->
        <root>
          <child/>
        </root>
      </x:instance>
      <x:bind nodeset="child" required="1"/>
    </x:model>
  </head>
  <body>
    <x:input ref="child">
      <x:label>Label</x:label>
    </x:input>
  </body>
</html>

Listing 1 clearly has two XForms islands, one for x:model and one for x:input. The question is how the code should locate them. Actually, it's not that difficult, since an element can be identified as an XForms island if it meets two simple conditions:

  • It must be in the XForms namespace
  • It must not have any ancestors in the XForms namespace

While this test could have been done with XPath, I wanted to explore different ways for Python to perform XML processing. So I wrote a filter function that, given a node, determines whether it starts an XML island. Listing 2 shows the code.


Listing 2. Picking out XForms islands
# extract islands
def island_filter(node, usedns):
    """
    An 'island' element has
    1) the XForms NS
    2) no ancestors with the XForms NS

    usedns is a string containing the XForms Namespace URI
    """
    if node.type != "element": return False  # skip non-elements
    rc = True
    if (get_element_ns(node) != usedns): rc = False
    node = node.parent
    while node is not None:
        if (node.type=="element" and get_element_ns(node)==usedns):
            rc = False
            break
        node = node.parent
    return rc

def get_element_ns(elem):
    ns = None
    try:
        ns = elem.ns().content
    except libxml2.treeError, e:
        pass
    return ns


XPath checks

The validator stores the resulting list of XForms islands for later processing. Now, the mistake from Listing 1 (as described in the comment) is that the root element has inadvertently been left in the default namespace -- namely that of XHTML. This mistake, and several other similar problems, can be detected with XPath-based checks like the one shown in Listing 3, which returns any suspicious element nodes.


Listing 3. Checking for namespace leakage with XPath
//xf:instance//*[namespace-uri(.)=namespace-uri(/*) or
                 namespace-uri(.)=$usedns]

Note that since Listing 3 can't make any assumptions about the structure of the surrounding host language, it makes heavy use of XPath's // abbreviation, which uses the descendant-or-self axis to thoroughly search the document under consideration. Note too that the namespace prefix mappings (xf:) in the XPath don't have to match those used in the target document (x:). The test checks descendants of the x:instance element to see whether they have either the XForms namespace or the namespace of the root node. This definitely qualifies as a heuristic, since it is possible for perfectly valid XForms documents to trigger this condition. On the other hand, the chances are pretty good that this condition is an authoring error, like the one in Listing 1, and so the validator spits out a warning.


Connecting ID and IDREF

Another area of frequent mistakes is matching up IDs and IDREFs. This is partially a historical problem, since the mechanism for defining an ID relies on the presence of a DTD. Some tools -- depending largely on the author's philosophy towards XML Schema and the Infoset -- also allow IDs to be defined through XML Schema datatypes. In practice, however, you'll often find little to go on other than the presence of attributes that happen to be named id.

This situation isn't pretty, but a real-world validator tool needs to be aware of it. The validator looks at a list of all attributes known to contain IDREFs in XForms. First it tries the built-in id() function; if that doesn't find a match, it resorts to an XPath test that checks for attributes named either id or xml:id (based on an unfinished W3C draft -- see Resources). Here's the code:

//*[@id='idstr' or @xml:id='idstr']


Validating the islands

As a final step, the validator looks at each XForms island and validates it using RELAX NG. This is more complicated than it sounds, since several areas of XForms (such as label) can contain markup from the host language, not to mention additional attributes that are allowed everywhere.

To deal with this, the validator uses a highly modularized RELAX NG schema for XForms, which is integrated into a highly permissive host language. By "highly modularized" I mean that every element definition, set of attributes on an element, and content model for an element is assigned a unique name that can be separately extended. Listing 4 shows how this works for a single element definition, using the handy compact syntax of RELAX NG.


Listing 4. RELAX NG modularized element definition
Common.Attributes = empty
Single.Node.Binding.Attributes =  attribute bind { xsd:NCName } |
  (attribute model { xsd:NCName }?, attribute ref { xsd:string })
UI.Common.Attributes &=
  #host language to add accesskey, navindex, etc. here
  attribute appearance { xsd:QName { pattern = "[^:]+:[^:]+" } |
  "minimal" | "compact" | "full" }?

Select = element select { Select.Attributes, Select.Content }
Select.Attributes &=
    Common.Attributes,
    Single.Node.Binding.Attributes,
    UI.Common.Attributes,
    attribute selection { "open" | "closed" }?,
    attribute incremental { xsd:boolean }?
Select.Content = Label, List.UI.Common.Content, UI.Common.Content

Note that even attributes that contain an IDREF are labeled as xsd:NCName, which performs only the lexical validation of the attribute. As I mentioned earlier, actual checking of the ID-to-IDREF connections happens at a different level. The primary advantage of defining everything separately is that it's simple to extend, for example by adding a class attribute to all form controls.

In fact, that's exactly what the host language schema does. This portion of the validator is still under development, but Listing 5 shows how the host language is defined.


Listing 5. XForms + host language definition
Common.Attributes &=
  attribute id { xsd:NCName }?,
  attribute xml:id { xsd:NCName }?,
  attribute class { xsd:NMTOKENS }?

When included by the main schema for XForms, the code in Listing 5 extends the schema for XForms in a way that allows everyday constructs, like class and id attributes, to pass validation. Since the included bits of host language can be almost anything, this is one area that will need ongoing adjustments, based on actual usage as seen in the wild.

As the validator works, it keeps track of the running results as an in-memory XML file. At the conclusion, an XSL transformation converts the results into the final HTML that gets sent over the wire.


Related standards

As always, namespaces are a tricky subject, especially for authors. Standards can make things easier, and two developing standards approach this problem in different ways.

One such suite of standards is Document Schema Definition Languages, or DSDL (see Resources); these languages are currently progressing variously towards becoming ISO Final Draft International Standards. Currently divided into 10 parts, DSDL is an implicit recognition of the complexity of the overall subject of validation. Individual parts include the definition for RELAX NG (part 2), Schematron (part 3), and a mechanism something like my XForms islands for selecting validation candidates out of a larger document (part 4). The remaining parts of DSDL cover other diverse areas like character repertoire validation and ways to combine various schema languages.

Another related standard and toolset is CAM, or "Content Assembly Mechanism" from OASIS. This technology allows business rules to define, validate, and compose documents; thus schema fragments can be brought together to define larger, compound documents.

All in all, mixed namespace validation in all its glory is a fertile area of XML development. The XForms validator is still a work in progress, as well as a great learning experience.


Resources

About the author

Photo of Micah Dubinko

Micah Dubinko is a consultant and founder of Brain Attic, L.L.C., a software vendor and consultancy specializing in defeating information overload. He wrote XForms Essentials for O'Reilly Media and served on the Working Group that developed XForms 1.0. He lives and works in Phoenix, AZ. You can contact him at micah@dubinko.info.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=15097
ArticleTitle=Inside the XForms validator
publish-date=09102004
author1-email=micah@dubinko.info
author1-email-cc=dwxed@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers