XML Matters: Kicking back with RELAX NG, Part 3

Compact syntax and XML syntax

The RELAX NG compact syntax provides a much less verbose, and easier to read, format for describing the same semantic constraints as RELAX NG XML syntax. This installment looks at tools for working with and transforming between the two syntax forms.

Share:

David Mertz (mertz@gnosis.cx), Facilitator, Gnosis Software, Inc.

Photo of David MertzDavid Mertz thinks that the schema that is real is not the real schema. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/. Suggestions and recommendations on this, past, or future, columns are welcomed.



14 May 2003

Readers of my earlier installments on RELAX NG (Part 1 and Part 2) will have noticed that I chose to provide many of my examples using compact syntax rather than XML syntax. Both formats are semantically equivalent, but the compact syntax is, in my opinion, far easier to read and write. Moreover, readers of this column in general will have a sense of how little enamored I am of the notion that everything vaguely related to XML technologies must itself use an XML format. XSLT is a prominent example of this XML-everywhere tendency and its pitfalls -- but that is a rant for a different column.

Later in this article, I will discuss the format of the RELAX NG compact syntax in more detail than the prior installments allowed.

Tool support

On the downside, since the RELAX NG compact syntax is newer -- and not 100% settled at its edges -- tool support for this syntax is less complete than for the XML syntax. For example, even though the Java tool trang supports conversion between compact and XML syntax, the associated tool jing will only validate against XML syntax schemas. Obviously, it is not overly difficult to generate the XML syntax RELAX NG schema to use for validation, but direct usage of the compact syntax schema would be more convenient. Likewise, the Python tools xvif and 4xml validate only against XML syntax schemas.

To help remedy the gaps in direct support for compact syntax, I have produced a Python tool for parsing RELAX NG compact schemas, and for outputting them to XML format. While my rnc2rng tool only does what trang does, Eric van der Vlist and Uche Ogbuji have expressed their interest in including rnc2rng in xvif and 4xml, respectively. Ideally, in the near future direct validation against compact syntax schemas will be included in these tools.

Writing rnc2rng proved more difficult than I anticipated; and there is probably a lesson in that. While RELAX NG compact syntax is quite readable -- as you will see below -- there are enough variations in the arrangement of tokens between instances that a parser was non-trivial to write. For better or worse, I use PLY'slex module to tokenize the schema, but gave up on using yacc for the parsing, and opted for application-specific massaging of the token stream instead. Debugging declarative grammars is often more difficult than incrementally adjusting imperative code. Despite my frequent concern about the unfriendliness of XML, the task of parsing an XML syntax schema would have been far simpler, since I could have let a framework like SAX or DOM do most of the work for me.


More on RELAX NG editors

Since the last installment, tool support for RELAX NG has gotten a little bit better. Version 2.0 of the <oXygen/> XML editor has been released, incorporating trang as a plug-in, and thereby offering some support for RELAX NG. While this is not the place for a full review, I found that <oXygen/> 2.0 -- which I liked in version 1.2 to start with -- has gained a number of nice features and general polish. I would like to see RELAX NG integrated at a deeper level into various editors -- to a degree similar to DTD and W3C XML Schema. With a bit more time, I think greater RELAX NG integration into tools is likely.


Syntax features: Namespaces

A compact syntax RELAX NG schema may begin with any of several optional namespace declarations. Each of these looks a lot like an assignment statement in a programming language. A default namespace for schema tags may be specified with:

default namespace = "http://relaxng.org/ns/structure/version"

When converted to XML syntax, use of this declaration appends an "ns" attribute to the root element of the schema. If this namespace is not explicitly specified, the default default namespace is used, and is declared with the root attribute, such as:

<root-tag xmlns="http://relaxng.org/ns/structure/1.0">

You may also declare an external namespace for elements or attributes:

namespace foo = "http://some.path.to/foo"

This allows you to describe elements like:

element foo:bar { ... }

When converted to XML syntax, the namespace URL is added to the root tag as an extra attribute:

<root-tag xmlns="http://relaxng.org/ns/structure/1.0"
          xmlns:foo="http://some.path.to/foo">

The namespace "a" is a bit special here. RELAX NG allows annotations, which are basically just tags with the "a" namespace. In compact syntax, you can avoid thinking about namespaces by adding an annotation with initial double hash marks:

## An annotation

Converted to XML syntax, this annotation appears as:

<a:documentation>An annotation</a:documentation>

By the way, a single leading hash introduces a comment instead of an annotation, so the following compact syntax form:

# This is a comment

corresponds to this XML form:

<!-- This is a comment -->

You can also use a slightly odd compact syntax form to specify other annotations within the "a" namespace:

[ a:defaultValue = "foo" ]

A root attribute "xmlns:a" will be specified automatically in the XML syntax if annotations are used, but since "a" is just another namespace, you can specify your own URL if you want. The default attribute is equivalent to specifying:

namespace a = "http://relaxng.org/ns/compatibility/annotation/1.0"

One more special namespace is specified differently in both syntax forms. Data types rely on a modular specification, usually using W3C XML Schema data types. You may specify these with compact syntax:

datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"

or XML syntax:

<root-tag xmlns="http://relaxng.org/ns/structure/1.0"
   datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

Syntax features: Nested And context-free

The main body of a RELAX NG grammar may have either of two styles. In some way, the more direct style is to simply nest elements and attributes where they should occur in a valid instance. Generally, it is good form to use indentation much as you would in a programming language, but as in C-family languages, curly braces are the actual block delimiters. A moderately complete schema would look like this:

Listing 1. A nested compact syntax schema
                # A library patron example
default namespace = "http://some.other.url/ns"
namespace foo = "http://home.of.foo/ns"
datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"
## Annotation here
element patron {
  element name { xsd:string { pattern = "\w{,10}" } }
  & element id-num { xsd:string }
  & element book {
      ( attribute isbn { text }
      | attribute title { text }
      | attribute anonymous { empty })
    }*
}

The library patron example uses most of the syntax elements. "&"s are interspersed between elements (or attributes) indicating that the several elements must occur, but may do so in any order. In XML syntax, this is the same as the <interleave> tag. Likewise, interpersed "|"s indicate a choice between several items -- in XML, <choice>. Notice the "book" element, too: The parenthesis indicate a group, but they are redundant in this case. A group (XML: <group>), however, is useful as part of quantification or interpersal. For example:

Listing 2. Using groups for quantification
                element foo {
    ( element bar { text },
      element baz { text } )+,
    element bam { text } }

In this case, a valid document's root <foo> element might contain several <bar></bar><baz></baz> sequences prior to one final <bam> element. There is no way to express the same concept by only quantifying the individual "bar" and "baz" elements.

A nested-style RELAX NG grammar need not describe a single element only. Any well-formed XML document must have a single root element, so clearly an attribute at the top is prohibited. Likewise, a sequence or interleave description at the top level could not describe a well-formed XML document, and therefore it could not describe a valid one. But there is nothing wrong with allowing a choice of root elements, such as:

( element foo {text}
| element bar {text} )

A second style of RELAX NG grammar more closely resembles a DTD. A special production named "start" is indicated at the beginning, followed by a variety of other named productions. As with namespace declarations, a production is named in the manner of an assignment in a programming language. For example, a library patron schema could also look something like this:

Listing 3. A context-free compact syntax schema
                # A library patron example
default namespace = "http://some.other.url/ns"
namespace foo = "http://home.of.foo/ns"
datatypes xsd = "http://www.w3.org/2001/XMLSchema-datatypes"
## Annotation here
start = patron
patron = name & id-num & book
name = element name { xsd:string { pattern = "\w{,10}" } }
id-num = element id-num { xsd:string }
book = element book {
      ( attribute isbn { text }
      | attribute title { text }
      | attribute anonymous { empty }) }*

Names of productions may occur within other productions, which can prevent repetitions, and generally make complex patterns more readable. Beyond readability, naming patterns allows recursive definition of patterns -- either direct or mutual recursion. For example, describing HTML -- where tables can nest within tables, or lists within lists -- is not possible in a strictly nested style. An upshot of recursive XML instance documents is to make DTDs and context-free RELAX NG much more natural as descriptions than is W3C XML Schemas (but you can get what is needed out of W3C XML Schemas; it just requires more work).

It is probably worth looking at an entire XML syntax RELAX NG schema document. For comparison, Listing 4 is what rnc2rng produces when processing the context-free library patron schema in Listing 3:

Listing 4. A context-free XML syntax schema
                <?xml version="1.0" encoding="UTF-8"?>
<!-- A library patron example -->
<grammar xmlns="http://relaxng/ns/structure/1.0"
    ns="http://some.other.url/ns"
    datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"
    xmlns:a="http://relaxng.org/ns/compatibility/annotations/1.0"
    xmlns:foo="http://home.of.foo/ns">
  <a:documentation>Annotation here</a:documentation>
  <start><ref name="patron"/></start>
  <define name="patron">
    <interleave>
      <ref name="name"/>
      <ref name="id-num"/>
      <ref name="book"/>
    </interleave>
  </define>
  <define name="name">
    <element name="name">
      <data type="string"/>
        <param name="pattern">\w{,10}</param>
      </data>
    </element>
  </define>
  <define name="id-num">
    <element name="id-num">
      <data type="string"/>
    </element>
  </define>
  <define name="book">
    <zeroOrMore>
      <element name="book">
        <choice>
          <attribute name="isbn"/>
          <attribute name="title"/>
          <attribute name="anonymous">
            <empty/>
          </attribute>
        </choice>
      </element>
    </zeroOrMore>
  </define>
</grammar>

I would say this is easier to read than a W3C XML Schema, but it doesn't even come close to the compact syntax (prior installments pointed out that this schema is actually impossible to express precisely in either a W3C XML Schema or a DTD).


Miscellany

In some of these examples you'll notice that elements and attributes in compact syntax always contain something in curly braces after their name. In XML syntax you can self-close an attribute tag, but to prevent ambiguity you need to specify at least {text} or {empty} for an attribute body. Of course, you can also use a more complex data type description if you wish. Also, the only quantification that makes sense for attributes is "?" -- attributes might be optional, but they will not be repeated multiple times.

In some corner cases, rnc2rng differs from trang. For example, both tools force an annotation to occur inside a root element in XML syntax, even if the annotation line occurs before the root element in the compact syntax. Since well-formed XML documents are single-rooted, this is a necessity. But trang also moves comments in a similar manner, while rnc2rng does not. At a minimum, the two tools use whitespace in a slightly different manner. Most likely, a few other variations exist, but ideally none that are semantically important.

Resources

  • Participate in the discussion forum.
  • Download the xvif library. For a somewhat more polished tool, 4Suite incorporates xvif for RELAX NG validation. The command-line tool 4xml will validate against both RELAX NG and DTDs, with various options. 4Suite includes many other tools and libraries for working with many XML-related technologies.
  • trang and jing are complementary tools for transformation between schemata, and validation against RELAX NG schemas. The former depends on the latter but both can be downloaded in a convenient archive here.
  • You will need to obtain an implementation of the Java API for XML Processing (JAXP) to use trang. If you run a Java 1.4 JVM, you are fine; otherwise, download crimsonhere.
  • DTDinst is a Java tool to for converting DTDs into an XML instance document format, including handling of parametric entities. The DTDinst XML format is of limited utility by itself, since nothing else works with it. However, an XSLT stylesheet is available to transform this format into RELAX NG (with a few caveats). You will need an XSLT tool to utilize this.
  • Find a collection of documents and tools presented in this series of articles here.
  • Read David Mertz's roundup of XML editors: Part 1 examines Java and MacOS applications(including <oXygen/>), while Part 2 looks at Windows-based products. You'll find all of the previous installments of the XML Matters column.
  • Find more XML resources on the developerWorks XML zone.
  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12271
ArticleTitle=XML Matters: Kicking back with RELAX NG, Part 3
publish-date=05142003