Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

XML Matters: Enforcing validity with the gnosis.xml.validity library

Squeezing OOP data into XML rules

David Mertz (mertz@gnosis.cx), Subsumer, Gnosis Software, Inc.
Photo of David Mertz
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.

Summary:  Most hitherto existing XML APIs have enforced well-formedness at a programmatic level, but hardly any can guarantee validity. This is a serious weakness in the whole field of XML processing. This installment discusses the author's gnosis.xml.validity library for enforcing validity in Python objects that are intended for XML serialization.

View more content in this series

Date:  01 Jul 2002
Level:  Intermediate

Comments:  

In an earlier tip written for developerWorks, I took a conceptual look at reconciling object-oriented programming (OOP) techniques with XML validity constraints. This installment of XML Matters presents an early version of an actual Python module for doing it. You could create an analogous capability in other programming languages, but Python provides particularly versatile reflection mechanisms, and allows clear expression of validity constraints.

On the face of it, Python -- with its extremely dynamic (albeit strict) typing -- might seem like a strange choice for implementing what is, essentially, an elaborate type system. But any oddness that you might perceive is superficial. In fact, while the type systems of languages like the Java language, C++, and C# are static, they are far too impoverished to offer much meaningful help to XML validity constraints. A pure functional language like Haskell might offer type hierarchies, discriminated unions, quantification, existential types, and so on, but OOP languages typically lack these things. In statically typed OOP languages, you would have to build just as much custom validation into a library as that found in the currently discussed Python library.

The module gnosis.xml.validity can helpfully be contrasted with several other XML-related modules. Two other libraries that have been incorporated into the author's gnosis.xml package were discussed in earlier articles (see Resources). gnosis.xml.pickle is able to produce a specialized XML serialization of any Python object whatsoever and, as with Python's standard pickle and cPickle modules, provides a way to save and restore objects. Furthermore, gnosis.xml.objectify operates in a reverse direction: Given an arbitrary XML document, you can generate a Python-like object (with a slight loss of information about the original XML).

The Python standard library includes support for DOM and SAX processing of XML documents. Widely used third-party Python packages extend support to include XSLT processing:

  • DOM (specifically xml.dom.minidom) offers a rather heavy API for OOP-style manipulation of XML documents -- with methods common across DOM implementations in many programming languages.
  • SAX treats an XML document as a series of parsing events, and basically allows a procedural programming style.
  • XSLT declares a set of rules for transforming an XML document into something else (such as a different XML document).

All of these libraries are useful, but none of them prevent an application from modifying an XML-representation object in ways that break the validity of the underlying XML. For example, deleting, adding, or moving a DOM node can easily create a DOM hierarchy that cannot be dumped into a valid XML document.

What makes up validity?

The basic purpose of XML validity is to specify what can occur inside an element, how often it can occur, and what alternatives exist for what can occur. Also, when multiple things can occur inside an element, the order of occurrence can be specified or left open, as needed. DTDs differ somewhat from W3C XML Schemas in what they can express, but the gist is the same. Let's look at a highly simplified, hypothetical dissertation.dtd:


Listing 1. A dissertation DTD with all basic constraints

<!ELEMENT dissertation (dedication?, chapter+, appendix*)>
<!ELEMENT dedication (#PCDATA)>
<!ELEMENT chapter (title, paragraph+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT paragraph (#PCDATA | figure | table)+>
<!ELEMENT figure EMPTY>
<!ELEMENT table EMPTY>
<!ELEMENT appendix (#PCDATA)>

In this example, a dissertation may contain one dedication, must contain (one or more) chapters, and may contain (zero or more) appendixes. The various sub-elements occur in the listed order (if at all). Some elements contain only character data. In the case of the <paragraph> tag, it may contain either character data or a <figure> sub-element or a <table> sub-element -- or any combination of each of these. Structures can nest, but every basic validity concept is included in the example.

The gnosis.xml.validity module allows you to create, for example, a dissertation Python object that can only represent a valid dissertation. Moreover, when the object is transformed into XML -- using the print command or str() function -- the XML automatically matches the desired DTD.


Validity in action

The easiest way to understand what gnosis.xml.validity does is to see it used. In attitude, gnosis.xml.validity owes its heritage to the Spark parser. That is, validity classes are defined using Python reflection rather than traditional sequential programming. This symmetry is interesting because, in a sense, Spark and gnosis.xml.validity do exactly opposite things: The former assumes rule-based structure in external texts; the latter enforces it in internal objects.

A validity class is based very closely on a corresponding DTD or XML Schema. A class simply inherits from a relevant validity type, and then specializes (if necessary) by adding a class attribute. In one convention that's often used, any class that's named with an initial underscore represents a structure that does not have a corresponding tag. For example, a <paragraph> element in a dissertation can contain a collection of PCDATA, <figure>, and <table> elements. The disjunction type that is assembled into a <paragraph> collection does not itself have an XML tag. Therefore, this disjunction type is named _mixedpara in the example below:


Listing 2. dissertation.py
from gnosis.xml.validity import *
class appendix(PCDATA):   pass
class table(EMPTY): pass
class figure(EMPTY): pass
class _mixedpara(Or): _disjoins = (PCDATA, figure, table)
class paragraph(Some): _type = _mixedpara
class title(PCDATA): pass
class _paras(Some): _type = paragraph
class chapter(Seq): _order = (title, _paras)
class dedication(PCDATA): pass
class _apps(Any): _type = appendix
class _chaps(Some): _type = chapter
class _dedi(Maybe): _type = dedication
class dissertation(Seq): _order = (_dedi, _chaps, _apps)

As with a DTD, the top level of a particular object or XML document can be any tag whose rules are given. dissertation happens to be the highest level available here, but you can create documents of lower types as well. Let's take a look:


Listing 3. Creating a valid dissertation chapter

>>> from dissertation import chapter, title, _paras, paragraph, PCDATA
>>> chap1 = chapter(( title(PCDATA('About Validity')),
..                   _paras([paragraph(PCDATA('It is a good thing'))])
..                ))
>>> print chap1
<chapter><title>About Validity</title>
<paragraph>It is a good thing</paragraph>
</chapter>

A <chapter> is initialized with a tuple containing a <title> and a _paras list. A <title>, in turn, is initialized with some PCDATA, which is itself initialized with a (Unicode) string. Likewise, a _paras list contains some paragraphs, which are themselves initialized with PCDATA. Once an appropriate object exists, it simply prints itself as valid XML.

Although those nested initialization items obey the details of the specified DTD validity rules, they are rather cumbersome. The gnosis.xml.validity offers a much friendlier style of initialization. Whenever a particular type is required, the initializer for that type is transparently lifted into the type itself. Moreover, when a quantification type would normally be initialized by a list of things of the right type, specifying just one item lifts it into a length-one list of the item. Lifting is recursive. Note that Seq types that use lifting must use the factory function LiftSeq(), but other types can lift their own initialization arguments (the details have to do with new-style inheritance from immutable Python types). This sounds complicated, but is enormously obvious in practice:

>>> from dissertation import LiftSeq
>>> chap1 = LiftSeq(chapter,('About Validity','It is a good thing'))
>>> print chap1
<chapter><title>About Validity</title>
<paragraph>It is a good thing</paragraph>
</chapter>


Validity enforcement

Thus far, you have created some valid XML and objects. So what? You could have just written valid XML text by hand. You realize the value of gnosis.xml.validity when you want to modify an object in either valid or invalid ways. For example, here is a valid modification:


Listing 4. Adding a paragraph (valid operation)

>>> paras_ch1 = chap1[1]
>>> paras_ch1 += [paragraph('OOP can enforce it')]
>>> print chap1
<chapter><title>About Validity</title>
<paragraph>It is a good thing</paragraph>
<paragraph>OOP can enforce it</paragraph>
</chapter>

What happens when you try something that is not allowed? For example, a dissertation can have at most one dedication (as specified in Listing 1):


Listing 5. Creating an optional dedication

>>> from dissertation import _dedi, dedication
>>> Maybe_dedication = _dedi([])
>>> print Maybe_dedication

>>> Maybe_dedication.append(dedication("To Mom."))
>>> print Maybe_dedication
<dedication>To Mom.</dedication>

>>> Maybe_dedication.append(dedication("Also to Dad."))
Traceback (most recent call last):
  File "<pyshell#71>", line 1, in ?
    Maybe_dedication.append(dedication("Also to Dad."))
  File "validity.py", line 140, in append
    raise LengthError, self.length_message % self._tag
LengthError: List <_dedi> must have length zero or one

Likewise, you cannot include something of the wrong type, even if the length of a quantification is acceptable:


Listing 6. Attempting to add item of wrong type

>>> from gnosis.xml.validity import ValidityError
>>> try:
..     paras_ch1.append(dedication("To my advisor"))
.. except ValidityError, x:
...    print x
Items in _paras must be of type <class 'dissertation.paragraph'>
(not <class 'dissertation.dedication'>)

All the exceptions that might be raised by violating constraints are descended from ValidityError. Programming with the gnosis.xml.validity library will probably involve wrapping many operations in try/except blocks; it should not be possible to create an invalid object by attempting a disallowed operation.


Some notes on the implementation

Keep in mind that gnosis.xml.validity is strictly for Python 2.2+. Although it is possible to implement it in earlier Python versions, I felt this project would make a good testing ground for some newer Python features. Specifically, the library takes advantage of the type/class unification and new-style classes. I have some ideas for doing some tricky stuff with metaclasses in future library versions, and I might even work in properties and slots.

The design of gnosis.xml.validity relies heavily on Python's introspection/reflection capabilities. Several abstract classes comprise the main functionality. Each of these classes must have concrete children to actually do anything, although each child only needs to implement a maximum of one class attribute. When an XML tag corresponds to a class, the tag name is taken directly from the class name. As noted earlier, if a class name begins with an underscore, it has no corresponding XML tag. The basic rule here is that any tagged validity class serializes itself with surrounding open and close tags; a tagless class just serializes its raw content (which might, however, include items that themselves have tags). This scheme imposes a limitation: gnosis.xml.validity cannot work with DTDs that specify XML tags with lead underscores; this limitation could be removed in future versions, but probably will not unless users have a need for this.

The base abstract classes consist of the following:

  • PCDATA may be used directly, and so is not really abstract. An XML element that containsPCDATA should inherit from this, but does not need to provide any further specialization. However, in an alternation list for the Or type, you simply need to list PCDATA. This is very closely modeled on DTD syntax. I recommend listing PCDATA first in such a list (as DTDs require), but this is not currently mandatory.
  • EMPTY is also modeled on DTD syntax. As with PCDATA, this class should be inherited from, but no further specialization is required.
  • A child of Or must add a _disjoins tuple as a class attribute. Normally, that one attribute will be the whole implementation. Other validity classes should be listed in the tuple. Conceptually, a disjunction should involve two or more things, but no error is currently raised if there are fewer disjoins.
  • A child of Seq must add an _order tuple as a class attribute. Normally, that one attribute is the whole implementation. Listed in the tuple should be two or more other validity classes; as with Or, the tuple length is not currently checked. In instantiating a Seq child, it is usually safer to utilize the factory function ListSeq().
  • The Quantification abstract class is a special case, in a way. The examples in this article have not used Quantification, but have instead used its (still abstract) children. For example, here is the implementation of the class Some:

Listing 7. Quantification abstract child Some

class Some(Quantification):
    length_message = "List <%s> must have length >= 1"
    min_length = 1
    max_length = maxint

  • The Maybe and Any classes have similar implementation. These Quantification children cover all the quantification options for DTDs, but XML Schemas can allow others (for example, Three_to_Seven) whose implementation is straightforward. I realize that a pretty good length_message could be generated from the other attributes, but I felt that the pluralization and phrasing of messages would be better if it was written by a programmer.
  • A concrete descendent of Quantification must add a _type class attribute, which simply points to another validity class. In principle, a concrete child could add its own min_length, max_length, and length_message -- but using an intermediary feels like better design.

What remains to be done

As of this writing, gnosis.xml.validity is largely a proof-of-concept: A few things are still missing. The most glaring absence is the lack of any facility for adding XML tag attributes -- let alone enforcing their validity. In structure, attributes look a lot like sub-elements (unordered ones), so a similar enforcement mechanism can be added to later versions of gnosis.xml.validity. Such an addition would certainly be of the highest priority.

gnosis.xml.validity would benefit from the addition of some other conveniences:

  • Automatically generating a set of Python validity classes from a DTD or XML Schema would be nice. Unlike in a DTD, however, a set of Python validity classes needs to be defined in a particular order -- or at least in an order that defines each class before it is named in an attribute of another class.
  • You might find it useful to read from an existing, and valid, XML document, but it's not necessarily obvious how you can best achieve this. Since member items need to be valid objects prior to their inclusion in larger structures, the simplest recursive descent approach would not work. But it should be possible to de-serialize an XML document to corresponding validity classes.
  • Finally, some sort of higher-level interface might make it easier to work with the presented validity classes. The strategy currently used in the library is to raise exceptions for every disallowed action; but wrapping this in more convenient APIs might be possible. Perhaps silent failure or flag return values would be useful, or another sort of fallback operations for error cases. Deciding the right interfaces probably will require more experimentation by users (including myself).

I welcome reader feedback about the direction that later versions of gnosis.xml.validity should take. I believe the initial functionality will already aid a variety of XML programming tasks, but given how little similar library development has been done elsewhere, my intuitions about what is most useful are still vague.


Resources

About the author

Photo of David Mertz

David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at mertz@gnosis.cx; his life pored over at http://gnosis.cx/dW/.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=12129
ArticleTitle=XML Matters: Enforcing validity with the gnosis.xml.validity library
publish-date=07012002
author1-email=mertz@gnosis.cx
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).