In an earlier tip written for developerWorks, I took a conceptual look at reconciling object-oriented programming (OOP) techniques with XML validity constraints. This installment of XML Matters presents an early version of an actual Python module for doing it. You could create an analogous capability in other programming languages, but Python provides particularly versatile reflection mechanisms, and allows clear expression of validity constraints.
On the face of it, Python -- with its extremely dynamic (albeit strict) typing -- might seem like a strange choice for implementing what is, essentially, an elaborate type system. But any oddness that you might perceive is superficial. In fact, while the type systems of languages like the Java language, C++, and C# are static, they are far too impoverished to offer much meaningful help to XML validity constraints. A pure functional language like Haskell might offer type hierarchies, discriminated unions, quantification, existential types, and so on, but OOP languages typically lack these things. In statically typed OOP languages, you would have to build just as much custom validation into a library as that found in the currently discussed Python library.
gnosis.xml.validity can helpfully be contrasted with several other XML-related modules. Two other libraries that have been incorporated into the author's
gnosis.xml package were discussed in earlier articles (see Resources).
gnosis.xml.pickle is able to produce a specialized XML serialization of any Python object whatsoever and, as with Python's standard
cPickle modules, provides a way to save and restore objects. Furthermore,
gnosis.xml.objectify operates in a reverse direction: Given an arbitrary XML document, you can generate a Python-like object (with a slight loss of information about the original XML).
The Python standard library includes support for DOM and SAX processing of XML documents. Widely used third-party Python packages extend support to include XSLT processing:
- DOM (specifically
xml.dom.minidom) offers a rather heavy API for OOP-style manipulation of XML documents -- with methods common across DOM implementations in many programming languages.
- SAX treats an XML document as a series of parsing events, and basically allows a procedural programming style.
- XSLT declares a set of rules for transforming an XML document into something else (such as a different XML document).
All of these libraries are useful, but none of them prevent an application from modifying an XML-representation object in ways that break the validity of the underlying XML. For example, deleting, adding, or moving a DOM node can easily create a DOM hierarchy that cannot be dumped into a valid XML document.
The basic purpose of XML validity is to specify what can occur inside an element, how often it can occur, and what alternatives exist for what can occur. Also, when multiple things can occur inside an element, the order of occurrence can be specified or left open, as needed. DTDs differ somewhat from W3C XML Schemas in what they can express, but the gist is the same. Let's look at a highly simplified, hypothetical
Listing 1. A dissertation DTD with all basic constraints
<!ELEMENT dissertation (dedication?, chapter+, appendix*)> <!ELEMENT dedication (#PCDATA)> <!ELEMENT chapter (title, paragraph+)> <!ELEMENT title (#PCDATA)> <!ELEMENT paragraph (#PCDATA | figure | table)+> <!ELEMENT figure EMPTY> <!ELEMENT table EMPTY> <!ELEMENT appendix (#PCDATA)>
In this example, a dissertation may contain one dedication, must contain (one or more) chapters, and may contain (zero or more) appendixes. The various sub-elements occur in the listed order (if at all). Some elements contain only character data. In the case of the
<paragraph> tag, it may contain either character data or a
<figure> sub-element or a
<table> sub-element -- or any combination of each of these. Structures can nest, but every basic validity concept is included in the example.
gnosis.xml.validity module allows you to create, for example, a
dissertation Python object that can only represent a valid dissertation. Moreover, when the object is transformed into XML -- using the
str() function -- the XML automatically matches the desired DTD.
The easiest way to understand what
gnosis.xml.validity does is to see it used. In attitude,
gnosis.xml.validity owes its heritage to the
Spark parser. That is, validity classes are defined using Python reflection rather than traditional sequential programming. This symmetry is interesting because, in a sense,
gnosis.xml.validity do exactly opposite things: The former assumes rule-based structure in external texts; the latter enforces it in internal objects.
A validity class is based very closely on a corresponding DTD or XML Schema. A class simply inherits from a relevant validity type, and then specializes (if necessary) by adding a class attribute. In one convention that's often used, any class that's named with an initial underscore represents a structure that does not have a corresponding tag. For example, a
<paragraph> element in a dissertation can contain a collection of
<table> elements. The disjunction type that is assembled into a
<paragraph> collection does not itself have an XML tag. Therefore, this disjunction type is named
_mixedpara in the example below:
Listing 2. dissertation.py
from gnosis.xml.validity import * class appendix(PCDATA): pass
As with a DTD, the top level of a particular object or XML document can be any tag whose rules are given.
dissertation happens to be the highest level available here, but you can create documents of lower types as well. Let's take a look:
Listing 3. Creating a valid dissertation chapter
>>> from dissertation import chapter, title, _paras, paragraph, PCDATA >>> chap1 = chapter(( title(PCDATA('About Validity')), .. _paras([paragraph(PCDATA('It is a good thing'))]) .. )) >>> print chap1 <chapter><title>About Validity</title> <paragraph>It is a good thing</paragraph> </chapter>
<chapter> is initialized with a tuple containing a
<title> and a
_paras list. A
<title>, in turn, is initialized with some
PCDATA, which is itself initialized with a (Unicode) string. Likewise, a
_paras list contains some paragraphs, which are themselves initialized with
PCDATA. Once an appropriate object exists, it simply prints itself as valid XML.
Although those nested initialization items obey the details of the specified DTD validity rules, they are rather cumbersome. The
gnosis.xml.validity offers a much friendlier style of initialization. Whenever a particular type is required, the initializer for that type is transparently lifted into the type itself. Moreover, when a quantification type would normally be initialized by a list of things of the right type, specifying just one item lifts it into a length-one list of the item. Lifting is recursive. Note that
Seq types that use lifting must use the factory function
LiftSeq(), but other types can lift their own initialization arguments (the details have to do with new-style inheritance from immutable Python types). This sounds complicated, but is enormously obvious in practice:
>>> from dissertation import LiftSeq >>> chap1 = LiftSeq(chapter,('About Validity','It is a good thing')) >>> print chap1 <chapter><title>About Validity</title> <paragraph>It is a good thing</paragraph> </chapter>
Thus far, you have created some valid XML and objects. So what? You could have just written valid XML text by hand. You realize the value of
gnosis.xml.validity when you want to modify an object in either valid or invalid ways. For example, here is a valid modification:
Listing 4. Adding a paragraph (valid operation)
>>> paras_ch1 = chap1 >>> paras_ch1 += [paragraph('OOP can enforce it')] >>> print chap1 <chapter><title>About Validity</title> <paragraph>It is a good thing</paragraph> <paragraph>OOP can enforce it</paragraph> </chapter>
What happens when you try something that is not allowed? For example, a dissertation can have at most one dedication (as specified in Listing 1):
Listing 5. Creating an optional dedication
>>> from dissertation import _dedi, dedication >>> Maybe_dedication = _dedi() >>> print Maybe_dedication >>> Maybe_dedication.append(dedication("To Mom.")) >>> print Maybe_dedication <dedication>To Mom.</dedication> >>> Maybe_dedication.append(dedication("Also to Dad.")) Traceback (most recent call last): File "<pyshell#71>", line 1, in ? Maybe_dedication.append(dedication("Also to Dad.")) File "validity.py", line 140, in append raise LengthError, self.length_message % self._tag LengthError: List <_dedi> must have length zero or one
Likewise, you cannot include something of the wrong type, even if the length of a quantification is acceptable:
Listing 6. Attempting to add item of wrong type
>>> from gnosis.xml.validity import ValidityError >>> try: .. paras_ch1.append(dedication("To my advisor")) .. except ValidityError, x: ... print x Items in _paras must be of type <class 'dissertation.paragraph'> (not <class 'dissertation.dedication'>)
All the exceptions that might be raised by violating constraints are descended from
ValidityError. Programming with the
gnosis.xml.validity library will probably involve wrapping many operations in
try/except blocks; it should not be possible to create an invalid object by attempting a disallowed operation.
Keep in mind that
gnosis.xml.validity is strictly for Python 2.2+. Although it is possible to implement it in earlier Python versions, I felt this project would make a good testing ground for some newer Python features. Specifically, the library takes advantage of the type/class unification and new-style classes. I have some ideas for doing some tricky stuff with metaclasses in future library versions, and I might even work in properties and slots.
The design of
gnosis.xml.validity relies heavily on Python's introspection/reflection capabilities. Several abstract classes comprise the main functionality. Each of these classes must have concrete children to actually do anything, although each child only needs to implement a maximum of one class attribute. When an XML tag corresponds to a class, the tag name is taken directly from the class name. As noted earlier, if a class name begins with an underscore, it has no corresponding XML tag. The basic rule here is that any tagged validity class serializes itself with surrounding open and close tags; a tagless class just serializes its raw content (which might, however, include items that themselves have tags). This scheme imposes a limitation:
gnosis.xml.validity cannot work with DTDs that specify XML tags with lead underscores; this limitation could be removed in future versions, but probably will not unless users have a need for this.
The base abstract classes consist of the following:
PCDATAmay be used directly, and so is not really abstract. An XML element that contains
PCDATAshould inherit from this, but does not need to provide any further specialization. However, in an alternation list for the
Ortype, you simply need to list
PCDATA. This is very closely modeled on DTD syntax. I recommend listing
PCDATAfirst in such a list (as DTDs require), but this is not currently mandatory.
EMPTYis also modeled on DTD syntax. As with
PCDATA, this class should be inherited from, but no further specialization is required.
- A child of
Ormust add a
_disjoinstuple as a class attribute. Normally, that one attribute will be the whole implementation. Other validity classes should be listed in the tuple. Conceptually, a disjunction should involve two or more things, but no error is currently raised if there are fewer disjoins.
- A child of
Seqmust add an
_ordertuple as a class attribute. Normally, that one attribute is the whole implementation. Listed in the tuple should be two or more other validity classes; as with
Or, the tuple length is not currently checked. In instantiating a
Seqchild, it is usually safer to utilize the factory function
Quantificationabstract class is a special case, in a way. The examples in this article have not used
Quantification, but have instead used its (still abstract) children. For example, here is the implementation of the class
Listing 7. Quantification abstract child Some
class Some(Quantification): length_message = "List <%s> must have length >= 1" min_length = 1 max_length = maxint
Anyclasses have similar implementation. These
Quantificationchildren cover all the quantification options for DTDs, but XML Schemas can allow others (for example,
Three_to_Seven) whose implementation is straightforward. I realize that a pretty good
length_messagecould be generated from the other attributes, but I felt that the pluralization and phrasing of messages would be better if it was written by a programmer.
- A concrete descendent of
Quantificationmust add a
_typeclass attribute, which simply points to another validity class. In principle, a concrete child could add its own
length_message-- but using an intermediary feels like better design.
As of this writing,
gnosis.xml.validity is largely a proof-of-concept: A few things are still missing. The most glaring absence is the lack of any facility for adding XML tag attributes -- let alone enforcing their validity. In structure, attributes look a lot like sub-elements (unordered ones), so a similar enforcement mechanism can be added to later versions of
gnosis.xml.validity. Such an addition would certainly be of the highest priority.
gnosis.xml.validity would benefit from the addition of some other conveniences:
- Automatically generating a set of Python validity classes from a DTD or XML Schema would be nice. Unlike in a DTD, however, a set of Python validity classes needs to be defined in a particular order -- or at least in an order that defines each class before it is named in an attribute of another class.
- You might find it useful to read from an existing, and valid, XML document, but it's not necessarily obvious how you can best achieve this. Since member items need to be valid objects prior to their inclusion in larger structures, the simplest recursive descent approach would not work. But it should be possible to de-serialize an XML document to corresponding validity classes.
- Finally, some sort of higher-level interface might make it easier to work with the presented validity classes. The strategy currently used in the library is to raise exceptions for every disallowed action; but wrapping this in more convenient APIs might be possible. Perhaps silent failure or flag return values would be useful, or another sort of fallback operations for error cases. Deciding the right interfaces probably will require more experimentation by users (including myself).
I welcome reader feedback about the direction that later versions of
gnosis.xml.validity should take. I believe the initial functionality will already aid a variety of XML programming tasks, but given how little similar library development has been done elsewhere, my intuitions about what is most useful are still vague.
- Check out the general goals for developing the
gnosis.xml.validitylibrary as outlined in the developerWorks XML tip, "Creating valid XML with object-oriented programming" (developerWorks, March 2002).
- The Haskell library
HaXmlaccomplishes everything that mine does, but within the framework of a pure functional language. While
HaXmlis very different, conceptually, from an object-oriented approach, you can read about it in an earlier installment of this column, "Transcending the limits of DOM, SAX, and XSLT" (developerWorks, October 2001).
- Download the most current version of Gnosis_Utils. To obtain
gnosis.xml.validity, download version 1.0.2 or higher.
- Find more XML resources on the developerWorks
XML technology zone.
- Get Rational Application Developer for WebSphere Software, an easy-to-use, integrated development environment for building, testing, and deploying J2EE applications, including generating XML documents from DTDs and schemas.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- Find other articles in David Mertz's XML Matters column.
David Mertz uses a wholly unstructured brain to write about structured document formats. David may be reached at email@example.com; his life pored over at http://gnosis.cx/dW/.