Create three-level taxonomy modeling strategies using W3C XSD and OASIS CAM

Consider a simple modeling strategy in the context of two schema approaches

Often when people create a vocabulary to describe a problem-space, they find themselves using a taxonomy that divides the problem-space using three levels. For example, in a financial application, a ledger record might be identified categorically (debit or credit) and then broken down within these categories by type and subtype (for instance, "interest accrued due to prior underpayment" might be broken into a triple of credit/interest/underpayment). You can model this sort of structure in XML in a number of ways depending on the requirements of the data, and you can enforce this modeling using a variety of different schema approaches. I describe two schema approaches: W3C XML Schema Definition (XSD) and Organization for the Advancement of Structured Information Standards (OASIS) Content Assembly Mechanism (CAM).

Share:

Piers Michael Hollott, Senior Consultant, Sierra Systems

Piers has worked in the software industry for 15 years, and specializes in Java development, XML technologies, and functional programming. He has contributed to several open source projects and is currently a consultant with Sierra Systems.



18 January 2011

Also available in Japanese

Taxonomy in a software application

Taxonomy is the science of classification, defining the methods used to relate different genres of things hierarchically. For instance, a taxonomy divides the natural world into animal and plant kingdoms, and so on, in finer details through species and subspecies. In the context of a software application or ontological vocabulary, taxonomy refers to the hierarchical schema that is used to differentiate classes of objects within an application, messaging layer, or dataset. In practice, I find that a taxonomy consisting of two levels, a supertype and a subtype, is too general to be particularly useful, and that a taxonomy with four or more levels becomes cumbersome and hard to maintain. Experience shows that a taxonomy with three levels is suitable for a wide variety of tasks. I refer to these three levels of taxonomy as category, type, and subtype.

Frequently used acronyms

  • ISO: International Organization for Standardization
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

An example of this breakdown from an application that tracks pets in a pet shop might feature a category of dog or cat, a type of Jack Russell Terrier, and subtypes of long-haired, coarse-haired, and short-haired. A more compelling example of this three-part breakdown of data, however, is provided by an application drawn from the financial world, a simple general ledger application.

Vocabulary for a sample financial application

Consider a financial ledger application that tracks debts, credits, and payments. These records might be defined using a vocabulary such as the one in Table 1.

Table 1. Sample vocabulary data for a financial application
CategoryTypeSubtype
debtinterestoverpay
adjust
creditinterestoverpay
adjust
paymenteftbank
chqbank
indiv
trustee

This is a simple vocabulary that I use throughout this application for illustrative purposes. It is in no way an attempt to encompass the large problem-space of the finance domain. I want to suggest some approaches and technologies that you might find useful in defining vocabulary for a variety of business domains using this vocabulary as an example.


Modeling taxonomy with a flat XML structure

The easiest way to model an item identified by taxonomy is as a single element with multiple attributes, as in Listing 1.

Listing 1. Flat XML structure
<items>
   <item category="payment" type="chq" subtype="indiv">134.50</item>
   <item category="credit" type="interest" subtype="underpay">100.00</item>
   <item category="payment" type="eft" subtype="bank">1565.75</item>
   and so on...

Notice that this structure defines no rules of hierarchy, although you might define these outside of the data structure by the use of vocabulary defined by the schema used to enforce and validate this structure. In many cases, a flat structure is satisfactory to represent a collection of items, but in the case where these items are determined hierarchically, this structure is less than satisfactory.

A flat structure such as this is similar to the structure of a relational table, which might be a useful consideration if you work with data extracted or converted from an existing relational structure. As you can see, however, a less structured approach such as this one has some disadvantages. You encounter several of these disadvantages when you create an XSD for this data.

Using attributes in this way implicitly enforces a cardinality of 0..1, which might be suitable in some cases, but it is not difficult to imagine circumstances where, for instance, multiple subtypes are attributed to a single category and type. A good example of one of these circumstances is in the pet shop application described earlier. Imagine that pet shampoos can be indicated as appropriate for different types or subtypes of pets. In this case, it makes sense to allow a user to indicate more than one subtype, for instance coarse-haired and long-haired, which cannot be expressed using a simple flat XML structure such as the one described earlier.

A number of XML editors allow you to create a W3C XSD from a sample XML file. When you generate an XSD from the code sample, you can see the beginnings of a vocabulary emerging. This is promising, but if you look more closely at the schema extract in Listing 2, you notice that, although the vocabulary is represented accurately, the relationships among the layers of taxonomy are not reflected.

Listing 2. XSD extract for a flat XML structure
<xs:attribute name="category" use="required">
   <xs:simpleType>
      <xs:restriction base="xs:string">
         <xs:enumeration value="debt"/>
         <xs:enumeration value="credit"/>
         <xs:enumeration value="payment"/>
      </xs:restriction>
   </xs:simpleType>
</xs:attribute>
<xs:attribute name="type" use="required">
   <xs:simpleType>
      <xs:restriction base="xs:string">
         <xs:enumeration value="chq"/>
         <xs:enumeration value="eft"/>
         <xs:enumeration value="interest"/>
      </xs:restriction>
   </xs:simpleType>
</xs:attribute>

Notice that although the category and type enumerate the expected values, the schema has no way to differentiate between a type that is appropriate for debt and one that is appropriate for payment. Using an XSD file with an XML editor that supports a type-ahead style of data entry might be convenient, but it might result in inaccurate or unexpected data.

In addition, the ordering of the xs:attribute blocks in the XSD is completely arbitrary. Although there is a natural hierarchy of category/type/subtype in the described taxonomy, this hierarchy is not reflected in the XSD because it is not present in the flat XML structure.

A preferable approach, then, is to break down the record into the natural hierarchical structure, in Listing 3. There is no need, however, to stop using attributes. It is quite straightforward to identify each element with a code attribute. Unfortunately, as you will see, the XSD is still unable to reflect the relationships between the different category, type, and subtype metadata. This distinction is not something that XSDs can easily do.

Listing 3. Structured XML data
<sample xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
      xsi:noNamespaceSchemaLocation="sample_structured.xsd">
   <items>
      <item>
         <category code="payment">
            <type code="eft">
               <subtype code="bank"/>
            </type>
         </category>
         <value>100</value>
      </item>
      <item>
         <category code="debt">
            <type code="interest">
               <subtype code="overpay"/>
            </type>
         </category>
      <value>23</value>
   </item>
   and so on...

When you look in the XSD file referred to in the xsi:noNamespaceSchemaLocation instruction, you see that the structure of the XSD has changed somewhat, but the restrictions and enumerations that make up the vocabulary portion of the XSD remain unchanged. The reality of grammatical schemas such as XSD are that they cannot easily model around business rules that apply to data, and you can do little about this as long as you use XSDs (at least, not with the 1.0 specification of W3C XSD—these limitations are being addressed in the forthcoming 1.1 specification, which is currently incomplete). Fortunately, XSDs are not the only option available.


Extending your data model using OASIS CAM

OASIS CAM is a template-based schema approach that works by extending XSDs using an XPath assertion method similar to that used in ISO Schematron. Because an OASIS CAM template uses XPath to locate specific nodes and patterns of nodes within the XML data, OASIS CAM can be much more expressive than XSD. Using a CAM template has the added advantage of separating business rules and domain specifics such as vocabulary and taxonomy into their own section within the template. The CAM is supported by some open source tooling, provided to facilitate adoption of the OASIS recommendation, but I also find it useful to modify the templates directly in a text editor. I recommend using both the available tooling and a text-based approach. After you prepare a template, you can use it to create sample XML files based on the schema, as well as to perform validation of existing XML data.


Generating a CAM template using CAMProcessor

You can use the CAMProcessor tool (which is an Eclipse-like implementation of the open source jCAM project) to generate an OASIS CAM template from an XSD file by a process known as ingestion. This option is available from CAMProcessor's File menu as New > New Template from XSD. After you generate a new template from the sample_structured.xsd file in the provided sample files, you see something like the sample_structured_generated.cam file. I want to draw your attention to the rules in the BusinessUseContext section of the template in Listing 4.

Listing 4. Generated CAM template for structured XML data (extract)
<as:BusinessUseContext>
   <as:Rules>
      <as:default>
         <as:context>
            <as:constraint 
               action="makeRepeatable(//items/item)" />
            <as:constraint 
               action="restrictValues(//item/category/@code,'debt'|'payment')" />
            <as:constraint 
               action="restrictValues(//category/type/@code,'chq'|'eft'|'interest')" />
            <as:constraint 
               action="restrictValues(//type/subtype/@code,'bank'|'indiv'|'overpay')" />
            <as:constraint 
               action="datatype(//item/value,byte)" />
         </as:context>
      </as:default>
   </as:Rules>
</as:BusinessUseContext>

The BusinessUseContext section in the CAM template contains all the business rules expressed in the template, and this is where you expect to find the vocabulary available for use in XML files based on or validated by the template. As you can see from the listing, the vocabulary in the template is separated into a separate section, and the XPath patterns used to match the category, type, and subtype nodes now reflect the hierarchy of the three-part taxonomy better, but you are still not addressing the relationships between the metadata. And this is where you need to leverage the XPath used in the template further.

Modifying your CAM template to express a three-part taxonomy

Listing 5 is an extract from sample_structured_modified.cam in the provided sample files, which I created manually by editing the generated CAM template using a text editor.

Listing 5. Modified CAM template for structured XML data (extract)
<as:BusinessUseContext>
   <as:Rules>
      <as:default>
         <as:context>
            <as:constraint action="makeRepeatable(//items/item)" />
            <as:constraint action="restrictValues(
               //item/category/@code,'payment'|'debt'|'credit')" />
            <as:constraint action="restrictValues(
               //item/category[@code="payment"]/type/@code,'eft'|'chq')" />
            <as:constraint action="restrictValues(
               //item/category[@code="debt"]/type/@code,'interest')" />
            <as:constraint action="restrictValues(
               //item/category[@code="credit"]/type/@code,'interest')" />
            <as:constraint action="restrictValues(
               //item//type[@code='eft']/subtype/@code,'bank')" />
            <as:constraint action="restrictValues(
               //item//type[@code='chq']/subtype/@code,'bank'|'indiv'|'trustee')" />
            <as:constraint action="restrictValues(
               //item//type[@code='interest']/subtype/@code,'overpay'|'adjust')" />
            <as:constraint action="datatype(//item/value,byte)" />
         </as:context>
      </as:default>
   </as:Rules>
</as:BusinessUseContext>

Notice how the XPath used to locate each of the node constraints in the BusinessUseContext section has been modified to express the relationships shown in Table 1. This modification was fairly straightforward because XPath is able to match elements based on flexible pattern matching. For each positive match, a list of potential code attribute values is provided. This list is not drastically different from the lists of values in Listing 2, the XSD for the flat file representation of the XML data with which you began. With the extra expressivity of XPath, however, you are now able to fully model the relationships between the various code values in the three-part taxonomy.


Conclusion

Grammar-based schemas such as Document Type Definition (DTD) and XSD will never be as expressive as schemas such as OASIS CAM and ISO Schematron, which leverage the additional expressivity of XPath. With a bit of modification of the XPath used to define an XML dataset with one of these schema approaches, you can leverage generalized business models such as the three-part taxonomy of category, type, and subtype, whether the data you are describing tracks clients in a pet shop or financial records in a general ledger application. Using a more expressive schema results in more accurate data and interoperability. Looking forward, using contemporary schema approaches such as OASIS CAM encourages growth of both the specification itself and tooling that supports the specification. I hope that I have demonstrated that you can also accomplish a lot with a simple text editor.


Download

DescriptionNameSize
Example XML Filesxml-examples.zip4KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Open source
ArticleID=607483
ArticleTitle=Create three-level taxonomy modeling strategies using W3C XSD and OASIS CAM
publish-date=01182011