Taxonomy in a software application
Taxonomy is the science of classification, defining the methods used to relate different genres of things hierarchically. For instance, a taxonomy divides the natural world into animal and plant kingdoms, and so on, in finer details through species and subspecies. In the context of a software application or ontological vocabulary, taxonomy refers to the hierarchical schema that is used to differentiate classes of objects within an application, messaging layer, or dataset. In practice, I find that a taxonomy consisting of two levels, a supertype and a subtype, is too general to be particularly useful, and that a taxonomy with four or more levels becomes cumbersome and hard to maintain. Experience shows that a taxonomy with three levels is suitable for a wide variety of tasks. I refer to these three levels of taxonomy as category, type, and subtype.
An example of this breakdown from an application that tracks pets in a pet shop might feature a category of dog or cat, a type of Jack Russell Terrier, and subtypes of long-haired, coarse-haired, and short-haired. A more compelling example of this three-part breakdown of data, however, is provided by an application drawn from the financial world, a simple general ledger application.
Vocabulary for a sample financial application
Consider a financial ledger application that tracks debts, credits, and payments. These records might be defined using a vocabulary such as the one in Table 1.
Table 1. Sample vocabulary data for a financial application
| Category | Type | Subtype |
|---|---|---|
| debt | interest | overpay |
| adjust | ||
| credit | interest | overpay |
| adjust | ||
| payment | eft | bank |
| chq | bank | |
| indiv | ||
| trustee |
This is a simple vocabulary that I use throughout this application for illustrative purposes. It is in no way an attempt to encompass the large problem-space of the finance domain. I want to suggest some approaches and technologies that you might find useful in defining vocabulary for a variety of business domains using this vocabulary as an example.
Modeling taxonomy with a flat XML structure
The easiest way to model an item identified by taxonomy is as a single element with multiple attributes, as in Listing 1.
Listing 1. Flat XML structure
<items> <item category="payment" type="chq" subtype="indiv">134.50</item> <item category="credit" type="interest" subtype="underpay">100.00</item> <item category="payment" type="eft" subtype="bank">1565.75</item> and so on... |
Notice that this structure defines no rules of hierarchy, although you might define these outside of the data structure by the use of vocabulary defined by the schema used to enforce and validate this structure. In many cases, a flat structure is satisfactory to represent a collection of items, but in the case where these items are determined hierarchically, this structure is less than satisfactory.
A flat structure such as this is similar to the structure of a relational table, which might be a useful consideration if you work with data extracted or converted from an existing relational structure. As you can see, however, a less structured approach such as this one has some disadvantages. You encounter several of these disadvantages when you create an XSD for this data.
Using attributes in this way implicitly enforces a cardinality of 0..1, which might be suitable in some cases, but it is not difficult to imagine circumstances where, for instance, multiple subtypes are attributed to a single category and type. A good example of one of these circumstances is in the pet shop application described earlier. Imagine that pet shampoos can be indicated as appropriate for different types or subtypes of pets. In this case, it makes sense to allow a user to indicate more than one subtype, for instance coarse-haired and long-haired, which cannot be expressed using a simple flat XML structure such as the one described earlier.
A number of XML editors allow you to create a W3C XSD from a sample XML file. When you generate an XSD from the code sample, you can see the beginnings of a vocabulary emerging. This is promising, but if you look more closely at the schema extract in Listing 2, you notice that, although the vocabulary is represented accurately, the relationships among the layers of taxonomy are not reflected.
Listing 2. XSD extract for a flat XML structure
<xs:attribute name="category" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="debt"/>
<xs:enumeration value="credit"/>
<xs:enumeration value="payment"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
<xs:attribute name="type" use="required">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="chq"/>
<xs:enumeration value="eft"/>
<xs:enumeration value="interest"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
|
Notice that although the category and type enumerate the expected values, the schema has no way to differentiate between a type that is appropriate for debt and one that is appropriate for payment. Using an XSD file with an XML editor that supports a type-ahead style of data entry might be convenient, but it might result in inaccurate or unexpected data.
In addition, the ordering of the xs:attribute blocks in the XSD is completely arbitrary. Although there is a natural hierarchy of category/type/subtype in the described taxonomy, this hierarchy is not reflected in the XSD because it is not present in the flat XML structure.
A preferable approach, then, is to break down the record into the natural hierarchical structure, in Listing 3. There is no need, however, to stop using attributes. It is quite straightforward to identify each element with a code attribute. Unfortunately, as you will see, the XSD is still unable to reflect the relationships between the different category, type, and subtype metadata. This distinction is not something that XSDs can easily do.
Listing 3. Structured XML data
<sample xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="sample_structured.xsd">
<items>
<item>
<category code="payment">
<type code="eft">
<subtype code="bank"/>
</type>
</category>
<value>100</value>
</item>
<item>
<category code="debt">
<type code="interest">
<subtype code="overpay"/>
</type>
</category>
<value>23</value>
</item>
and so on...
|
When you look in the XSD file referred to in the xsi:noNamespaceSchemaLocation instruction, you see that the structure
of the XSD has changed somewhat, but the restrictions and enumerations that make up
the vocabulary portion of the XSD remain unchanged. The reality of grammatical schemas
such as XSD are that they cannot easily model around business rules that apply to
data, and you can do little about this as long as you use XSDs (at least, not with the 1.0 specification of W3C XSD—these limitations are being addressed in the forthcoming 1.1 specification, which is currently incomplete). Fortunately, XSDs are not the only option available.
Extending your data model using OASIS CAM
OASIS CAM is a template-based schema approach that works by extending XSDs using an XPath assertion method similar to that used in ISO Schematron. Because an OASIS CAM template uses XPath to locate specific nodes and patterns of nodes within the XML data, OASIS CAM can be much more expressive than XSD. Using a CAM template has the added advantage of separating business rules and domain specifics such as vocabulary and taxonomy into their own section within the template. The CAM is supported by some open source tooling, provided to facilitate adoption of the OASIS recommendation, but I also find it useful to modify the templates directly in a text editor. I recommend using both the available tooling and a text-based approach. After you prepare a template, you can use it to create sample XML files based on the schema, as well as to perform validation of existing XML data.
Generating a CAM template using CAMProcessor
You can use the CAMProcessor tool (which is an Eclipse-like implementation of the open
source jCAM project) to generate an OASIS CAM template from an XSD file by a process
known as ingestion. This option is available from CAMProcessor's File menu as
New > New Template from XSD. After you generate a new template from the sample_structured.xsd file in the provided sample files, you see something like the sample_structured_generated.cam file. I want to draw your attention to the rules in the BusinessUseContext section of the template in Listing 4.
Listing 4. Generated CAM template for structured XML data (extract)
<as:BusinessUseContext>
<as:Rules>
<as:default>
<as:context>
<as:constraint
action="makeRepeatable(//items/item)" />
<as:constraint
action="restrictValues(//item/category/@code,'debt'|'payment')" />
<as:constraint
action="restrictValues(//category/type/@code,'chq'|'eft'|'interest')" />
<as:constraint
action="restrictValues(//type/subtype/@code,'bank'|'indiv'|'overpay')" />
<as:constraint
action="datatype(//item/value,byte)" />
</as:context>
</as:default>
</as:Rules>
</as:BusinessUseContext>
|
The BusinessUseContext section in the CAM template contains all the business rules expressed in the template, and this is where you expect to find the vocabulary available for use in XML files based on or validated by the template. As you can see from the listing, the vocabulary in the template is separated into a separate section, and the XPath patterns used to match the category, type, and subtype nodes now reflect the hierarchy of the three-part taxonomy better, but you are still not addressing the relationships between the metadata. And this is where you need to leverage the XPath used in the template further.
Modifying your CAM template to express a three-part taxonomy
Listing 5 is an extract from sample_structured_modified.cam in the provided sample files, which I created manually by editing the generated CAM template using a text editor.
Listing 5. Modified CAM template for structured XML data (extract)
<as:BusinessUseContext>
<as:Rules>
<as:default>
<as:context>
<as:constraint action="makeRepeatable(//items/item)" />
<as:constraint action="restrictValues(
//item/category/@code,'payment'|'debt'|'credit')" />
<as:constraint action="restrictValues(
//item/category[@code="payment"]/type/@code,'eft'|'chq')" />
<as:constraint action="restrictValues(
//item/category[@code="debt"]/type/@code,'interest')" />
<as:constraint action="restrictValues(
//item/category[@code="credit"]/type/@code,'interest')" />
<as:constraint action="restrictValues(
//item//type[@code='eft']/subtype/@code,'bank')" />
<as:constraint action="restrictValues(
//item//type[@code='chq']/subtype/@code,'bank'|'indiv'|'trustee')" />
<as:constraint action="restrictValues(
//item//type[@code='interest']/subtype/@code,'overpay'|'adjust')" />
<as:constraint action="datatype(//item/value,byte)" />
</as:context>
</as:default>
</as:Rules>
</as:BusinessUseContext>
|
Notice how the XPath used to locate each of the node constraints in the BusinessUseContext section has been modified to express the relationships shown in Table 1. This modification was fairly straightforward because XPath is able to match elements based on flexible pattern matching. For each positive match, a list of potential code attribute values is provided. This list is not drastically different from the lists of values in Listing 2, the XSD for the flat file representation of the XML data with which you began. With the extra expressivity of XPath, however, you are now able to fully model the relationships between the various code values in the three-part taxonomy.
Grammar-based schemas such as Document Type Definition (DTD) and XSD will never be as expressive as schemas such as OASIS CAM and ISO Schematron, which leverage the additional expressivity of XPath. With a bit of modification of the XPath used to define an XML dataset with one of these schema approaches, you can leverage generalized business models such as the three-part taxonomy of category, type, and subtype, whether the data you are describing tracks clients in a pet shop or financial records in a general ledger application. Using a more expressive schema results in more accurate data and interoperability. Looking forward, using contemporary schema approaches such as OASIS CAM encourages growth of both the specification itself and tooling that supports the specification. I hope that I have demonstrated that you can also accomplish a lot with a simple text editor.
| Description | Name | Size | Download method |
|---|---|---|---|
| Example XML Files | xml-examples.zip | 4KB | HTTP |
Information about download methods
Learn
- Meet CAM: A new XML validation technology (Brian M. Carey, developerWorks, Sept 2009): Take semantic and structural validation to the next level in a useful introduction to OASIS CAM and the ways in which this approach differs from other schemas.
- XML Validation Framework using OASIS CAM (CAMV) (Puneet Kathuria, David Webber, and Martin Roberts, developerWorks, May 2010): Write your XML data validation rules based on a declarative programming approach in a practical example of using CAM templates to support supply chain messaging in the automotive industry.
- Taking XML Validation to the Next Level (Michael Sorens,devx, May 2009): Dig into a great series that introduces many of the features of OASIS CAM.
- XML area on developerWorks: Get the resources you need to advance your skills in the XML arena.
- My developerWorks: Personalize your developerWorks experience.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks. Also, read more XML tips.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- jCAM project: Download this open source project, which contains the CAMProcessor application.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- XML zone discussion forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.




