Subject classification with DITA and SKOS

Managing formal subjects

Use a DITA specialization to manage the subject matter of your document content -- that is, identify and process your content based on what each topic is about. With the approach outlined in this article, you can take advantage of the technologies of the Semantic Web for improved search, integration, and other processing. Instead of starting from scratch, however, you can build on standard topic-oriented strategies for authoring and processing content.

Erik Hennum (ehennum@us.ibm.com), Information Architect, IBM, Software Group

Erik Hennum works on the design and implementation of User Assistance for the IBM Systems Group. For DITA, he has helped shape the principles of domain specialization and is a member representative on the OASIS DITA Technical Committee.



Robert Anderson (robander@us.ibm.com), Developer, Information Development Workbench, IBM, Software Group

Photo of Robert AndersonRobert D. Anderson is the chief architect for the DITA Open Toolkit, and is also a member of the OASIS DITA Technical Committee. He has worked on IBM's internal Information Development Workbench since 1999, supporting both XML (DITA) and SGML (IBMIDDoc).



Colin Bird (colinl_bird@uk.ibm.com), Information Architect, IBM, Software Group

Photo of Colin BirdColin Bird is an Information Architect in the User Technologies Department at IBM Hursley. Prior to that, Colin worked at the IBM UK Scientific Centre, developing image manipulation and visualization applications. This work led to several information retrieval projects, particularly involving content-based image retrieval, culminating in a year's secondment to the Intelligence, Agents, Multimedia Group at Southampton University. Colin currently holds a Visiting Senior Research Fellowship there. His time at Southampton engendered a strong interest in the capture of metadata for subsequent use in both the principled retrieval and the adaptive delivery of information.



18 October 2005

Formalizing DITA topics as subjects

In a topic-oriented architecture such as DITA, content is authored in small, independent units that are assembled to provide help systems, books, courses, and other deliverables. Each unit of information answers a single question for a specific purpose. That is, each topic has specific, independent subject matter -- the very reason that these units of information are called topics. For instance, one topic might describe the format of a user definition file on a Web site while another topic explains the principles of Web site security and a third topic lays out the procedure for setting up Web site logins.

Because each topic has a specific meaning, DITA topics are tailor-made for semantic processing. However, current semantic processors can't read the text of a topic to find out what it means. What's missing is a formal declaration of the topic's subject matter that a semantic processor can understand -- like the address on an envelope that allows mail sorters to route the contents to the appropriate destination.

Simple Knowledge Organization System (SKOS) provides a standard for indicating the subject matter of content. SKOS lets you define the subjects for a particular subject matter area (organizing these subjects as a taxonomy if desired) and then classify each piece of content to indicate its subject. For instance, using SKOS, you could define configuration and security as subjects, and classify the three example topics that relate to those subjects so that users could browse the subjects to find the content regardless of whether the words "configuration" or "security" actually appear in the text.

SKOS is expressed with Resource Description Framework (RDF), the fundamental language of the Semantic Web. However, SKOS provides a higher-level language that's designed for readable content. SKOS has benefited from broad perspectives, including those of experts in OWL/RDF, TopicMaps, ontology, and library science. In the spectrum of standards, SKOS contributes by bridging the gap between traditional indexing and formal ontologies for the Semantic Web.

Thus, DITA has a natural fit with SKOS in solutions where DITA topics are classified with subjects that are expressed in SKOS for runtime processing.

Note: SKOS uses the concept label for these formal subjects. DITA, however, uses the concept label for a unit of content that is conceptual in nature. To avoid confusion, this article (and the accompanying specialization) uses the subject label for the same thing as the SKOS concept, and uses the concept label in the DITA sense.

The typical approach for taxonomies is to maintain the formal subject definition separate from the content. However, by using a DITA specialization you can maintain the subject definition as authored content. Content creators realize several benefits by managing subject definitions as content:

  1. Formal subjects are often defined by glossary topics or other topics that already exist within the published information set. The TopicMaps community has long recognized such authoritative definitional resources under the name published subject indicators. For instance, the documentation for an application server product is likely to define important subjects within the subject area such as authentication, Web server, and so on.

    Even if you don't include the subject definitions in your published information, you can use your standard content tools for your subject definitions. For instance, you can author the subject definitions with your XML editor, and archive and version the subject definitions along with your content in your content management or version control system. You can also use existing formatting processes to produce catalogs of subject definitions for use by authors -- that is, you don't have to implement a separate authoring and processing system for subject definitions.

  2. Subject classification is as much a part of the information architecture of your content as the navigational organization. So rather than trying to bolt on semantic precision after the fact, information architects can encourage better content by providing a formal definition of the subjects to be covered and thus guide the creation of content.

    Where an existing information architecture is especially crisp, the information architect may find that some subject classifications merely formalize the existing organization of or relationships between topics.

  3. RDF is optimized for processing rather than authoring. In particular, RDF expresses information as a set of records rather than taking advantage of XML for more understandable tree and table structures. RDF enthusiasts sometimes suggest that sophisticated tools make it unnecessary to see the format of source files; however, HTML has demonstrated the value of an understandable file format even after such tools exist.

    In particular, a taxonomy of subjects has a natural tree structure that can have a straightforward representation as a specialized DITA map. Variant serializations of RDF models are legion, and a DITA map that's specialized for subject definition and classification could be regarded as an alternative serialization of a SKOS model.


Parts of a subject classification

As specified by SKOS and implemented in the DITA specialization, a subject classification has the following parts:

Table 1. Parts of a subject classification
PartIdentifiesDITA implementation
Subject definition The meaning of a formal subjectA DITA topic that defines a default label for the subject and explains what the subject covers.
Subject schemeThe relationships between subjectsA DITA map that organizes the subjects into hierarchies. For instance, the Task subject might contain both the Installing subject and the Configuring subject. The map can also express associative relationships for subjects that are related in other ways.
Content classificationThe subject matter of resourcesAnother DITA map that expresses relationships between topics that define formal subjects and content topics that treat some aspect of the subject. The same map can define the navigation relationships and related links for the content topics in standard DITA practice.

In addition, DITA provides the content topics that are classified and other information about those topics. For instance, a DITA map can organize the content topics to provide a navigation hierarchy or define related links for those topics.

Figure 1 illustrates the parts of the subject classification:

Figure 1. Parts of a subject classification
Subject topics, subject relationships, topics, classifications relationships, and topic navigation relationships

Figure 1 shows:

  • The formal subjects as teal circles
  • The content topics as blue circles
  • The subject relationships as yellow arrows
  • The classification relationships as blue arrows
  • The topic navigation relationships such as navigation hierarchies and related links as magenta arrows

From a publishing perspective, subject classification resembles both indexing and glossaries. This concept isn't new -- the recognition that glossaries and indexing make semantic assertions contributed to the development of TopicMaps. Where an index term can have multiple meanings, each subject has the same precision as a single meaning from a glossary. It is this precision that makes it possible to manage content based on its meaning.


Defining subjects

To define a subject, you create a DITA topic (typically a concept topic) to identify an aspect of the subject matter of your content (see Figure 2).

Figure 2. Defined subjects
Defined subjects

The DITA topic specifies the subject with a specialized section element that includes the following kinds of information:

  • Default labels, including synonyms and denotative images
  • Notes on the definition and on the scope of coverage for the subject

Listing 1 shows an example of the definition for the Configuring subject:

Listing 1. Definition for the Configuring subject
<concept id="configuring">
    <title>Configuring</title>
    <shortdesc>You configure components to set up or refine your solution.</shortdesc>
    <conbody>
        <p>You don't have to get the best configuration the first time....</p>
        <subjectDetail>
            <subjectLabels>
                <altLabel>Setting up</altLabel>
            </subjectLabels>
            <scopeNote>Administrative tasks performed after installation...</scopeNote>
        </subjectDetail>
    </conbody>
</concept>

This specialized section can be used in any topic type that allows section elements. In particular, to formalize an existing glossary, concept, or reference topic that authoritatively defines a subject, you can add the specialized section. Because the meaning of a formal topic should never vary based on its use, these fields should be part of the topic.


Organizing subjects

As part of a scheme, specialized DITA map elements specify the relationships that define a thesaurus or taxonomy hierarchy (see Figure 3).

Figure 3. Hierarchical and associative relationships between subjects
Subjects with hierarchical and associative relationships

The hierarchy can include subject heads, whose meaning is equivalent to the union of subjects contained within the subject head. Specialized topic groups and relationship tables specify related-to relationships that cut across the subject hierarchy.

Listing 2 is an example of a subject scheme that identifies Installing and Configuring as Task subjects and Resource utilization and Security as Concerns:

Listing 2. Example of a subject scheme
<subjectScheme title="Sampletaxonomy" id="taxonomyScheme">
    <subjectdef href="task.dita" navtitle="Task">
        <subjectdef href="installing.dita" navtitle="Installing"/>
        <subjectdef href="configuring.dita" navtitle="Configuring"/>
        ...
    <subjectdef href="concern.dita" navtitle="Concern">
        <subjectdef href="utilization.dita" navtitle="Resource utilization"/>
        <subjectdef href="security.dita" navtitle="Security"/>
        ...

You can have multiple schemes for the same subjects. For instance, different audiences might be interested in a different subset of the taxonomy.

This approach of imposing alternative organizational structures on subjects fits well with the standard use of DITA maps for separating context from content, allowing different organizations to be imposed on the same content. That is, the scheme can be considered a special kind of context for subject definition topics.

Schemes can use non-DITA subject definitions (such as publicly-defined SKOS, OWL, or TopicMaps subjects). You cite the public identifier of the subject with the subjectdef element and identify the subject definition format with the format attribute. This allows you to incorporate publicly-defined subjects into your schemes, or to integrate a formal ontology maintained by your organization with concepts that are specific to your content.

Finally, if you're not organizing subjects into a hierarchical taxonomy or otherwise expressing relationships between subjects, you don't have to create a scheme. You can still classify content topics without a scheme. For instance, you might take this approach to support a controlled index or a controlled tagsonomy in which topics are classified with tags, but each tag must be defined by creating a subject topic with a precise definition.


Classifying content

To classify content, another map specialization associates formal subjects with topics (see Figure 4).

Figure 4. Content topics classified by subjects
Subjects classifying content topics

Inside the topicref element that references and contains references to the classified content, you nest a topicsubject element to specify the subjects of the content. You can identify a primary subject with the href attribute of the topicsubject element, which also contains subjectref elements for the secondary subjects. If no subject is primary, the topicsubject element should be a container without the href attribute.

Listing 3 is an example of a content classification:

Listing 3. Content classification
<topicref href="websecure.dita">
    <topicsubject>
        <subjectref href="webserver.dita"/>
        <subjectref href="security.dita"/>
    </topicsubject>
    <topicref href="https_protocol.dita"/>
    ... other subordinate content topics ...
</topicref>
<topicref href="loginsetup.dita" collection-type="sequence">
    <topicsubject>
        <subjectref href="configuring.dita"/>
        <subjectref href="webserver.dita"/>
        <subjectref href="security.dita"/>
    </topicsubject>
    <topicref href="editinguserdef.dita"/>
     ... other subordinate content topics ...
</topicref>

In the same way that subject schemes can cite public non-DITA subjects, you can classify DITA content with SKOS, OWL, or TopicMaps subjects by citing the public URI identifiers with the subjectref element and setting the format attribute. You can also classify non-DITA content such as HTML and PDF files by referring to the files with the topicref element, but identifying the content format with the format attribute.

Because subjects are defined by special topics, you can include the subject definition in the content and use it for classification. For instance, the subject topic for Security can both classify content about security and describe security within the Web site or help system content. Figure 5 illustrates this scenario:

Figure 5. A subject topic that is also a classified content topic
A subject topic that is also a classified and navigational topic

The central circle represents a conceptual topic (such as Security), which:

  • Has a broader relationship to a subject (perhaps System Concerns) within a scheme
  • Is classified by two other subjects (perhaps the Background type and the Novice User role)
  • Contributes to the classification of one topic (such as Web Security)
  • Occupies the second position in a navigation sequence (perhaps under a Glossary heading)

Associating and processing the pieces

Because the classification map is distinct from the scheme map, you can apply multiple schemes to the same classification without requiring changes to the classification. To combine the scheme and classification maps for a deliverable, a higher-level map can refer to both maps using a DITA map reference (see Figure 6).

Figure 6. Assembling a deliverable from the scheme and content maps
A deliverable map, a scheme map, and a topic navigation and classification map

You might process a single map to generate both an HTML representation of the content and a SKOS representation of the subjects and classification.


Generating SKOS for processing

After you define the formal subjects, organize them into hierarchical schemes, and classify your content, you can run a transform to convert the DITA subject classification into SKOS for processing by SKOS tools. To experiment, install the DITA Open Toolkit (see Resources) and download the demonstration that accompanies this article, x-dita10_thesaurus.zip. This demonstration updates the demo directory with a thesaurus subdirectory, which has a readme.html file with links to the reference documentation and instructions for building the sample content.

In a typical approach, SKOS tools display the thesaurus or taxonomy to the user for browsing. By picking one or more subjects, the user navigates to the relevant classified content. For instance, a user might pick the Security and Web Server subjects to find the example topics. You can see an example of one such interface (many are possible) at the Semantic Web Environmental Directory (SWED) site (see Resources). This framework uses SKOS for its subject classification. More sophisticated uses of the subject classification are also possible.


Summary

This article has only addressed the basics of classifying your content. The DITA specialization provides additional capabilities, including associative relationships between subjects and tabular classification. As noted above, however, you need separate runtime tools to operate on the classification. In addition, the art of defining subjects and organizing them into taxonomies has its own strategies and best practices.

Also, users of this DITA specialization should bear in mind that SKOS itself hasn't yet progressed to an approved standard. As a result, the SKOS-compatible DITA specialization could undergo some evolution.

However, the DITA specialization already demonstrates the value of maintaining a taxonomy and classification as part of your content. By taking this approach, content creators can use the technologies of the Semantic Web to process their content based on what it means.


Download

DescriptionNameSize
Download for thesaurus specializationx-dita10_thesaurus.zip259 KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=95788
ArticleTitle=Subject classification with DITA and SKOS
publish-date=10182005