Level: Intermediate Erik Hennum (ehennum@us.ibm.com), Information Architect, IBM Robert Anderson (robander@us.ibm.com), Developer, Information Development Workbench, IBM Colin Bird (colinl_bird@uk.ibm.com), Information Architect, IBM
18 Oct 2005 Use a DITA specialization to manage the subject matter of your document content -- that is, identify and process your content based on what each topic is about. With the approach outlined in this article, you can take advantage of the technologies of the Semantic Web for improved search, integration, and other processing. Instead of starting from scratch, however, you can build on standard topic-oriented strategies for authoring and processing content. Formalizing DITA topics as subjects
In a topic-oriented architecture such
as DITA, content is authored in small, independent units that are assembled
to provide help systems, books, courses, and other deliverables. Each unit
of information answers a single question for a specific purpose. That is,
each topic has specific, independent subject matter -- the very reason that
these units of information are called topics. For instance, one topic
might describe the format of a user definition file on a Web site while another
topic explains the principles of Web site security and a third topic lays out
the procedure for setting up Web site logins.
Because each topic has
a specific meaning, DITA topics are tailor-made for semantic processing. However, current
semantic processors can't read the text of a topic to find out what
it means. What's missing is a formal declaration of the topic's subject matter
that a semantic processor can understand -- like the address on an envelope that allows mail sorters to route the
contents to the appropriate destination.
Simple Knowledge Organization System (SKOS) provides a standard for indicating
the subject matter of content. SKOS lets you define the subjects for a particular subject
matter area (organizing these subjects as a taxonomy if desired) and then
classify each piece of content to indicate its subject. For instance, using
SKOS, you could define configuration and security as subjects, and classify
the three example topics that relate to those subjects so that users could browse the
subjects to find the content regardless of whether the words "configuration"
or "security" actually appear in the text.
SKOS is expressed with Resource Description Framework (RDF), the fundamental language of the Semantic
Web. However, SKOS provides a higher-level language that's designed for readable
content. SKOS has benefited from broad perspectives, including those of experts in
OWL/RDF, TopicMaps, ontology, and library science. In the spectrum of standards,
SKOS contributes by bridging the gap between traditional indexing and formal
ontologies for the Semantic Web.
Thus, DITA has a natural fit with SKOS
in solutions where DITA topics are classified with subjects that are expressed
in SKOS for runtime processing.
Note: SKOS uses the concept label
for these formal subjects. DITA, however, uses the concept label for
a unit of content that is conceptual in nature. To avoid confusion, this article
(and the accompanying specialization) uses the subject label for the
same thing as the SKOS concept, and uses the concept label in the DITA
sense.
The typical approach for taxonomies is to maintain the formal
subject definition separate from the content. However, by using a DITA specialization
you can maintain the subject definition as authored content. Content
creators realize several benefits by managing subject definitions as content:
- Formal subjects are often defined by glossary topics or other topics
that already exist within the published information set. The TopicMaps
community
has long recognized such authoritative definitional resources under the name
published subject indicators. For instance, the documentation for
an application server product is likely to define important subjects within
the subject area such as authentication, Web server, and so
on.
Even if you don't include the subject definitions in your published
information, you can use your standard content tools for your subject definitions.
For instance, you can author the subject definitions with your XML editor,
and archive and version the subject definitions along with your content in
your content management or version control system. You can also use existing
formatting processes to produce catalogs of subject definitions for use by
authors -- that is, you don't have to implement a separate authoring and processing
system for subject definitions.
- Subject classification is as much a part of the information architecture
of your content as the navigational organization. So rather than
trying to bolt on semantic precision after the fact,
information architects can encourage better content by providing
a formal definition of the subjects to be covered
and thus guide the creation of content.
Where an existing information architecture
is especially crisp, the information architect may find that some subject
classifications merely formalize the existing organization of or
relationships between topics.
- RDF is optimized for processing rather than authoring. In particular,
RDF expresses information as a set of records rather than taking advantage
of XML for more understandable tree and table structures. RDF enthusiasts sometimes suggest that sophisticated tools make it unnecessary to see the format of source files; however, HTML has demonstrated the value of an understandable file format even after such tools exist.
In particular, a taxonomy
of subjects has a natural tree structure that can have a straightforward representation
as a specialized DITA map. Variant serializations of RDF models are legion,
and a DITA map that's specialized for subject definition and classification could
be regarded as an alternative serialization of a SKOS model.
 |
Parts of a subject classification
As specified by SKOS and implemented in the DITA
specialization, a subject classification has the following parts:
Table 1. Parts of a subject classification
| Part | Identifies | DITA implementation |
|---|
| Subject definition | The meaning of a formal subject | A DITA topic that defines a default label for the subject and explains
what the subject covers. | | Subject scheme | The relationships between subjects | A DITA map that organizes the subjects into hierarchies. For instance,
the Task subject might contain both the Installing subject and the
Configuring subject.
The map can also express associative relationships for subjects
that are related in other ways. | | Content classification | The subject matter of resources | Another DITA map that expresses relationships between topics that define
formal subjects and content topics that treat some aspect of the subject.
The same map can define the navigation relationships and related links for
the content topics in standard DITA practice. |
In addition, DITA provides the content topics that are classified and other information about those topics. For instance, a DITA map can organize the content topics to provide a navigation hierarchy or define related links for those topics.
Figure 1 illustrates the parts of the subject classification:
Figure 1. Parts of a subject classification
Figure 1 shows:
- The formal subjects as teal circles
- The content topics as blue circles
- The subject relationships as yellow arrows
- The classification relationships as blue arrows
- The topic navigation relationships such as navigation hierarchies and
related links as magenta arrows
From a publishing perspective, subject classification resembles both
indexing and glossaries. This concept isn't new -- the recognition that glossaries
and indexing make semantic assertions contributed to the development of TopicMaps.
Where an index term can have multiple meanings, each subject has the same
precision as a single meaning from a glossary. It is this precision that makes
it possible to manage content based on its meaning.
Defining subjects
To define a subject,
you create a DITA topic (typically a concept topic) to identify an aspect
of the subject matter of your content (see Figure 2).
Figure 2. Defined subjects
The DITA topic specifies the subject with a specialized section element that includes the
following kinds of information:
- Default labels, including synonyms and denotative images
- Notes on the definition and on the scope of coverage for the subject
Listing 1 shows an example of the definition for the Configuring subject:
Listing 1. Definition for the Configuring subject
<concept id="configuring">
<title>Configuring</title>
<shortdesc>You configure components to set up or refine your solution.</shortdesc>
<conbody>
<p>You don't have to get the best configuration the first time....</p>
<subjectDetail>
<subjectLabels>
<altLabel>Setting up</altLabel>
</subjectLabels>
<scopeNote>Administrative tasks performed after installation...</scopeNote>
</subjectDetail>
</conbody>
</concept> |
This specialized section can
be used in any topic type that allows section elements. In particular, to formalize an
existing glossary, concept, or reference topic that authoritatively defines
a subject, you can add the specialized section. Because the
meaning of a formal topic should never vary based on its use, these fields
should be part of the topic.
Organizing subjects
As part of a scheme, specialized
DITA map elements specify the relationships that define a thesaurus or taxonomy
hierarchy (see Figure 3).
Figure 3. Hierarchical and associative relationships between subjects
The hierarchy can include subject heads,
whose meaning is equivalent to the union of subjects contained within the
subject head. Specialized topic groups and relationship tables specify related-to
relationships that cut across the subject hierarchy.
Listing 2 is an example
of a subject scheme that identifies Installing and Configuring as Task subjects
and Resource utilization and Security as Concerns:
Listing 2. Example of a subject scheme
<subjectScheme title="Sampletaxonomy" id="taxonomyScheme">
<subjectdef href="task.dita" navtitle="Task">
<subjectdef href="installing.dita" navtitle="Installing"/>
<subjectdef href="configuring.dita" navtitle="Configuring"/>
...
<subjectdef href="concern.dita" navtitle="Concern">
<subjectdef href="utilization.dita" navtitle="Resource utilization"/>
<subjectdef href="security.dita" navtitle="Security"/>
... |
You can have multiple schemes
for the same subjects. For instance, different audiences might be interested
in a different subset of the taxonomy.
This approach of imposing alternative
organizational structures on subjects fits well with the standard use of DITA
maps for separating context from content, allowing different organizations
to be imposed on the same content. That is, the scheme can be considered a
special kind of context for subject definition topics.
Schemes can use non-DITA subject definitions (such as publicly-defined SKOS, OWL, or TopicMaps
subjects). You cite the public identifier of the subject with the subjectdef
element and identify the subject definition format with the format attribute.
This allows you to incorporate publicly-defined subjects into your schemes,
or to integrate a formal ontology maintained by your organization with concepts
that are specific to your content.
Finally, if you're not organizing
subjects into a hierarchical taxonomy or otherwise expressing relationships
between subjects, you don't have to create a scheme.
You can still classify content topics without a scheme. For instance, you
might take this approach to support a controlled index or a
controlled tagsonomy in which
topics are classified with tags, but each tag must be defined by creating a
subject topic with a precise definition.
Classifying content
To classify content, another map specialization associates formal subjects with topics (see Figure 4).
Figure 4. Content topics classified by subjects
Inside the topicref element that references and contains references to the classified
content, you nest a topicsubject element to specify the subjects of the content.
You can identify a primary subject with the href attribute of the topicsubject
element, which also contains subjectref elements for the secondary subjects.
If no subject is primary, the topicsubject element should be a container without
the href attribute.
Listing 3 is an example of a content classification:
Listing 3. Content classification
<topicref href="websecure.dita">
<topicsubject>
<subjectref href="webserver.dita"/>
<subjectref href="security.dita"/>
</topicsubject>
<topicref href="https_protocol.dita"/>
... other subordinate content topics ...
</topicref>
<topicref href="loginsetup.dita" collection-type="sequence">
<topicsubject>
<subjectref href="configuring.dita"/>
<subjectref href="webserver.dita"/>
<subjectref href="security.dita"/>
</topicsubject>
<topicref href="editinguserdef.dita"/>
... other subordinate content topics ...
</topicref> |
In the same way that subject
schemes can cite public non-DITA subjects, you can classify DITA content with
SKOS, OWL, or TopicMaps subjects by citing the public URI identifiers with
the subjectref element and setting the format attribute. You can also classify
non-DITA content such as HTML and PDF files by referring to the files with
the topicref element, but identifying the content format with the format attribute.
Because subjects are defined by special topics, you can include the subject definition
in the content and use it for classification. For instance, the subject
topic for Security can both classify content about security and describe security within the Web site or help system content. Figure 5
illustrates this scenario:
Figure 5. A subject topic that is also a classified content topic
The central circle represents a conceptual topic (such as Security), which:
- Has a broader relationship to a subject (perhaps System Concerns) within
a scheme
- Is classified by two other subjects (perhaps the Background type and the
Novice User role)
- Contributes to the classification of one topic (such as Web Security)
- Occupies the second position in a navigation sequence (perhaps under a
Glossary heading)
Associating and processing the pieces
Because the classification map is distinct from the
scheme map, you can apply multiple schemes to the same classification without
requiring changes to the classification. To combine the scheme and classification
maps for a deliverable, a higher-level map can refer to both maps using a
DITA map reference (see Figure 6).
Figure 6. Assembling a deliverable from the scheme and content maps
You might process a single map to generate both
an HTML representation of the content and a SKOS representation of the subjects
and classification.
Generating SKOS for processing
After you define the formal subjects, organize
them into hierarchical schemes, and classify your content, you can run a transform
to convert the DITA subject classification into SKOS for processing by SKOS
tools. To experiment, install the DITA Open Toolkit (see Resources) and download the demonstration
that accompanies this article, x-dita10_thesaurus.zip.
This demonstration updates the demo directory
with a thesaurus subdirectory, which has a readme.html file with links to the reference documentation and instructions for building the sample content.
In a typical approach, SKOS tools display the thesaurus or taxonomy to the user for browsing.
By picking one or more subjects, the user navigates to the relevant classified
content. For instance, a user might pick the Security and Web Server subjects
to find the example topics. You can see an example of one such interface (many are possible) at the Semantic
Web Environmental Directory (SWED) site (see Resources). This framework uses SKOS for
its subject classification. More sophisticated uses of the subject classification
are also possible.
Summary
This article has only addressed the basics of classifying your content. The DITA
specialization provides additional capabilities, including associative
relationships between subjects and tabular classification. As noted above,
however, you need separate runtime tools to operate on the classification.
In addition, the art of defining subjects and organizing them into taxonomies has its own strategies and best practices.
Also, users of this DITA
specialization should bear in mind that SKOS itself hasn't yet progressed
to an approved standard. As a result, the SKOS-compatible DITA specialization
could undergo some evolution.
However, the DITA specialization already demonstrates
the value of maintaining a taxonomy and classification as part of your
content. By taking this approach, content creators can use the
technologies of the Semantic Web to process their content based
on what it means.
Download | Description | Name | Size | Download method |
|---|
| Download for thesaurus specialization | x-dita10_thesaurus.zip | 259 KB | HTTP |
|---|
Resources Learn
Get products and technologies
Discuss
About the authors  | |  | Erik Hennum works on the design and implementation of User Assistance for the IBM Systems Group. For DITA, he has helped shape the principles of domain specialization and is a member representative on the OASIS DITA Technical Committee. |
 | 
|  | Robert D. Anderson is the chief architect for the DITA Open Toolkit, and is also a member of the OASIS DITA Technical Committee. He has worked on IBM's internal Information Development Workbench since 1999, supporting both XML (DITA) and SGML (IBMIDDoc). |
 | 
|  | Colin Bird is an Information Architect in the User Technologies Department at IBM Hursley. Prior to that, Colin worked at the IBM UK Scientific Centre, developing image manipulation and visualization applications. This work led to several information retrieval projects, particularly involving content-based image retrieval, culminating in a year's secondment to the Intelligence, Agents, Multimedia Group at Southampton University. Colin currently holds a Visiting Senior Research Fellowship there. His time at Southampton engendered a strong interest in the capture of metadata for subsequent use in both the principled retrieval and the adaptive delivery of information. |
Rate this page
|