Skip to main content

Information architecture essentials, Part 3: Organizing complex information

Understanding and creating controlled vocabularies

Benjamin Lieberman, Ph.D., Principal Software Architect, BioLogic Software Consulting, LLC
Benjamin A. Lieberman serves as the principal architect for BioLogic Software Consulting, a firm providing services on a wide variety of software development topics, including requirements analysis, software analysis and design, configuration management, and development process improvement. Dr. Lieberman is also an accomplished professional writer and author of The Art of Software Modeling and numerous software-related articles. Dr. Lieberman holds a doctorate degree in biophysics and genetics from the University of Colorado.

Summary:  Useful information rarely presents itself neatly categorized, labeled, and ready for storage in a content management system. How much easier life would be if it were so. Instead, you must analyze the information to be archived to determine a usable and maintainable structure for both storage and easy retrieval. To allow for constructive use of the information, you must choose categories that support the intended audience's ability to rapidly locate the most relevant materials.

Date:  26 Feb 2008
Level:  Introductory PDF:  A4 and Letter (47KB)Get Adobe® Reader®
Comments:  

The goal is to provide insightful search results

Given the increasing amount of information available on- and offline, it's more important than ever to create usable data structures. The goal of data organization is to provide access to the vast resource represented by a diverse data repository. Consider the now common example of performing a Web search using Yahoo, Google, or Ask.com. Just a few years ago, a fruitful search might have required hunting through page after page of hits for one or two of value. Today, with advanced-search algorithms, most searchers can find information of interest in the first few pages, or they can rapidly refine a search based on related terms that are highlighted.

The first step to providing useful insight into a large storehouse of data is to generate a common way of referring to the information—in other words, develop a controlled vocabulary.


Skills and competencies

Controlled vocabularies may be of any complexity, but the creation of one usually requires the information architect to have a deep insight into the information space. Standards, such as ISO2788 (see Resources), are used by vocabulary creators to establish the set of terms for a particular categorization hierarchy. Data-vocabulary standards are a uniform, tested, and effective way to manage a specific collection of information. For example, library standards exist for cataloging books, music, film, maps, and other items. This system provides a uniform way for anyone familiar with the library standard to rapidly locate information of interest.

The most common way to create a controlled vocabulary is to use commonly found terms to describe sets of information and to arrange those terms into a single, rooted hierarchy. For example:

Stone -> Metamorphic -> Calcareous -> Marble -> Udaipur Green

This structure is intuitive, common, and relatively easy to construct if you have an understanding of geology. A following section in this article discusses four different kinds of term-based controlled vocabulary structures.

Another form of vocabulary standard is an authority file. Authority files are often used in library-organization schemes to unambiguously define a set of terms. They're also extensively used in law to establish uniform definitions for particular legal terms. These kinds of formal languages are usually created only when there is a significant consequence to misunderstanding a particular term. For example, in a court case a misinterpretation due to natural-language ambiguity could lead to a large financial consequence or even imprisonment.

A related, but less rigorous, controlled vocabulary is occupational jargon. Jargon is established in an industry (medical, legal, scientific, engineering, and so on) to allow for rapid, unambiguous understanding. Occupational jargon requires deep knowledge of a particular topic. There are many cases of jargon, but because most terms aren't officially sanctioned, a set of workers in one area may not share the same jargon terms with a group in a different area. Consequently, a jargon term may not have a single meaning. For this reason, you should be careful when using jargon for controlled terms. Be sure the intended audience is familiar with these terms and that they're well defined and stable.

Iconic representation is another powerful and controlled way to present information. In this type of presentation, the information categories are represented by visual iconic forms rather than language terms. Consider a city map intended for use by visitors who may or may not speak the local language. The pictorial representation of city attractions and services is much easier to interpret than standard language descriptions. But this approach requires familiarity with the meaning of the selected symbols (such as the symbol for a medical building; in many Middle Eastern countries the symbol for a medical building is a red crescent rather than a red cross), making iconic representations a challenging approach. Moreover, it's difficult at best to embed the idea of a hierarchy using images or iconic representations for information.

Categorization exercise

The following 20 random words help illustrate how each person's mental picture influences the grouping of information. Everyone has a different way of relating to the world around them, and their preferences are made apparent in what grouping categories they select, especially when a person is working alone. Take the word list (random words generated using RandomWord; see the Resources section), and write each word on an individual piece of paper. Now, try to organize all 20 into a reasonable structure. First try the exercise by yourself. Then repeat it in a small group. Conflict will inevitably arise as each person tries to assert his or her approach as "best." The take-home message is that the most effective way to create controlled vocabularies is in collaboration with many other groups. Collaboration tends to normalize each individual's prejudice. This is one of the rare occasions where design by committee is a good idea!

TEAR, ARENA, ORGANIZATION, GUIDANCE, DESCRIPTION, SWEEP, GRAND, MOLECULE, GENIUS, CALLING, ICE, QUEEN, INSTRUMENT, APPLICANT, LIMB, PLASTER, RELIEF, SERIES, CONSTITUENT, COMPASSION

Choose the most effective vocabulary type for your information and users

Many excellent sources discuss the different kinds of term-based controlled vocabularies, so this article only briefly covers them (see the Resources section). Four basic forms (listed by increasing complexity) are most often used when defining an information-organization structure:

  • List
  • Synonym ring
  • Faceted description
  • Thesaurus

One way to understand the differences among the types of vocabulary is to think about adding a new dimension of information from one approach to the next. For example, a list is a one-dimensional structure that is based on a single common attribute of the information listed. If you were presenting a list of all forms of rock found on earth, it would contain the following three items: igneous, sedimentary, and metamorphic.

A synonym ring adds another dimension to the list, allowing navigation from one term to another closely related term. For example, Netflix uses prior renting behaviors to recommend additional selections. Someone starting with a set of film actors can browse to related directors or music or plot and then move on to genres, critic picks, movie categories, series, and so on. With this related-term approach to information organization, the user can rapidly browse to films of interest and can reduce the set of movie candidates from tens of thousands if the user indicates no interest in a particular suggestion category.

An approach that is growing in popularity, particularly on the Web, is the faceted organizational scheme. Originally developed by S. R. Ranganathan, a faceted description of an item uses multiple characteristics to provide a cross-indexing capability when information may belong to many categories. Each of these facets is like a facet on the face of diamond, reflecting a different aspect of the subject (see Figure 1). The primary advantage of this approach is that it lets you incorporate new information into the existing structure by establishing a new facet.


Figure 1. Facets for a "Calendar" topic
facet example

The most extensive approach is a full thesaurus. A thesaurus is the most complete description for a particular topic. In addition to synonyms—related terms (RT)—and cross-references to other hierarchies—used for (UF)—a thesaurus adds the final dimension of narrowing terms (NT) and broadening terms (BT), where a narrowing term is more specific for the topic, and a broadening term is more generic.

Table 1 gives an example of a thesaurus (from the United States Geological Survey).


Table 1. Ecological processes thesaurus
DescriptionDynamic biogeochemical interactions that occur among and between biotic and abiotic components of the biosphere
Broadening term (BT)biological and physical processes
Narrowing term (NT)algal blooms, bioaccumulation, biogeochemical cycling, biological productivity, contaminant transport, dispersal (organisms), ecological competition, ecosystem functions, eutrophication, extinction and extirpation, habitat alteration, migration (organisms), pollination, succession (biological)
Related term (RT)ecology, population and community ecology
Used for (UF)environmental processes, ecological models

A thesaurus-based model for organizing information contains the most structure and provides the greatest search and filter capabilities. But it's also the most difficult and time consuming to construct. Before you go to the effort of generating a full thesaurus as part of your information architecture, be sure you understand your intended audience's search and filter needs.

Tools and techniques

The purpose of organizing information is to allow rapid identification and retrieval of useful data. There are many ways—a practically infinite number— to group and arrange sets of information. So, identifying the "correct" structure depends entirely on how the information will be accessed. Each user group, and perhaps each individual user, will have a particular goal in mind when they access an information repository. The responsibility of the information analyst is to understand these goals and select a strategy that best meets the user's need.

Understand search behaviors

Almost everyone at one time or another has been faced with the daunting challenge of finding the figurative needle in the metaphorical haystack. Whether you're a student searching through labyrinthine library-book stacks, or a business person hunting the no-less-mysterious shared drive structure, the need to find a particular piece of information has the same level of importance. The difference comes in the approach used to locate that piece of critical information. There are three basic search behaviors:

  • Opportunistic— Multiple answers sought (browsing)
  • Focused — Single answer sought (seeking)
  • Rigorous — Deep understanding sought (research)

Opportunistic hunters follow their noses; they may have only a vague idea what they're after, and they refine the search based on what they find. To organize information for this type of search, you should place a premium on identifying common categories (see the ideas about abstraction discussed in the next section). And you should allow for a free-ranging investigation of common terms, such as that provided by a synonym ring.

Focused seekers are in search of a single answer to a specific question. They want to rapidly refine the search by narrowing choices to a small set of candidate information. In this case, a thesaurus will most likely provide the necessary filtering capabilities, with narrowing terms rapidly focusing on the desired answers.

Finally, rigorous researchers are after a deep understanding of a particular topic. They want to leave no stone unturned. A multifaceted information approach, where each facet of a piece of information leads to an increasingly more detailed understanding, best serves these scholars.

Discover information abstraction

Although many skills are important for the information analyst, perhaps the most important are the twin ideas of abstraction and leveling. Abstraction is a technique for discovering commonalities in disparate data elements to identify a common root element. Leveling is the grouping of like with like at a particular abstraction level. Combined, these two ideas let you create effective information structures to better meet the needs of a particular user group.

Abstraction is about discovering common attributes that are global to a group of items. These common attributes can then be identified as part of a base element in an organizational hierarchy. Many controlled languages use a hierarchy of terms that go from a more general level of description to a more specific level. Consider the hierarchy shown in Figure 2.


Figure 2. Term hierarchy for marble
hierarchy for marble

Note that there are two different forms of marble, and each falls under a separate subcategory. Each type is formed from a different geological process, even though they have the name "marble" in common. Moreover, each has a very different chemical makeup, which affects the way each material is used. In this case, someone searching for "marble" may not understand the differences and therefore be confused by the search result. To prevent this confusion, each type of marble should have the other as a related term (calcinate marble should be shown related to dolomitic marble, and dolomitic marble should be shown related to calcinate marble).

Most hierarchies are single dimensional, which means that all elements share a common root element. Often, however, it's necessary to refer to an element by a different set of characteristics when searching or filtering. In these cases, a multidimensional hierarchy, where each element is part of more than one tree, is useful. You saw this in the earlier facet example, where a data element was described by multiple characteristics.

Related to abstraction, the idea of leveling is to place information items into peer categories at the correct level in the hierarchy. For example, the marble hierarchy includes multiple peer-level types of marble, such as travertine, serpentine, and onyx. These aren't listed under different headings because they all share the characteristic of being a type of dolomitic marble.

When you're deciding the level of a particular piece of information, it's important to consider what set of attributes you plan to use. A book may have weight, price, and construction materials as well as information content. Which of these attributes is relevant when you are deciding where a particular book should go in a hierarchy? In biology, the binomial system of nomenclature has proven to be handy for organizing different life forms based on appearance (phenotype), which can lead to grouping organisms that have no genetic relationship (genotype), simply because they evolved in separate environments to appear similar. Practically, choosing attributes means that you should take care when considering which characteristics are related and which aren't, to avoid confusion.


Milestones

Selecting an organizational scheme and populating term hierarchies are challenging tasks. To guide this development, a measurement of effectiveness for the selected approach is sometimes helpful. There are many ways to measure how well an information-organization strategy is working. The simplest is to observe how much time individuals take to locate useful information and how often a search is abandoned in frustration. If the information store is on a Web site, you can observe how much time a person used to location information by noting the number of information pages that are browsed before a link is followed. A more intrusive technique is similar to the sometimes-annoying question on every help page: "Was this information useful?" It lets users provide feedback on the effectiveness of a particular data search and filter approach.

Finding information isn't the challenge; rather, it's realizing the information is of value to someone. For someone to realize the value of information, you need to organize data to permit rapid discovery and utilization. Organizing information by list, synonym, facet, and thesaurus are four common ways that have proven to be effective. Each of these approaches has advantages and drawbacks that you should balance with the needs of the user community.


Resources

About the author

Benjamin A. Lieberman serves as the principal architect for BioLogic Software Consulting, a firm providing services on a wide variety of software development topics, including requirements analysis, software analysis and design, configuration management, and development process improvement. Dr. Lieberman is also an accomplished professional writer and author of The Art of Software Modeling and numerous software-related articles. Dr. Lieberman holds a doctorate degree in biophysics and genetics from the University of Colorado.

Comments



Trademarks

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=291101
ArticleTitle=Information architecture essentials, Part 3: Organizing complex information
publish-date=02262008
author1-email=blieberman@biologicsoftwareconsulting.com
author1-email-cc=