As there is an ever-increasing volume of online documents in the enterprise, a systematic organization of enterprise content through classification becomes more and more important. This need is amplified by the advancing integration of formerly separate systems and comprehensive enterprise search systems, like IBM OmniFind Enterprise Edition, that allow end-users to retrieve documents from disparate sources through a single point of access. The use of an enterprise taxonomy (a hierarchically organized set of relevant categories, in other words) and the classification of document content, is a powerful approach to address this need.
This paper leads you through the individual tasks required to successfully create and deploy a high-performance enterprise taxonomy using the following tools:
- SchemaLogic Enterprise Suite to manage, model, and maintain a consolidated enterprise taxonomy
- IBM Classification Module for OmniFind Discovery Edition (ICM) to automatically classify documents
- IBM OmniFind Enterprise Edition (henceforth called OmniFind) to exploit category information in document search
This article assumes you have a basic knowledge of the functionality of an enterprise search system and OmniFind in particular. This article takes the viewpoint of how OmniFind can be integrated with SchemaLogic Suite and ICM into a true enterprise search system leveraging the power of taxonomies and auto-classification.
The structure of the article is as follows:
- Sections "Taxonomy management" and "Taxonomy deployment" cover requirements of taxonomy systems and how taxonomies are deployed, with the succeeding section highlighting "SchemaLogic Enterprise Suite"
- Sections "Automatic text classification" and "ICM server and tools" give a background on auto-classification in general, the classification method used in ICM, its virtues and superiority to other techniques, as well as how to work with the tooling
- Sections "Integrating classification and search" and "How to fit the systems together step by step" describe the benefits of integrating classification and search and how the integration of the three applications works in practice
An enterprise taxonomy is a set of terms and the associated relationships that exist between them. It can be as simple as a list of products, or it can be a complex structure that supports the relationships between companies, their suppliers, and their customers. Enterprise taxonomies are used to describe the content that is generated by a business. Using the standard language that is present in these taxonomies, both business users and the software that supports their work can describe similar content the same way.
Broken down into its component parts, an enterprise taxonomy is:
- A multi-purpose, often hierarchical, list of terms that describes content
- Centrally managed or distributed with a strong governance model
It typically includes a combination of the following:
- Controlled vocabularies
- Allowed values for defined metadata fields
- Preferred terms
- Synonyms, acronyms, and abbreviations
- Standard thesaurus relationships
- Other named relationships
Enterprise taxonomies provide the following benefits:
- Taxonomies can describe content across applications
- Taxonomies are focus-agnostic; they have no explicit "point of view"
- Taxonomies can be used to control the values in metadata
- Taxonomies can enhance search queries by adding synonyms, acronyms, and abbreviations
- Taxonomies can assist site navigation
More recently, enterprise taxonomies are also being used to enhance corporate search. New search techniques, like semantic and actionable search, are greatly improved by adding the knowledge that is already built into an enterprise taxonomy structure.
Typical examples include:
- Faceted navigation: An active interface -- a dynamic combination of search and taxonomy browse
- Search results clustering: Automatic grouping of documents into spontaneously labeled categories
- Categorized search results: Search results are grouped into a meaningful and stable classification defined by the taxonomy
- Actionable search: Allows users to do something directly from a search result
- Semantic search: Search enhanced by relevant data from different sources; this data is described by using an enterprise taxonomy
Taxonomies are designed by looking at content and talking to subject matter experts so that an appropriate model of the data can be created. Typically, taxonomists look at existing databases, Web sites, organizational charts, product lines, other legacy databases, term lists, and product documentation to tease out existing categories. Often a good, representative category system can be drawn from other systems.
Once the majority of this up-front analysis has been completed and the requirements for the enterprise taxonomy have been determined, a taxonomy-modeling software system must be put in place. Up until the late 1990s, large organizations either built their own applications for managing taxonomies, relied on modeling packages that shipped with individual systems, such as search engines or auto-classification systems, or used simple desktop applications, such as spreadsheets. Each approach presents challenges to the enterprise.
Home-grown systems can be very expensive to develop and maintain, particularly in environments that rely on large, complex, and dynamic taxonomies that are central to the business, such as driving high quality search results or high quality auto-classification. As a result, either organizations have continued to make the investment in their own systems, or, in a non-critical environment, the high cost contributes to the demise of taxonomy projects.
Modeling tools that are part of a system, such as a search or auto-classification system, are built to interact specifically with that system. As such, the taxonomy models managed in these system-specific tools are not easily integrated with other systems without extensive custom software development efforts. Additionally, these tools are often simple and cannot be used in more complex modeling, large scale, or distributed ownership scenarios.
Recently, however, another option has become available to organizations. A number of software vendors have developed and released taxonomy modeling applications that are designed for generic enterprise use and are not tied to specific systems. These systems run the gamut of cost and capabilities, from single-user desktop modeling applications, to robust enterprise-grade semantic management systems. These systems are designed for modeling and are generally more usable than system-specific tools. The real power of these applications is to provide a centralized modeling environment that multiple users, user types, and systems throughout the enterprise can access, both for modeling and consuming models.
Choosing the right system for your enterprise requires mapping the requirements gathered in the up-front analysis with features, capabilities, and costs of each option.
To determine which route to go for managing your taxonomy (whether to build a system, license a commercial system, or use a taxonomy package that ships with an existing system), a number of considerations need to take place. The primary drivers for your decision should be based on the needs of the organization and how you plan to manage and use your taxonomy.
The general requirements include technological requirements, user and usability requirements, and taxonomic requirements, such as the size and level of activity of the taxonomy. Some of these key requirements include:
- Level of activity: This describes how dynamic the taxonomies will be. Taxonomies that change very little typically have few editors or owners, and the model updates are not required to flow immediately to other systems. In these cases, a simpler modeling application may suffice. However, highly dynamic taxonomies, ones that are changing daily or even hourly, will require a system with enough performance to support rapid changes to models from multiple users, multiple systems, or both.
- Size and complexity of vocabularies: Taxonomies with a large number of terms or a large number of relationships between those terms will require a modeling system that can scale with the growth of the taxonomy.
- Taxonomy integration: The more systems that
centralized taxonomy models can be leveraged in, the more
powerful and cost effective the taxonomy creation and
maintenance process will be. Broadly, other systems interact
with the modeling system in one or both of two ways:
- Subscribing systems: These are systems that consume the whole or a sub-set of the taxonomy. How a taxonomy can be used is usually limited by how subscribing systems can use it and what structures they can utilize. These systems are on the receiving end of the taxonomy and may consume anything from flat lists to complex hierarchies all the way to complex ontologies.
- Publishing systems: In some environments, other
systems may publish to the modeling application. In one case,
the "taxonomy of record" for certain subsets of
the overall enterprise taxonomy might be managed in another
system. For example, a product list may be managed in an ERP
system, but that list can be utilized by other systems (such as
an ECM, Auto-classification, or Search system). In this case,
the modeling application may be used as a "clearing
house," receiving the product list from the ERP system,
then re-distributing all or sub-sets to different systems.
In another case, another application might generate terms that should be incorporated into the enterprise taxonomy. These types of systems include advanced natural language processing systems that can discover new terms and relationships by analyzing content (such as document text in ECM systems). Other examples include terms generated from Social tagging systems and terms generated from search analytic systems.
- Subscribing and publishing systems: In some cases, a system can both subscribe and publish to the taxonomy modeling system. An example of this is an auto-categorization system that can consume a taxonomy for the purpose of categorization and can also discover new terms and relationships to feed back into the model. Iterative taxonomy maintenance helps auto-categorization systems become iteratively more accurate.
Active enterprise environments, with multiple subscribing systems, publishing systems, or both require a modeling application that can readily and reliably connect to multiple systems in such a manner.
- Distribution of ownership: How broadly, geographically, and organizationally model ownership is distributed throughout an enterprise will dictate how robust the modeling tool must be to support a diverse population of users. This includes the ability for users to connect to the system easily and for client-side application management to be minimal.
- Number and types of users: In addition to model owners and taxonomy editors, an enterprise may have many users who are stake holders and interested parties to the taxonomy or taxonomy sub-sets and may be contributors to the taxonomy development process. These users need to be able to access and view the models in ways that fit with their needs and authority. They may need to contribute to the taxonomy development process directly in-situ within a subscribing system. For example, an author tagging a document in an ECM system may need to suggest a new term for a particular option list managed in the taxonomy modeling tool. Ideally, that process would be seamlessly integrated with the ECM UI so that users would not even realize they interacting with the taxonomy management system.
- How users will use the tools: Different types of users will use the taxonomy modeling system in different ways. Taxonomy editors need powerful editing capabilities to quickly and easily make individual and bulk model changes. They need powerful search and browse capabilities to quickly locate terms and taxonomy branches. Business owners of subsets may need to view just their portion of the taxonomy and not need as powerful capabilities. Thirdly, stake holders and users who occasionally suggest new terminology may need an even smaller subset of capabilities.
- Usability: The modeling Graphical User Interfaces must be easy to use by editors and contributors, and the capabilities must match the user's role. Capabilities to search and navigate the taxonomy, create, edit, and delete terms and branches must be easy and intuitive to use.
- Technical architecture: The system must fit within enterprise technical architecture guidelines, such as supported platforms and databases. Additionally, if custom connections to the modeling system are to be developed and maintained by the enterprise, the system must have well-documented APIs and a comprehensive SDK.
- Multilingual capability: Many organizations need to maintain their taxonomies in many languages. Simple multi-lingual environments may need the ability to map languages one to one. More complex environments may need to accurately model complex interrelationships amongst languages, such as the ability to map a single complex term in one language to multiple terms or even a Boolean type statement in another language.
- Scalability: The modeling system may need to scale in a number of different ways, such as the volume of terms, the number of users, and geographical distribution of servers.
- Security and permissions: The security model of the system must meet the needs of the organization, including the ability to tie into enterprise security systems such as LDAP systems. The permissions model must allow for sufficient levels of ownership, viewing, editing, and collaboration rights.
A second category of requirements has to do with the actual logical modeling capabilities and the type and complexity of the models. These considerations include:
- Term relationships: The types of taxonomies and the complexity of the term relationships need to be determined. These relationships can range the gamut from flat lists (no relationships), to simple hierarchies (for example, parent/child relationships), to thesaurus relationships (for example, those conforming to the NISO construction standards; see Resources), all the way to highly specified, defined relationships, and ontologies.
- Sub-setting taxonomies: The ability to define subsets of the taxonomy, often based on facets, term relationships, or other attributes is often needed for integrating with downstream systems, which have their own constraints on required terminology or complexity of model.
- System- and user-defined attributes: Attributes of terms in your system, such as when a term was created or modified, by whom, and so on, are often required for managing a taxonomy. Most of these attributes will come with the modeling system "out of the box." However, particularly for integrating taxonomies with other systems, creation and management of user-defined attributes is necessary. User-defined attributes are those attributes that can be established and configured by the customer. Often when integrating with other systems, information specific to that system must be published along with the terms or the taxonomy. Being able to record, store, and manage that information in the modeling system is highly advantageous.
- Other modeling capabilities: Additionally, there
are a number of other considerations to be made concerning
modeling capabilities. These should be based on known
requirements and include:
- Polyhierarchy: The ability for terms to have multiple parents
- Topic maps: The ability to model topic maps (for example, those conforming to ISO/IEC 13250:2003)
- RDF: The ability to model information in a manner compliant with RDF (Resource Description Framework, a family of World Wide Web Consortium specifications)
- OWL (Web Ontology Language): The ability to render a taxonomic model in an OWL-compliant format (see Resources)
The technical method for integrating taxonomies and taxonomy subsets is typically dictated by the capabilities of both the taxonomy modeling system and by the target system.
Typically, integration falls into one of two types, with different variations on the theme. At the simplest or shallowest level, the integration can occur by transforming and transferring files, such as XML files. Most systems can now readily export taxonomies in an XML format, and subscribing systems can import taxonomies as XML.
At the deepest level, the modeling system can be directly connected to subscribing systems using APIs. On the modeling system side, this allows the exposure of the modeling system's capabilities directly to other systems, including their user interfaces. Additionally, this allows the subscribing system to use the models stored in the modeling system without having to store another representation of it, such as in an XML file or in a database structure. Whenever the subscribing system needs the taxonomy or a subset, either for display or other purposes such as building a search index or analyzing a query string, it accesses the data in real time from the modeling system. This prevents the problem of having multiple versions of the taxonomy in different systems and the versions getting out of synchronization.
In all but the deepest API to API integrations, processes must be put in place to synchronize changes made in the modeling system to subscribing systems. Synchronization methods typically fall into one or more of the following categories:
- Manual: A user-activated batch process, typically executed through a UI or a command line interface
- Scheduled: An automated synchronization set up to occur on a specified schedule
- Event driven: An automated process that occurs whenever specified changes are made to the model
One such enterprise-grade taxonomy management system is the SchemaLogic Enterprise Suite.
At the core of the SchemaLogic Enterprise Suite is SchemaServer, a central repository where business model standards are gathered, created, refined, and reconciled, and from which the standards are distributed to subscribing systems. The Suite is capable of managing both semantic, including taxonomic, and structural models.
For semantic models, SchemaServer allows an organization to capture and manage the standard business terminology, code systems, and semantics that must be used consistently across the enterprise as vocabularies, terms, and relationships. This powerful and flexible system can model simple but critical lists like sales regions or marketing segmentation models up to complex multi-faceted taxonomies, thesauri, or ontologies describing complex business systems or product and services networks with hundreds of thousands of terms.
Additionally, SchemaServer describes the structural models used to store and exchange information as a hierarchy of information classes. This logical modeling capability allows you to capture a consistent and easy-to-understand model of all of the information systems in the enterprise and easily view and understand how they interrelate with each other. Relational, Object-Oriented, XML, SOA, and other technical models can be unified and brought under a business-oriented management model appropriate for business and technical participants. The relationships between semantic and structural models are also captured, enabling comprehensive governance and impact analysis to be implemented.
SchemaServer is accessed through powerful GUIs that allow you to quickly and easily create and maintain taxonomies. Workshop is a powerful modeling application used by expert modelers and taxonomy editors to administer corporate taxonomies. It includes powerful editing tools for importing, exporting, and making large-scale or bulk changes to the model.
Workshop Web is a zero footprint, browser-based UI to manage the objects in SchemaServer. It is designed for everyday business users of the system.
Figure 1. Geography hierarchy in Workshop Web
The SchemaLogic solution is built around four key capabilities to enable organizations to manage business semantics and taxonomies within the context of everyday operations:
- Model the structure and information relationships
- Govern and manage the changes
- Publish to subscribing systems
- Collaborate to expand and maintain
These key capabilities enable participation and contribution across organizational, corporate, and industry boundaries to facilitate the development of business semantics in a dynamic, constantly changing environment.
The SchemaLogic Enterprise Suite includes many of the key features required for taxonomy development and deployment projects:
- Ease of use: Workshop Web emphasizes ease of use by general business users who typically do not have deep taxonomy development experience and is highly graphical. Workshop, in addition to the capabilities found in Workshop Web, provides additional powerful editing and administrative capabilities.
- Import/export: Workshop users can perform manual file-based imports and exports utilizing XML or CSV-formatted files. Imports and exports can be performed against any defined subset or the entire model. Non-file-based imports and exports can be accomplished with the API or through product Adapters.
- Collaboration: The SchemaLogic Suite includes a customizable governance system of contracts, permissions, and rights that allows all users to collaborate on the development of the semantics model.
- Permissions model: The permissions model allows user roles to be applied to any in the system. Changes can be suggested but not committed until the appropriate owners and stake holders have approved the changes.
- Impact analysis: The impact analysis feature allows users to graphically see which objects would be affected by a proposed change as well as all the owners, stake holders, and subscribers affected by a change.
- Defined relationships: Customers can define any number of custom term relationships, allowing organizations to model a full range of semantic relationships, including flat lists, simple hierarchical, thesaurus, and ontological.
- Configurable attributes: Customers can define any number of custom attributes for any object in the system. This is useful when integrating taxonomies with other systems, as those systems often require specific term attributes in order to integrate taxonomies.
- SDK: The SchemaLogic Suite includes a fully documented Web services SOAP API and a Java API to allow customers to write custom applications against the modeling server. This allows the modeling server to be integrated with existing line-of-business applications, exposing the modeling capabilities to users through their line-of-business UI's.
- Integration service and adapters: The SchemaLogic Suite includes a small footprint integration server, an integration architectural framework accessible through Web services designed to give organizations the ability to quickly build and deploy API to API Adapters to publish taxonomies or taxonomy subsets to subscribing systems. Additionally, SchemaLogic has pre-built product adapters for systems such as IBM Content Management suite and IBM OmniFind suite of systems.
OmniFind Enterprise Edition V8.4 exposes a number of capabilities that can be leveraged by an organization's taxonomy to finely tune and significantly enhance search results.
- Rule-based taxonomy: To simplify enterprise search
deployments, OmniFind Enterprise Edition V8.4 provides the
ability to configure a taxonomy of categories and category
rules. The taxonomy serves two purposes. First, when the search
index is created, taxonomy categories are applied to documents
based on whether a document satisfies the rule. Secondly, once
categories are applied to documents, the taxonomy can be used to
create a browsing interface to the collection. Unlike many
navigation-only solutions, OmniFind Enterprise Edition does not
require a pre-defined taxonomy in order to deliver highly
relevant search results. However, it can take advantage of
taxonomy tags to influence both the results and interface of a
Using the SchemaLogic Suite, organizations can apply OmniFind-specific rules to existing taxonomic terms and publish those taxonomies, or subsets, to OmniFind. This allows organizations to leverage existing taxonomies in OmniFind and to manage their OmniFind taxonomy as part of their overall taxonomy management system and processes.
- Linguistic dictionaries: OmniFind allows organizations to manage a number of different dictionaries to fine tune results. In the SchemaLogic Suite, each of these dictionaries can be managed seamlessly within the overall taxonomy of the organization and periodically published to OmniFind.
- Synonyms: This dictionary can be used to expand terms in the query string sent to the search engine to include specified synonyms. For example, this allows organizations to tune the search engine to search for the complete spellings of common enterprise acronyms. If a user searches on "WAS," the search engine can also automatically return results for "WebSphere Application Server," and the other way around.
- Boost words: This dictionary specifies terms and phrases that raise or lower the rank value of the document in which the term appears. This allows organizations to manipulate the ranking of search results to provide more highly relevant documents to users.
- Stop words: This dictionary specifies a list of enterprise-specific terms that are removed from query strings to improve the relevancy of search results. Typically, stop words are commonly occurring words or phrases whose inclusion in a query string may cause a large quantity of poor results.
There are various approaches to building document taxonomies. Some of the approaches are rule-based, mostly created and maintained by human experts. Others rely on automatic text classification techniques. Various text classification methodologies may yield different types of "models," or statistical descriptions of the world. Models may be complex or simple in the sense that the "classifiers," or the software components that determine if a text belongs to a category, can be architecturally simple or complex. Typically there is a trade off between sophisticated models and simple models, or between variance and bias.
Techniques such as Bayesian networks or neural networks use highly expressive models, which try to produce a non-biased classifier in order to "describe" a corpus of documents. Their results tend to have very high variance, which can be reduced only by large training sets and very static data. But in most real-world situations, and always in the customer interaction space, the databases, or corpora, are small, heterogeneous, and tend to change rapidly. Therefore, it cannot be assumed that there will be enough data to train these "complex" structures and reduce the variance. This typically results in what is called an over-fitted system, perhaps performing well in artificial tests, but not in the real world.
ICM relies on a proprietary and unique algorithm (Relationship Modeling Engine, RME) to create an optimal trade-off between variance and bias. This approach is superior to the apparently "complex" methods such as Bayesian networks or neural networks. ICM RME's sophistication is not in the architecture of the classifiers, but rather in how these classifiers are fine-tuned and built in real time.
Most of the algorithmic effort in the development of ICM RME was invested in how to automatically create and tune classifiers. As a result, ICM RME classifiers provide superior accuracy and the ability to generalize and learn from small training sets. In addition, these classifiers are highly intelligent in the way they are created dynamically and tuned, with either training or incremental learning.
ICM RME's algorithmic infrastructure is a unique self-learning engine, capable of classifying textual information, even in imperfect and noisy situations. It incorporates new knowledge on the fly, without the need to reconfigure or re-train the system. ICM RME's technology is different from standard classification techniques; it emphasizes cleanness and transparency. Using Concept Modeling techniques, ICM RME has the unique capability of serving multiple applications from a single knowledge base. This is the original premise of knowledge management -- push knowledge from the more human-intensive channels to the more automated or unattended channels automatically. ICM RME provides a mature technology that can adapt to real-world changes and continuously provide accuracy levels that make it valuable in a variety of real-world, mission-critical applications.
ICM RME provides services to applications that need to understand text or correlate between text and certain objects (for example, personalization or general data classification applications).
Typically, an application sends raw data to ICM RME for analysis and expects to receive a quick and accurate response based on the data content. Note that about half of the content is irrelevant to the actual analysis of the message and that the message contains shorthand (abbreviations), potential spelling errors, and other imperfect characteristics.
ICM RME receives this message and processes it in two main phases: a multilingual Natural Language Processing (NLP) phase, and a language-independent statistic Concept Modeling phase. The first step of NLP processing primarily consists of finding portions of the text that contain relevant data, and extracting key features or linguistic events from the text that will be used later by the Concept Modeling engine (and possibly by the calling application directly).
The NLP engine processes input text, regardless of channel, and creates a Concept Model. A Concept Model is a computer-readable data structure containing the primary concepts that appear in the original text and some of their relationships. This structure is then fed into the Concept Modeling engine for pattern matching.
NLP processing addresses the fact that many different variants of the same "word" can appear in the text. Some of these variations are morphological variants (for example, "go," "goes," and "went" are linked to the same concept), and some are due to spelling errors or other naturally-occurring variations in expression. Concepts can be words, short sentences, multi-word tokens, numbers, dates, URLs, e-mail addresses, or any other meaningful patterns that appear in the document.
This highlights one of the ICM RME's differentiators: The system is looking for patterns in higher-level semantic structures, leveraging automatically collected domain and language knowledge rather than finding text patterns directly.
The result of ICM RME processing is a list of categories or intents embedded in the original text. The system may also extract certain features or patterns and create metadata fields if configured to do so. The system can also flag all messages with certain categories over a pre-defined threshold for special processing. It is important to note that the certainty factor is actually an estimate of the statistical likelihood that the category was identified correctly. This is another unique feature of ICM RME, which makes its configuration much more straightforward and provides companies with much greater control over how and when fully-automated actions are taken by applications -- a critical requirement in most environments, but absolutely essential in customer interactions.
Note that ICM RME performs very well in multi-intent scenarios; the feedback can be provided as a list of categories. In addition, the feedback process is very simple for the calling application; it only has to tell ICM RME which categories were correct -- the system does the rest. There is no need to say why or express a degree of confidence in the feedback. ICM RME automatically finds out why the feedback was given, and, in case of erroneous feedback, it will quickly nullify its effect (or "unlearn" it).
IBM Classification Module for OmniFind Discovery Edition is a cross-platform server application for writing client applications that interact with the Relationship Modeling Engine. The Relationship Modeling Engine is a full suite of language processing technologies targeted at analyzing, understanding, and classifying the natural language of customer interactions and other types of everyday communication. This functionality is easily embedded with the Classification Module, which exposes all the functionality necessary to develop applications that harness the power of the Relationship Modeling Engine. The Classification Module provides several client API libraries to enable rapid development of various client applications in several programming languages, in particular in Java. Ease-of-use and maintenance is combined with high availability and scalability. It is designed to run on multiple machines and provides the ability to scale according to customer load by making optimal use of hardware and software resources. The system is configured and maintained using the Classification Manager application.
The Relationship Modeling Engine uses natural language processing and sophisticated semantic analysis techniques to analyze and categorize text. When an application sends input text to the Relationship Modeling Engine for analysis, the system identifies the categories that are most likely to match this text. The Relationship Modeling Engine works together with an adaptive knowledge base -- a set of collected data used to analyze and categorize texts. The knowledge base reflects the kinds of text that the system is expected to handle. Relationship Modeling Engine-enabled applications use categories to denote the intent of texts. When text is sent to the Relationship Modeling Engine for matching, the knowledge base data is used to select the category that is most likely to match the text. Before the knowledge base can analyze texts, it must be trained with a sufficient number of sample texts that are properly classified into categories. A trained knowledge base can take a text and compute a numerical measure of its relevancy to each category. This process is called matching or categorization. The numerical measure is called relevancy or score. The accuracy of a knowledge base can be maintained and improved over time by providing it with feedback -- confirmation or correction of the current categorization. The feedback is used to automatically update and improve the knowledge base. This process of automatic self-adjustment is called learning.
Classification Workbench is an application that allows you to create a knowledge base (KB) for use with IBM Classification Module for OmniFind Discovery Edition (ICM), analyze the KB, and evaluate its accuracy using reports and graphical diagnostics. The result is a KB that can be used in conjunction with applications powered by ICM.
Prior to using Classification Workbench, you'll collect pre-categorized sample data (for example, documents) representative of the data you expect to classify using ICM. You'll import this data into Classification Workbench to create a corpus file. Classification Workbench provides a variety of features and techniques that allow you to fine-tune the corpus to optimize KB accuracy. Using the corpus as input, you can create and test the KB. Then you can evaluate the KB using Classification Workbench reports and graphical diagnostics and improve its accuracy by editing the corpus you use to create the KB. The final product is a production-ready KB, for use with ICM-based applications.
An ICM RME KB is represented as a tree of nodes, with each node containing statistical knowledge or rules that assist the system in classifying text. Categories are the names of the nodes in the KB. The simplest way to organize nodes in a KB is a flat knowledge base structure, so that all nodes are on the same level. Classification Workbench builds such KBs automatically from a categorized corpus, and you do not have to explicitly specify its structure. In some cases, you may want to build a hierarchical knowledge base, consisting of nodes at multiple levels in the hierarchy.
One important advantage of ICM RME KB is the ability to mix rules and statistics. This way you can effectively apply business logic, external non-statistical information usually defined through metadata, in the classification process. You can easily craft such a KB using the Classification Workbench interactive KB Editor. Alternatively, the KB structure can be specified in an external textual format and imported to Classification Workbench.
Figure 2 illustrates a possible hierarchical KB structure. Squares represent rule nodes that work on metadata (for example, "language = French" or "Products = Servers"). Ellipses represent statistical nodes.
Figure 2. KB structure example
A typical workflow of using Classification Workbench to create a KB would be:
- Gather pre-categorized data that will form the basis of a corpus.
- Convert this data into a format recognized by Workbench (for example, Workbench recognizes CSV or XML obeying a certain pattern). Writing an application that will already produce the format recognized by Workbench can be a good option.
- Create the KB structure. Workbench recognizes an XML format for KB.
Then you'll use Workbench to:
- Import the data and create a corpus file
- Import the KB (if available)
- Edit and categorize corpus items, as required
- Create and analyze a KB, and generate analysis results
- Evaluate KB accuracy by viewing summary reports and graphs. The best way to evaluate the KB accuracy is using the "KB Tune-Up Wizard."
- As required, improve KB accuracy by editing the corpus and retraining
- Export the KB to the IBM Classification Module for OmniFind Discovery Edition (ICM) Server
For the training task, the Classification Workbench reports present a lot of information both on the overall KB accuracy and on a per-category basis. The evaluation should start from the overall KB accuracy verification, by generating the: "KB Data Sheet," "KB Summary," and "Cumulative Success" reports. The "KB Data Sheet" will provide a highlight of the potential problems. Measures like "Total cumulative success," "Top performing categories," "Poorest performing categories," "Categories that may be determined by external factors," and "Pairs of categories with overlapping intents" are very informative for the general KB accuracy measurement, but they represent only informative indications. The final decision has to be taken by the KB administrator who understands the data and the business logic of the project.
"Categories that may be determined by external factors" may indicate that the user should add external information to the documents using metadata and rules to the KB that refer to the metadata.
"Pairs of categories with overlapping intent" may indicate that categories should be redefined, either split the "overlapping" categories into several non-overlapping ones, or combine several categories into one. These are possible indications, but the decision has to be made according to the project data and business logic needs.
If the nature of the data changes over time, the KB accuracy can be verified periodically, using Classification Workbench reporting tools. If needed, a retraining will be done.
To conclude, IBM Classification Module for OmniFind Discovery Edition (ICM) is a powerful tool that uses natural language processing and sophisticated semantic analysis techniques to analyze and categorize text. ICM works together with an adaptive knowledge base/taxonomy (KB) that uses categories to denote the intent of texts. When text is sent to ICM for matching, the knowledge base data is used to select the category that is most likely to match the text. The KB can be hierarchical and can combine rule-based and statistical information. The Classification Workbench tool allows easy creation, analysis, and tuning of a knowledge base from representative data. Its reporting tools are very powerful, allowing editing and tuning of the data and of the KB to increase the accuracy of the classification.
Using the SchemaLogic Suite, organizations can publish existing taxonomic terms to the Classification Workbench so that the subset of the taxonomy that is imported into the Workbench becomes the KB structure, and hence the set of categories, which the classifier is trained upon.
Thus, the auto-categorizer will tag documents or text streams with enterprise-specific categories that are actively managed within the organization. This ensures that a consistent set of approved terminology is used for auto-categorization.
Search and classification are often integrated together in a single system. They fit together nicely, because of several reasons.
First, they provide complementary mechanisms for describing documents. Search describes the document based on a small set of words supplied by the user (such as the query "fat"), whereas classification attempts to describe the overall document based on a set of descriptors supplied by the taxonomy (for example, in a subject taxonomy, one of the subjects). This means that if a search engine supplies the category to the user, it can be extremely easy for the user to distinguish which search results are really relevant. For example, if the user query is "fat," some of the results will be marked as "dieting" or "nutrition," but others will be marked as "file systems" (because FAT is also File Allocation Table, used by the DOS operating system). A user seeing this mixture of topics can then refine the query to select just the ones intended by this ambiguous query.
Secondly, the processing of the data required by search and classification, (in other words, document fetching, tokenization, lemmatization, and so on) is the same to a large extent. Hence, a system that couples them together can take advantage of common processing steps.
Search and classification can be paired in a number of ways:
- Search within a category: You can select a category and then search only documents that are both within the category and that match your query.
- Faceted search: In this method, you are allowed to specify several different facets (or characteristics) of a document to a search engine (for example, "search for all PDF documents about databases from last year"). This is actually a generalization of "search within a category," where multiple criteria that may or may not be categories from a taxonomy can be combined.
- Taxonomy browsing: Some or all of the documents on a Web site are displayed as a taxonomy that can be navigated, with each document assigned to one or more nodes of the taxonomy.
- Classifying search results: The results of a search are displayed together with their assigned categories. Categories can be used to group or sort result sets.
To address these usage scenarios in an optimal way, organizations can leverage the power of all three applications together by using the SchemaLogic Suite to centrally manage the enterprise taxonomy and publish appropriate subsets to both OmniFind and the Classification Workbench. This results in the use of a consistent, actively managed set of semantics for auto-categorization and search, significantly enhancing results and ensuring that these systems are automatically kept up to date with the ever-evolving enterprise taxonomy.
The following section demonstrates how the integration works in practice.
This section gives detailed instructions for using the three systems in concert in the following scenario:
- A taxonomy is centrally managed using the SchemaLogic Suite
- Based on the taxonomy, a KB is trained for auto-classification using ICM
- The taxonomy is deployed within OmniFind, which uses a plug-in in its document-processing pipeline to connect to the Classification Module server and receive classifications from the taxonomy for each document it processes
The description assumes the following software versions: ICM Version 8.3 (previous name: "IBM Classification Module for WebSphere® Content Discovery Version 8.3) and OmniFind Version 8.4. Please note that the focus here is on the steps that realize the integration of the three systems and does not present details for the tasks that are accomplished within the tools.
Step 1: Create an OmniFind collection on which you want to employ auto-classification
Create an OmniFind collection with "rule-based categorization." The configuration option "rule-based categorization" is needed to allow categories obtained later from ICM to be stored in the OmniFind index and to allow the Search Application to browse the category tree.
Step 2: Deploy BNSCategoryAnnotator in OmniFind
As mentioned above, the integration of the ICM server with OmniFind requires an extension module to be loaded into OmniFind. The extensibility of OmniFind is based on the Unstructured Information Management Architecture (UIMA) (see the Resources section for more information). In this architecture extension modules, ( UIMA plug-ins, in other words) are also called annotators. The annotator used here is contained in the UIMA PEAR package BNSCategoryAnnotator.pear (a simplified version of it is attached in the Download section). Figure 3 gives an architectural overview of this integration:
Figure 3. BNSCategoryAnnotator provides the bridge between OmniFind and ICM
The package contains a configuration file BNS.xml (BNSSample.xml in the downloadable version), which contains a number of configuration parameters that need to be set before deploying the plug-in. The most important parameters are listed in Table 1.
Table 1. BNSCategoryAnnotator configuration parameters.
|ServerURL||The URL of the ICM server||http://127.0.0.1:8081/Listener/mod_gsoap.dll|
|KBName||The name of the KB in ICM||EnterpriseTaxonomy|
|DefaultBodyFieldName||The KB field in ICM that is expected to contain the document body||text|
|MinRelevanceScore||A float between 0 and 1; categories with a relevancy score below this threshold are ignored||0.5|
|MaxCategories||The maximum number of categories that may be assigned to a document||3|
We recommend that you use the Eclipse-based Configuration
Description Editor that comes with the UIMA SDK to adapt these
parameters as required by your application (a description of how to
install the UIMA SDK Eclipse
tooling and how to use the editor is contained in the
SDK User's Guide and Reference; see, also, the Resources section). At a minimum, the
ServerURL parameter needs to
be adapted to your ICM server installation so that the annotator
can connect to the ICM server. Also, in the simplified version, the
CategoryDirectory needs to be
set to the following path which contains the
CategoryTree.xml file on the OmniFind
controller node: <ES_NODE_ROOT>/master_config/<CollectionId>.parserdriver/
(replace <ES_NODE_ROOT> by the value of the respective
environment variable when logged on as the OmniFind administrator; to
find the <CollectionId>, go to the collection's General tab).
DefaultBodyFieldName, you can choose some name, like
"text," "body," or "contents." This name must be used again in
the Classification Workbench, where you have to choose a field that contains
the document text of your training data. Finally, the value for
KBName should be left to the
default (empty string). This ensures that the name of the root
node of the taxonomy is taken as the KBName, which is true for KBs
developed with Workbench.
Figure 4. Editing parameters in BNS.xml using UIMA SDK's Component Descriptor Editor
Then the PEAR package must be uploaded onto the OmniFind controller node. In the OmniFind administration console, use the System:Parse page in Edit mode to add the PEAR package as a new text analysis engine. Please refer to the tutorial "Semantic Search with UIMA and OmniFind" (developerWorks, December 2006) for details about deploying and using custom analysis engines with OmniFind.
Step 3: Associate the custom analysis engine with your OmniFind collection
To have the collection use the auto-classifier, it needs to be associated with the new text analysis engine. This setting is available in the Text Processing Options page for your collection (see the collection's Parse page in Edit mode). More details about configuring text processing can be found in the tutorial "Semantic Search with UIMA and OmniFind."
Step 4: Create and publish a taxonomy with SchemaLogic Enterprise Suite
In the described setup, you need to publish a taxonomy both to ICM and OmniFind. Publication to ICM is done using the SchemaLogic Adapter for ICM, and publication to OmniFind is done using the SchemaLogic Adapter for OmniFind. The configuration and running of those adapters is done through the Workshop UI and the Integration Service. Use CSV format in the ICM adapter to publish a taxonomy (subset) for ICM and publish the same taxonomy subset directly to OmniFind.
Configuration of the adapters includes specifying:
- The directory to write the CSV file for ICM server, respectively the connection information to the OmniFind controller node
- The taxonomy or taxonomy subset in the SchemaLogic modeling server to be published to ICM and OmniFind
- Any terms that should be excluded or included based on term attributes or term relationship types
Figure 5 shows how the SchemaLogic Adapter for OmniFind can be configured:
Figure 5. Editing configuration settings in the SchemaLogic Adapter for OmniFind
The adapters can be run by any of the following methods:
- A manual process, where an administrator executes the publication from the Workshop UI
- A scheduled process, where publication is configured to occur with a specified frequency
- A Web services call to the Adapter made by another application or system
After successful publication, the taxonomy is available as a KB (structure only) for ICM and as a category tree that can be browsed in OmniFind.
Step 5: In Classification Workbench, import the taxonomy as a KB structure and train it for auto-classification
Using the Import Wizard, in what to import, select the option Knowledge base, and in what type of knowledge base, select the option KB configuration. Provide the path of the configuration file describing the KB structure in the following screen, and click Finish (for details, please refer to section "Importing and Exporting a KB Structure" in the Classification Workbench User's Guide).
To get a high-quality classifier, it is important to carefully select enough training samples for each category that should be recognized in the taxonomy. A training sample needs to be pure text, extracted from a sample document of the category in question.
Because the OmniFind/ICM integration for the auto-categorization runtime requires that all document text is provided within a single field (of NLP usage type Body), each training sample should also be formed in this way: all document text is contained within a fixed single field of type Body. To simplify the extraction of document text, the OmniFind/ICM integration can be run in "training mode" (not included in the sample annotator). This mode simplifies the task of collecting and preprocessing training data for KB training with the Workbench considerably because you can use OmniFind crawlers to fetch training documents and the OmniFind parser for document preprocessing and content extraction in the same way documents would be preprocessed for categorization.
For the training itself, import the training samples into Workbench and make sure that the categories associated with the samples are correct. Please refer to the Workbench documentation for the details on how to train a KB.
Step 6: Export the taxonomy to the ICM server
When satisfied with the classification quality of the trained KB, you need to export the KB to the ICM server.
You deploy the KB with the ICM server using the Export Wizard: Select Knowledge base in what to export, and use the KB format "IBM Classification Module."
This export step needs to be repeated each time you change anything in the KB structure, like when you have add a category, or change a name. When you maintain the taxonomy within SchemaLogic Suite, you will not perform such changes locally within Workbench, but rather on the original taxonomy that you then re-import to Workbench before re-training.
Now, OmniFind is ready to process documents. Start crawling and parsing and build an index. The sample OmniFind Search Application lets you browse through the taxonomy to view the documents associated with any given category. You can also use the Search and Indexing API (SIAPI) to enhance sophisticated queries with restrictions to categories. Note that category constraints need to specify the string "rulebased" as a taxonomy ID in that case.
Whenever the taxonomy changes, steps 4 (publish to ICM and OmniFind), 5 (import into Workbench and re-train), and 6 (export to ICM server) must be repeated.
Note that taxonomy changes may invalidate any categorization of documents processed by OmniFind previously. Hence, whenever you update the taxonomy, categories stored in the OmniFind index for a document may be wrong until you re-process (in other words, re-crawl, re-parse, and re-index) that document.
This article has motivated the use of
- Centrally maintained and consolidated taxonomies and
- Automatic text classification
for enterprise search applications. It has shown how to set up and use the three-fold integration of OmniFind combined with both SchemaLogic Enterprise Suite to address the first item and IBM Classification Module for OmniFind Discovery Edition to address the second. This integration exploits the plug-in architecture UIMA that is built into OmniFind, and a version of the required plug-in is provided as sample code.
|Sample annotator to connect OmniFind to ICM||BNSCategoryAnnotatorSample.zip||3.7MB||HTTP|
SchemaLogic® home page: Find more information on SchemaLogic.
OmniFind Enterprise Edition product home page: Find more information on OmniFind
IBM Classification Module for OmniFind Discovery Edition home page: Find more
information on Classification Module for OmniFind Discovery Edition.
Online documentation for OmniFind products: Find information about installing, administering, and developing content integration and enterprise search and discovery solutions.
"Semantic Search with UIMA and OmniFind" (developerWorks, December 2006): This
tutorial is a good starting point
for learning how to use custom text analysis and semantic search in
IBM OmniFind Enterprise Edition.
- ANSI/NISO standard for thesauri:
- ANSI/NISO Z39.19 - 2005 Guidelines for the Construction, Format and Management of Monolingual Controlled Vocabularies: Find guidelines and conventions for the contents, display, construction, testing, maintenance, and management of monolingual controlled vocabularies.
- Resource Description Framework (RDF) Standards of the World Wide
Web Consortium (W3C): An integration of a variety of applications from library catalogs and world-wide directories to syndication and aggregation of news, software, and content to personal collections of music, photos, and events using XML as an interchange syntax.
- OWL Web Ontology Language: A W3C Recommendation for describing
- developerWorks resource
page for IBM OmniFind: Find articles and tutorials and connect to other resources to expand your OmniFind skills.
Information Management Architecture SDK: Learn more about the Unstructured Information
Management Architecture (UIMA). This Java SDK supports the implementation, composition,
and deployment of applications working with unstructured information.
- developerWorks Information Management
zone: Learn more about DB2. Find technical documentation, how-to articles, education, downloads, product information, and more.
- Stay current with developerWorks
technical events and webcasts.
- Technology bookstore: Browse for books on these and other technical topics.
Get products and technologies
UIMA SDK: The free UIMA SDK comes as a self-extracting
installer for Windows and Linux or a zip file for all other
The full BNSCategoryAnnotator.pear includes
a "training mode" and is not limited to only one collection. It is
available from the OmniFind EMEA Center of Excellence as part of a
service engagement, which you can inquire about by e-mail.
- Build your next
development project with IBM
trial software, available for download directly from developerWorks.
- Participate in the discussion forum.
Check out developerWorks
blogs and get involved in the developerWorks community.
Dr. Jochen Dörre is a Software Engineer at IBM Böblingen Laboratory with a background in text search and text mining technology. He joined IBM in 1997 and has worked on several software development projects in those fields specializing on text categorization, text analytics integration, search over XML documents, as well as core search engine design and performance issues. Prior to joining IBM, Jochen has worked in natural language processing research for several years. He received his PhD from the University of Stuttgart. Jochen is a member of the World-Wide Web Consortium (W3C) XQuery Working Group, where he co-develops the extension of the XML query language XQuery with full-text search operations.
Josemina Magdalen is a Software Development Team Leader at Israel Software Group (ILSL) . She has a background in Natural Language Processing (text classification and search, as well as text mining technologies). Josemina joined IBM in 2005 and has worked in the Content Discovery Engineering Group doing software development projects in text categorization and search, as well as text analytics. Prior to joining IBM, Josemina has worked in Natural Languages Processing research and development (Machine Translation, Text Classification and Search, Data Mining), for over ten years. Josemina is working on her PhD at the Hebrew University in Jerusalem.
Wendi Pohs has designed and developed taxonomy and search applications for large organizations for the past 20 years. She has served on development teams for Lotus Development Corporation's Notes/Domino and Discovery Server products, and most recently managed Search and Taxonomy Integration for IBM's Corporate Intranet, w3. Author of a book on knowledge management practices, she specializes in advanced taxonomy applications, built with an experienced practitioner's point of view. Currently, as CTO of Infoclear Consulting, Wendi has provided taxonomy consulting services to a large government contractor, a major news provider, a leading financial institution, and an innovative public health Web site.
Bob St. Clair is a Senior Product Manager for SchemaLogic responsible for products integrating the SchemaLogic Enterprise Suite with other enterprise systems. Since joining SchemaLogic in 2005, he has designed and built several taxonomy and metadata integration solutions with Search, Portal, and Enterprise Content Management products. Prior to joining SchemaLogic, he worked for Corbis, one of the largest stock photo companies in the world, where he designed thesaurus construction, content cataloging, and Media Asset Management systems. Bob holds a Masters of Library and Information Science degree for the University of Washington in Seattle.