Organizations are facing substantial increases in the volume and variety of information. At the same time, compliance and discovery requirements are becoming more complex. Businesses need to develop comprehensive information retention management strategies and are starting by addressing their e-mail and file systems. Yet in deploying these solutions to harness this explosion, IBM customers have frequently found that new burdens are placed on their end users to organize, categorize, and generally classify their e-mails and files. Requiring full user participation in content collection is expensive and inaccurate, creating barriers to long-term adoption of these critical solutions. To address this customer pain, IBM has integrated advanced content classification (provided by the IBM Classification Module) with its modular content collection capabilities.
Overview of IBM Content Collector, Version 2.1
IBM Content Collector, Version 2.1 is an archiving solution for e-mails and other digital content. With IBM Content Collector, you can create task routes to automatically schedule archiving of enterprise user e-mails or digital content on shared drives. IBM Content Collector, Version 2.1 supports the ability to archive e-mails from IBM Lotus® Domino®, Microsoft® Exchange servers, and documents from file system folders. To archive digital content, IBM Content Collector supports two ECM repositories from IBM — IBM FileNet® P8 and IBM Content Manager. IBM Content Collector, Version 2.1 also provides client plug-ins for Microsoft Exchange Outlook client and Lotus Notes® client application such that e-mail users can mark specific e-mails for archiving, and search and retrieve information from archived e-mails. Refer to the Content Collector product home page to learn more about the product capabilities (see Resources).
Overview of IBM Classification Module, Version 8.6
IBM Classification Module automates the organization of unstructured content by analyzing the full text of documents and e-mails. With automatic classification, you accelerate the time to value from your ECM investments by relieving employees of time-consuming and costly decision-making tasks. IBM Classification Module can engage a predefined knowledge base to analyze the contents of file system folders and e-mails to assign a list of possible categories (which represent how the e-mails are to be classified) and relevancy scores (which represent the confidence with which an e-mail is assigned to a category). Refer to the Classification Module product home page to learn more about the product capabilities (see Resources).
IBM Content Collector can leverage text analytic results from IBM Classification Module to make intelligent decisions on captured content.
For example, to meet compliance and record management initiatives, or to prepare collections of data for legal discovery, IBM Classification Module can distinguish important e-mails from e-mails that have no business value. Based on the content analysis, IBM Content Collector determines the appropriate action to be taken. An e-mail that discusses a patent application might be copied to an IBM FileNet P8 folder and be declared as a record in IBM FileNet Records Manager. To contrast, an e-mail that discusses patent leather might be filtered out and not archived.
This article provides instructions for integrating IBM Classification Module with the e-mail archiving task routes of IBM Content Collector so that you can leverage the classification capabilities for archival processes.
IBM Content Collector provides a user interface to define task routes that can archive e-mails from Microsoft Exchange or Lotus Domino e-mail servers, or documents from file system directories. The defined task routes can be scheduled to archive the digital contents periodically.
Typically, a task route consists of a collector source for e-mails or file system documents that follows a series of task processes to:
- Parse the e-mails or documents to create or update metadata properties
- Invoke algorithms to detect duplicates
- Stub the original e-mails and documents with links to the archived e-mail
- Store the e-mail or document instance into the supported ECM repositories
In addition, IBM Content Collector can leverage the content classification capabilities of IBM Classification Module to:
- Archive e-mails or documents into predefined folders. For faster discovery of information from archived content, it is important to organize the contents into a known set of predefined folders.
- Identify mission-critical e-mails or documents and declare them as records. To address any third-party litigation or compliance initiatives, it is necessary to identify digital content that might relate to enterprise policies or business missions.
- Populate metadata of the archived instance with category names that might be configured to support fielded searches and the faceted classification of search results. (For more information on faceted classification, see Resources.)
Figure 1 illustrates a sample task route defined in IBM Content Collector, where the "top relevancy score" from IBM Classification Module is used to mark the e-mail as a record or simply archive it into an ECM folder:
Figure 1. Sample task route for record and folder classification
This article assumes the following software prerequisites are already met:
- Install IBM Content Collector, Version 2.1 on a server
- Install IBM Classification Module, Version 8.6 on a server
- Deploy a knowledge base on IBM Classification Module. (A sample knowledge base has been provided in the Downloads section. You can also refer to the IBM Classification Module Information Center to create a knowledge base from a subset of known digital content -- see Resources).
Configure the IBM Classification Module server
This section provides the required integration instructions for leveraging the classification capabilities of IBM Classification Module with the detailed archiving features provided by IBM Content Collector. For better scaling and performance, install these products on separate servers.
Assuming that you previously installed IBM Classification Module, Version 8.6 on a server, use the following procedures to configure the system to classify e-mails.
Install Microsoft Office Outlook 2003 or Microsoft Office Outlook 2007
The e-mail archiving filters in IBM Classification Module leverages the Messaging Application Program Interface (MAPI) for parsing e-mails from a Microsoft Exchange server. To configure this support, you must:
- Install Microsoft Office Outlook 2003 or Microsoft Office Outlook 2007 on the server that hosts IBM Classification Module.
- As shown in Figure 2, select Microsoft Office
Outlook as the default e-mail application in your Web browser:
Figure 2. Select Microsoft Office Outlook as the default e-mail application
Configure the IBM Classification Module server for e-mail archiving
In this scenario, IBM Content Collector archives e-mails from a Microsoft Exchange server, and features of IBM Classification Module are used to classify the e-mails.
Follow these instructions to tune the IBM Classification Module server for archiving e-mails:
- Log on to the server hosting IBM Classification Module version 8.6.
- Stop the IBM Classification Module services:
- Launch Windows services.
- Stop the service labeled "IBM Classification Module Process Manager".
- Stop the service labeled "IBM Classification Module Trace".
- If e-mails are being archived, overwrite the default document filter:
- Open a DOS command window.
- Change to the C:\IBM\ClassificationModule\Filters directory
and enter the following commands:
-
copy docFilterManager.xml docFilterManager.xml.orig -
copy docFilterManager.email.xml docFilterManager.xml
-
- Start the IBM Classification Module services:
- Launch Windows services.
- Start the service labeled "IBM Classification Module Process Manager".
- Start the service labeled "IBM Classification Module Trace".
Launch the IBM Classification Module Management Console to load and start a knowledge base
If you previously created a knowledge base, launch the management console to load and start the knowledge base instance. Figure 3 shows the management console used to start and stop the knowledge base instance. Please refer to the IBM Classification Module Information Center for specific details on uploading, launching, starting, and stopping an existing knowledge base (see Resources).
Figure 3. Management console
Configure the IBM Content Collector Server
Assuming that you previously installed IBM Content Collector, Version 2.1 on a server, use the following procedures to configure the system to use IBM Classification Module as a utility connector.
Install the IBM Classification Module client modules
For IBM Content Collector to leverage the classification capabilities of IBM Classification Module, Version 8.6, you must install client components on the IBM Content Collector server:
- Run the IBM Classification Module installation program on the IBM Content Collector server.
- Select the option to install Custom components.
- Select the Classification Module Client only check box.
Figure 4 illustrates the IBM Classification Module software that must be installed on the IBM Content Collector server:
Figure 4. IBM Classification Module software components
Note: If you plan to reorganize e-mails in an IBM FileNet P8 repository or extract e-mails to build a new knowledge base, you can also install the IBM FileNet P8 integration component. This article does not discuss the capabilities of integrating IBM Classification Module with IBM FileNet P8. Please refer to the IBM Classification Module Information center to learn about the IBM FileNet P8 integration components. (See Resources.)
Register IBM Classification Module as a utility connector
- After the client software is installed, you need to copy the
following libraries from the C:\IBM\classificationModule\Bin directory
to the C:\Program Files\IBM\ContentCollector\ctms directory. Open a
DOS command window and enter these commands to do so:
-
copy C:\IBM\classificationModule\Bin\PackageDll23.dll C:\Program Files\IBM\ContentCollector\ctm\PackageDll23.dll -
copy C:\IBM\classificationModule\Bin\ stlport_ban46.dll C:\Program Files\IBM\ContentCollector\ctm\ stlport_ban46.dll -
copy C:\IBM\classificationModule\Bin\ bnsClient86.dll C:\Program Files\IBM\ContentCollector\ctm\bnsClient85.dll
Note: You must copy the original bnsClient86.dll file as bnsClient85.dll.
-
- Register IBM Classification Module as a utility connector. To do so,
open a DOS command window and enter these commands:
-
cd C:\Program Files\IBM\ContentCollector\ctms -
utilityConnector.exe –u -
utilityConnector.exe –r
-
- Launch the IBM Content Collector configuration manager to confirm
that IBM Classification Module exists as one of the utility
connectors. Figure 5 shows the configuration
manager with IBM Content Collector successfully recognized as a
utility connector:
Figure 5. IBM Classification Module as a utility connector
Add IBM Classification Module to a task route
Note: Before you start this task, ensure that the IBM Classification Module server is started and that a knowledge base instance is running on the remote IBM Classification Module server.
To add an IBM Classification Module process to a task route, select the IBM Classification Module task from the Utility section (see Figure 5), and place it in an existing task route.
Note: To get started, you can import one of the templates provided in the Downloads section of this article, or you can select a template that is bundled with IBM Content Collector.
After the IBM Classification Module task is placed, configure the task process:
- Specify the host name of the server that hosts the IBM Classification Module server.
- Specify the port of the IBM Classification Module listener component (the default port is 18087).
- Click on the explore button to retrieve the list of available knowledge bases.
- Select the knowledge base that you want to use for analyzing the e-mails.
- Click the explore button to retrieve the list of available content fields.
- Select the content field that you want to use to represent the content
that is to be analyzed.
Note: This field should be Document because e-mail is a binary file.
Figure 6 shows a sample of IBM Classification Module configuration:
Figure 6. IBM Classification Module configuration as a task process
Apply classification scenarios in task routes
IBM Classification Module provides language processing technology that can analyze digital content and apply a predefined taxonomy to discover possible categories and relevancy scores. This technology can be applied to e-mails and file system documents.
The top category name and relevancy score can be used in the IBM Content Collector task route to perform any or all of the following actions:
-
Assign a folder path for the archived e-mails or documents on file
system
Archiving e-mails or documents on file systems into known folders simplifies the task of finding e-mails or documents in support of discovery for potential litigations. -
Identify e-mails that should be declared as
records
To ensure enterprise compliance with record retention and disposition policies, it is important to analyze e-mails for business-sensitive data and declare such an email as a record. -
Populate metadata properties
The metadata of the archived e-mails can be populated with the top category names or top relevancy scores, which can later help with fielded searches and the faceted classification of search results.
The following sections demonstrate sample task routes that can perform each of the actions described above.
Assign a folder path for the archived e-mails or documents on file system
In any archival solution, it is necessary to organize the archived contents to simplify later discovery. The top category names discovered by IBM Classification Module can be used to organize the archived e-mails or documents into known folders. In this section, a sample task route shows how IBM Classification Module is used for organizing digital content into known IBM FileNet P8 folders.
IBM Classification Module applies statistical text analysis on the contents of the e-mail (body, attachments, and address fields) to assign pairs of relevancy scores and category names. As shown in Figure 7, the Most Relevant Category is assigned as the "Folder path" destination for the archived e-mails:
Figure 7. Use IBM Classification Module metadata to set folder path
Likewise, IBM Classification Module applies statistical text analysis on the contents of file system documents to assign pairs of relevancy scores and category names. As shown in Figure 8, the Most Relevant Category is assigned as the "Folder path" destination for the archived documents.
Figure 8. Use IBM Classification Module metadata to set folder path
Identify e-mails that should be declared as records
In addition to archiving the e-mails into a meaningful folder path for easy discovery, you can use the "top relevancy score" assigned to the "most relevant category" a threshold to mark the e-mails as a record. As illustrated in Figure 9, a "decision point" is inserted in the task route and the following rules are defined:
- Top Score > 70%: If the top relevancy score is greater than 70%, then assign the "Folder path" of the archived e-mail as the "Most relevant category" and declare the e-mail as a record.
- Top Score < 70%: If the top relevancy score is less than 70%, then assign the "Folder path" of the archived e-mail as the "Most relevant category" without declaring the e-mail as a record.
The above rules help with archiving e-mails into known folders and marking e-mails as records for ensuring enterprise compliance in case of litigation.
Figure 9. Use IBM Classification Module metadata to set folder path and declare records
The information discovered by IBM Classification Module can also be used to populate metadata properties for the archived e-mails. In IBM Content Collector, the e-mails are archived into an IBM FileNet P8 or IBM Content Manager (CM8) repository. Each of these repository types provides search capabilities. To facilitate better discovery by search clients, metadata that is populated by information discovered by IBM Classification Module can be used in fielded searches and in the faceted display of search results.
Fielded searches: In a fielded search, you can specify a query to return e-mails with a specific term in a specific e-mail attribute. For example, you might search for e-mails with the term "HR" in the e-mail attribute called "Subject". As shown in Figure 10, IBM Classification Module can be used to append the value of the "Subject" field of the e-mail with the "Most relevant category" name:
Figure 10. Use IBM Classification Module metadata to populate attributes
Faceted classification of search results: After a free text search is performed, it can be overwhelming to sort through the e-mails for the e-mail that you are looking for. Displaying a large number of search results by known facets is a growing technique for narrowing the set of search results that gets displayed. IBM Classification Module can help by populating e-mail fields that search clients can use to return a faceted display of the results. As shown in Figure 11, the "Document Title" of the archived e-mail is populated with the "Most relevant category" name, which enables the search client to leverage the value of this field and provide a classified view of the search results:
Figure 11. Use IBM Classification Module metadata to populate attributes for faceted classification results
In this article, you have learned the importance of classification technologies in an archival solution. With automatic classification, you accelerate time to value from your ECM investments by relieving employees of time-consuming and costly decision-making tasks. You have also learned the instructions required to integrate IBM Classification Module, Version 8.6 with IBM Content Collector, Version 2.1.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample task route templates | Sample_task_routes.zip | 59KB | HTTP |
| Sample knowledge base | KBIOD2007.kb | 443KB | HTTP |
Information about download methods
Learn
-
IBM
Classification Module:
Learn more about IBM Classification Module.
-
IBM
Classification Module Information Center:
Find information about installing, administering, and developing
applications for IBM Classification Module.
-
IBM Content
Collector:
Learn more about IBM Content Collector.
-
IBM E-mail Archive
and eDiscovery Solution Information Center:
Find product documentation and other information for IBM Content
Collector.
-
"Add
automatic content classification to your IBM FileNet P8"
(developerWorks, January 2008): Follow step-by-step instructions for
performing the seamless integration between IBM Classification Module and
IBM FileNet P8, and learn how to automate the content classification in
the integrated environment.
-
"Leverage
taxonomies for enterprise search using IBM OmniFind, IBM Classification Module, and
SchemaLogic"
(developerWorks, February 2007): Employ professional tools for taxonomy
management and auto-classification to enhance enterprise search solutions.
-
Faceted
classification
(Wikipedia): Learn more about faceted classification.
- developerWorks Information Management zone:
Learn more about Information Management. Find technical documentation,
how-to articles, education, downloads, product information, and
more.
- Stay current with
developerWorks technical events and webcasts.
- Technology bookstore:
Browse for books on these and other technical topics.
Get products and technologies
- Build your next
development project with
IBM trial software,
available for download directly from developerWorks.
Discuss
- Participate in
developerWorks blogs
and get involved in the developerWorks community.

Srinivas Varma Chitiveli is a lead software engineer at IBM. He has been involved with IBM products that deal with content analytics and discovery. Enterprise content crawling, archiving, classification, entity extraction, and search are technologies where Srinivas has contributed as a development lead and customer advocate.

Josemina Magdalen is a software development team lead at Israel Software Group (ILSL). She has a background in Natural Language Processing (text classification and search, as well as text mining technologies). Josemina joined IBM in 2005 and has worked in the Content Discovery Engineering Group doing software development projects in text categorization and search and text analytics. Prior to joining IBM, Josemina worked in Natural Languages Processing research and development (machine translation, text classification and search, and data mining) for over ten years.
Comments (Undergoing maintenance)





