Apply IBM Classification Module to e-mail archiving

Gain an understanding of IBM® Classification Module, Version 8.6 and its integration with IBM Content Collector, Version 2.1. This solution can be deployed to classify e-mails and archive them into easily discoverable folders of an enterprise content management (ECM) system. If the content is classified for mission-critical information, then the e-mails can be declared as records to be under the control of a records management system. Learn how to install IBM Classification Module, Version 8.6, define extendable task routes for IBM Content Collector, Version 2.1, automate the classification of e-mails into folders, and mark the e-mails for record declaration.

Srinivas Varma Chitiveli (schitive@us.ibm.com), Software Engineer, IBM

Srinivas Varma ChitiveliSrinivas Varma Chitiveli is a lead software engineer at IBM. He has been involved with IBM products that deal with content analytics and discovery. Enterprise content crawling, archiving, classification, entity extraction, and search are technologies where Srinivas has contributed as a development lead and customer advocate.



Josemina Magdalen (josemina@il.ibm.com), Software Engineer, IBM

Josemina MagdalenJosemina Magdalen is a software development team lead at Israel Software Group (ILSL). She has a background in Natural Language Processing (text classification and search, as well as text mining technologies). Josemina joined IBM in 2005 and has worked in the Content Discovery Engineering Group doing software development projects in text categorization and search and text analytics. Prior to joining IBM, Josemina worked in Natural Languages Processing research and development (machine translation, text classification and search, and data mining) for over ten years.



16 December 2008

Introduction

Organizations are facing substantial increases in the volume and variety of information. At the same time, compliance and discovery requirements are becoming more complex. Businesses need to develop comprehensive information retention management strategies and are starting by addressing their e-mail and file systems. Yet in deploying these solutions to harness this explosion, IBM customers have frequently found that new burdens are placed on their end users to organize, categorize, and generally classify their e-mails and files. Requiring full user participation in content collection is expensive and inaccurate, creating barriers to long-term adoption of these critical solutions. To address this customer pain, IBM has integrated advanced content classification (provided by the IBM Classification Module) with its modular content collection capabilities.


Solution introduction

Overview of IBM Content Collector, Version 2.1

IBM Content Collector, Version 2.1 is an archiving solution for e-mails and other digital content. With IBM Content Collector, you can create task routes to automatically schedule archiving of enterprise user e-mails or digital content on shared drives. IBM Content Collector, Version 2.1 supports the ability to archive e-mails from IBM Lotus® Domino®, Microsoft® Exchange servers, and documents from file system folders. To archive digital content, IBM Content Collector supports two ECM repositories from IBM — IBM FileNet® P8 and IBM Content Manager. IBM Content Collector, Version 2.1 also provides client plug-ins for Microsoft Exchange Outlook client and Lotus Notes® client application such that e-mail users can mark specific e-mails for archiving, and search and retrieve information from archived e-mails. Refer to the Content Collector product home page to learn more about the product capabilities (see Resources).

Overview of IBM Classification Module, Version 8.6

IBM Classification Module automates the organization of unstructured content by analyzing the full text of documents and e-mails. With automatic classification, you accelerate the time to value from your ECM investments by relieving employees of time-consuming and costly decision-making tasks. IBM Classification Module can engage a predefined knowledge base to analyze the contents of file system folders and e-mails to assign a list of possible categories (which represent how the e-mails are to be classified) and relevancy scores (which represent the confidence with which an e-mail is assigned to a category). Refer to the Classification Module product home page to learn more about the product capabilities (see Resources).

IBM Content Collector can leverage text analytic results from IBM Classification Module to make intelligent decisions on captured content.

For example, to meet compliance and record management initiatives, or to prepare collections of data for legal discovery, IBM Classification Module can distinguish important e-mails from e-mails that have no business value. Based on the content analysis, IBM Content Collector determines the appropriate action to be taken. An e-mail that discusses a patent application might be copied to an IBM FileNet P8 folder and be declared as a record in IBM FileNet Records Manager. To contrast, an e-mail that discusses patent leather might be filtered out and not archived.

This article provides instructions for integrating IBM Classification Module with the e-mail archiving task routes of IBM Content Collector so that you can leverage the classification capabilities for archival processes.


Integration overview

IBM Content Collector provides a user interface to define task routes that can archive e-mails from Microsoft Exchange or Lotus Domino e-mail servers, or documents from file system directories. The defined task routes can be scheduled to archive the digital contents periodically.

Typically, a task route consists of a collector source for e-mails or file system documents that follows a series of task processes to:

  • Parse the e-mails or documents to create or update metadata properties
  • Invoke algorithms to detect duplicates
  • Stub the original e-mails and documents with links to the archived e-mail
  • Store the e-mail or document instance into the supported ECM repositories

In addition, IBM Content Collector can leverage the content classification capabilities of IBM Classification Module to:

  • Archive e-mails or documents into predefined folders. For faster discovery of information from archived content, it is important to organize the contents into a known set of predefined folders.
  • Identify mission-critical e-mails or documents and declare them as records. To address any third-party litigation or compliance initiatives, it is necessary to identify digital content that might relate to enterprise policies or business missions.
  • Populate metadata of the archived instance with category names that might be configured to support fielded searches and the faceted classification of search results. (For more information on faceted classification, see Resources.)

Figure 1 illustrates a sample task route defined in IBM Content Collector, where the "top relevancy score" from IBM Classification Module is used to mark the e-mail as a record or simply archive it into an ECM folder:

Figure 1. Sample task route for record and folder classification
Sample task route for record and folder classification

Software requirements

This article assumes the following software prerequisites are already met:

  • Install IBM Content Collector, Version 2.1 on a server
  • Install IBM Classification Module, Version 8.6 on a server
  • Deploy a knowledge base on IBM Classification Module. (A sample knowledge base has been provided in the Downloads section. You can also refer to the IBM Classification Module Information Center to create a knowledge base from a subset of known digital content -- see Resources).

Configure the IBM Classification Module server

This section provides the required integration instructions for leveraging the classification capabilities of IBM Classification Module with the detailed archiving features provided by IBM Content Collector. For better scaling and performance, install these products on separate servers.

Assuming that you previously installed IBM Classification Module, Version 8.6 on a server, use the following procedures to configure the system to classify e-mails.

Install Microsoft Office Outlook 2003 or Microsoft Office Outlook 2007

The e-mail archiving filters in IBM Classification Module leverages the Messaging Application Program Interface (MAPI) for parsing e-mails from a Microsoft Exchange server. To configure this support, you must:

  1. Install Microsoft Office Outlook 2003 or Microsoft Office Outlook 2007 on the server that hosts IBM Classification Module.
  2. As shown in Figure 2, select Microsoft Office Outlook as the default e-mail application in your Web browser:
    Figure 2. Select Microsoft Office Outlook as the default e-mail application
    Select Microsoft Office Outlook as the default e-mail application

Configure the IBM Classification Module server for e-mail archiving

In this scenario, IBM Content Collector archives e-mails from a Microsoft Exchange server, and features of IBM Classification Module are used to classify the e-mails.

Follow these instructions to tune the IBM Classification Module server for archiving e-mails:

  1. Log on to the server hosting IBM Classification Module version 8.6.
  2. Stop the IBM Classification Module services:
    1. Launch Windows services.
    2. Stop the service labeled "IBM Classification Module Process Manager".
    3. Stop the service labeled "IBM Classification Module Trace".
  3. If e-mails are being archived, overwrite the default document filter:
    1. Open a DOS command window.
    2. Change to the C:\IBM\ClassificationModule\Filters directory and enter the following commands:
      • copy docFilterManager.xml docFilterManager.xml.orig
      • copy docFilterManager.email.xml docFilterManager.xml
  4. Start the IBM Classification Module services:
    1. Launch Windows services.
    2. Start the service labeled "IBM Classification Module Process Manager".
    3. Start the service labeled "IBM Classification Module Trace".

Launch the IBM Classification Module Management Console to load and start a knowledge base

If you previously created a knowledge base, launch the management console to load and start the knowledge base instance. Figure 3 shows the management console used to start and stop the knowledge base instance. Please refer to the IBM Classification Module Information Center for specific details on uploading, launching, starting, and stopping an existing knowledge base (see Resources).

Figure 3. Management console
Management console

Configure the IBM Content Collector Server

Assuming that you previously installed IBM Content Collector, Version 2.1 on a server, use the following procedures to configure the system to use IBM Classification Module as a utility connector.

Install the IBM Classification Module client modules

For IBM Content Collector to leverage the classification capabilities of IBM Classification Module, Version 8.6, you must install client components on the IBM Content Collector server:

  1. Run the IBM Classification Module installation program on the IBM Content Collector server.
  2. Select the option to install Custom components.
  3. Select the Classification Module Client only check box.

Figure 4 illustrates the IBM Classification Module software that must be installed on the IBM Content Collector server:

Figure 4. IBM Classification Module software components
IBM Classification Module software components

Note: If you plan to reorganize e-mails in an IBM FileNet P8 repository or extract e-mails to build a new knowledge base, you can also install the IBM FileNet P8 integration component. This article does not discuss the capabilities of integrating IBM Classification Module with IBM FileNet P8. Please refer to the IBM Classification Module Information center to learn about the IBM FileNet P8 integration components. (See Resources.)

Register IBM Classification Module as a utility connector

  1. After the client software is installed, you need to copy the following libraries from the C:\IBM\classificationModule\Bin directory to the C:\Program Files\IBM\ContentCollector\ctms directory. Open a DOS command window and enter these commands to do so:
    • copy C:\IBM\classificationModule\Bin\PackageDll23.dll C:\Program Files\IBM\ContentCollector\ctm\PackageDll23.dll
    • copy C:\IBM\classificationModule\Bin\ stlport_ban46.dll C:\Program Files\IBM\ContentCollector\ctm\ stlport_ban46.dll
    • copy C:\IBM\classificationModule\Bin\ bnsClient86.dll C:\Program Files\IBM\ContentCollector\ctm\bnsClient85.dll
      Note: You must copy the original bnsClient86.dll file as bnsClient85.dll.
  2. Register IBM Classification Module as a utility connector. To do so, open a DOS command window and enter these commands:
    • cd C:\Program Files\IBM\ContentCollector\ctms
    • utilityConnector.exe –u
    • utilityConnector.exe –r
  3. Launch the IBM Content Collector configuration manager to confirm that IBM Classification Module exists as one of the utility connectors. Figure 5 shows the configuration manager with IBM Content Collector successfully recognized as a utility connector:
    Figure 5. IBM Classification Module as a utility connector
    IBM Classification Module as a utility connector

Add IBM Classification Module to a task route

Note: Before you start this task, ensure that the IBM Classification Module server is started and that a knowledge base instance is running on the remote IBM Classification Module server.

To add an IBM Classification Module process to a task route, select the IBM Classification Module task from the Utility section (see Figure 5), and place it in an existing task route.

Note: To get started, you can import one of the templates provided in the Downloads section of this article, or you can select a template that is bundled with IBM Content Collector.

After the IBM Classification Module task is placed, configure the task process:

  1. Specify the host name of the server that hosts the IBM Classification Module server.
  2. Specify the port of the IBM Classification Module listener component (the default port is 18087).
  3. Click on the explore button to retrieve the list of available knowledge bases.
  4. Select the knowledge base that you want to use for analyzing the e-mails.
  5. Click the explore button to retrieve the list of available content fields.
  6. Select the content field that you want to use to represent the content that is to be analyzed.
    Note: This field should be Document because e-mail is a binary file.

Figure 6 shows a sample of IBM Classification Module configuration:

Figure 6. IBM Classification Module configuration as a task process
IBM Classification Module configuration as a task process

Apply classification scenarios in task routes

IBM Classification Module provides language processing technology that can analyze digital content and apply a predefined taxonomy to discover possible categories and relevancy scores. This technology can be applied to e-mails and file system documents.

The top category name and relevancy score can be used in the IBM Content Collector task route to perform any or all of the following actions:

The following sections demonstrate sample task routes that can perform each of the actions described above.

Assign a folder path for the archived e-mails or documents on file system

In any archival solution, it is necessary to organize the archived contents to simplify later discovery. The top category names discovered by IBM Classification Module can be used to organize the archived e-mails or documents into known folders. In this section, a sample task route shows how IBM Classification Module is used for organizing digital content into known IBM FileNet P8 folders.

IBM Classification Module applies statistical text analysis on the contents of the e-mail (body, attachments, and address fields) to assign pairs of relevancy scores and category names. As shown in Figure 7, the Most Relevant Category is assigned as the "Folder path" destination for the archived e-mails:

Figure 7. Use IBM Classification Module metadata to set folder path
Use IBM Classification Module metadata to set folder path

Likewise, IBM Classification Module applies statistical text analysis on the contents of file system documents to assign pairs of relevancy scores and category names. As shown in Figure 8, the Most Relevant Category is assigned as the "Folder path" destination for the archived documents.

Figure 8. Use IBM Classification Module metadata to set folder path
Use IBM Classification Module metadata to set folder path

Identify e-mails that should be declared as records

In addition to archiving the e-mails into a meaningful folder path for easy discovery, you can use the "top relevancy score" assigned to the "most relevant category" a threshold to mark the e-mails as a record. As illustrated in Figure 9, a "decision point" is inserted in the task route and the following rules are defined:

  • Top Score > 70%: If the top relevancy score is greater than 70%, then assign the "Folder path" of the archived e-mail as the "Most relevant category" and declare the e-mail as a record.
  • Top Score < 70%: If the top relevancy score is less than 70%, then assign the "Folder path" of the archived e-mail as the "Most relevant category" without declaring the e-mail as a record.

The above rules help with archiving e-mails into known folders and marking e-mails as records for ensuring enterprise compliance in case of litigation.

Figure 9. Use IBM Classification Module metadata to set folder path and declare records
Use IBM Classification Module metadata to set folder path and declare records

Populate metadata properties

The information discovered by IBM Classification Module can also be used to populate metadata properties for the archived e-mails. In IBM Content Collector, the e-mails are archived into an IBM FileNet P8 or IBM Content Manager (CM8) repository. Each of these repository types provides search capabilities. To facilitate better discovery by search clients, metadata that is populated by information discovered by IBM Classification Module can be used in fielded searches and in the faceted display of search results.

Fielded searches: In a fielded search, you can specify a query to return e-mails with a specific term in a specific e-mail attribute. For example, you might search for e-mails with the term "HR" in the e-mail attribute called "Subject". As shown in Figure 10, IBM Classification Module can be used to append the value of the "Subject" field of the e-mail with the "Most relevant category" name:

Figure 10. Use IBM Classification Module metadata to populate attributes
Use IBM Classification Module metadata to populate attributes

Faceted classification of search results: After a free text search is performed, it can be overwhelming to sort through the e-mails for the e-mail that you are looking for. Displaying a large number of search results by known facets is a growing technique for narrowing the set of search results that gets displayed. IBM Classification Module can help by populating e-mail fields that search clients can use to return a faceted display of the results. As shown in Figure 11, the "Document Title" of the archived e-mail is populated with the "Most relevant category" name, which enables the search client to leverage the value of this field and provide a classified view of the search results:

Figure 11. Use IBM Classification Module metadata to populate attributes for faceted classification results
Use IBM Classification Module metadata to populate attributes for facetted search results

Summary

In this article, you have learned the importance of classification technologies in an archival solution. With automatic classification, you accelerate time to value from your ECM investments by relieving employees of time-consuming and costly decision-making tasks. You have also learned the instructions required to integrate IBM Classification Module, Version 8.6 with IBM Content Collector, Version 2.1.


Downloads

DescriptionNameSize
Sample task route templatesSample_task_routes.zip59KB
Sample knowledge baseKBIOD2007.kb443KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=359161
ArticleTitle=Apply IBM Classification Module to e-mail archiving
publish-date=12162008