Integrate InfoSphere Guardium Data Redaction with IBM Classification Module

Identifying sensitive and non-sensitive documents

IBM® InfoSphere® Guardium® Data Redaction is capable of finding and concealing sensitive text within a document. Within an organization, not all documents contain sensitive data. For the data redaction to be effective, it is critical that relevant documents be identified. The sensitivity of the entities is often dependent on context. InfoSphere Classification Module is capable of identifying sensitive documents containing data that require redaction. This article explains how to integrate Guardium Data Redaction and Classification Module to achieve the goal of redacting only relevant documents.

Share:

Jane Singer (jsinger@il.ibm.com), QA Engineer, IBM

Jane Singer photoJane Singer is on the QA teams for both InfoSphere Guardium Data Redaction and InfoSphere Classification Module at the IBM Israel Software Lab. In addition she leads L3 and presales support for InfoSphere Classification Module.



03 November 2011

Also available in Chinese

Overview

InfoSphere Guardium Data Redaction is a product aimed at achieving a balance between openness and privacy. Often, the same regulations require organizations to share their documents with regulators, business partners, or customers, and at the same time to protect sensitive information which may be buried in these documents. With thousands of document in Enterprise Content Management systems such as IBM FileNet ® and IBM Content Manger®, automation combined with a well-structured workflow is essential for practically controlling access to private information in documents at a fine grain.

For example, in eDiscovery, lawyers must share documents with the opposing lawyer adversaries. But lawyers do not want to release any information they don't need to, and attorney-client privileged information must be carefully protected. Similarly, The Freedom of Information Act (FOIA) is intended to hold government organizations more accountable for their actions by making information about those actions available on demand. However, individuals are not entitled to access sensitive personal information. But on the other hand, the same regulation requires that those ordering the documents must not see any sensitive personal or national security information embedded in documents that might be made public.

InfoSphere Guardium Data Redaction product automatically finds and deletes sensitive text within a document, redacting the document. It then outputs the redacted document in a format such as a PDF. Alternatively, the product includes a web-based Secure Viewer for even more control over the release of private information. Each user sees just what they are allowed to see. In some cases, even if a user is allowed to see some information, it is withheld unless they ask for it, specifying the reason for their need to know.

Within an organization, not all documents contain sensitive data. For the redaction to be effective, it is critical that relevant documents be identified. InfoSphere Guardium Data Redaction is capable of identifying and redacting many types of personally identifiable information, but not all occurrences constitute sensitive data. The sensitivity of the entities is often dependent on context. For example, names of medical procedures in administrative documents catalog are not sensitive, but in patient records they are. IBM Classification Module is capable of identifying sensitive documents containing data that requires redaction.

The level of sensitivity varies across documents of different types. A group of documents from one department within the organization may require a customized redaction policy. Other groups of documents may have been created for public consumption, and it can be assumed that these documents contain no sensitive data. These document groupings may or may not be part of a formalized classification system.

Below is an example of a sensitive document, and its redacted version. Personal names, addresses, account and telephone numbers have been removed.

Figure 1. An overview of the redaction process
Shows original document, sensitive information removed, and resulting document

There are different formats available for the redacted version of the document. In addition to the usual formats (PDF, Microsoft Word document, TIFF, text, and so on), a propriety format is available that can be viewed by the Secure Viewer (an application shipped with InfoSphere Guardium Data Redaction).

IBM Classification Module is capable of identifying documents according to a large range of criteria, including statistical classification and rule-based decisions. The implementation involves these stages:

  1. Create a knowledge base and train it using user-defined groups of sample documents.
  2. Create a decision plan that will:
    • Categorize new documents based on the knowledge base results.
    • Move documents to relevant folders.
  3. Run the Classification Module Classification Center using the created decision plan. Documents are moved to relevant folders.
  4. Run redaction batch processes on the repository folders. Redacted versions of the document are created; original copies are kept.

The implementation described here involves documents stored in a file system. Both Classification Module and InfoSphere Guardium Data Redaction are capable of accessing and processing documents on IBM FileNet and IBM Content Manager systems.

The workflow described here uses IBM Classification Module's Classification Center to classify documents into a taxonomy tree.

For information on how to create a knowledge base and decision plan, and to set up Classification Center for classifying documents into folders, see the IBM Classification Module Information Center

Guardium Data Redaction then redacts documents in two different category folders, nested within the Repository Folders, according to two different redaction policies. Guardium Data Redaction uses a specific folder structure (repository folders) which serves as the basis for its data processors.

The workflow described here involves these steps:

  1. Set the configuration for redaction: Configure two processors in InfoSphere Guardium Data Redaction.
  2. Start the Data Redaction server in order to create the relevant processors and their repository folders.
  3. Create the Classification Module knowledge base and decision plan.
  4. Run the Classification Module Classification Center to move the documents to the redaction in folders.
  5. Restart the InfoSphere Guardium Data Redaction server to redact documents and move them to the appropriate folders for further processing.

Set the configuration for redaction

Before running the Classification Module Classification Center or InfoSphere Guardium Data Redaction, the processors should be set up.

Configure two repositories

Two separate processors (Legal and IBM Global Financing) are defined in two processor configuration files found in the IBM\GuardiumDataRedaction\server\conf folder.

Each processor has one configuration file named in the IBM\GuardiumDataRedaction\server\conf\plugins.xml file:

Listing 1. Sample processor setup in plugins.xml
<plugin>
	<pluginClass>com.ibm.nex.redaction.docrepository.SimpleFilesDocumentRepository
	</pluginClass>
        <configFile>batchFileSystemProcessorIBM_Legal.xml</configFile>
</plugin>
<plugin>
	<pluginClass>com.ibm.nex.redaction.docrepository.SimpleFilesDocumentRepository
	</pluginClass>
	<configFile>batchFileSystemProcessorIBM_Finance.xml</configFile>
</plugin>

Each XML configuration file contains the following settings:

  • The base folder for the repository

    This folder should match the directory used by Classification Center, for example:

    <baseDir>c:/data/IBM Products CC Output Folder</baseDir>

  • Repository folder name

    The folder name should match exactly the associated category name in the Classification Module knowledge base.

    <processor folder="Legal">

Setting different data policies

We will set two policies:

  • Legal role: US dollar amounts are redacted.
  • Financial role: Organization names are redacted.

These profiles are configured in the XmlPolicyModel.xml file in IBM\GuardiumDataRedaction\server\conf

Each ns21:permission element maps one role with one category. The ns21:redact element sets this as a redacted category. The categories are mapped in the <ns21:category id="1"> within the same file.

Below, each user has one redacted category. Each mapping maps a single user to a single category. The user role (userRoleID) and category (semanticCategoryId) are configured elsewhere in the same file. Here, each category is set to redacted.

Listing 2. Legal role
<ns21:permission userRoleId="1002" semanticCategoryId="100">
<ns21:redact />
</ns21:permission>
Listing 3. Financial role
<ns21:permission userRoleId="1003" semanticCategoryId="3">
	  <ns21:redact />
</ns21:permission>

Start the InfoSphere Guardium Data Redaction server

From the IBM InfoSphere Guardium Data Redaction Windows menu, choose Start server. This will start the server and create the configured repositories. You can optionally stop the server in order to prevent it from processing the files created by the Classification Center before you have checked them. If the in folder becomes populated while the Data Redaction server is running, these files will be picked up for processing.


Create the Classification Module knowledge base and decision plan

Classification Module Classification Center is capable of copying and/or moving files within a file system and reading/modifying metadata associated with a document within a full content management system. These actions are based on a series of decisions made within a decision plan running on the Classification Module server. Although this decision plan takes actions based on triggers, these rules can consider results from statistic analysis of the document content returned by the knowledge base (also running on the server). The knowledge base typically assigns a category to the document, based on statistical similarities.

For details on how to create a knowledge base and decision plan, see the Classification Module InfoCenter Workbench topic in the Information Center, accessible from the Resources section.

Create the knowledge base

Classification Module Workbench is shipped with a project called IBM Products. This project contains the basis for the knowledge base used here. The following figures shows the list of categories.

Figure 2. The IBM Products knowledge base
Explorer view of the knowledge base The IBM Products Knowledge Base

The knowledge base structure mimics the target folder structure. The following figure shows the folder structure, each folder named after a category.

Figure 3. The folder structure
opened up explorer view of the folder structure for organizing classified documents

Create the decision plan

The decision plan includes a set of rules. Below is an example of a rule that moves documents to the target folders based on the highest category match (for an example of such rules, see the Rules for File System project in Classification Module Workbench).

Figure 4. The decision plan (first rule)
The rule for matching the document against the knowledge base.

The folders that will be redacted are a special case. The figure below shows an action for moving the document to the in subfolder within a redaction repository:

Figure 5. The decision plan (second rule)
The rule for moving files to the correct repository folder.

Run Classification Module Classification Center

For details on how to set up Classification Center for classifying documents into folders, see InfoSphere Classification Module InfoCenter Classification Center topic.

Once Classification Center is run, the documents for redaction should be moved to the redaction in folders; non-redacted documents should be moved to the Products subcategories within this structure. The figure below shows the in folder for two repositories and other non-repository folders named after categories.

Figure 6. The redaction repository file structure
The redaction repository file structure, the Classification Center inserts documents into the input directories.

Check to see that the above folders were populated by Classification Center.

The following figure shows two folders (Financial and Legal) that will serve also as Data Repository folders:

Figure 7. The Financial and Legal repository folders
Explorer view of the Financial and Legal repository folders

Here, Classification Center moves files to the subfolder in of each repository folder.


Restart the InfoSphere Guardium Data Redaction server

From the IBM InfoSphere Guardium Data Redaction Windows menu, choose Start server. Since the in folder of the two new repositories now contain the documents created by the Classification Center, Redaction will now process these files.

The figure below shows the orig and out folders within each repository structure.

Figure 8. The out folder now contains redacted documents. The orig folder contains the original copies.
The out folder now contains redacted documents. The orig folder contains the original copies.

Data redaction processes documents from the in folder and creates redacted and non-redacted versions in the respective folders:

orig folders: original documents

out folders: redacted copies

The percentage of files that are sent for review depends on the percentage set in the relevant repository file (such as batchFileSystemProcessorIBM_Legal.xml above):

<reviewPercentage>0</reviewPercentage>

We now have various versions, redacted and non-redacted, of our original documents classified into folders. There are various aspects of this model that can be adapted according to business needs.


Some ideas for varying the model

Finding sensitive documents for redaction without subject classification

In the case where the only goal is to locate sensitive data, there is no need for conventional content classification. In this case a Classification Module knowledge base can be created that recognizes the nature of the sensitive documents, and the decision plan can be used to move only those documents to the Redaction repository folder. There is no need for a folder dedicated to CC output. Because the 2-category knowledge base is often used for finding a few relevant items within a large content set, this method is often called "pinpointing." However, it can be used also for finding a large group of similar documents among non-relevant documents.

Figure 9. The pinpointing knowledge base
A two-category knowledge base for finding sensitive material within a large collection of documents.

To create such a knowledge base, choose a number of sensitive documents and an equal number of non-sensitive documents.

Adding manual review of the Classification Center output before and/or after redaction

The Classification Center can be used to manually review documents before they are sent for redaction.

This method can be used early on when the system is first put into production when knowledge base confidence may be low. In addition feedback can be submitted to improve the knowledge base.

The Redaction Manager can be used to review documents, after they are classified and redacted. The document redaction can be edited or removed and sent to another Repository Folder for redaction according to a different policy.

Using multiple pinpointing knowledge bases

Multiple knowledge bases could be set up for pinpointing specific documents for redaction. One or more processes could be implemented consecutively according to need, until all documents are moved to a folder for redaction. This would be helpful, for example, in the case where new sensitive documents of a different nature need to be located for redaction, or where the nature of new documents changes.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Security
ArticleID=768143
ArticleTitle=Integrate InfoSphere Guardium Data Redaction with IBM Classification Module
publish-date=11032011