Best practices for using IBM Classification Module with document retention policies

Product Documentation

Abstract

Companies can use IBM Classification Module to classify content in their enterprise content management systems and assist with enforcement of document retention policies. IBM Classification Module analyzes each document and can automatically set appropriate record classes and expiration dates according to company policy. This document describes a process that allows customers to defend their classification actions when required.

Content

IBM Classification Module makes decisions about content by using a combination of rules and statistical analysis that are applied on the full content and metadata of a document, based on a large number of examples. IBM Classification Module makes decisions in a consistent, reproducible way that can be monitored and audited. You can use Classification Module tools to estimate the accuracy and automation rate margins in which the production system will work. After the production server is deployed, you can analyze the accuracy of the system.

You can configure Classification Module to assign version numbers to knowledge bases and decision plans whenever changes are made to them, and store backup copies of each version of the knowledge base and decision plan. When using this version option, all Classification Module results and suggestions include the specific decision plan and knowledge base version that is responsible for that decision. Associating each result with a specific version allows for the reproduction of each decision made by the system.

An important part of controlling the classification process is monitoring knowledge base learning. You can incorporate improvement suggestions in a controlled manner by using a feature called deferred feedback. After documents are reviewed, feedback is stored inside the data server and can be imported into the Classification Workbench offline and processed:

A subject matter expert or business user approves the feedback (that is, the labels assigned by the reviewers).
An updated knowledge base is produced and tested.
If knowledge base accuracy is improved, the updated knowledge base (with an updated version number) is deployed onto the production server.

To set up IBM Classification Module for use with document retention policies, complete the following tasks*:

When setting up the system:

Build and deploy the knowledge base and decision plan.
Configure versioning and deferred feedback.
Establish a comprehensive backup procedure.
Prepare a data set to use for regression tests.

When the system is in production:

Manually review documents that cannot be automatically classified.
Audit classification performance.
Verify the accuracy of the production server.
Process deferred feedback.

Setting up the system

Build and deploy the knowledge base and decision plan. Create, analyze, and tune the knowledge base and decision plan by using Classification Workbench.

Use a categorized data set to train and tune the initial knowledge base.
For decision plans that primarily use knowledge bases to make classification decisions, use the Total Automation vs. Error graph to help determine the proper balance between automation and accuracy. See Summary graphs: Total Automation vs. Error in the Classification Module information center.
If the automation rate is not 100%, create rules in the decision plan to identify and flag documents that require manual review (by using an attribute, item type, or folder). For example:

For decision plans that use knowledge bases, add a rule that is triggered if the scores of the top-scoring categories for the document do not exceed a preset threshold, such as 80%. For such documents, the rule can set an attribute to flag these documents or can automatically move the documents into a folder or item type called ManualReview.
For decision plans that are primarily rule based, build validation rules as part of the decision plan. For example, add a rule in the beginning of the decision plan to set the ManualReview content field to true for each document. Whenever a rule is triggered and the document is automatically processed, add an action to set the ManualReview content field to false. At the end of the decision plan, add a rule that automatically moves those documents flagged for ManualReview into a folder or item type called ManualReview.

Create rules in the decision plan to flag documents for auditing. Use the advanced random trigger option to build a rule that randomly flags a specific percentage of documents to move or copy into an audit folder for review. For example, create a rule that is triggered for 5% of the documents that are processed and sets the CopyToAuditFolder content field to True. Besides defining this decision plan rule, you also need to set up a procedure for moving or copying the flagged documents to a special audit folder. If the audit folder contains only copies of the documents, you can delete these copies after you complete the audit.

Configure versioning and deferred feedback. When the knowledge base and decision plan are ready for production, deploy them on the Classification Module server and set the following options in the Management Console:

For the knowledge base, select Back up automatically and set the Feedback option to Defer processing. Rather than incorporating feedback continuously, deferred processing allows you to test the impact of accumulated feedback before applying it to a knowledge base in production.
For the decision plan, select Back up automatically.

When you select the Back up automatically option, Classification Module automatically creates backup copies of the knowledge base and decision plan when you make changes to them. All of the decision results and suggestions will be associated with specific decision plan and knowledge base versions. Storing previous versions of the decision plan and knowledge base is useful if you need to reproduce results from a previous version after the knowledge base or decision plan was changed.

The backup copies are created in the ICM_HOME/dserverdir/VERSIONS directory on the data server. The file name is the name of the knowledge base or decision plan concatenated with the backup version number.
Best practice: Set up a procedure to store these backup copies in a secure location for future reference.

Establish a comprehensive backup procedure. Establish a procedure to back up and store historical data about the classification actions.

On the data server, back up the ICM_HOME\dserverdir directory to store all previous versions of the knowledge base and decision plan.
On the Classification Center server, back up the ICM_HOME\ECMTools\logs directory to store a history of all classification decisions.
Important: If you do not set up a backup policy for the Classification Center logs, the history of classification decisions will not be saved. Classification Center uses a recycling logs process in which older log files are deleted when the maximum number of log files is reached. For the events.csv file, the maximum number of logs is 5 and the maximum file size is 100 KB. For all other Classification Center logs, the maximum number of logs is 10 and the maximum file size is 1 MB. To change these default settings or prevent the logs from being recycled, edit the ICM_HOME\ECMTools\ClassificationCenter\WebContent\WEB-INF\classes\log4j.properties file.
Best practice: Back up log files before the maximum number of log files is reached and concatenate the date of the backup to each log file name.

Prepare a data set to use for periodic regression tests. Use this data set in Classification Workbench when you process deferred feedback (as discussed below). For example, you can create this data set by extracting a sample of documents from your repository by using the Content Extractor.

Maintaining the system in production
The following sections describe periodic maintenance procedures that can be applied as part of a controlled disposition policy to ensure ongoing, high quality classification.

Manual review of documents that cannot be automatically classified
As part of the normal course of activity, keep track of documents that cannot be automatically classified and are flagged for manual review (if you created an appropriate rule in your decision plan, as previously described). These documents should be identifiable using an attribute, folder or item type. Use the Classification Center to locate these documents and then manually review and classify them.

Audits of classification performance
Periodically verify that the correct knowledge base categories and decision plan actions were applied during classification by reviewing documents in the Classification Center. As part of the normal course of activity, audit a random sample of documents that are moved or copied to an audit folder (if you created an appropriate rule in your decision plan, as previously described). By reviewing documents, you can improve the classification of documents in the future because the system stores your suggestions as possible feedback.

Best practice: When you select the content to review in the Classification Center, use one of the random order options in the Number and order of documents option to specify a number or percentage of documents that are selected for review.

If you find a document for which the decision plan actions are not correctly applied, you can reclassify the document by using one or both of the following reclassify options:

If the decision plan changed since the time that the document was last classified, reclassify the document by using the current version of the decision plan. Changes to the decision plan include the addition of a new decision plan rule, or feedback was applied to any of the referenced knowledge bases.
If you want to see detailed information about the rules that are triggered when the document is reclassified, select the Generate detailed trace information option. For each rule that is triggered, you can see which classification actions succeeded and can be applied, as well as which classification actions were skipped and cannot be applied.
If the correct actions are not applied after using the first option, reclassify the document by selecting specific knowledge base categories that you think are most appropriate for classifying the document.

If the correct actions are still not applied after using the second option, determine whether the decision plan needs to be updated to handle this type of document or whether this document is an exception to a rule that cannot be changed without affecting overall performance. Use the detailed trace information that was generated to help determine which rules need to be modified. Save the document in XML format so that you can import the document into Classification Workbench and use the document to verify the modified decision plan. You might also want to add this document to the data set that you prepared for periodic regression tests.

Periodic verification of the accuracy of the production server
You can check the accuracy of the knowledge base and decision plan that run on a Classification Module production server by importing analysis data into Classification Workbench and viewing reports.

Run the Content Extractor to produce a content set in XML format that contains analysis data. For example, to extract analysis data about all classified documents in an IBM Content Manager repository, specify Attributes=BNR.BNR_action=classified in the extractorCM8.properties file before you run the Content Extractor.
Import the analysis data into a Classification Workbench knowledge base or decision plan project.
Run the Analysis wizards:

For knowledge bases, click Analysis > Generate Analysis Data.
For decision plans, click Analysis > Analyze Imported Data.

View the reports (by clicking Analysis > Reports) to verify that the decision plan and knowledge base is performing as expected.

If you want to check the accuracy of only the latest versions of the knowledge base and decision plan that run on the server, complete the following step before you run the Analysis wizard:

On the Content Set menu in Classification Workbench, click New View > Define.
Select the Look for items containing text option, select DP_Version in the field list, and specify the current knowledge base and decision plan version in the contains field. For example, specify Tutorial_DP||4||Tutorial_KB||1.1

Then run the Analysis wizards on this view that contains data for only the current version of the knowledge base and decision plan.

Process deferred feedback
To ensure that you have control of how feedback is applied to the system, monitor and control the feedback that is submitted to the knowledge base during reviews by using deferred feedback. Use Classification Workbench to review and analyze the effects of the feedback on knowledge base performance. If performance is improved, you can then publish the updated knowledge base.

Best practice: Establish a policy for processing the deferred feedback on a regular basis, such as once a month.

Prerequisite: In the Management Console, ensure that the knowledge base Feedback option is set to Defer processing.

Prepare the feedback for import into Classification Workbench by converting the feedback into a content set in XML format. On Windows, run the bnsExtractTexts87.exe command. On AIX, Linux, or Solaris, run the ./bnsRun extract command. Ensure that you set the following options in the configuration file:

`ExtractType = KB` `KBName =` `Name_of_knowledge_base_project` `FeedbackEvents = FeedbackPostpone`You must also specify the time period of the feedback events to import by setting the StartTime and EndTime options. For example: `StartTime = 2010/08/18 00:03:00.000` `EndTime = 2009/08/20 23:59:59.000`

Import the XML content set and the relevant knowledge base from the Classification Module server into a new knowledge base project in Classification Workbench.
Import the analysis data into the Classification Workbench knowledge base project.
Run the Create, Analyze and Learn wizard to:

Generate baseline performance results for comparison
Apply feedback to the knowledge base by using a randomized selection of items
Test the performance of the knowledge base to which you applied learning by using the data set that you prepared for periodic regression tests.

View reports to analyze the effects of the feedback on knowledge base performance.
If knowledge base performance improved, publish the updated knowledge base to the Classification Module server.

For detailed instructions about processing deferred feedback, see Learning from deferred feedback in the Classification Module information center.

To ensure that you have sufficient disk space on the data server to store the feedback data, periodically back up and delete data from the data server. For information, see Backing up the data server in the Classification Module information center.

Replicating your classification decisions
If you need to replicate the classification of a specific document, complete the following steps:

Locate the document in your repository and extract the document by using the Content Extractor.
Import the extracted document into a new decision plan project in Classification Workbench. Ensure that the fields that are defined in Classification Workbench are also defined on the server.
Look for the DP_Version field. The value of this field specifies the version of the decision plan and knowledge base that was used for that decision. For example, the value Tutorial_DP||4||Tutorial_KB||1.1 indicates that the document was processed using version 4 of the decision plan and version 1.1 of the knowledge base.
From the backed up copies in the ICM_HOME\dserverdir directory, locate the correct version of the knowledge base and decision plan.
Import the knowledge base into a new Classification Workbench knowledge base project.
Import the decision plan into the Classification Workbench decision plan project that you created in step 2.
Under Referenced Projects, right-click the referenced knowledge base project, select Replace Knowledge Base, and select the name of the knowledge base project that you created in step 5.
Run the document through the decision plan. For instructions, see Running an item through the decision plan in the Classification Module information center.

The original decisions can now be replicated and debugged if necessary. You can also import the current decision plan and knowledge base to compare the results.

Measures definitions
It is important to understand the following concepts when you set up your classification system:

Precision
The percentage of items that Classification Module identifies as relevant to a category, that are actually relevant to the category. High precision means that you do not have many "false positives"; that is, you do not claim that many items belong to the category when in fact they do not.

Recall
The percentage of items that are actually relevant to a category, that are recognized as such by Classification Module. High recall means that you do not have many “false negatives”; that is, you do not “miss” many items, and you “catch” almost all the items that belong to the category.

Automation
The percentage of your content that is classified automatically. While ideally you want to automatically classify all of your content, there might be some errors along the way. For example, some content items might receive scores that exceed the threshold of a category in the knowledge base, even if they do not belong to that category.

Error Rate
The percentage of content that is automated incorrectly. The higher the threshold is set for a category, the lower the error rate will be. This is because only documents that receive high scores for the category will be classified automatically. However, setting a higher threshold also means that a lower percentage of documents will be classified automatically.

* Each IBM customer is responsible for ensuring its own compliance with legal requirements. It is the customer’s sole responsibility to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.

[{"Product":{"code":"SSBRAM","label":"IBM Content Classification"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF027","label":"Solaris"},{"code":"PF033","label":"Windows"}],"Version":"8.7","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Tips

Best practices for using IBM Classification Module with document retention policies

Product Documentation

Abstract

Content

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?