IBM Content Classifier Workbench

IBM Content Classifier Workbench is an application of the IBM Content Classification product. When you create an Auto-classification model with this application, you must ensure that you follow these practices.

Preparing the sample content sets for training

The more number of samples per category in the content set you provide, the better it is for training. However, the documents in the content set for a single category in the sample must not be self-contradictory as it can confuse the classifier.
The best number is 40 - 50 items per category.
The sample content set must be prepared or reviewed by some subject matter experts.

Creating a Knowledge Base

You must specify at least one field from the content set to be analyzed by the Classification Module. This field must contain meaningful text.

To tell the Classification Module that this field is to be analyzed, you assign a content type value to the field. Do not assign a content type value for the fields that contain non-textual values, such as account numbers or telephone numbers, or non-meaningful text, such as a content field that contains arbitrary administrative comments about each content item.

The number of documents per category

The number of documents per category in the Knowledge Base must be inversely proportional to the number of categories. The fewer categories that you have, the more documents you want to maintain within the Knowledge Base. The default value is 80. To change this value, follow these steps:

Open an existing Knowledge Base or create a new one. Click Project Options.
Click Advanced and edit Optimally maintain X documents per category.
Increase the number as necessary. If you have five categories, for example, this number can be set to a range of 200 to 300. If you have many categories, it is best to keep this number around 100 rather than 200 or 300.

Training your Knowledge Base

Follow these steps to train your Knowledge Base:

Click Create, analyze and learn.
Use any other option besides Create using all, analyze using all.
Use Create using all, analyze using all to assess only the data consistency of the document set contents and to check whether the information in the content can statistically predict the categories. A good document set can yield in this test more than 95% for the top ranking category.
Use one of the other methods on the list, such as Create using even, analyze using odd to have a more realistic prediction estimation after the assessment is done.
The right option depends on how the content set was prepared and shared by the Subject Matter Experts. Some degree of randomness must be introduced to show how the content set is partitioned for training and analysis. In general, the best option is Create Using Even, Analyze using Odd.
In the next page, enable Save learning data (SARC File) with Knowledge base for the model to accept feedback.

Analyzing and reviewing your Knowledge Base by using the built-in reports

Click View Reports on the toolbar to open the View Reports window.
To understand the overall accuracy of your Knowledge Base, view Knowledge Base Data Sheet, Cumulative Success summary reports, and Total Precision vs. Recall graph.
In Knowledge Base Data Sheet, set the first column to report a number greater than 95. It indicates that performance of the Knowledge Base is good during analysis. The less this number is than 95, the more self-contradictory the Knowledge Base becomes. This expectation is related to the data consistency verification when you use Create using all, analysis using all, while for the regular creating and analysis case, the number can be lower.
In the Total Precision vs. Recall graph, the curve must be in the upper-right portion. It indicates that performance of the Knowledge Base is good.
If the curve is on the lower-left portion, it means that performance of the Knowledge Base is poor.

Choosing to reserve items from the training set in the Knowledge Base

Open an existing Knowledge Base or create a new one.
Click Knowledge Base.
Right click Reserve Items. It ensures that no matter how much feedback you give to this Knowledge Base in the future, the reliable sample data set that is used for training is always retained within the Knowledge Base.

Another way to reserve items is to use the Freeze option when you use the BundleDPKB utility to prepare a model and upload it into StoredIQ. An example of adding the Freeze option to the BundleDPKB command is: BundleDPKB "C:\IBM\ContentClassification\Classification Workbench\Projects_Unicode" project_DP Output freeze

Working with Learning Data (SARC)

In the Knowledge Base Workbench projects, you can set the Knowledge Base to work with the Learning Data method. During the training process, a file that retains the important training content information (SARC) is generated along with the Knowledge Base.

Assessing the data quality in IBM Content Classifier Workbench

Run Create using all, analysis using all on the data to verify that it is statistically consistent. Results are expected to be unrealistically high, which is about 95% correct for the top category. If you get less than that it means the data sets of the categories are contradictory.

When you pass that initial test, you must run a real test by training a part of the data and matching the other part. It gives you an idea of the results that you get in deployed systems. You must use either even or odd sets or random cuts.

For more information about how to work with IBM Content Classification, see this Redbook at http://www.redbooks.ibm.com/redbooks/pdfs/sg247707.pdf