IBM Content Classifier Workbench
IBM Content Classifier Workbench is an application of the IBM Content Classification product. When you create an Auto-classification model with this application, you must ensure that you follow these practices.
- Preparing the sample content sets for training
-
- The more number of samples per category in the content set you provide, the better it is for training. However, the documents in the content set for a single category in the sample must not be self-contradictory as it can confuse the classifier.
- The best number is 40 - 50 items per category.
- The sample content set must be prepared or reviewed by some subject matter experts.
- Creating a Knowledge Base
- You must specify at least one field from the content set to be analyzed by the Classification Module. This field must contain meaningful text.
- The number of documents per category
- The number of documents per category in the Knowledge Base must be inversely proportional to the
number of categories. The fewer categories that you have, the more documents you want to maintain
within the Knowledge Base. The default value is 80. To change this value, follow these steps:
- Open an existing Knowledge Base or create a new one. Click Project Options.
- Click Advanced and edit Optimally maintain X documents per category.
- Increase the number as necessary. If you have five categories, for example, this number can be set to a range of 200 to 300. If you have many categories, it is best to keep this number around 100 rather than 200 or 300.
- Training your Knowledge Base
- Follow these steps to train your Knowledge Base:
- Click Create, analyze and learn.
- Use any other option besides Create using all, analyze using all.
- Use Create using all, analyze using all to assess only the data consistency of the document set contents and to check whether the information in the content can statistically predict the categories. A good document set can yield in this test more than 95% for the top ranking category.
- Use one of the other methods on the list, such as Create using even, analyze using odd to have a more realistic prediction estimation after the assessment is done.
- The right option depends on how the content set was prepared and shared by the Subject Matter Experts. Some degree of randomness must be introduced to show how the content set is partitioned for training and analysis. In general, the best option is Create Using Even, Analyze using Odd.
- In the next page, enable Save learning data (SARC File) with Knowledge base for the model to accept feedback.
- Analyzing and reviewing your Knowledge Base by using the built-in reports
-
- Click View Reports on the toolbar to open the View Reports window.
- To understand the overall accuracy of your Knowledge Base, view Knowledge Base Data Sheet, Cumulative Success summary reports, and Total Precision vs. Recall graph.
- In Knowledge Base Data Sheet, set the first column to report a number greater than 95. It indicates that performance of the Knowledge Base is good during analysis. The less this number is than 95, the more self-contradictory the Knowledge Base becomes. This expectation is related to the data consistency verification when you use Create using all, analysis using all, while for the regular creating and analysis case, the number can be lower.
- In the Total Precision vs. Recall graph, the curve must be in the upper-right portion. It indicates that performance of the Knowledge Base is good.
- If the curve is on the lower-left portion, it means that performance of the Knowledge Base is poor.
- Choosing to reserve items from the training set in the Knowledge Base
-
- Open an existing Knowledge Base or create a new one.
- Click Knowledge Base.
- Right click Reserve Items. It ensures that no matter how much feedback you give to this Knowledge Base in the future, the reliable sample data set that is used for training is always retained within the Knowledge Base.
- Working with Learning Data (SARC)
- In the Knowledge Base Workbench projects, you can set the Knowledge Base to work with the Learning Data method. During the training process, a file that retains the important training content information (SARC) is generated along with the Knowledge Base.
- Assessing the data quality in IBM Content Classifier Workbench
- Run Create using all, analysis using all on the data to verify that it is
statistically consistent. Results are expected to be unrealistically high, which is about 95%
correct for the top category. If you get less than that it means the data sets of the categories are
contradictory.
When you pass that initial test, you must run a real test by training a part of the data and matching the other part. It gives you an idea of the results that you get in deployed systems. You must use either even or odd sets or random cuts.
For more information about how to work with IBM Content Classification, see this Redbook at http://www.redbooks.ibm.com/redbooks/pdfs/sg247707.pdf