Preparing data for import

You use a set of sample content items to create and analyze a knowledge base or test a decision plan.

Sample data should include items with textual content that are representative of those you expect IBM® Content Classification to classify, such as emails, documents, or web form texts.

Categorized content for knowledge base creation

Sample content must be categorized to create a knowledge base. You must define a set of categories, which is known as a taxonomy, before you create a knowledge base. The categories that you choose should be representative of the business practices you want to address using the Content Classification. Sample content that is used to analyze a decision plan does not need to be categorized.

Your sample data might be pre-categorized before you import (for example, similar content items grouped in folders where folder names represent categories) or you can categorize your data after you import it into Classification Workbench. Classification Workbench analyzes categorized content items and creates statistical models of each category. These category models are stored in the knowledge base. When you use the Content Classification to classify new content, the content is compared to the category models, the best matches are found, and an automatic action is taken (such as moving documents into appropriate folders).

Creating a knowledge base by using sample content items yields optimal results. An alternative method based on keyword data is also available.