Introduction to categorization and cleaning

You can use various techniques for categorizing items and cleaning your content set. Cleanup involves removing unwanted strings (or entire content items) that would negatively impact knowledge base training and performance. A clean, well-categorized content set will maximize knowledge base performance.

As you use these techniques, it is important to ensure that the items in each category remain representative of the type of data you expect IBM® Content Classification to classify in the future. Optimal results are achieved when your content set contains data that is as close as possible in content and structure to the "real-life" data that the system will classify. Content items should include the same "noise" (that is, imperfections, misspellings, extraneous text, and so on), as the items that the system will encounter.

Important: Data that is used to create a knowledge base must be representative of the actual data that you expect the system to classify in the real world. If you find that removing "noise" (for example, a fixed pattern) improves knowledge base performance, and this same pattern will appear in real world data, this noise should also be filtered out of the incoming data (through a custom preprocessing script), before the data is classified.

Categorization and cleanup can proceed in parallel. As you categorize items, you might notice text strings that should be removed. A number of the features provide techniques to both categorize and clean content items.

Your use of these techniques depends on the state of your data upon import. In most cases, a content set will require some degree of categorization and cleanup. If, however, your content set was well-categorized and very "clean" (that is, representative of data you expect the system to classify) prior to import, you can proceed to create and analyze the knowledge base. Depending on how the resulting knowledge base performs, you could then use these techniques to fine-tune the content set and improve knowledge base performance.

The following examples illustrate how you might clean and categorize your content set, prior to creating and analyzing a knowledge base.

Example 1

Your company plans to use an email classification application powered by Content Classification. Until now, your agents have been pasting standard "canned text" responses into email replies they sent to customer questions.

You created a content set by collecting and importing email messages containing typical questions. Most of the messages were copied from your email application's Sent folder-they contain both the customer's original questions, and the canned text responses that were pasted into the replies sent by your company.

Some of these canned text responses might correspond to appropriate categories for your content set (that is, the same category should be applied to all emails containing a particular canned text). For example, you have a number of emails containing questions about your product line, and a standard canned text response describing your products. You might want to create a category called Product Line and apply it to these emails.

Using Classification Workbench, you can locate all content items containing the canned description of your product line, automatically apply the Product Line category to these items, and "clean" the canned response from each item. (For your content set to be effective, the canned text responses must be removed from the content items, so that the content items will only contain the original inquiries.)

Example 2

Your organization plans to use a Content Classification-based application to classify news articles. You build a content set by collecting and importing a number of sample news articles from the Internet. Along with the main body text, the web pages include extra, seemingly unnecessary text (for example, copyright information) that is unrelated to the news articles' content.

If this extra text will not be included in actual news articles you plan to classify, it should be removed. If, however, you expect to classify news articles with similar "unnecessary" text, you should leave it in to maximize knowledge base training and performance.

Example 3

You create a content set by gathering and importing archived email messages. Other people in your company also forwarded messages to you, for inclusion in the content set. By forwarding these messages, your own company's signature was included at the bottom of each message.

Because this footer text will not appear in the messages you expect to receive and classify, you search for and remove all occurrences of your company's footer. You might want to leave in other footer text (besides your company's footer), if you expect to receive messages with these types of footers.