Overview

The Watson™ Explorer Engine clustering architecture creates and labels clusters directly from the content of the input documents. In IBM® Watson Explorer Engine, clustering does not rely on a static list of pre-defined category labels, but rather infers labels directly from your content. This has many advantages, but it incurs a cost: labels can occasionally look strange, just as the automatically generated document summaries that search engines return can look strange. This portion of the documentation explains how to improve the cluster labels with a little customization.

Watson Explorer Engine uses the term knowledge base to refer to a collection of information about common words (known as stopwords) that should not be used as cluster labels, and related terms that should be grouped together through various types of rules. The knowledge bases that are being used by your search project are stored in your project's stoplist option. (The name of this option is a legacy of earlier versions of Watson Explorer Engine for which most clustering knowledge consisted of stopwords.) The stoplist is an ordered list of knowledge bases, delimited by '+' symbols, that you can set in the XML for your project or in the Watson Explorer Engine administration tool on the Projects >> Clustering tab.

Currently, setting Main and Secondary languages in the Simple tab, Language section, will appropriately set the default value of stoplist for that language combination. See the Language Configuration in Watson Explorer Engine section for more details. To see the actual value of stoplist option, turn debugging on from the results page using your project and click on the [+more] link at the end of the last section called Modified Variables. You should be able to see the values of all the high-level language variables (such as language.main and language.other) as well as the low-level ones (such as stoplist, stem and segmenter).

Watson Explorer Engine provides a number of predefined knowledge bases, each of which has an associated set of stopwords. You can also add your own knowledge bases, to which you can add both clustering rules and custom stopwords. You add a new knowledge base by adding a kb element to your project's XML, or (more commonly) in the Watson Explorer Engine administration tool interface using the Knowledge Bases screen. Once you have finished creating a knowledge base, you can add it to your project's Domain and custom knowledge bases for clustering variable (in the Simple tab, Language section).

The most common reason to add a knowledge base or add entries to an existing one is to improve unsatisfactory cluster labels caused by commonly-occurring but uninformative phrases. For example, "Boeing" is generally a highly descriptive label, but if you are only clustering the output of Boeing's search engine, all content relates to Boeing and the label loses its descriptiveness. The label might appear because only some of the search results contain the word "Boeing" and are thus deemed to be similar. Another example: when clustering patent abstracts, words like "method" and "apparatus" are uninformative because most patent abstracts customarily use one of these words. In these cases, "Boeing", "method", and "apparatus" should be declared as stopwords. See the section on Clustering Stopwords for more detailed information.

Poor labels are sometimes due to mistakes made by the search engine. For example, some corporate HTML documents contain navigation links like "Solutions | Support | Investors | Education | Careers | News & Events | About Us" which may be reflected in the summaries returned by the search engine. These uninformative phrases are best ignored by the clustering engine, but it must be told how to do so, since these mistakes are specific to the search engine installation.

Poor labels are also sometimes due to concepts being incorrectly grouped or incorrectly separated. For example, the Porter (English) stemmer will analyze "engine" and "engineering" as being the same word, but will not find "theatre" and "theater" to be the same. More about these problems and their solutions is discussed in Stemming and in Rephrase Rules.

Aesthetically, labels may not always appear exactly as you would prefer them to. This is another problem that can be solved by writing special rules, which in this case are known as redisplay rules. For more information, see the section on Redisplay Rules for more detailed information. You can also view the redisplay schema in the online help.

With all these options for customizing and improving output from the Watson Explorer Engine software, it may sound like customization is a daunting task. The process is actually very simple and requires little work. For example, we never specify more than 10 extra stopwords or 3 or 4 phrases for most small search engine projects. These knowledge base entries are for the most part quite obvious after a few queries, and adding such simple customizations often greatly improves the clustering quality on content as diverse as news, patents, auction items, and corporate documents.

See the Knowledge Base Evaluation Sequence section for detailed information about how knowledge base rules are applied, and the heuristics used to prevent loops within different types of rules.