Clustering Stopwords

A stopword is a word that has little meaning by itself. For example, the, a, then, and towards are stopwords for all English documents. Watson™ Explorer Engine has general lists of stopwords for English and some other languages. The selection of stoplists to use depends on what has been specified by the Clustering Configuration option. A stopword can never appear by itself as a cluster label, although it might be used within a phrase, depending on the phrase_mode of the stopword.

Some words that are ordinarily interesting lose that quality within specialized contexts. Consider these examples:

  • method, apparatus, and patent in patent documents
  • Boeing and company within a Boeing search engine
  • product, price, and shipping within a shopping search engine
  • problem, report, and resolve in a collection of customer problem reports
  • computer, computational, and compute in a computer science publication

The last example illustrates an important point: stopwords are always verbatim. A compute stopword will not automatically stop computes from appearing, it should be added separately.

Each stopword has a phrase mode specification:

  • default: Uninteresting by itself.
  • not-start: Uninteresting by itself or at the start of a phrase. For example, Inc. is not interesting at the start of a phrase.
  • not-end: Uninteresting by itself or at the end of a phrase. For example, the is not interesting at the end of a phrase.
  • link: Uninteresting at the beginning or end of a phrase. For example, of is only interesting in the middle of a phrase.

Clustering stopwords may be added by using the stopword element in the XML, adding them to a custom knowledge base, adding them to a specific source, or using the Using the Watson Explorer Engine API.

For example, to add test to the custom knowledge base using the Watson Explorer Engine administration tool, go to the Configuration >> Knowledge Bases section and click on the custom knowledge base. Type "test" into the lookup box and press Enter. On the details page, select stopword as the treatment and pick the appropriate phrase mode. Click Save.

You can view the XML for this knowledge base by clicking the XML button on the Knowledge Bases >> Entry List tab. You can also use this XML view to add or edit stopwords directly. The following example shows the stopword that we just defined:

<stopword word="test" />
Note: If you add stopwords through the Watson Explorer Engine administration tool and then examine the XML for your knowledge base, you will see additional stopword attributes, such as the name of the user that created the stopword and a timestamp for when it was created. These attributes are optional, but can be useful when multiple users are working on the same knowledge base and you are attempting to diagnose problems.

The same interface allows you to add stopphrases. These phrases stop all words sharing the same stem (see the Stemming section for additional information about stems). Stopping the phrase computational methods will also stop the phrase computer method. To turn a phrase into a stopphrase in the Watson Explorer Engine administration tool, select stopphrase as the word's treatment. Alternatively, using the XML or API interfaces, you would use a rephrase rule to rephrase it to a pipe ("|"), as discussed in more detail in the next section.

Tip: If you have defined stopwords or stopphrases but are still seeing them appear as cluster labels, make sure that the word(s) that they contain are not involved in rephrase rules. Stopwords are identified through an internal attribute that can be overwritten if a stopword is rephrased to another term that does not have this attribute set. Similarly, terms that are being rephrased to a stopword will also be treated as stopwords.

For information about removing stopwords from end-user queries, see the Automatically Removing Stopwords from End-User Queries section.