Clustering Stopwords
A stopword is a word that has little meaning by itself. For example, the, a, then, and towards are stopwords for all English documents. Watson™ Explorer Engine has general lists of stopwords for English and some other languages. The selection of stoplists to use depends on what has been specified by the Clustering Configuration option. A stopword can never appear by itself as a cluster label, although it might be used within a phrase, depending on the phrase_mode of the stopword.
Some words that are ordinarily interesting lose that quality within specialized contexts. Consider these examples:
- method, apparatus, and patent in patent documents
- Boeing and company within a Boeing search engine
- product, price, and shipping within a shopping search engine
- problem, report, and resolve in a collection of customer problem reports
- computer, computational, and compute in a computer science publication
The last example illustrates an important point: stopwords are always verbatim. A compute stopword will not automatically stop computes from appearing, it should be added separately.
Each stopword has a phrase mode specification:
- default: Uninteresting by itself.
- not-start: Uninteresting by itself or at the start of a phrase. For example, Inc. is not interesting at the start of a phrase.
- not-end: Uninteresting by itself or at the end of a phrase. For example, the is not interesting at the end of a phrase.
- link: Uninteresting at the beginning or end of a phrase. For example, of is only interesting in the middle of a phrase.
Clustering stopwords may be added by using the stopword element in the XML, adding them to a custom knowledge base, adding them to a specific source, or using the Using the Watson Explorer Engine API.
For example, to add test to the custom knowledge base using the Watson Explorer Engine administration tool, go to the Configuration >> Knowledge Bases section and click on the custom knowledge base. Type "test" into the lookup box and press Enter. On the details page, select stopword as the treatment and pick the appropriate phrase mode. Click Save.
You can view the XML for this knowledge base by clicking the XML button on the Knowledge Bases >> Entry List tab. You can also use this XML view to add or edit stopwords directly. The following example shows the stopword that we just defined:
<stopword word="test" />
The same interface allows you to add stopphrases. These phrases stop all words sharing the same stem (see the Stemming section for additional information about stems). Stopping the phrase computational methods will also stop the phrase computer method. To turn a phrase into a stopphrase in the Watson Explorer Engine administration tool, select stopphrase as the word's treatment. Alternatively, using the XML or API interfaces, you would use a rephrase rule to rephrase it to a pipe ("|"), as discussed in more detail in the next section.
For information about removing stopwords from end-user queries, see the Automatically Removing Stopwords from End-User Queries section.