Rephrase Rules
Rephrasing is useful for synonyms, spelling variations, acronyms, and abbreviations. In these cases, the goal is to change one form of a word or phrase into another. For example:
- color and colour are accepted spellings of the same word.
- NGO is a Non-Governmental Organization.
- config. is an abbreviation of configuration.
In IBM Watson™ Explorer Engine XML, these types of mappings are referred to as rephrase elements. In the Watson Explorer Engine administration tool, they can be specified by selecting rephrased as the treatment of the word or phrase, filling in the rephrased to box, and then clicking Save. If you access the XML for a knowledge base by clicking the XML button on the Knowledge Bases >> Entry List tab, you could write the rephrase rules directly, as shown in the following example:
<rephrase this="freeware" as="free software" /> <rephrase this="color" as="colour" /> <rephrase this="ngo" as="non-governmental organization" /> <rephrase this="config" as="configuration" />
The order of the rephrase is significant. By rephrasing freeware as free software, the word software is now a potential label for a cluster containing freeware and other software. Deciding how to handle acronyms is subtle enough that we have a section specifically on Handling Acronyms.
Conceptually, rephrase is similar to the standard search and replace function found in word processors. The fundamental difference, however, is that the search is done using stem classes (see Stemming) and not simple strings. For example, consider the following rephrase rule:
<rephrase this="color" as="colour" />
This rule will find occurrences of text that have the stem class as "color" and replace them with the stem class of the as part, in this case "colour". Consequently, "colors" and any other form of the word will also be mapped to "colour". This differs from the Stemming rule in that it effectively merges the stem classes of the this word and the as word, instead of removing a member of one class and inserting it into the other.
Rephrase rules can also be used to add domain-specific or localized mappings between apparently un-related terms, but the way in which this is displayed in your search application can be somewhat confusing. For example, if you are from Pittsburgh, you might want to rephrase the terms "soda" and "pop", because the latter is the regional term for carbonated beverages. This would prevent you from seeing separate clusters for "soda" and "pop", but the name of the cluster that you actually get is based on the more common of the two terms in your search results, not on the order within the rephrase rule. Regardless of the cluster name that is used, the cluster will always contain results that match either "soda", "pop", or both terms.
Rephrase rules are also useful for filtering input. For example, to eliminate a word from use in cluster labels but to allow potential matches across that word, create a rephrase rule that maps that word to NULL, as in the following example:
<rephrase this="boeing" as=""/>
This simply removes the word from the input.
Similarly, you may want to eliminate a word or phrase from cluster labels and prevent search results from including phrases that span that word or phrase (or any stemming class within it). As mentioned at the end of the section on Clustering Stopwords, this can be done by creating a rephrase rule that maps the specified word or phrase to a pipe ("|") symbols. Rephrasing one or more words to a pipe symbol has two primary effects:
- prevents the word(s) in the phrase and all other words that share any of their stemming classes from being used as cluster labels
- identifies the word(s) in the phrase and all other words that share any of their stemming classes as phrase-breaking terms (because they are literally being mapped to the pipe symbol). No search results can match a phrase that contains these words because each of them causes Watson Explorer Engine to see a new phrase.
The pipe symbol does not invoke any special function other than being an example of phrase-breaking punctuation that we expect to always be phrase-breaking punctuation. See the Supported Punctuation section of the Advanced for more information.