Handling Hyphenated Terms in Queries

Hyphenated terms are common in many technical domains, the names and reference numbers of official documents, product names and part numbers, and so on. Unfortunately, returning the most relevant search results for a hyphenated term can be a complex problem because the use of hyphenation is rarely uniform. A good example of this is doing a search for popular goverment documents such as tax forms. Users of search applications in the United States would probably like to see a search for "1040-EZ" also return search results from similar but probably equivalent terms such as "1040 EZ" and "1040EZ".

Different search engines handle hyphenation differently. Doing a verbatim search for hyphenated terms in most metasearch sources is the same as searching for any other text string. Those searches would not recognize "1040-EZ" as being equivalent to "1040 EZ" and "1040EZ", and would therefore not return results that only contained the latter terms. Search sources based on Watson™ Explorer Engine search collections (those produced by crawling sites using the Watson Explorer Engine search engine) treat a hyphen as a term separator, and would thus return results containing "1040 EZ" in a search for "1040-EZ", but would not return results that only contained "1040EZ".

To prevent search application developers from having to worry about how the search engine associated with each of their sources handles hyphenation, Watson Explorer Engine provides built-in capabilities for handling the three possible forms of hyphenated expressions such as the one used in this example. Watson Explorer Engine search applications can easily incorporate support for these types of alternate representations by integrating the hyphen-query-modification macro into a project. This macro transforms searches on hyphenated or potentially hyphenated terms based on the following rules:

  • whenever a query term contains a hyphen
  • whenever a query term is one word that consists of a sequence of numbers followed by letters or vice versa
  • whenever a term or set of two terms are in a keymatch file that contains words commonly appearing in more than one of the forms "one two", "onetwo" or "one-two". This keymatch file is the file data/keymatch/hyphenated-words.xml in your Watson Explorer Engine installation directory.

Adding support for hyphenation can be done in either of two ways:

  • for all sources within a project, by adding the hyphen-query-modification macro to the project itself. For information about adding support for hyphenation to all sources within a project, click Project-Wide Hyphenation.
  • For selected sources within a project, by adding the hyphen-query-modification to a source bundle containing those sources, making sure that the macro precedes the sources to which you want it to apply. For information about adding support for hyphenation to selected sources within a project, click Selected-Source Hyphenation.

In most cases, you will want to add support for hyphenation to all sources within a project, but support for selected-source hyphentation is provided for advanced users who may want to submit explicit queries to some sources while providing support for hyphenation in queries to other sources used by a single search application.