Enabling Wildcard and Regular Expression Support at the Collection Level

As discussed in Default Query Syntax, Watson™ Explorer Engine supports wildcards and regular expressions in queries. Wildcard and regular expression support can be enabled either at the project level or the search collection level. In most cases, wildcards and regular expressions should be enabled at the search collection level. You need to enable wildcard at the project level only if:

Note: Enabling wildcards and regular expressions at the project level and at the search collection level are mutually exclusive. Enabling wildcards and regular expressions at the project level causes any similar settings that you made at the search collection level to be ignored. For information about enabling wildcard and regular expression support at the project level, see Enabling Wildcard Support at the Project Level.

Per-search-collection wildcard and regular expression support are enabled simultaneously, through the same configuration options. To enable wildcard and regular expression support for a search collection, do the following:

Tip: For detailed information about using wildcards (and more advanced pattern definition mechanisms such as regular expressions) when querying Watson Explorer Engine search applications, see Wildcard and Regular Expression Support in Watson Explorer Engine Queries.

The options that are related to wildcard and regular expression support in Watson Explorer Engine search collections are the following:

Once you set the Generate dictionaries option to true, optionally specify the name of a dictionary directory, and save the other indexing options for wildcard dictionaries, Watson Explorer Engine automatically creates the Dictionary directory, and begins creating the wildcard and regular expression index for the current collection in that directory. Any content elements in that search collection that do not contain text are not indexed. (Empty content elements are sometimes used by converters to pass information about those elements through the conversion pipeline.)

Activating wildcard and regular expression support in queries to a given search collection causes each unique word in the data sources that are associated with this collection to be indexed and stored in the Dictionary directory. These indices must be loaded into memory whenever a query is received for a collection in which wildcard and regular expression support is active. Therefore, this support can substantially increase the memory requirements for a search collection.

Note: Words added to the wildcard dictionary when indexing a search collection can never be removed. In some cases, the wildcard dictionary can be built based on the index, so if an indexing error occurs, no words are added to the dictionary when the abort-batch-on-error attribute is set for the index-atomic value. For more information, see the Watson Explorer Engine API Developers Guide.

In some other cases, the dictionary is built as the input text is processed. This is slower than building from the index, but the words are immediately added to the dictionary even if an indexing error occurs. Words are added immediately to the dictionary when an index stream:

  • Contains a knowledge base
  • Utilizes a delanguage that is not the same as the delanguage used for the dictionary
  • Uses a different stemmer from the dictionary

Activating wildcard and regular expression support for a search collection can provide performance improvements over conceptually similar features such as query expansion because it enables the search engine to handle the expansion, rather than submitting more complex queries composed of multiple terms to the search engine, as is done by query expansion.

Note: Regular queries that are submitted to the search engine undergo the same normalization as the data collection has undergone, so that a word with special characters matches its unnormalized form in the index. However, regular expression queries do not. This is because "normalizing" a regular expression can substantially change the meaning of it (especially if a single character is changed to multiple characters). Functionally, this means that when a regular expression query that contains special characters is matched against a normalized wildcard dictionary, the query does not match (even if the collection contains a valid instance of the term). As an example, a collection with grüßen does not match the regular expression query m/grüßen/, but it matches its normalized form of m/grussen/ and the regular query of grüßen.