Enabling Wildcard and Regular Expression Support at the Collection Level

As discussed in Default Query Syntax, Watson™ Explorer Engine supports wildcards and regular expressions in queries. Wildcard and regular expression support can be enabled either at the project level or the search collection level. In most cases, wildcards and regular expressions should be enabled at the search collection level. You need to enable wildcard at the project level only if:

  • You want to add wildcard and regular expression support for federated sources. (Federated sources are not associated with a search collection because they retrieve results from another search engine.)
Note: Enabling wildcards and regular expressions at the project level and at the search collection level are mutually exclusive. Enabling wildcards and regular expressions at the project level causes any similar settings that you made at the search collection level to be ignored. For information about enabling wildcard and regular expression support at the project level, see Enabling Wildcard Support at the Project Level.

Per-search-collection wildcard and regular expression support are enabled simultaneously, through the same configuration options. To enable wildcard and regular expression support for a search collection, do the following:

  • Select the Configuration > Indexing tab for your search collection in the Watson Explorer Engine administration tool.
  • To the right of the Global Settings header, click the edit button.
  • The options that are related to wildcard and regular expression support are located in the Term expansion support subsection of the screen. To simply enable wildcard and regular expression support, set the Generate dictionaries option to true (the other Term expansion support options are discussed later in this section).
Tip: For detailed information about using wildcards (and more advanced pattern definition mechanisms such as regular expressions) when querying Watson Explorer Engine search applications, see Wildcard and Regular Expression Support in Watson Explorer Engine Queries.

The options that are related to wildcard and regular expression support in Watson Explorer Engine search collections are the following:

  • Generate dictionaries - Whether wildcard and regular expression dictionaries are generated for the current search collection. This option is false by default, and must be set to true in order to be able to use wildcards and regular expressions in queries of the current search collection.
  • Dictionary directory - The name to use for the directory in which the index files necessary to support wildcard and regular expression searches are created. If no name is specified and the Generate dictionaries option is enabled, this dictionary is created in a directory named expansions.
  • Dictionary stemmers - The list of stemmers used to normalize the dictionary before adding to it.
  • Dictionary delanguage - Specifying true (the default value) causes all words to be language-normalized, which normalizes Japanese writing systems and removing diacritics from other languages. This is usually done internally in the search-engine, which means that not doing so in the term expansion dictionary for a search collection usually simply wastes resources. However, if you change the configuration of your search collection to not delanguage the content that it is indexing, it might be useful to set this option to false in order to match the indexed content.

Once you set the Generate dictionaries option to true, optionally specify the name of a dictionary directory, and save the other indexing options for wildcard dictionaries, Watson Explorer Engine automatically creates the Dictionary directory, and begins creating the wildcard and regular expression index for the current collection in that directory. Any content elements in that search collection that do not contain text are not indexed. (Empty content elements are sometimes used by converters to pass information about those elements through the conversion pipeline.)

Activating wildcard and regular expression support in queries to a given search collection causes each unique word in the data sources that are associated with this collection to be indexed and stored in the Dictionary directory. These indices must be loaded into memory whenever a query is received for a collection in which wildcard and regular expression support is active. Therefore, this support can substantially increase the memory requirements for a search collection.

Note: Words added to the wildcard dictionary when indexing a search collection can never be removed. In some cases, the wildcard dictionary can be built based on the index, so if an indexing error occurs, no words are added to the dictionary when the abort-batch-on-error attribute is set for the index-atomic value. For more information, see the Watson Explorer Engine API Developers Guide.

In some other cases, the dictionary is built as the input text is processed. This is slower than building from the index, but the words are immediately added to the dictionary even if an indexing error occurs. Words are added immediately to the dictionary when an index stream:

  • Contains a knowledge base
  • Utilizes a delanguage that is not the same as the delanguage used for the dictionary
  • Uses a different stemmer from the dictionary

Activating wildcard and regular expression support for a search collection can provide performance improvements over conceptually similar features such as query expansion because it enables the search engine to handle the expansion, rather than submitting more complex queries composed of multiple terms to the search engine, as is done by query expansion.

Note: Regular queries that are submitted to the search engine undergo the same normalization as the data collection has undergone, so that a word with special characters matches its unnormalized form in the index. However, regular expression queries do not. This is because "normalizing" a regular expression can substantially change the meaning of it (especially if a single character is changed to multiple characters). Functionally, this means that when a regular expression query that contains special characters is matched against a normalized wildcard dictionary, the query does not match (even if the collection contains a valid instance of the term). As an example, a collection with grüßen does not match the regular expression query m/grüßen/, but it matches its normalized form of m/grussen/ and the regular query of grüßen.