Enabling Wildcard and Regular Expression Support at the Collection Level
As discussed in Default Query Syntax, Watson™ Explorer Engine supports wildcards and regular expressions in queries. Wildcard and regular expression support can be enabled either at the project level or the search collection level. In most cases, wildcards and regular expressions should be enabled at the search collection level. You need to enable wildcard at the project level only if:
- You want to add wildcard and regular expression support for federated sources. (Federated sources are not associated with a search collection because they retrieve results from another search engine.)
Per-search-collection wildcard and regular expression support are enabled simultaneously, through the same configuration options. To enable wildcard and regular expression support for a search collection, do the following:
- Select the Watson Explorer Engine administration tool. tab for your search collection in the
- To the right of the Global Settings header, click the edit button.
- The options that are related to wildcard and regular expression support are located in the Term expansion support subsection of the screen. To simply enable wildcard and regular expression support, set the Generate dictionaries option to true (the other Term expansion support options are discussed later in this section).
The options that are related to wildcard and regular expression support in Watson Explorer Engine search collections are the following:
- Generate dictionaries - Whether wildcard and regular expression dictionaries are generated for the current search collection. This option is false by default, and must be set to true in order to be able to use wildcards and regular expressions in queries of the current search collection.
- Dictionary directory - The name to use for the directory in which the index files necessary to support wildcard and regular expression searches are created. If no name is specified and the Generate dictionaries option is enabled, this dictionary is created in a directory named expansions.
- Dictionary stemmers - The list of stemmers used to normalize the dictionary before adding to it.
- Dictionary delanguage - Specifying true (the default value) causes all words to be language-normalized, which normalizes Japanese writing systems and removing diacritics from other languages. This is usually done internally in the search-engine, which means that not doing so in the term expansion dictionary for a search collection usually simply wastes resources. However, if you change the configuration of your search collection to not delanguage the content that it is indexing, it might be useful to set this option to false in order to match the indexed content.
Once you set the Generate dictionaries option to true, optionally specify the name of a dictionary directory, and save the other indexing options for wildcard dictionaries, Watson Explorer Engine automatically creates the Dictionary directory, and begins creating the wildcard and regular expression index for the current collection in that directory. Any content elements in that search collection that do not contain text are not indexed. (Empty content elements are sometimes used by converters to pass information about those elements through the conversion pipeline.)
Activating wildcard and regular expression support in queries to a given search collection causes each unique word in the data sources that are associated with this collection to be indexed and stored in the Dictionary directory. These indices must be loaded into memory whenever a query is received for a collection in which wildcard and regular expression support is active. Therefore, this support can substantially increase the memory requirements for a search collection.
In some other cases, the dictionary is built as the input text is processed. This is slower than building from the index, but the words are immediately added to the dictionary even if an indexing error occurs. Words are added immediately to the dictionary when an index stream:
- Contains a knowledge base
- Utilizes a delanguage that is not the same as the delanguage used for the dictionary
- Uses a different stemmer from the dictionary
Activating wildcard and regular expression support for a search collection can provide performance improvements over conceptually similar features such as query expansion because it enables the search engine to handle the expansion, rather than submitting more complex queries composed of multiple terms to the search engine, as is done by query expansion.