Note: This section provides a quick summary of the different components involved in an end-user Watson Explorer Engine search application. You can skip this section if you are already familiar with Watson Explorer Engine.
All of the components of a Watson Explorer Engine application are
stored as XML objects in the Watson Explorer Engine repository, which is a structured XML file.
Not all of the components of an end-user Watson Explorer Engine
search application are relevant to applications that are written
using the Watson Explorer Engine API, but you will encounter the
names of these components throughout the Watson Explorer Engine
documentation and should therefore at least be familiar with their
names and general purpose. The description of each component in
the following list discusses its relevance in the API context.
The components of a Watson Explorer Engine search application are
the following:
- Search Collection: identifies one or more information resources and the online index created from them when the Watson Explorer Engine search engine crawls those sources.
In traditional end-user Watson Explorer Engine search applications, a search collection is typically created starting with one or more seeds, which are each an entry point that Watson Explorer Engine can use to recursively extract data from a remote resource. Typical seeds are things like the name of a local directory, an SMB share, URLs, databases, mail servers, or specific types of data repositories (Lotus Notes servers, Sharepoint servers, Documentum Docbases, defect-tracking systems such as Bugzilla, and so on). A search collection is crawled when it is created (or at scheduled intervals), an index is created based on the data retrieved during the crawl, and only its index is consulted when it is queried. The index can subsequently be updated by completely or selectively re-crawling the seed for that collection. Selectively re-crawling a seed is known as refreshing the collection.
The Watson Explorer Engine API enables you to create search collections that do not use a seed. Instead of pulling content by crawling some resource and retrieving that data that you want to index, content can be fed or pushed to the crawler service via the API. After an optional set of conversions to convert data into a generic indexable form, the crawler will then submit this content to the indexer service, which indexes it so that the query service can respond to queries of the new collection.
A mixed push/pull model is also possible. When using this type of model, the configuration of a collection is similar to the one used in a pull model, but crawls, refreshes, and metadata updates for existing search results are initiated through the API by pushing seeds or explicit URLs to the crawler.
- Source: identifies a specific online resource to which a query can be submitted, how that query is submitted, and the way in which results returned from that resource are processed. Sources are typically associated with Watson Explorer Engine search collections, but can also identify other search engines or online resources to which queries can be forwarded and results retrieved. Multiple sources can be combined into a Source Bundle in order to enable a single query to be submitted to multiple sources at the same time. Each source is composed of a Form that provides settings for many variables that are used when querying the data source with which it is associated, and optional components such as a Parser, unique variable declarations, code for testing the source, and so on.
When associated with a search collection, a source specifies most of the query-time parameters (i.e., those which can be sent without re-indexing or restarting the indexing service). Multiple sources can be associated to the same search collections.
From an API standpoint, sources are used as sets of default parameters. Most of the parameters that can be specified in search collection sources can be overwritten using arguments to the query-search function.
Sources typically have two primary components:
- Form: defines how to
submit a query to an online resource in order to
initiate the retrieval of results from that
resource. A Watson Explorer Engine Form specifies the
URL or URLs through which an online resource can be
queried, and transforms your query into the query
syntax that is expected by the remote resource. A
Source can have more than one associated Form. If
multiple Forms are present, Watson Explorer Engine uses the first
form that matches the syntax of the query.
For sources associated with search collections, the
form is where most of the interesting query-time
parameters are specified, including things such as
relevancy formula, sorting options and directory
rights (how an authenticated user is expanded to
include group information), and so on. There is
almost a one-to-one mapping between these and the
query-search
arguments.
- Parser: receives the results
produced by submitting a Form and transforms these results
using XSL or regular expression matches into the XML format
used internally by Watson Explorer Engine.
Sources associated with search collections do not have
any parser by default. A parser can be added to do
some fine grained transformations on the search
results using XSL.
- Dictionary: used with a source to provide spelling suggestions when mis-spelled words are encountered in a query that is submitted to a specific Source or Source Bundle. Spelling corrections or alternate suggestions can apply to all or specific fields within a source. You can build custom dictionaries from various input formats, including text files, XML dictionary files, existing search collections, or existing dictionaries.
- Knowledge Base: defines how specific terms are handled in queries or when creating clusters based on the results that are returned by a query. Clusters are groups of related content that are automatically created by analyzing your search results and applying various algorithms to identify meaningful groups that can be concisely described. The terms in a knowledge base can include source-specific stopwords, stop-phrases, stemmer corrections, synonyms, and other rules that will determine whether two words should be in the same cluster or whether a specific cluster should be created at all.
- Display and Project: commonly used in standard end-user Watson Explorer Engine applications, these components do not need to be used in API-based applications. Displays are irrelevant to the API as the calling application will be in charge of displaying the results to the end user. Projects are not intended to be used with the API. The most common options are available as arguments to query-search. More advanced options as well as custom variables can be passed using the extra-xml argument (which can be used to load a project if you really wanted to use projects but this is not recommended).