Indexing Using Multiple Streams

Search collections can be configured to accept multiple streams of content for indexing.

In general, search collections that crawl principally non-English content in a single language should use a lexical analysis stream for that language (if available) because lexical analysis streams provide the best user experience. For more information, see Lexical Analysis Streams.

Otherwise, the primary and secondary streams are indexed appropriately, using English language-specific components for the following:

  • Segmentation: Determine the boundaries of a word
  • Tokenization: Determine which characters are part of a word as opposed to separators between words
  • Determining which stemmers and knowledge bases should be used to reduce words to classes

Multiple streams can be defined at the same time, each with its own options and configuration. For example, if you had a small collection that you wished to increase recall for, you could define 3 streams in the following way:

<vse-index-stream stem="none" />
    <vse-index-stream stem="depluralize+case" />
    <vse-index-stream stem="english+case" />

In that collection, a string like The men murder stones would be indexed in three different streams:

The men murder stones
    the man murder stone
    the man kill stone

Unlike an index with just one default stream, this index would behave in the following way:

  • A query for The would match this document and another document whose original text had been the pigs fly, but this document would be more relevant.
  • Queries for man and kill would match.

The advantage of multi-stream indexing is being able to combine high recall with precise relevancy ranking. With only one index stream, increasing recall often decreases precision in ranking - a stream with the stemmers english+case cannot differentiate between documents that contain 'kill' and documents that contain 'murder'. Similarly, with a single stream, increasing precision can decrease recall - a stream with no stemmer can tell the difference between 'apple' and 'Apple', but cannot return both documents for a single query of 'apple'. Using multiple streams combines the strengths of each of the individual streams. For example, combining the two streams that are mentioned above means that documents with the words 'apple', 'Apple', and potentially 'Macintosh' are all returned on the query 'Apple', but that the one with an exact match to the query term is ranked highest, since it matches in more streams.

The disadvantage of multi-stream indexing is that it produces a larger index. This has a number of consequences. Where n is the number of streams:

  • index size and merging time increase by about 1/3 * n
  • indexing takes n times longer (not crawling)
  • number of queries to be done against this index is multiplied by n
  • search performance will be slower due to the need to search the larger index

The tradeoff between more precise search results and index size and potential negative performance impact are application-specific, and depend on considerations such as the amount of data that you are indexing, the amount of storage available for your indices, the hardware configuration of the system on which the Watson™ Explorer Engine indexer is running.