Indexing Internals

The data that is received by the indexer is passed to a builder thread which processes the IBM XML in the form of documents. The indexed information is initially stored in RAM in buffers. As the buffers fill up, the newly indexed information is stored in segments on disk. All data maintained by the builder threads is transient. If the indexer-service is aborted for any reason then:

  • the data will never reach the index, and will be lost
  • the crawler will mark these URLs as requiring a recrawl and, in a normally functioning system, they will eventually be recrawled and sent again to the indexer.

To avoid potentially requiring that huge amounts of data be reprocessed and to ensure that updates are going to be available to searchers in a timely fashion, the builder buffers are periodically passed to the merger thread. By default, this is done when the builder has buffered 100 URL updates (specified in the build-flush option) or when the builder has been idle for 30 seconds (specified in the build-flush-idle option). To pass the buffered data to the merger, the builder flushes all memory buffers to disk in the form of segments which the merger receives and commits into persistent storage. At this point the data is guaranteed to eventually appear in the index, even if the indexer is aborted for any reason. The builder will pause until there are fewer than the number of segments specified in the max-unmerged option, and will then proceed to prepare more data.

The merger is a background process that manages the set of indices and the set of segments. The merger will periodically wake up and examine the current set of indices and segments. If there are segments to be processed it will begin merging all the segments into a new index. At this point, the merger may also elect to combine one or more of the existing indices into the merge process. Indices will be combined as long as the total merge size is approximately the same size or larger than the latest index and may also elect to merge indices to enforce the value specified in the max-indices option. If there are more indices and segments than the number specified in the max-merge option, the merger will need to make multiple passes to perform the merge operation. After all the segments and the selected indices have been merged, the newly created index file replaces all those segments and indices. At the same time, the reconstructor will be informed of all documents that were modified during the merge. These changes are atomically committed to persistent storage.

Note: In the Watson™ Explorer Engine administration tool, all of the options discussed in the previous section are located on the Configuration > Indexing tab for a search collection, and are located in the Indices or Advanced sections.

The reconstructor is a background process that updates shingles used for duplicate detection and Fast-Indexing data for modified documents. As it updates information the search will be immediately affected. Note that there is a slight lag between the time a document is modified and the time that the reconstructor is able to update the information.

Independent of the indexing work, the crawler will also periodically transmit the graph representation of the URL hyperlinking to the indexer. The link-analysis thread processes the graph to run a bibliographic style link analysis algorithm to assign background relevance weights to documents. The graph is also transmitted when the crawler completes all outstanding work.

Status about all of these threads is displayed in the detailed Live/Staging Status pages. Additional information about the data that has been indexed is also available on the Indexing tab of either of these status pages. The Indexed Data section provides a list of the content elements that are currently in the index, along with the number of words in each content. This does not include any data in the builder threads nor does it include the data in the segments. It does reflect deletions and updates that are committed in the index. The following are a few special contents that are used internally:

  • acl# : the count is the number of ACLs entries that are in the index and the number of words is the total number of words covered by each acl. That is, if a word is covered by 2 ACLs, then it will be counted as 2 words in this table. There is no cost to these words, the counts are merely reported for informational purposes.
  • del# : the number of regions in the index that have been marked for deletion but have not currently been garbage collected. The number of words are the number of words that would be removed from physical storage if all deletions were collected.
  • doc# : the number of Documents vs. URLs elements in the index. Each time a document is indexed, it is stored into an index.
  • url# : each contiguous interval of indexed data is marked with a url# entry. The number of words in these contents is the total number of words in the index. Each contiguous indexed region is marked with a url#. The words themselves do not have any cost, it is purely informative.

At the bottom of this same status page there is a table that provides details about all current index files and segments. Provided that your collection has received a query, the index files will include information about the cache memory being used. If you open the toggle for each file, it provides the same Indexed Data information for just the one index file. It also provides useful information about the memory requirements/disk cache status of the collection. The following values are reported:

  • Memory resident map: this information about the collection is required to be loaded in RAM to perform a search. It is a function of the block size.
  • Contents cache size and Contents disk size : each content requires information about the occurrence. For performance, it is very important that the cache size be the disk size. If this is not true, the index will be shown in red.
  • Text cache size and Text disk size : to provide contextual summaries, titles, etc. all of the text is stored in the index. If possible, provided enough cache memory to include all of the text will greatly improve performance.
  • Words cache size and Words disk size : this is the standard inverted index data structure which is used to rank the user queries.

In addition to the memory recorded in this table as being used, the indexer will also use memory for each builder thread (approximately 250 MB by default), the memory shown in the Fast-index memory usage details section, and between 32 and 128 bytes for each document.