High Availability and Scalability Through Distributed Indices

Today's enterprise environments depend on search capabilities, making the capacity, performance, and availability of an enterprise search solution more important than ever. If the system to which your queries are submitted fails, the storage on which those collections are stored fills up or goes offline, or the index for a search collection is corrupted due to more subtle hardware problems, your enterprise search applications will be inaccessible until the problems is identified and resolved. Housing all of the data for your search applications on a single system makes that system a potential single point of failure for the applications that use those search collection.

In addition to potential access and availability problems, hosting search collections on a single system can cause a variety of performance problems, including:

network bottlenecks as multiple clients attempt to access the collection(s) on a single server
disk and network bottlenecks as multiple clients of collaborative search applications attempt to update the collections that are associated with those applications
general performance issues as the size of the indices for a search collection exceeds the amount of physical memory that is currently available. This affects performance due to the time required for swapping and the time required to load new portions of the index from disk into memory. Swapping will always impact performance when index size exceeds the total size of physical memory.

Potential availability problems and a variety of possible performance issues can be addressed by distributing network load, memory requirements, and search application indices across multiple systems. Watson™ Explorer Engine provides several different distributed index technologies that you can choose between in order to select the mechanism that best fits your search applications, hardware environment, and best addresses any performance or availability issues that you may be experiencing. These different technologies are the following:

Replication, which maintains an identical copy of indices on multiple machines. Only the search collection on a master server is actively updated. Copies of that index are then pushed to other machines in a pool of replicas, as needed or as scheduled. Queries can then be distributed across all systems with a complete replica of to spread the system load evenly across those systems (load balancing). Replication is often referred to as remote-push because data is pushed from a server to a client on the server's schedule - the client is simply the passive recipient of the updated index.
Replication is best suited for use with small search collections or larger search collections that change infrequently, because of the time and system overhead required to copy an entire index from the master server to each replica whenever the master index is updated. Replication also requires robust and reliable network connectivity to ensure that index files can be copied in their entirety, because that copy is an atomic operation.

Note: If you set up remote-push in a version of Watson Explorer earlier than version 11.0.0.1, and did not have user names and passwords specified in your remote-push configuration, remote-push and mirror configurations will stop working when you install Watson Explorer version 11.0.0.1. In addition, if your remote-push configuration includes unencrypted passwords, remote-push will stop working when you install Watson Explorer version 11.0.0.1. To fix this issue, enter (or reenter) the username and passwords for your Search Collection(s) in the Watson Explorer Engine Administration tool at Configuration > Remote > Remote Index. Open the mirror and edit the username and password for each Server listed. Edit the same fields under Global Configuration.

Note: Not all parts of the configuration XML are evaluated at the same time. Specifically, the remote push configuration isn't evaluated when the collection services start, like the remainder of the collection configuration.
Segmentation, which divides an index across multiple systems in order to reduce network bottlenecks, reduce system-specific storage requirements, and address system-specific memory limitations. Segmentation splits an index into multiple segments, which can then be stored on different systems. The size and deployment of these segments can be tailored to best fit the amount of physical memory that is available on participating systems.
Segmentation is designed for use with larger search collections whose index will not fit into memory on specific systems or where other system processes cause frequent swapping of a search application's index that could otherwise remain resident. Like replication, segmentation is best suited for search collections whose index changes infrequently, because an appropriate index segment must be copied to each target system from the master server whenever the master index is updated. Segmentation also require robust, reliable network connectivity to ensure that large index segments will be copied successfully, because that copy is an atomic operation.

Note: Another approach to segmentation is to segment the data that you are crawling rather than its index. For example, crawls of large resources such as corporate email archives are often segmented into ranges by user or department, increasing the parallelism of the crawl and producing smaller indices than those for the entire archive. (The resulting indices can then be combined for use by a single search application by adding them to a source bundle.) The discussion of distributed indices in this section and the following tutorials focuses on automatic segmentation within Watson Explorer Engine, not manual segmentation that is done by data examination.
Synchronization, which maintains multiple instances of an index on each system within a set of machines, and incrementally shares index updates with the other systems in that set so that each instance of the index is kept up to date. Changes to that index, such as new or modified annotations, can be made on any system within a synchronization set, and those changes will eventually propagate to all systems within that set. The systems within a synchronization set are referred to as being consistent when all of them have received each others' updates and no additional updates are pending.
Synchronization is best suited to the search collections used by collaborative search applications, where an index is frequently being updated with tags, comments, ratings, and other annotations. Synchronization is also well-suited to lower-speed, higher traffic, or less robust networking environments because less data needs to be exchanged when synchronizing indices across clients than would be required when copying entire indices. Synchronization is especially well-suited to less reliable networking environments because updates are exchanged transactionally, and are automatically retried if they fail initially.

In Watson Explorer Engine, any member of a synchronization set, complete replica of an index, or set of segments that comprise a complete copy of an index is known as a mirror.

All of these distributed indexing mechanisms help solve system load and network problems by making it possible to distribute query load across multiple systems. They also help guarantee search application availability by eliminating any single system and any single index as a potential point of failure. Finally, they all support load-balancing and failover, sharing the same mechanism for defining and configuring these capabilities in the Watson Explorer Engine administration tool.

Important: Using any form of distributed indexing across hosts that are running different versions of Watson Explorer is not supported. Each new version of Watson Explorer provides fixes to problems that have been encountered in previous releases, as well as incremental improvements in how distributed indexing is communicated and coordinated across all participating systems.

In order to use a distributed index, an instance of Watson Explorer Engine must be running on each system that is a member of a synchronization set, hosts a replicated index, or hosts any portion of a segmented index. This enables Watson Explorer Engine to coordinate index use, synchronize indexes across all participating servers, perform load balancing, and react appropriately should hardware problems make any host or associated index(es) unavailable.

Note: When using distributed indexing, the order of documents may differ between client and server for various reasons:

The order in which documents are sent to the indexer by the crawler can differ from the order in which the indexer processes them and replies to the crawler. The processsing order is used for sequencing distributed index updates.
To increase performance through parallelism, distributed index clients apply updates in parallel rather than in a strict order. The time at which an update is applied can therefore differ between client and server. However, when the ranking of multiple search results is identical, the indexer uses the order that it indexed those documents as the "tie-breaker".

These potential differences between client and service indices can cause documents to appear in a different order at search time, and can also affect deduplication.