Watson Explorer Engine on Virtual Machines or in the Cloud

IBM recommends deploying Watson™ Explorer Engine on physical hardware, using high-speed local storage whenever possible. However, Watson Explorer Engine has been used successfully on virtual machines and on virtual machines deployed in cloud environments, both as an embedded component of other applications and when supporting information optimization and traditional enterprise search applications that were built using the Watson Explorer Engine Platform.

Note: Watson Explorer Engine can only be used in virtual machine environments that support the architecture for which Watson Explorer Engine is compiled: 64-bit (AMD64 or Intel 64) x86.

Internal and customer testing of Watson Explorer Engine in cloud and virtual machine environments highlights some general considerations:

  • Watson Explorer Engine is I/O intensive (network and disk) when crawling
  • Watson Explorer Engine is I/O (network and disk), CPU, and memory intensive when responding to query requests

These bullets highlight the fact that I/O bottlenecks or resource limitations on the systems that host a virtual machine or cloud computing environment, or in the virtual machines themselves, can substantially reduce Watson Explorer Engine performance. Potential I/O bottlenecks and resource limitations can be compounded if the data that your Watson Explorer Engine Platform application is crawling is also located in cloud-based storage that is accessed through another virtual machine. In general, due to differences in virtual machine implementations, storage implementation and performance, and virtual machine resource allocation, even well-tuned cloud/virtual machine installations of Watson Explorer Engine may not be able to match the performance of a Watson Explorer Engine installation that is running on dedicated physical hardware and is using local storage.

Tip: Virtual machines can easily be deployed to handle front-end functionality for Watson Explorer Engine, including clustering, federation, and searches of smaller collections or sources pointing to backend servers that are physical machines that are more capable of handling the disk-bound tasks associated with search engine processing. Virtual machines that are directly crawling and indexing data may require significant configuration to address I/O-related performance issues.

While Watson Explorer Engine has already been used successfully in cloud-based applications, the underlying cloud and virtual machine environments must be correctly configured to provide sufficient resources and throughput for Watson Explorer Engine to execute and respond quickly. Performance problems associated with Watson Explorer Engine in these environments are almost always correctable by increasing or reconfiguring the resources associated with the virtual machine(s). Some general suggestions for configuring Watson Explorer Engine in a cloud/virtual machine environment are the following:

  • Use the most powerful virtual machine configuration possible. Virtual machine instances with 64 GB RAM and 16 cores have been able to satisfy performance requirements for various customers, but greater amounts of memory typically provide better performance.
  • I/O performance is difficult to predict because the resource consumption of other virtual machines on the same server cannot be anticipated. Similarly, data retrieval rates differ when accessing data on physical machines and on other virtual machines stored in the cloud. Using the most powerful virtual machines possible can help mitigate the impact of other virtual machines running on a server.
  • Crawling performance can be improved by increasing the number of converters that can be running at any given time beyond the standard recommendation of (number-of-converters = number-of-cores). This is done in the Global Settings -> Converting section of a search collection's Configuration -> Crawling tab. For VMs, a suggested starting number is (2 * number-of-cores).

If using IBM's SoftLayer global cloud infrastructure, IBM recommends using SoftLayer's iSCSI storage. If using Amazon as your cloud host, IBM recommends using Amazon's Eleastic Block Storage (EBS).