Common Performance Improvements
The following list provides a high-level view of common techniques for improving the performance of a Watson™ Explorer Engine platform application. These items are discussed in more detail throughout this document:
- Binning - Large numbers of binning-sets configured in a collection can cause performance issues at search-time. Performance improvements have been implemented to optimize nested binning-set calculations, but this does not impact sibling binning-sets. Investigate using dynamic binning if applicable.
- Collection Caching - If the index for a given collection can fit into the amount of RAM that is available on the machine, that collection should be cached in memory. All collections that are associated with a given Watson Explorer Engine Platform installation must be considered during this calculation. If not all collections can fit into RAM for caching, you should cache those that are the most complex and sizable. See Caching Index Data for information information about specifying and using caching.
- Conversion Minimization - All of the data that is crawled by a Watson Explorer Engine Platform application goes through a variety of conversion steps to extract content and convert it into a form that can be indexed easily. Minimizing the number of conversion steps used for different types of content can improve performance prior to indexing.
- Duplicate detection - Disabling duplicate detection in the Indexer configuration will reduce indexing time since shingles won't be calculated.
- Index Planning - Data that you do not want to be able to search should not be included in the index. This will help minimize index size, which can improve access times when searching the index and which will also increase the number of collections that can be cached in memory.
- Language Detection - Language detection is the process of identifying the national language in which the data that you are indexing is provided. Disabling language detection in the Normalization converter will reduce conversion time.
- Search Collection Output - The only content elements that should be output from a given search collection should be those that will be shown to an end user or those that are used for internal calculations. Returning additional content elements will increase the amount of output from the search engine, which may have a minor impact on performance and can also lead to the increased size of the output XML or HTML itself.
- Task Distribution - The time required to crawl a remote data source depends on the responsiveness and general performance of that resource. Depending on the load requirements for an application, you should consider crawling a data source on a system other than the one that responds to queries.