Troubleshooting Ingestion Performance
The Watson™ Explorer Engine platform provides many settings and techniques that can be fine-tuned to improve ingestion performance. The most common of these include the following:
- Examine the crawl-urls elements that you are enqueueing, and adjust the number of crawl-url, crawl-delete, and index-atomic nodes that they contain. Enqueueing crawl-urls elements that contain smaller, larger, or more predictably-sized numbers of nodes can positively affect application performance depending upon the system constraints under which your application is operating.
- Examine the number of threads that your application is using to enqueue data, and adjust
that number. Using more or fewer threads can positively affect application performance
depending upon the hardware constraints under which your application is operating. Note: When enqueueing URLs, using more than two concurrent enqueue threads conflicts with the HTTP 1.1 specification that a single client should have at most two concurrent connections to a given web server, and can cause problems when interacting with Microsoft IIS web servers. To work around this limitation and improve performance, you should increase the maximum number of connections that your Watson Explorer Engine application(s) can have to a web server by adding code to your application(s) that enables you to set this limit or, if writing a.NET application, by modifying the App.config file for the project(s) for those applications. See the Watson Explorer Engine API Developer's Guide for more information.
- If your application enqueues URLs that it must subsequently crawl to retrieve indexable
content, adjust the value of the Total concurrent requests (the maximum number of
URLs that can be fetched at one time) and Concurrent requests to the same host
options to increase parallelism and enable your application to retrieve as much data as
possible in a single request.
In Watson Explorer Engine platform API applications, the total number of concurrent requests is specified by setting the n-concurrent-requests curl-option, and the number of concurrent requests to the same host is specified by setting the n-fetch-threads crawl-option in the crawl-options for a search collection.
See Global Settings for information about setting these options in the Watson Explorer Engine administration tool.
- If your Watson Explorer Engine platform application enqueues unconverted data, adjust the
number of converters so that more converters can execute simultaneously. The number of
converters that can run at one time is specified by setting priority levels and the number
of converters that are associated with each in the n-link-extractor
crawl-option in the crawl-options for a search collection.
See Converter Configuration Settings for information about setting this option in the Watson Explorer Engine administration tool.
- Consider switching synchronization modes to improve performance. See Selecting an Enqueue Synchronization Mode for a detailed discussion of synchronization modes, their performance implications, and the circumstances under which each should be used.
- If using the enqueued synchronization, changing your application to limit the maximum amount of in-flight data at any given time can often improve performance. The enqueued synchronization mode reports success once the data associated with a URL has been enqueued, which can mean that huge amounts of data could otherwise be enqueued at one time, which could cause converters and indexers to consume too many resources on the system where Watson Explorer Engine is installed.