Connector Performance Tuning
- Increasing Crawler Aggressiveness
- Another way to improve the performance of a connector is to enable multithreading and
then reduce the delay between requests. This will put more load on the server where the
resource that you are crawling is stored, but will allow the connector to crawl that
resource more quickly. Note: Enabling multithreading is not supported in all connectors, and will increase Watson™ Explorer Engine and resource server memory consumption. Consider increasing the size of the Java heap to prevent Java
OutOfMemoryerrors.To optimize for speed set the value of the Delay setting to 0 in the tab for the associated search collection. Setting this value to 0 will eliminate any delay between successive calls to the resource server, and will also cause the connector to create as many threads as it can in order to submit and service those requests.
Note: Setting the Delay option to 0 can cause additional errors to be introduced because the resource server or Watson Explorer Engine may not be able to keep up with incoming requests. However, it is still useful to try this setting when tuning a connector for performance, because this setting will provide the theoretical maximum performance for the crawl.To tune for more balanced speed, adjust the value of the Delay setting to a value greater than 0 and less than the default value of
100(This value is expressed in milliseconds). You may also want to adjust the value of the Concurrent requests to the same host setting in the Crawling aggressiveness section to a greater value than the default value of1. This setting controls how many threads the connector creates when starting.Note: Some of these settings are replicated in the configuration settings for certain connectors, both to highlight their relevance and to enable setting connector-specific Delay values. Settings that are replicated in the seed for a connector take precedence over the crawler settings, but only apply to URLs that are destined for that connector. This enables the use of different settings in multiple connectors that contribute to a single search collection.
- Threading
- The number of threads the connector uses for crawling can be adjusted to balance memory
allocation and speed. There are many variables that determine how aggressively you can
connect to the associate resource server(s). You will develop a comfort level for your
particular environment by adjusting the number of threads that the connector uses and
monitoring its performance. Typically, using between 3 and
5 threads is sufficient for most environments. The default value
for threading is
1
- Minimizing Error Level Logging
- Another setting that can have a performance impact is error level logging. By default,
error level logging is turned off. When turned on, be aware that the accumulation of large
log files can cause connector performance to suffer. In most cases, you should only enable
debug mode and trace level logging when doing error level logging. It is not recommended
to simply let log files build over time in a production environment. Tip: For more detailed information about advanced logging configuration settings, see the online resources for Log4j.
- Analyzing applications with JConsole
- JConsole is a graphical, JMX-compliant tool that connects to a running JVM and can therefore be used to analyze information about a connector. For more detailed instructions on using JConsole, see the online JConsole documentation.
- Analyzing applications with JMX
- The Java Management Extensions (JMX) supply tools for managing and monitoring
applications, system objects, devices and service oriented networks. To use JMX, you must
first enable the management port in a connector seed. Once that
port is enabled, you can use a variety of JMX-compliant tools to analyze exactly what the
connector is doing when in operation.
For example, Java virtual machine (JVM), which has built-in instrumentation, can enable you to monitor and manage the performance of a connector using JMX. To enable the JMX agent and configure its operation, you must set certain system properties when you start the Java virtual machine (JVM). For detailed instruction, see the online resources for using JMX.
- Profiling applications with Visual VM
- VisualVM is a another visual tool that integrates several command line Java Development Kit (JDK) tools and offers lightweight profiling capabilities such as monitoring the memory use of a connector over time. For more detail on using VisualVM, consult the online Visual VM resources.
- Reducing Memory Footprint
- If you want to reduce the memory footprint of a connector you may opt to turn off caching. However, caching can have a dramatic impact on speed. Therefore, instead of disabling cache, you can adjust values in the advanced cache settings portion of the connector seed. Conversely, if you have a lot of memory, you can opt to increase the cache settings and heap size to prevent out of memory errors. Another setting to consider is to flush the cache once security updates to the resource that you are crawling have been indexed, which will help improve the overall performance of the connector.