Specifying Appropriate Crawler Options
Watson™ Explorer Engine supports a number of crawling options that have a direct impact on performance, resource utilization, and usability. The crawl-option elements are global configuration options for the Watson Explorer Engine crawler, and the curl-option elements can be data or URL specific.
To enable a crawl-option or curl-option, select the Configuration tab for your search collection, select the XML tab, and click edit. Add the following XML code inside the crawler element, where OPTION is the name of the crawl-option or curl-option option that you want to set, and VALUE is the value that you want to assign to that option:
<crawl-options> <crawl-option name="OPTION">VALUE</crawl-option> <curl-option name="OPTION">VALUE</curl-option> </crawl-options>
The following list identifies crawl-option and curl-option nodes that you may want to configure in your application:
- Maximum idle time: As discussed earlier in this section, setting this option to a
non-zero value increases the availability of the crawler. A large value will keep the
crawler online between enqueues, but the crawler will be consuming system resources during
this time. A small value will enable the crawler to go offline between enqueues, but will
make the crawler appear to be less responsive to new enqueues because each enqueue may need
to restart the crawler.
This option is located in the General section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying a value for the idle-running-time crawl-option.
- Default allow: Specifies how URLs that are not filtered by the configuration of
your search collection should be handled. You will usually want to set this value to
allow so that URLs which are not explicitly filtered will be crawled.
This option is located in the General section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying a value for the default-allow crawl-option.
Other possible values for this option are disallowed (they will not be crawled), and log-disallowed (they will not be crawled but they will be logged).
Note: One case in which you would not want to set this option to allow is if you are writing a push-pull application, where URLs or data are programmatically enqueued but contain links that you do not want to be crawled. - Delay: Specifies the number of milliseconds to delay between requests to the same
server. If you are enqueueing URLs or data without specifying complete as the value
of their status attribute, you will usually want to set this to 0 so that your
application can react to enqueues as soon as they are received. (See Setting Appropriate Crawl URL and Container Attributes for detailed information about the status attribute.)
The Delay option is located in the Crawling aggressiveness section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying a value for the delay curl-option.
Tip: You usually will not want to set this option to 0 is if you are writing a traditional pull or push-pull application, where URLs or data are enqueued and contain links that you want to crawl, and you do not want to overwhelm the remote server. - Number of converters: Specifies a set of integers, one per line, representing the
number of converters that can be running at each priority level. There is no pre-defined
number of priority levels - adding a new number to this list creates a new priority level.
Priority levels on enqueued data are typically set in the XML that is being enqueued.
In Watson Explorer Engine platform applications, this option is only relevant if you are (1) pushing in unconverted content or (2) enqueueing URLs whose content you want to retrieve, and want to use the Watson Explorer Engine converters to convert that content for you. If you want to use the Watson Explorer Engine converters, you should set the value of this crawl-option to the number of cores you have available.
This option is located in the Converting section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying values for the n-link-extractor crawl-option.
- Total concurrent requests: Specifies the maximum number of URLs whose content can
be retrieved concurrently.
In Watson Explorer Engine platform applications, this option is only relevant if you are enqueueing URLs whose content you subsequently want to retrieve, but do not want to overwhelm the server from which that data is being retrieved. In that case, the Total concurrent requests option helps throttle requests to the remote server down to an acceptable rate.
This option is located in the Crawling aggressiveness section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying values for the n-fetch-threads crawl-option.
- Total requests per host: the maximum number of requests to send to the same server
at one time. If a delay is also specified, this number of requests will be sent
initially, and the delay will be applied after each returns.
In Watson Explorer Engine platform push-pull applications where URLs or data are programmatically enqueued and contain links that you want to crawl, this option enables you to throttle requests for data to a server so that you do not overwhelm a remote server.
This option is located in the Crawling aggressiveness section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying a value for the n-concurrent-requests curl-option.
- Cache content types: Specifies a list of MIME content types that will be cached
(preserved on the Watson Explorer Engine server) by Watson Explorer Engine for quick access
through the Cache link in search results that are displayed using a Watson Explorer Engine platform display. By default, the crawler caches any documents of type
text/html, text/plain, text/xml,
application/vxml-unnormalized, or application/vxml.
For optimal performance, you can disable caching in your Watson Explorer Engine platform application by modifying this option and removing any values that it contains. If you are using light-crawler mode, this option must be set to the empty string. (See Using Light Crawler Mode for more information about light-crawler mode.)
This option is located in the Converting section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying a value for the cache-types crawl-option.
- Disable URL normalization: Determines whether an input URL should be parsed in
order to normalize it. Normalizing a URL refers to adding missing information, such as the
protocol, to that URL, and removing extraneous or unnecessary characters such as HTML
anchors, double slashes in the path specification, changing relative paths to absolute
paths, and so on.
Setting this value to true will prevent the crawler from parsing input URLs, which can save processing time.
This option is located in the Advanced section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying a value for the disable-url-normalization crawl-option.
- Remove logged input upon completion: Determines whether the log data associated
with each URL will be purged from the crawler's logs once the URL is successfully crawled.
This reduces the size of the crawler logs, but may complicate refresh operations in
traditional Watson Explorer Engine platform applications that retrieve URLs from a remote
source, and use this information to determine when the content associated with a URL was
last updated.
This option is a space-saving optimization for the logs, and is almost always used in Watson Explorer Engine API applications where indexable data is being directly enqueued to the crawler, because such applications do not perform the standard refresh operation - indexed data is typically identified through a unique key of some sort, and is updated by being resubmitted with that same key value.
This option is located in the Advanced section of a collection's Configuration > Crawling tab in the Watson Explorer Engine administration tool. It can also be set by supplying a value for the purge-input-xml curl-option.
- Audit log: Determines whether the crawler's audit log is enabled, and the
circumstances under which entries are added to that log. For detailed information about
using the audit log, see Using the Audit Log or the Watson Explorer Engine API Developer's Guide.
This option is located in the Advanced section of a search collection's Configuration > Crawling tab. It can also be set by supplying a value for the audit-log crawl-option.