Crawler Configuration

The behavior of a crawler is based on a system of conditional settings, filters and allowed URLs. This system provides a great deal of flexibility in crawling. To make the configuration more manageable, the system is preconfigured with Components that handle many of the common configuration issues. More information about crawler configuration is available in Crawling, Seeds, and Connectors.

To view the configuration that has been created to this point, go to the Configuration >> Crawling section of the sample Search Collection. When you added the Seed URL, this created a component reference with your values filled in. To see the actual implementation of this component, click the view resolved link to the right of the component URLs. This expands the component reference and displays the implementation. In this case there are two top level rules added to the configuration (and they are marked in yellow to indicate that they are the result of the resolution). The first rule, the Seed URLs rule, instructs the crawler to add these URL(s) to the set of starting URL(s). The next rule instructs the crawler that if a URL has an http or https protocol and the host in the URL is either ibm.com or anything ending in .ibm.com, then the URL should be marked as allowed.

The Default allow setting in each rule defines whether URLs that are not filtered by the rule should be allowed (crawled), dis-allowed (not crawled), or log-disallowed (logged but not crawled). In the default global configuration, the default allow option is set to disallow. All URLs that are not otherwise marked as allow will not be crawled. The recursive rules may mark URLs as allow or disallow and the last such setting takes precedence. In addition to the allow and disallow settings, URLs may be absolutely filtered. After examining these rules, you can click on any of the normal view links to hide the component resolution.

Other rules expressed as Conditional Settings provide defaults for specific types of files and URLs. To see an example, click view resolved beside the Component: Binary file extensions (filter). This component expands to a single recursive rule that filters the URLs based on their filename extension. Any URL disallowed by a filter rule will not be crawled even if it is marked as allow. Click normal view to hide the component resolution.

New conditions can be added by clicking the Add a new condition button and selecting conditions from the scrollable list. This button appears in any place where it is possible to add a condition, which enables you to build nested rules like the Crawl Limits condition added by the Seed URLs component.

As an example, click Add a new condition, select Custom conditional settings from the scrollable list that displays, and click Add. The condition is either inclusive (Conditions apply for a...) or exclusive (Conditions apply except for a...), and you can select the portion of the URL on which to match (url, host, port, path, and so on), and provide wildcards, a regular expression, or a case-insensitive regular expression to identify matches for the condition.

You can recursively specify custom rules and effectively configure the crawler on a per-url basis. For example, to crawl two URLs at a time from ibm.com you would select Conditions apply for a, select host, select One of the wildcards, and enter ibm.com in the box. This causes the rules to apply only for URLs on the host ibm.com.

To complete this rule, scroll down to the Crawling agressiveness section and enter 2 in the box for Concurrent requests to the same host. Scroll back up and click OK to save the new rule. Below your new rule, there is a link that enables you to add a new Custom conditional setting as a sub-condition of the rule that you just defined. To add a sub-condition, click Add a new sub-condition, choose URL filter from the scrollable list, and click Add. In the box under URLs may not have a url matching one of the wildcards, enter *faq* and click OK. You have now created a set of conditional settings that will affect the URLs for the host ibm.com by increasing the load we put on the server and disallowing any URLs that contain the substring faq. These rules have no effect on URLs on any other hosts (if others were defined), because the filter is constrained by the encompassing condition, which in this case limits this subcondition to URLs on ibm.com.

To proceed with the tutorial, click Refine URLs.