Recursive Conditions

The heart of the crawler configuration is a set of recursive URL rules. The crawler configuration components generate these recursive rules. The view resolved button on each component reference will display the actual recursive rules that are being used. The majority of the options that are available in the crawler can be specified in the recursive rules and consequently different subsets of URLs can be handled differently.

For every URL that the crawler considers crawling, the following process is applied:

  • All global options are assigned.
  • All rules are processed (in order) as follows:
    • If it is a Conditions apply for a rule and the condition is matched by the URL, then all options are assigned. Processing continues for each of the rules contained therein. If it is not matched, none of the rules contained within this rule will be evaluated.
    • If it is a Conditions apply except for a rule and the condition is not matched by the URL, then all options are assigned. Processing continues for each of the rules contained therein. If it is matched, none of the rules contained within this rule will be evaluated.
    • If it is a URLs may not have a rule and the condition is matched by the URL, then it is marked as filtered.
    • If it is a URLs must have a rule and the condition is not matched by the URL, then it is marked as filtered.
  • If the URL was filtered then it will not be crawled. If one or more of the rules that filtered the URL enabled the Log URLs that don't pass this condition flag, it will be logged in the URLs section. If none of the rules set this flag, it will not be crawled and no record of it will be kept.
  • Otherwise, the default-allow option determines the disposition of the URL:
    • allow - the URL will be crawled.
    • disallow - the URL will not be crawled and no record of it will be kept.
    • disallow-log - the URL will not be crawled and the reason will be recorded in the URLs section.

Implicit in this description is the fact that when an option is set multiple times, the last setting takes precedence. There is no way to reorder the components in the interface, but an XML editing mode is available for the configuration which allows you to reorder them by hand.