Web crawler active sites

You can monitor the Web crawler to see information about the Web sites that the crawler is actively crawling.

When you view details about a Web crawler while monitoring a collection, you can view statistics about active sites. The statistics show:
  • How many URLs the crawler brought from its internal database to memory for crawling at this time
  • How many URLs the crawler has attempted to crawl so far
  • How much time remains before a site is deactivated and removed from memory for this iteration of the crawler
  • How much time a site has been in memory so far

This information changes from moment to moment as the crawler progresses through the crawling rules that are configured for it. Ideally, the number of activated URLs is close to the value that is configured for the Maximum number of active hosts field in the crawler advanced properties

If the number of activated URLs is near zero, then the crawler is not finding eligible URLs. Conditions that can cause such low activity include DNS lookup failures, network connectivity issues, database errors, and crawl space definition problems. For example:
  • If many sites have been in memory for a long time, and few URLs have been crawled, look for network connectivity problems.
  • If not enough sites are in the list, look for crawl space definition problems or DNS lookup problems.