Creating Web crawler reports

By viewing reports about past Web crawler activity, you can assess overall performance and adjust the Web crawler properties and crawl space definitions as necessary.

Before you begin

If your administrative role limits you to monitoring collections, you can view crawler statistics and create reports about crawler activity, but you cannot change the crawler's behavior (such as starting or stopping the crawler).

About this task

Different types of reports can provide you with information about Web crawler activity. For certain types of reports, information is returned as fast as it can be collected from the crawler's internal database. The Site report and HTTP status code reports take time to create. If you create these types of reports, you can specify an email address for receiving the report instead of waiting for results to be returned to the administration console.

Procedure

To create Web crawler reports:

  1. Expand the collection that owns the Web crawler that you want to monitor and go to the Crawl and Import pane.
  2. If the Web crawler that you want to create reports for is running or paused, click the icon to monitor details about the content crawled by the crawler.
  3. On the details page for the Web crawler, select an option for the type of report that you want to create:
    • In the Crawler status summary area, click Crawler history to create reports about the crawler and all of the sites that it discovers or crawls.
    • In the URL status area, specify the URL of specific site that you want to create a report for, and then click Site details.
  4. For both crawler history and site reports, you can select the check box of each statistic that you want to see in a report, then click View report.

    For these types of statistics, the crawler returns a report to the administration console as fast as it can retrieve information from its internal database.

  5. If you are creating a crawler history report, you can specify options for creating a Site report, then click Run Report.

    This report is created with the statistics that you choose to include and saved in a file that you specify (the file name must be absolute). You can specify that you want to receive email after the report is created.

  6. If you are creating a crawler history report, you can specify options for creating an HTTP status code report, then click Run Report.

    This report provides information about the number of HTTP status codes distributed per site. The report is saved in a file that you specify (the file name must be absolute). You can specify that you want to receive email after the report is created.

    Use this report to see which sites return a large number of 4xx status codes (which indicate that pages were not found), 5xx status codes (which indicate a server problem), 6xx status codes (which indicate connectivity problems), and so on.

    This report is most useful when the crawler has been active for some time (for example, a crawler that has been active for weeks). It can help you identify vanished sites, newly arrived sites, sites with huge numbers of URLs (which might indicate redundant crawling of a Lotus Notes® database), and sites with a recursive file system served by the HTTP server. If the sites with large numbers of HTTP status codes are not contributing to the index, you can improve the performance of the crawler by removing the sites from the crawl space.