After a crawl has completed, the URLs that have been crawled can be browsed in the URLs subsection of the Live Status tab. The Staging Status tab includes a similar URLs subsection when the staging sub-collection is being used, such as while a crawl is underway. Being able to navigate the crawled URLs enables you to find problems in the data that you are crawling, verify that you have crawled the data that you expect to crawl, and to generally understand what the crawler has done.
By default, all URLs associated with a collection are listed alphabetically on this tab. All requests that do not contain a host, such as file requests, are grouped together in an entry labeled other. You can click on any column heading to re-sort this list based on the values in that column. For example, click the Completed header to sort the hosts by the number of URLs that have been successfully downloaded. This view shows hosts that contain the largest number of successfully crawled pages. To drill further down through the data, click on the value in the Completed column for the ibm.com host. After the list redisplays, you can see a summary of the top level folders and URLs on ibm.com that have been completed. You can click any of these folders to see the URLs beneath that folder that have been crawled.
To find and eliminate problematic URL spaces, the following simple procedure is quite effective. Start the crawl. While the crawl is running, periodically sort the Crawled URLs by the pending URLs. Problematic URL spaces will be obvious because there will be a suspiciously-large block of URLs. Look at any hosts that have far more URLs than expected, and drill down into the folders to find the offending script or folder.
Common reasons to encounter large URL spaces are the following:
If you make changes to the configuration while the crawler is running, you will need to restart the crawl to incorporate the changes. If you disallow URLs, this will have no effect on any URL that has already been crawled, but will affect URLs that are currently pending and URLs that are later discovered. A warning message will appear under the main tabs to alert you to configuration updates and to recommend you restart the crawl to incorporate changes. Click the Restart link in the warning message to restart the crawl.
To proceed with the tutorial, click Scheduling a Crawl or Refresh.