Refine URLs

About this task

After a crawl has completed, the URLs that have been crawled can be browsed in the URLs subsection of the Live Status tab. The Staging Status tab includes a similar URLs subsection when the staging sub-collection is being used, such as while a crawl is underway. Being able to navigate the crawled URLs enables you to find problems in the data that you are crawling, verify that you have crawled the data that you expect to crawl, and to generally understand what the crawler has done.

By default, all URLs associated with a collection are listed alphabetically on this tab. All requests that do not contain a host, such as file requests, are grouped together in an entry labeled other. You can click on any column heading to re-sort this list based on the values in that column. For example, click the Completed header to sort the hosts by the number of URLs that have been successfully downloaded. This view shows hosts that contain the largest number of successfully crawled pages. To drill further down through the data, click on the value in the Completed column for the ibm.com host. After the list redisplays, you can see a summary of the top level folders and URLs on ibm.com that have been completed. You can click any of these folders to see the URLs beneath that folder that have been crawled.

To find and eliminate problematic URL spaces, the following simple procedure is quite effective. Start the crawl. While the crawl is running, periodically sort the Crawled URLs by the pending URLs. Problematic URL spaces will be obvious because there will be a suspiciously-large block of URLs. Look at any hosts that have far more URLs than expected, and drill down into the folders to find the offending script or folder.

Common reasons to encounter large URL spaces are the following:

Procedure

  1. state parameters - Portal software often encodes state information as CGI parameters. For example, it collects navigation data by encoding the current URL as a CGI parameter on every link on that page. Other common examples are folder navigation (each folder on a page is marked as either open or closed, and you could crawl the entire universe of possible states of the tree), sorting options, color-scheme settings, and so on.

    To remove such problems, add a new custom conditional setting that is as specific as possible to match all of the problematic URLs It could be a host rule if the entire site is a portal, or it could be the URL itself if a single script is causing the problem. In the configuration of this rule, add the names of the offending CGI parameters to the Normalization section's CGI Parameters to remove box.

  2. calendars - Calendars, room reservation systems, etc., all contain an essentially infinite amount of data (for example, the information that nothing is currently scheduled to happen on September 1, 1970 at your organization). To avoid crawling calendars, add a global condition that disallows the scripts (URLs) used by the calendar application.

Results

If you make changes to the configuration while the crawler is running, you will need to restart the crawl to incorporate the changes. If you disallow URLs, this will have no effect on any URL that has already been crawled, but will affect URLs that are currently pending and URLs that are later discovered. A warning message will appear under the main tabs to alert you to configuration updates and to recommend you restart the crawl to incorporate changes. Click the Restart link in the warning message to restart the crawl.

To proceed with the tutorial, click Scheduling a Crawl or Refresh.