Web sites protected by HTTP basic authentication

If a Web server uses HTTP basic authentication to restrict access to Web sites, you can specify authentication credentials that enable the Web crawler to access password-protected pages.

To determine whether a user (or client application) has permission to access pages on a Web site, many Web servers use a client authentication scheme called HTTP basic authentication to establish the user's identity. Typically, this interaction is interactive:
  • When an HTTP user agent (such as a Web browser) requests a page that is protected by HTTP basic authentication, the Web server responds with a 401 status code, which indicates that the requester is not authorized to access the requested page.
  • The Web server also challenges the requester to present credentials that can be used to verify whether the user is allowed to access the restricted content.
  • The Web browser presents the user with a dialog that requests a user name, password, and any other information that is required to constitute the user's credentials.
  • The Web browser encodes the credentials, then includes them when it repeats the request for the protected page.
  • If the credentials are valid, the Web server responds with a 200 return code and the contents of the requested page.
  • Subsequent requests for pages from the same Web server typically include the same credentials, which enables the authorized user to access additional restricted content without specifying additional credentials.

    After a user's identity is established, the Web server and HTTP user agent typically exchange tokens, called cookies, that enable knowledge of the user's login status to be maintained between HTTP requests.

Because the Web crawler does not run interactively, the credentials that enable it to crawl password-protected pages must be specified before the crawler begins crawling. When you create a Web crawler or edit the crawl space, specify information about each secure Web site that needs to be crawled.

To specify this information, you must work closely with the administrators for the Web sites or Web servers that are protected by HTTP basic authentication. They must provide you with the security requirements for the Web sites to be crawled, including all information that is used to authenticate the Web crawler's identity and determine that the crawler has permission to crawl the restricted pages.

If security was enabled for the collection when the collection was created, you can specify security tokens, such as user IDs, group IDs, or user roles, to control access to documents when you configure the crawler. The Web crawler associates these security tokens with every document that it crawls in the file system tree for the specified root URL. The tokens are used in addition to any document-level security tokens that you configure for the entire Web crawl space.

The order of the URLs is important. After you add information about a password-protected Web site, you must position it in the order that you want the crawler to process it. List the more specific URLs first, and put the more generic URLs lower in the list. When the Web crawler evaluates a candidate URL, it uses the authentication data that is specified for the first URL in the list that matches the candidate URL.