Web sites protected by form-based authentication

If a Web server uses HTML forms to restrict access to Web sites, you can specify authentication credentials that enable the Web crawler to access password-protected pages.

To determine whether a user (or client application) has permission to access pages on a Web site, many Web servers use HTML forms to establish the user's identity. Typically, this interaction is interactive:
  • When an HTTP user agent (such as a Web browser) requests a page that is protected by form-based authentication, the Web server checks to see whether the request includes a cookie that establishes the user's identity.
  • If the cookie is not present, the Web server prompts the user to enter security data into a form. When the user submits the form, the Web server returns the required cookies, and the request for the password-protected page is allowed to proceed.
  • Future requests that include the required cookies are also allowed to proceed. The authorized user is able to access additional restricted content without being asked to fill in a form and specify credentials with each request.

Because the Web crawler does not run interactively, the credentials that enable it to crawl password-protected pages must be specified before the crawler begins crawling. When you create a Web crawler or edit the crawl space, specify information about each secure Web site that needs to be crawled.

The fields that you specify correspond to the fields that an interactive user fills in when prompted by the Web browser, and any hidden or static fields that are required for a successful login.

To specify this information, you must work closely with the administrators for the Web sites or Web servers that are protected by form-based authentication. They must provide you with the security requirements for the Web sites to be crawled, including all information that is used to authenticate the Web crawler's identity and determine that the crawler has permission to crawl the restricted pages.

The order of the URL patterns is important. After you add information about a password-protected Web site, you must position it in the order that you want the crawler to process it. List the more specific URL patterns first, and put the more generic URL patterns lower in the list. When the Web crawler evaluates a candidate URL, it uses the form data that is specified for the first URL pattern in the list that matches the candidate URL.

Using a plug-in to crawl secure WebSphere Portal sites

If global security is enabled in WebSphere® Application Server, and you want to crawl secure WebSphere Portal sites with the Web crawler, you must create a crawler plug-in to handle the form-based authentication requests. For a discussion about form-based authentication and a sample program that you can adapt for your custom Web crawler plug-in, see http://www.ibm.com/developerworks/db2/library/techarticle/dm-0707nishitani.

The plug-in is required if you use the Web crawler to crawl any sites through WebSphere Portal, including IBM® Web Content Manager sites and Quickr® for WebSphere Portal sites.