Crawl complex form-based authentication Web sites using IBM OmniFind Enterprise Edition

Creating a web crawler plug-in to preprocess documents

Because many organizations have Web-based intranet sites, the web crawler is one of the most prominent features of IBM® OmniFind™ Enterprise Edition. Most organizations have some sort of secure front-end to some or all of the content of these Web sites. One such front-end involves form-based authentication (FBA), a process that allows the user to enter authentication information through an HTML form. Although the web crawler can crawl some sites with FBA, there is no standard way to implement FBA, and there are many products and solutions available that provide FBA mechanisms. Some FBA mechanisms can induce redirects, use non-standard return codes, use multiple cookies, and so on. Therefore, the web crawler FBA settings do not provide support for all available FBA implementations. However, it is possible to write a prefetch web crawler plug-in to negotiate through most complex FBA mechanisms. This article will show you how to negotiate certain FBA mechanisms and how to write a web crawler plug-in that does so.

Share:

Koichi Nishitani (knish@jp.ibm.com ), Software Engineer, IBM

Koichi NishitaniKoichi Nishitani is a software engineer at IBM Japan's Yamato Lab in Yamato, Japan. He is a developer on the IBM OmniFind Enterprise Edition team. You can reach Koichi at knish@jp.ibm.com.



05 July 2007

To get the most out of this article, you should have a good understanding of Java™ programming, and some understanding of HTML and HTTP.

Introduction: What is form-based authentication?

FBA is a term used to describe any authentication mechanism in which a Web form is used to enter authentication information. There is no single standard implementation of FBA; a multitude of implementations exist. Some implementations of FBA simply send authentication information to the server and return a cookie; others are more elaborate and involve numerous redirects to different pages, or require that the user obtain a cookie before sending authentication information to a server.

This is an example of a simple HTML Web form used to send authentication information to a server:

Listing 1. Authentication form for a Web page
<form name="loginform" action="j_security_check" method="POST">
        Username: <input type=text name="username"> <br>
        Password: <input type=text name="password" password> <br>
        <input type=submit>
</form>

This form sends the username and password to a servlet called j_security check. Servlets are programs that run on the server. In this case, a program that checks the username and password of the user. This is defined by the action URL in line 1. Also defined is the method used to send the authentication request to the server, POST, and the fields that the user must enter: username and password.

In the simplest situation, the server does not care that the user enters the authentication information in a Web form - it simply receives a username and password (provided in a POST request to the servlet) and returns a cookie (or a set of cookies) to the browser if the password is correct. In this case, the user does not need to physically enter information in a web form -- just generating the POST request itself is sufficient.

For a more complicated scheme, you may be required to at least display a web form to enter authentication information. The following could be the flow for a more complex FBA scheme:

  1. The user would send a request for a login page to be viewed through the browser, without a cookie.
  2. The FBA mechanism sees that the user is not authenticated, since the request for the login page comes without a valid cookie, and then redirects the user to the web form to enter credentials. In some situations, multiple redirects occur at this point, and a cookie may be provided to the browser to accompany the request.
  3. The user then enters login credentials and submits the form.
  4. The browser sends the credentials and the cookie, if any, to the action URL of the web form.
  5. The authentication server returns a cookie, or a series of cookies, to the browser.
  6. The server redirects the browser to the requested page.

The redirects in steps 2 and 5 typically occur as a result of "302 (Found)" server responses.

Figure 1. A complicated FBA example
Figure of a complicated FBA setup

FBA in OmniFind: What can it do?

The FBA feature in OmniFind enables the web crawler to negotiate FBA-protected web sites that can be authenticated through a POST or GET request. These settings can be enabled while creating, as well as after creating, the web crawler. For each FBA-protected site area, the following settings are necessary:

  • URL pattern: An authentication request is sent to the web server for each URL crawled that fits this pattern.
  • Form action: The URL for the action request as mentioned in the authentication form.
  • Form method: The method used to send the request. This is also mentioned in the <form> tag.
  • Field name-field value pairs: Authentication information is sent through these name-value pairs, for example, username=john&password=jsmith.
Figure 2. FBA settings in OmniFind
OmniFind FBA settings screen

The FBA mechanism works as follows:

  1. The web crawler accesses the root of the web server and obtains a session cookie, if any.
  2. The crawler then forms a POST query to the form action, including the field name-value pairs specified in the settings.
  3. If the POST authentication request succeeds, the web crawler stores the cookie generated from the request. If authentication fails, the crawler cannot access the desired Web page.
  4. When crawling pages, the crawler will attach these cookies to the page request.
Figure 3. Procedure for FBA in OmniFind
OmniFind FBA procedure

However, this method is not sufficient when the FBA-protected site must access the login page prior to POSTing to the web form, when the URL for the authentication servlet is not known in advance, or when a cookie other than the server's session cookie is required to access the login page. In these cases, the default OmniFind FBA mechanism is insufficient to negotiate the FBA; instead, a web crawler plug-in is necessary.


Web crawler plug-in basics

There is a developerWorks Cookbook that explains how to use Eclipse to write and deploy a web crawler plug-in for OmniFind 8.3; the procedure is the same for OmniFind 8.4. This section describes the architecture of the web crawler plug-in and how it allows us to crawl an FBA-protected site; the next section describes how to actually write the plug-in.

The main purpose of a crawler plug-in is to enable the crawler to perform tasks that are not performed by the crawler by default, but are still necessary to crawl documents. In this case, let's use it to do the following:

  1. Access the login page of the web site's form-based authentication;
  2. Store the cookies after performing authentication;
  3. Attach cookies to requests for the pages protected by the form-based authentication.

Although the OmniFind FBA mechanism performs the last two functions, it is the access to the login page that is problematic for the default mechanism. The OmniFind FBA mechanism does not access a login page before trying to authenticate. Therefore, in an environment where the action URL is not known beforehand, or where a cookie for the login page itself must be obtained, you cannot use the default mechanism alone. Once you decide to use the plug-in to access the login page, all your authentication processing needs to use the plug-in framework because you cannot dynamically assign the URL of the login form to OmniFind.

The plug-in framework for the web crawler consists of two parts: a prefetch plug-in and a postparse plug-in. The prefetch plug-in performs modifications to page requests to the web server, such as attaching cookies or redirecting the requested URLs. The postparse plug-in analyzes the content of a fetched Web page and can make modifications or remove the page from the parsing queue. In this case, you only need to write a prefetch plug-in because after you perform the authentication, you only need to modify the outgoing request by attaching a cookie.

The prefetch plug-in is inherited from the com.ibm.es.wc.pi.PrefetchPlugin interface, and must define the following methods:

  • boolean init(): Initialization for the plug-in is performed. This is called once when the web crawler is started.
  • boolean processDocument (PrefetchPluginArg[] arg0): This method is called by the web crawler once for each Web page that is crawled.
  • boolean release(): Cleanup for the plug-in is performed. This is called when the web crawler stops.

When the web crawler starts, and a plug-in is specified in the crawler settings, an instance of the plug-in class is created, and the init() method is called. The plug-in object persists in memory until the web crawler is stopped and the release() method is called. For this task, in the init() method, you need to find the login page, figure out what the action URL is, and send a POST request with authentication credentials to the web server. You would then receive a cookie or several cookies, which you can store in the PrefetchPlugin object and attach to subsequent calls to processDocument().

Figure 4. Execution of FBA web crawler plug-in
Execution of FBA plug-in

Implementing a prefetch plug-in

This is the overall procedure for implementing the prefetch crawler plug-in:

  1. Accessing the login page
  2. Setting up and performing the authentication request
  3. Processing cookies for subsequent page requests

The first two items are done within the init() method of the crawler plug-in initially, and might be required to be performed again if the authorization expires. The third item is performed within the processDocument method for each document.

Accessing the login page

First, you will need to achieve access to the login page. You can try to access an FBA-protected site without the appropriate cookie if the URL to the login screen is not known ahead of time, for example, in the case of a WebSphere® Portal or a secure Web Content Management site. However, usually the browser or user agent is redirected to the login screen by a 302 return code (Page found).

The prefetch plug-in will need to detect the redirects and send requests for the redirected Web page. The following code can achieve this:

Listing 2. Accessing the login page by traversing redirects
//import java.net.*;
URL requestURL = getRequestURL();
HttpURLConnection conn = requestURL.openConnection();
while (true) {
int responseCode = conn.getResponseCode();
switch (responseCode) {
	case 302:
		URL newURL = new URL(conn.getConnHeaderField (conn, "Location"));
		conn = newURL.openConnection();
		break;
	case 200:
		break;
	}
}

Setting up and performing the authentication request

After obtaining the login page, you have to send the authentication request to the server. To do this, the following information is necessary (very much like the information needed for the OmniFind FBA mechanism):

  • The name of the login form, or some label to identify where the form is on the Web page
  • The URL to which the authentication request will be sent (the action URL)
  • The field name for the username
  • The field name for the password
  • The request method (usually GET or POST)
  • The authentication credentials (the username and password)

You should have the username and password of an authorized user of the web site. You can obtain the other information (possibly besides the action URL) by looking at the source code of the Web page. For example, in the following form below, the required information is as follows:

Listing 3. HTML of a login form and the parameters required for FBA
<form name="loginform" action="j_security_check" method="POST">
	Username: <input type=text name="username"> <br>
	Password: <input type=text name="password" password> <br>
	<input type=submit>
</form>
ItemParameter
Form nameloginform
Action URLj_security_check
Username field labelusername
Password field labelpassword
Request methodPOST

In most cases, none of these parameters change between requests for the login page, except for possibly the action URL, so you can code these values into the plug-in. In the code, this is what you would need to do:

Listing 4. Creating a POST request for authentication
String strLoginURL = "loginform";
URL loginURL = new URL (strLoginURL);
conn = loginURL.openConnection();
//add any cookies we already have to the request
conn.setRequestProperty ("Cookie", cookies);
//setting request method to POST 
conn.setRequestMethod ("POST");
conn.doOutput(true);
PrintWriter printPost = new PrintWriter (conn.getOutputStream());
//POSTing authentication info
printPost.println ("username=ofuser&password=omnifind");	
printPost.close();
//sending the request 
conn.connect();

Processing cookies for page requests

After connecting to the web site, the cookie can now authenticate the user. Now that you have obtained a cookie for the web site, for the crawl, attach the cookie to each document request. Do this in the processDocument() method, which is called when each document is fetched. Attaching a cookie only involves adding a Cookie: field to the request header.

Listing 5. Adding a cookie to each document request
public boolean processDocument (PrefetchPluginArg[] args) {
	PrefetchPluginArg arg = args[0];
	// add the authentication cookie
	String destURL = arg.getURL();
	String oldHeader = arg.getHTTPHeader();
	String cookieLine = "Cookie: " + cookies + "\r\n";
	String newHeader;
	// just in case the getHTTPHeader() contains content
	int posCRLF = oldHeader.lastIndexOf ("\r\n\r\n");
	newHeader = oldHeader + cookieLine;
	arg.setHTTPHeader (newHeader);
	return true;
}

And now, you should have an understanding of how to crawl web sites that have more complicated FBA-protection schemes.

A more complete sample code is attached to this article, in the Download section. The sample includes ways to crawl WebSphere Portal and Web Content Management sites, including finding the portal login page from the portal frontpage. The sample can be easily modified to access sites protected by third-party FBA mechanisms.


Conclusion

Writing a web crawler plug-in to negotiate an FBA site is not difficult, but it does require understanding what elements are needed for the authentication, and the mechanism required to obtain pages once authentication is performed. In this particular case, we added a way to find login pages by traversing redirects by way of the init() method of the plug-in, and attached a cookie to each document through the processDocument method of the plug-in, so that pages can be crawled.


Download

DescriptionNameSize
Sample web crawler plug-in to negotiate FBAFBA.zip28KB

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=238972
ArticleTitle=Crawl complex form-based authentication Web sites using IBM OmniFind Enterprise Edition
publish-date=07052007