Creating a prefetch plug-in for the web crawler

To create a prefetch plug-in, you write a Java™ class that implements the interface com.ibm.es.wc.pi.PrefetchPlugin.

Procedure

To create a prefetch plug-in:

  1. Inherit the com.ibm.es.wc.pi.PrefetchPlugin interface and implement the following methods:
    public class MyPrefetchPlugin implements com.ibm.es.wc.pi.PrefetchPlugin {
    	   public MyPrefetchPlugin() { ... }
    	   public boolean init() { ... }
    	   public boolean processDocument(PrefetchPluginArg[] args) { ... }
    	   public boolean release() { ... }
    	}

    The init method is called once when the plug-in is instantiated. If you specify that you have a plug-in class, the crawler loads that class when the crawler is started and creates a single instance of the class. Your plug-in class must have a no-argument constructor. The crawler creates only one instance of the class. After creating the instance of the class, the crawler calls the init method before the first use. This method does the required setup tasks that cannot be done until an instance of the class is in memory.

    If the plug-in is not supposed to be used or other errors occur, the init method can return false, and the crawler removes this instance from the list of prefetch plug-ins. If the init method returns true, the plug-in is ready to use. The init method cannot throw an exception.

    The processDocument method is called on the single plug-in instance for every document that will be downloaded. The crawler uses from one to several hundred download threads, which run asynchronously, so this method can be called from multiple threads concurrently.

    The release method is called once when the crawler stops to allow the plug-in object to release any system resources or flush any queued objects. This method cannot throw exceptions. A true result means success. A false result is logged.

    For name resolution, use the ES_INSTALL_ROOT/lib/URLFetcher.jar file.

  2. Compile the implemented code and make a JAR file for it.
    Add the ES_INSTALL_ROOT/lib/URLFetcher.jar file to the class path when you compile.
  3. In the administration console, follow these steps:
    1. Edit the appropriate collection.
    2. Select the Crawl page and edit the crawler properties for the crawler that will use the custom Java class.
    3. Specify the following items:
      • The fully qualified class name of the implemented Java class, for example, com.ibm.plugins.MyPlugin. When you specify the class name, ensure that you do not specify the file extension, such as .class or .java.
      • The class path for the plug-in, including all needed JAR files. Ensure that you include the name of the JAR files in your path declaration, for example, /ics/plugins/Plugins.jar
    4. Stop and restart the session for the crawler that you edited. Then, start a full crawl.

Results

If an error occurs and the web crawler stops while it is loading the plug-in, view the log file and verify that:
  • The class name and class path that you specified on the crawler properties page is correct.
  • All necessary JAR files were specified for plug-in class path.
  • The crawler plug-in does not throw CrawlerPluginException or any other unexpected exception, and no fatal errors occur in the plug-in.

You must write this method to be thread-safe, which you can do by wrapping its entire contents in a synchronized block, but that permits only one thread to execute the method at a time, which causes the crawler to become single-threaded during plug-in operation, creating a performance bottleneck.

A better way to make the method multithread-safe is by using local (stack) variables for all states, which minimizes the amount of global data and synchronizes only during access to objects that are shared between threads. This method cannot throw an exception. It can return true to indicate successful processing of a document or false to indicate a problem. A false return value is logged with the URL by the crawler.

Prefetch plug-in example

You can use a prefetch plug-in to add a cookie to the HTTP request header before the document is downloaded.
package com.mycompany.ofpi;
import com.ibm.es.wc.pi.PrefetchPlugin;
import com.ibm.es.wc.pi.PrefetchPluginArg;
import com.ibm.es.wc.pi.PrefetchPluginArg1;
public class MyPrefetchPlugin implements PrefetchPlugin {
	public boolean init() { return true; }
	public boolean release() { return true; }
	public boolean processDocument(PrefetchPluginArg[] args) {
		PrefetchPluginArg1 arg = (PrefetchPluginArg1)args[0];
		String header = arg.getHTTPHeader();
		header = header.substring(0, header.lastIndexOf("\r\n"));
		header += "Cookie: class=TestPrefetchPlugin\r\n\r\n";
		arg.setHTTPHeader(header);
		return true;
	}
}
This example shows:
  • The first element ([0]) in the argument array that is passed to your plug-in is an object of type PrefetchPluginArg1, which is an interface that extends the interface PrefetchPluginArg. This is the only argument and the only argument type that is passed to the prefetch plug-in. You can safely cast to it. To be completely safe, you can enclose the cast in a try/catch block and look for a ClassCastException object or do an "instanceof" test first.
  • After you have the argument, you can call any method in the PrefetchPluginArg1 interface. The getURL method returns the URL (in String form) of a document that the crawler downloads. You can use this URL to decide if the document requires additional information in the request header, such as a cookie.
  • The getHTTPHeader method returns a String that contains the all of the content of the HTTP request header that the crawler sends so that the crawler can download the document. The plug-in can inspect and modify this header if necessary. For example, a single cookie can be added to the header or any other information if it is valid for an HTTP request header. You can also remove any of this information. If you modify the header, you must conform to HTTP protocol requirements. For example, every line must end with a CRLF sequence, and the header must use ISO-8859-1 encoding.
  • The setHTTPHeader method sets the request header that you modified. The request header will be parsed in the web crawler after returning the processDocument method, and additional headers are extracted to add the actual request header. Method line and headers that are generated internally, such as the authentication headers and host headers, are protected against this modification.
  • The processDocument method is called once for every document that the crawler downloads. If the processDocument method returns false, its results are ignored. If it returns true, the crawler checks what it did. To stop the download, the Prefetch plug-in calls the setFetch(false) method.