Creating a prefetch plug-in for the web crawler
To create a prefetch plug-in, you write a Java™ class that implements the interface com.ibm.es.wc.pi.PrefetchPlugin.
Procedure
To create a prefetch plug-in:
Results
- The class name and class path that you specified on the crawler properties page is correct.
- All necessary JAR files were specified for plug-in class path.
- The crawler plug-in does not throw CrawlerPluginException or any other unexpected exception, and no fatal errors occur in the plug-in.
You must write this method to be thread-safe, which you can do by wrapping its entire contents in a synchronized block, but that permits only one thread to execute the method at a time, which causes the crawler to become single-threaded during plug-in operation, creating a performance bottleneck.
A better way to make the method multithread-safe is by using local (stack) variables for all states, which minimizes the amount of global data and synchronizes only during access to objects that are shared between threads. This method cannot throw an exception. It can return true to indicate successful processing of a document or false to indicate a problem. A false return value is logged with the URL by the crawler.
Prefetch plug-in example
package com.mycompany.ofpi;
import com.ibm.es.wc.pi.PrefetchPlugin;
import com.ibm.es.wc.pi.PrefetchPluginArg;
import com.ibm.es.wc.pi.PrefetchPluginArg1;
public class MyPrefetchPlugin implements PrefetchPlugin {
public boolean init() { return true; }
public boolean release() { return true; }
public boolean processDocument(PrefetchPluginArg[] args) {
PrefetchPluginArg1 arg = (PrefetchPluginArg1)args[0];
String header = arg.getHTTPHeader();
header = header.substring(0, header.lastIndexOf("\r\n"));
header += "Cookie: class=TestPrefetchPlugin\r\n\r\n";
arg.setHTTPHeader(header);
return true;
}
}
- The first element ([0]) in the argument array that is passed to your plug-in is an object of type PrefetchPluginArg1, which is an interface that extends the interface PrefetchPluginArg. This is the only argument and the only argument type that is passed to the prefetch plug-in. You can safely cast to it. To be completely safe, you can enclose the cast in a try/catch block and look for a ClassCastException object or do an "instanceof" test first.
- After you have the argument, you can call any method in the PrefetchPluginArg1 interface. The getURL method returns the URL (in String form) of a document that the crawler downloads. You can use this URL to decide if the document requires additional information in the request header, such as a cookie.
- The getHTTPHeader method returns a String that contains the all of the content of the HTTP request header that the crawler sends so that the crawler can download the document. The plug-in can inspect and modify this header if necessary. For example, a single cookie can be added to the header or any other information if it is valid for an HTTP request header. You can also remove any of this information. If you modify the header, you must conform to HTTP protocol requirements. For example, every line must end with a CRLF sequence, and the header must use ISO-8859-1 encoding.
- The setHTTPHeader method sets the request header that you modified. The request header will be parsed in the web crawler after returning the processDocument method, and additional headers are extracted to add the actual request header. Method line and headers that are generated internally, such as the authentication headers and host headers, are protected against this modification.
- The processDocument method is called once for every document that the crawler downloads. If the processDocument method returns false, its results are ignored. If it returns true, the crawler checks what it did. To stop the download, the Prefetch plug-in calls the setFetch(false) method.