To create a prefetch plug-in, you write a Java™ class that implements the interface com.ibm.es.wc.pi.PrefetchPlugin.
Procedure
To create a prefetch plug-in:
- Inherit the com.ibm.es.wc.pi.PrefetchPlugin interface
and implement the following methods:
public class MyPrefetchPlugin implements com.ibm.es.wc.pi.PrefetchPlugin {
public MyPrefetchPlugin() { ... }
public boolean init() { ... }
public boolean processDocument(PrefetchPluginArg[] args) { ... }
public boolean release() { ... }
}
The init method is called once
when the plug-in is instantiated. If you specify that you have a plug-in
class, the crawler loads that class when the crawler is started and
creates a single instance of the class. Your plug-in class must have
a no-argument constructor. The crawler creates only one instance of
the class. After creating the instance of the class, the crawler calls
the init method before the first use. This method
does the required setup tasks that cannot be done until an instance
of the class is in memory.
If the plug-in is not supposed to
be used or other errors occur, the init method
can return false, and the crawler removes this instance from the list
of prefetch plug-ins. If the init method returns
true, the plug-in is ready to use. The init method
cannot throw an exception.
The processDocument method
is called on the single plug-in instance for every document that
will be downloaded. The crawler uses from one to several hundred download
threads, which run asynchronously, so this method can be called from
multiple threads concurrently.
The release method
is called once when the crawler stops to allow the plug-in object
to release any system resources or flush any queued objects. This
method cannot throw exceptions. A true result means success. A false
result is logged.
For name resolution, use the ES_INSTALL_ROOT/lib/URLFetcher.jar file.
- Compile the implemented code and make a JAR file
for it. Add the ES_INSTALL_ROOT/lib/URLFetcher.jar file
to the class path when you compile.
- In the administration console, follow these steps:
- Edit the appropriate collection.
- Select the Crawl page and edit
the crawler properties for the crawler that will use the custom Java class.
- Specify the following items:
- The fully qualified class name of the implemented Java class, for example, com.ibm.plugins.MyPlugin.
When you specify the class name, ensure that you do not specify the
file extension, such as .class or .java.
- The class path for the plug-in, including all needed JAR files.
Ensure that you include the name of the JAR files in your path declaration,
for example, /ics/plugins/Plugins.jar
- Stop and restart the session for the crawler that you
edited. Then, start a full crawl.
Results
If an error occurs and the web crawler stops while it is loading
the plug-in, view the log file and verify that:
- The class name and class path that you specified on the crawler
properties page is correct.
- All necessary JAR files were specified for plug-in class path.
- The crawler plug-in does not throw CrawlerPluginException or
any other unexpected exception, and no fatal errors occur in the plug-in.
You must write this method to be thread-safe, which you can
do by wrapping its entire contents in a synchronized block, but that
permits only one thread to execute the method at a time, which causes
the crawler to become single-threaded during plug-in operation, creating
a performance bottleneck.
A better way to make the method multithread-safe
is by using local (stack) variables for all states, which minimizes
the amount of global data and synchronizes only during access to objects
that are shared between threads. This method cannot throw an exception.
It can return true to indicate successful processing of a document
or false to indicate a problem. A false return value is logged with
the URL by the crawler.
Prefetch plug-in example
You can use a prefetch
plug-in to add a cookie to the HTTP request header before the document
is downloaded.
package com.mycompany.ofpi;
import com.ibm.es.wc.pi.PrefetchPlugin;
import com.ibm.es.wc.pi.PrefetchPluginArg;
import com.ibm.es.wc.pi.PrefetchPluginArg1;
public class MyPrefetchPlugin implements PrefetchPlugin {
public boolean init() { return true; }
public boolean release() { return true; }
public boolean processDocument(PrefetchPluginArg[] args) {
PrefetchPluginArg1 arg = (PrefetchPluginArg1)args[0];
String header = arg.getHTTPHeader();
header = header.substring(0, header.lastIndexOf("\r\n"));
header += "Cookie: class=TestPrefetchPlugin\r\n\r\n";
arg.setHTTPHeader(header);
return true;
}
}
This example shows:
- The first element ([0]) in the argument array that is passed to
your plug-in is an object of type PrefetchPluginArg1,
which is an interface that extends the interface PrefetchPluginArg.
This is the only argument and the only argument type that is passed
to the prefetch plug-in. You can safely cast to it. To be completely
safe, you can enclose the cast in a try/catch block and look for a ClassCastException object
or do an "instanceof" test first.
- After you have the argument, you can call any method in the PrefetchPluginArg1
interface. The getURL method returns the URL (in
String form) of a document that the crawler downloads. You can use
this URL to decide if the document requires additional information
in the request header, such as a cookie.
- The getHTTPHeader method returns a String that
contains the all of the content of the HTTP request header that the
crawler sends so that the crawler can download the document. The plug-in
can inspect and modify this header if necessary. For example, a single
cookie can be added to the header or any other information if it is
valid for an HTTP request header. You can also remove any of this
information. If you modify the header, you must conform to HTTP protocol
requirements. For example, every line must end with a CRLF sequence,
and the header must use ISO-8859-1 encoding.
- The setHTTPHeader method sets the
request header that you modified. The request header will be parsed
in the web crawler after returning the processDocument method,
and additional headers are extracted to add the actual request header.
Method line and headers that are generated internally, such as the
authentication headers and host headers, are protected against this
modification.
- The processDocument method is called once for
every document that the crawler downloads. If the processDocument method
returns false, its results are ignored. If it returns true, the crawler
checks what it did. To stop the download, the Prefetch plug-in calls
the setFetch(false) method.