Creating a postparse plug-in for the web crawler

With the postparse plug-in, you can use Java APIs to view the content, security tokens, and metadata of a document that is parsed by the simple HTML parser that is provided by the Web crawler. You can add to, delete from, or replace any of these fields, or stop the document from being sent to the document processing pipeline, including the parser, tokenizer, and indexer.

About this task

To create a postparse plug-in, you write a Java™ class that implements the interface com.ibm.es.wc.pi.PostparsePlugin, for example:

public class MyPostparsePlugin implements 
com.ibm.es.wc.pi.PostparsePlugin {   
   public MyPostparsePlugin () { ... }   
   public boolean init() { ... }   
   public boolean processDocument(PostparsePluginArg[] args) { ... }   
   public boolean release() { ... }
}

The plug-in class can implement both interfaces, but it needs only one init method and one release method. If the class does both prefetch and postparse processing, you need to initialize and release resources for both tasks. Both the init method and the release method are called once.

The processDocument method is called on the single plug-in instance for every URL for which a download was attempted. Not all downloads return content. The HTTP return codes, such as 200, 302, or 404, can be used by your plug-in to determine what to do when called. If content was obtained and if the content was suitable for HTML parsing, the content is put through the parser, and the results of parsing are available when your plug-in is called.

Postparse plug-in examples

The following example shows how to add security ACLs to the metadata that the crawler sends with documents that are downloaded from a particular site. You can use a postparse plug-in to add those ACLs just before the crawler writes the document to the parser's input buffer:

package com.mycompany.ofpi;  // Plug-ins

import com.ibm.es.wc.pi.*;

	public class MyPostparsePlugin implements PostparsePlugin {
	   public MyPostparsePlugin() { }
	   public boolean init() { return true; }
	   public boolean release() { return true; }
	   public boolean processDocument(PostparsePluginArg[] args) {
              try {
	         PostparsePluginArg1 arg = (PostparsePluginArg1)args[0];
                 if (arg.getURL().startsWith("http://mysite.com/users/")) {
                    // Extract user name from URL; look up appropriate tokens.
	            String acls = // Create a comma-separated list of the 
                            // additional ones.
	            arg.addSecurityACLs(acls);
	         }
	         return true;
              } catch (Exception e) {
                 return false;  // disregard returned results
              }
	   }
	}

You can also use a postparse plug-in to add a new metadata field to your crawled documents. For example, if some of your documents contain a particular facet value, you might want to add a metadata field called "MyUserSpecificMetadata" to the search record that contains a string that you need to query when the crawler is running with various "searchability" attributes. In another example, because the built-in parsers cannot extract metadata from binary documents, you might want to add enterprise-specific metadata to binary documents after they are crawled to ensure that the metadata fields can be searched when users search the collection.

The following example shows how to add a metadata field:

public class MyPostparsePlugin implements PostparsePlugin {

  public MyPostparsePlugin() { }
  public boolean init() { return true; }
  public boolean release() { return true; }
  public boolean processDocument(PostparsePluginArg[] args) {
    try {
      PostparsePluginArg1 arg = (PostparsePluginArg1)args[0];
        if (arg.getContent() != null && arg.getContent().length > 0) {
          String content = new String( arg.getContent(), arg.getEncoding() );
            if (content != null && content.indexOf(keyword) > 0) {
              final String userdata = null; // look up string by keyword.
              FieldMetadata mf = new FieldMetadata(
                 "MyUserSpecificMetadata", // field name
                 userdata,                 // field value
                 false,                    // searchable?
                 true,                     // field-searchable?
                 false,                    // parametric-searchable?
                 true,                     // can be extracted by search?
                 "MetadataPreferred",      // metadata value rather 
                                           // than content
                 false);                   // show in summary?
                 arg.addMetadataField(mf);   // Add it to the list.
                 return true;                // Use results.
            }
        }
        return false;  // ignore results
      } catch (Exception e) {
        return false;  // disregard returned results
    }  
  }
}

The document content is available from the plug-in argument (arg.getContent). The encoding that the crawler found is available. With the content and encoding, you can create a String. You can then look for some keyword (content.indexOf(...)), associate new data with it (userdata = ...), and insert that new data as the content of the new field.

To define a new metadata field, create an instance of the FieldMetadata object and set its field values.