Creating a crawler plug-in for type A data sources

You can create a Java™ class to programmatically update the value of security tokens, metadata, and the document content of type A data sources.

About this task

When the crawler session starts, the plug-in process is forked. An AbstractCrawlerPlugin object is instantiated with the default constructor and the init, isMetadataUsed, and isContentUsed methods are called one time. During the crawler session, the activate method is called when the crawler starts its crawling and the deactivate method is called when the crawler finishes its crawling. When the crawler session ends, the term method is called and the object is destroyed. If the crawler scheduler is enabled, the activate method is called when the crawling is scheduled to start and the deactivate method is called when the crawling is scheduled to end. Because a single crawler session runs continuously when the crawler scheduler is enabled, the term method is not called to destroy the object.

Tip: For information about creating a crawler plug-in for the following type B data sources, see Creating a crawler plug-in for type B data sources:
  • Agent for Windows file systems crawler
  • BoardReader crawler
  • Case Manager crawler
  • Exchange Server crawler
  • FileNet P8 crawler
  • SharePoint crawler

Procedure

To create a Java class for use as a crawler plug-in with content-related functions for type A data sources:

  1. Extend com.ibm.es.crawler.plugin.AbstractCrawlerPlugin and implement the following methods:
    init()
    isMetadataUsed()
    isContentUsed()
    activate()
    deactivate()
    term()
    updateDocument()

    The AbstractCrawlerPlugin class is an abstract class. The init, activate, deactivate, and term methods are implemented to do nothing. The isMetadataUsed method and isContentUsed method are implemented to return false by default. The updateDocument method is an abstract method, so you must implement it.

    For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.

  2. Compile the implemented code and make a JAR file for it.
    Add the ES_INSTALL_ROOT/lib/dscrawler.jar file to the class path when you compile.
  3. In the administration console, follow these steps:
    1. Edit the appropriate collection.
    2. Select the Crawl page and edit the crawler properties for the crawler that will use the custom Java class.
    3. Specify the following items:
      • The fully qualified class name of the implemented Java class, for example, com.ibm.plugins.MyPlugin. When you specify the class name, ensure that you do not specify the file extension, such as .class or .java.
      • The fully qualified class path for the JAR file and the directory in which all files that are required by the Java class are located. Ensure that you include the name of the JAR file in your path declaration, for example, C:\plugins\Plugins.jar. If you need to specify multiple JAR files, ensure that you use the correct separator depending on your platform, as shown in the following examples:
        • AIX® or Linux®: /home/esadmin/plugins/Plugins.jar:/home/esadmin/plugins/3rdparty.jar
        • Windows: C:\plugins\Plugins.jar;C:\plugins\3rdparty.jar
  4. On the Crawl page, click Monitor. Then, click Stop and Start to restart the session for the crawler that you edited. Click Details and start a full crawl.

Results

If the crawler stops when it is loading the plug-in, view the log file and verify that:
  • The class name and class path that you specified in the crawler properties page are correct.
  • All necessary libraries are specified for the plug-in class path.
  • The crawler plug-in does not throw a CrawlerPluginException error.
Tip: If a crawler gets NullPointerException after it is configured to use a custom crawler plug-in, override com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#isMetadataUsed() to return true instead of false.
Metadata field definitions: If you want to add a new metadata field in your crawler plug-in, you must create an index field and add the metadata field to the collection by configuring parsing and indexing options in the administration console. Ensure that the name of the metadata field is the same as the name of the index field.
The following methods in the FieldMetadata class are deprecated. These field characteristics are overwritten by field definitions in the parser configuration:
public void setSearchable(boolean b)
public void setFieldSearchable(boolean b)
public void setParametricSearchable(boolean b)
public void setAsMetadata(boolean b)
public void setResolveConflict(String string)
public void setContent(boolean b)
public void setExactMatch(boolean b)
public void setSortable(boolean b)
Using Plug-inLogger to log messages: The Plug-inLogger is a class that you can use to include log statements from the plug-in in the Watson Explorer Content Analytics log files. To use the Plug-inLogger, specify the following statement in the import statements:
import com.ibm.es.crawler.plug-in.logging.Plug-inLogger;
Add the following statements after the start of the class declaration:
/** Logger */
     private static final PluginLogger logger;
     static { 
PluginLogger.init(PluginLogger.LOGTYPE_OSS,PluginLogger.LOGLEVEL_INFO);
     logger = PluginLogger.getInstance();
   }
/** End Logger **/ 
In the updateDocument section, add the following statements to output test logging statements of the type INFO, WARN and ERROR:
/* Testing Logging Statements*/ 
    logger.info("This is info."); 
    logger.warn("This is warning."); 
    logger.error("This is error."); 
/* End Testing Logging Statements */
With the default collection settings, these statements cause warning and error messages to be shown in the collection log file. For example:
W FFQD2801W 2013/04/27 23:02:05.619 CDT plug-in plug-in.WIN_50605.crawlerplug-in 
FFQD2801W A warning was generated from the crawler plug-in. 
Message: This is a warning message. 
E FFQD2800E 2013/04/27 23:02:05.681 CDT plug-in plug-in.WIN_50605.crawlerplug-in 
FFQD2800E An error was generated from the crawler plug-in. 
Message: This is an error message.
To show informational messages in the collection log file, open the administration console. Select the collection, click Actions > Logging > Configure log file options, and then select All messages for the type of information to log and trace. After you stop and restart the crawler session, informational messages appear in the collection log file.