IBM Content Analytics with Enterprise Search, Version 3.0.0                  

Creating a crawler plug-in for non-web data sources

You can create a Java class to programmatically update the value of security tokens, metadata, and the document content of data sources other than web.

When the crawler is started, the plug-in process is forked. An AbstractCrawlerPlugin object is instantiated with the default constructor and the init, isMetadataUsed, and isContentUsed methods are called once. When the crawler is stopped, the term method is called and the object is destroyed.

To create a Java class for use as a crawler plug-in with content-related functions:
  1. Extend com.ibm.es.crawler.plugin.AbstractCrawlerPlugin and implement the following methods:
    init()
    isMetadataUsed()
    isContentUsed()
    term()
    updateDocument()

    The AbstractCrawlerPlugin class is an abstract class. The init method and the term method are implemented to do nothing. The isMetadataUsed method and isContentUsed method are implemented to return false by default. The updateDocument method is an abstract method, so you must implement it.

    For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.

  2. Compile the implemented code and make a JAR file for it. Add the ES_INSTALL_ROOT/lib/dscrawler.jar file to the class path when you compile.
  3. In the administration console, follow these steps:
    1. Edit the appropriate collection.
    2. Select the Crawl page and edit the crawler properties for the crawler that will use the custom Java class.
    3. Specify the following items:
      • The fully qualified class name of the implemented Java class, for example, com.ibm.plugins.MyPlugin. When you specify the class name, ensure that you do not specify the file extension, such as .class or .java.
      • The fully qualified class path for the JAR file and the directory in which all files that are required by the Java class are located. Ensure that you include the name of the JAR file in your path declaration, for example, C:\plugins\Plugins.jar. If you need to specify multiple JAR files, ensure that you use the correct separator depending on your platform, as shown in the following examples:
        • AIX® or Linux: /home/esadmin/plugins/Plugins.jar:/home/esadmin/plugins/3rdparty.jar
        • Windows: C:\plugins\Plugins.jar;C:\plugins\3rdparty.jar
  4. On the Crawl page, click Monitor. Then, click Stop and Start to restart the session for the crawler that you edited. Click Details and start a full crawl.
If the crawler stops when it is loading the plug-in, view the log file and verify that:
Tip: If a crawler gets NullPointerException after it is configured to use a custom crawler plug-in, override com.ibm.es.crawler.plugin.AbstractCrawlerPlugin#isMetadataUsed() to return true instead of false.
Metadata field definitions: If you want to add a new metadata field in your crawler plug-in, you must create an index field and add the metadata field to the collection by configuring parsing and indexing options in the administration console. Ensure that the name of the metadata field is the same as the name of the index field.
The following methods in the FieldMetadata class are deprecated. These field characteristics are overwritten by field definitions in the parser configuration:
public void setSearchable(boolean b)
public void setFieldSearchable(boolean b)
public void setParametricSearchable(boolean b)
public void setAsMetadata(boolean b)
public void setResolveConflict(String string)
public void setContent(boolean b)
public void setExactMatch(boolean b)
public void setSortable(boolean b)

Feedback

Last updated: May 2012

© Copyright IBM Corporation 2004, 2012.
This information center is powered by Eclipse technology. (http://www.eclipse.org)