IBM Content Analytics with Enterprise Search, Version 3.0.0                  

Creating and deploying a plug-in for archive files

Crawler plug-ins for archive files are Java application programming interfaces (APIs) that you can add your own logic to. You can use this type of plug-in with non-web crawlers to extract entries from archive files, which can then be parsed and included in collections.

Ensure that the correct version of Java is installed. The crawler plug-in for archive files must be compiled with the IBM® Software Development Kit (SDK) for Java Version 1.6.

You cannot use this plug-in with the Agent for Windows file systems, FileNet P8, and SharePoint crawlers.

The non-web crawlers provide a plug-in interface that enables you to extend their crawling capabilities and crawl archive files in IBM Content Analytics with Enterprise Search. The crawler uses the specified crawler plug-in for archive files to extract archive entries from an archive file and send the extracted archive entries to the parsers.

To use this capability, you must develop a crawler plug-in for archive files that implements the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and register the plug-in in the crawler configuration file.

Important: To enable users to fetch and view files that are extracted from an archive file when they view search results, you must extend your archive plug-in to view extracted files.
  1. Create a Java class to use as a crawler plug-in for archive files.
    1. Implement the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and implement the following methods:
      public interface ArchiveFile {
         /**
          * Creates a new archive file with the specified InputStream instance.
          */
         public void open(InputStream input) throws IOException;
      
         /**
          * Close this archive file.
          */
         public void close() throws IOException;
      
         /**
          * Reads the next archive entry and positions stream at the beginning of
          * the entry data.
          * 
          * @param charset the name of charset
          * @return the next entry
          */
         public ArchiveEntry getNextEntry(String charset) throws IOException;
      
         /**
          * Returns an input stream of the current archive entry.
          * 
          * @return the input stream
          */
         public InputStream getInputStream() throws IOException;
      }
      For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.
    2. Implement the com.ibm.es.crawler.plugin.archive.ArchiveEntry interface and implement the following methods:
      public interface ArchiveEntry {
         /**
          * Returns the name of this entry.
          * 
          * @return the name of this entry
          */
         public String getName();
         
         /**
          * Returns the modify time of this entry.
          * 
          * @return the modify time of this entry
          */
         public long getTime();
      
         /**
          * Returns the length of file in bytes.
          * 
          * @return the length of file in bytes
          */
         public long getSize();
         
         /**
          * Tests whether the entry is a directory.
          * 
          * @return true if the entry is a directory
          */
         public boolean isDirectory();
      }
    3. Compile the implemented code and create a JAR file for it. Add the dscrawler.jar file to the class path when you compile. The crawler plug-in for archive files must be compiled with the IBM Software Development Kit (SDK) for Java Version 1.6.
  2. Verify the crawler plug-in with the com.ibm.es.crawler.plugin.archive.ArchiveFileTester class. Add the dscrawler.jar file and your plug-in code to the class path when you run this Java application.
    1. List the archive entries with your plug-in code. Confirm that this command returns correct information about the archive file.
      AIX® or Linux

      java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath

      Windows

      java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath

    2. Extract the archive entries with your plug-in code. Confirm that this command extracts all archive entries successfully.
      AIX or Linux

      java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath

      Windows

      java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath

  3. Deploy the crawler plug-in.
    1. In the administration console, stop the crawler that you want to use with your crawler plug-in for archive files.
    2. Enter the following command to create a configuration file named crawler_typecrawler_ext.xml, where crawler_ID identifies the crawler that you want to configure, and crawler_type identifies the prefix of the existing crawler configuration file. The existing file is named crawler_typecrawler.xml and it is located in the ES_NODE_ROOT/master_config/crawler_ID directory.
      AIX or Linux
      $ES_NODE_ROOT/master_config/crawler_ID/crawler_typecrawler_ext.xml
      Windows
      %ES_NODE_ROOT%/master_config/crawler_ID/crawler_typecrawler_ext.xml
    3. Use a text editor to update the crawler_typecrawler_ext.xml file and add the rules for your crawler plug-in for archive files. Here is a template crawler configuration file for enabling your crawler plug-in for archive files.
      <ExtendedProperties>
        <AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
        <AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
          <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
           Name="Type">archive_file_type</SetAttribute>
          <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
           Name="Class">plugin_classname</SetAttribute>
          <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
           Name="Classpath">path_to_required_jars</SetAttribute>
          <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
           Name="Path"></SetAttribute>
        <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
        Name="Extensions" />
        <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
         Name="Extension">archive_file_extension</AppendChild>
      </ExtendedProperties>
      where:
      archive_file_type
      Specifies the type of the archive files.
      plugin_classname
      Specifies the fully qualified class name of your crawler plug-in for archive files.
      path_to_required_jars
      Specifies the class path, delimited by the path separator, that are required to run your crawler plug-in for archive files.
      archive_file_extension
      Specifies the file extension of the archive files that you want to process with your crawler plug-in for archive files.
    4. Restart the crawler that you stopped.
Here is a sample crawler configuration for enabling the crawler plug-in for LZH archive files.
<ExtendedProperties>
  <AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Type">lzh</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Class">com.ibm.es.sample.archive.lzh.LzhFile</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Classpath">C:\lzhplugin;C:\lzhplugin\lzhplugin.jar</SetAttribute>  
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Path"></SetAttribute>
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
  Name="Extensions" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
   Name="Extension">.lzh</AppendChild>
</ExtendedProperties>

Feedback

Last updated: May 2012

© Copyright IBM Corporation 2004, 2012.
This information center is powered by Eclipse technology. (http://www.eclipse.org)