IBM Content Analytics with Enterprise Search, Version 3.0.0

Creating and deploying a plug-in for archive files

Crawler plug-ins for archive files are Java application programming interfaces (APIs) that you can add your own logic to. You can use this type of plug-in with non-web crawlers to extract entries from archive files, which can then be parsed and included in collections.

Ensure that the correct version of Java is installed. The crawler plug-in for archive files must be compiled with the IBM® Software Development Kit (SDK) for Java Version 1.6.

You cannot use this plug-in with the Agent for Windows file systems, FileNet P8, and SharePoint crawlers.

The non-web crawlers provide a plug-in interface that enables you to extend their crawling capabilities and crawl archive files in IBM Content Analytics with Enterprise Search. The crawler uses the specified crawler plug-in for archive files to extract archive entries from an archive file and send the extracted archive entries to the parsers.

To use this capability, you must develop a crawler plug-in for archive files that implements the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and register the plug-in in the crawler configuration file.

Important: To enable users to fetch and view files that are extracted from an archive file when they view search results, you must extend your archive plug-in to view extracted files.

Create a Java class to use as a crawler plug-in for archive files.

Implement the com.ibm.es.crawler.plugin.archive.ArchiveFile interface and implement the following methods:

public interface ArchiveFile {
   /**
    * Creates a new archive file with the specified InputStream instance.
    */
   public void open(InputStream input) throws IOException;

   /**
    * Close this archive file.
    */
   public void close() throws IOException;

   /**
    * Reads the next archive entry and positions stream at the beginning of
    * the entry data.
    * 
    * @param charset the name of charset
    * @return the next entry
    */
   public ArchiveEntry getNextEntry(String charset) throws IOException;

   /**
    * Returns an input stream of the current archive entry.
    * 
    * @return the input stream
    */
   public InputStream getInputStream() throws IOException;
}

For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.

Implement the com.ibm.es.crawler.plugin.archive.ArchiveEntry interface and implement the following methods:

public interface ArchiveEntry {
   /**
    * Returns the name of this entry.
    * 
    * @return the name of this entry
    */
   public String getName();
   
   /**
    * Returns the modify time of this entry.
    * 
    * @return the modify time of this entry
    */
   public long getTime();

   /**
    * Returns the length of file in bytes.
    * 
    * @return the length of file in bytes
    */
   public long getSize();
   
   /**
    * Tests whether the entry is a directory.
    * 
    * @return true if the entry is a directory
    */
   public boolean isDirectory();
}

Compile the implemented code and create a JAR file for it. Add the dscrawler.jar file to the class path when you compile. The crawler plug-in for archive files must be compiled with the IBM Software Development Kit (SDK) for Java Version 1.6.

Verify the crawler plug-in with the com.ibm.es.crawler.plugin.archive.ArchiveFileTester class. Add the dscrawler.jar file and your plug-in code to the class path when you run this Java application.
1. List the archive entries with your plug-in code. Confirm that this command returns correct information about the archive file.
  AIX® or Linux
  
  java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
  
  Windows
  
  java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
2. Extract the archive entries with your plug-in code. Confirm that this command extracts all archive entries successfully.
  AIX or Linux
  
  java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
  
  Windows
  
  java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
Deploy the crawler plug-in.
1. In the administration console, stop the crawler that you want to use with your crawler plug-in for archive files.
2. Enter the following command to create a configuration file named crawler_typecrawler_ext.xml, where crawler_ID identifies the crawler that you want to configure, and crawler_type identifies the prefix of the existing crawler configuration file. The existing file is named crawler_typecrawler.xml and it is located in the ES_NODE_ROOT/master_config/crawler_ID directory.
  AIX or Linux
  
  $ES_NODE_ROOT/master_config/crawler_ID/crawler_typecrawler_ext.xml
  
  Windows
  
  %ES_NODE_ROOT%/master_config/crawler_ID/crawler_typecrawler_ext.xml
3. Use a text editor to update the crawler_typecrawler_ext.xml file and add the rules for your crawler plug-in for archive files. Here is a template crawler configuration file for enabling your crawler plug-in for archive files.
```
<ExtendedProperties>
  <AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Type">archive_file_type</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Class">plugin_classname</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Classpath">path_to_required_jars</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
     Name="Path"></SetAttribute>
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
  Name="Extensions" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
   Name="Extension">archive_file_extension</AppendChild>
</ExtendedProperties>
```
  where:
  archive_file_type
  
  Specifies the type of the archive files.
  
  plugin_classname
  
  Specifies the fully qualified class name of your crawler plug-in for archive files.
  
  path_to_required_jars
  
  Specifies the class path, delimited by the path separator, that are required to run your crawler plug-in for archive files.
  
  archive_file_extension
  
  Specifies the file extension of the archive files that you want to process with your crawler plug-in for archive files.
4. Restart the crawler that you stopped.

Here is a sample crawler configuration for enabling the crawler plug-in for LZH archive files.

<ExtendedProperties>
  <AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Type">lzh</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Class">com.ibm.es.sample.archive.lzh.LzhFile</SetAttribute>
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Classpath">C:\lzhplugin;C:\lzhplugin\lzhplugin.jar</SetAttribute>  
    <SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
     Name="Path"></SetAttribute>
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile" 
  Name="Extensions" />
  <AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
   Name="Extension">.lzh</AppendChild>
</ExtendedProperties>

Feedback