Crawler plug-ins for archive files are Java application programming interfaces (APIs)
that you can add your own logic to. You can use this type of plug-in
with non-web crawlers to extract entries from archive files, which
can then be parsed and included in collections.
Ensure that the correct version of Java is installed. The crawler plug-in for archive
files must be compiled with the IBM® Software
Development Kit (SDK) for Java Version
1.6.
You cannot use this plug-in with the Agent for Windows file systems, FileNet P8, and SharePoint crawlers.
The non-web crawlers provide a plug-in interface that enables
you to extend their crawling capabilities and crawl archive files
in
IBM Content
Analytics with Enterprise Search. The crawler
uses the specified crawler plug-in for archive files to extract archive
entries from an archive file and send the extracted archive entries
to the parsers.
To use this capability, you must develop a crawler
plug-in for archive files that implements the com.ibm.es.crawler.plugin.archive.ArchiveFile
interface and register the plug-in in the crawler configuration file.
Important: To enable users to fetch and view files that are
extracted from an archive file when they view search results, you
must extend your archive plug-in to view extracted files.
- Create a Java class
to use as a crawler plug-in for archive files.
- Implement the com.ibm.es.crawler.plugin.archive.ArchiveFile
interface and implement the following methods:
public interface ArchiveFile {
/**
* Creates a new archive file with the specified InputStream instance.
*/
public void open(InputStream input) throws IOException;
/**
* Close this archive file.
*/
public void close() throws IOException;
/**
* Reads the next archive entry and positions stream at the beginning of
* the entry data.
*
* @param charset the name of charset
* @return the next entry
*/
public ArchiveEntry getNextEntry(String charset) throws IOException;
/**
* Returns an input stream of the current archive entry.
*
* @return the input stream
*/
public InputStream getInputStream() throws IOException;
}
For name resolution, use the ES_INSTALL_ROOT/lib/dscrawler.jar file.
- Implement the com.ibm.es.crawler.plugin.archive.ArchiveEntry
interface and implement the following methods:
public interface ArchiveEntry {
/**
* Returns the name of this entry.
*
* @return the name of this entry
*/
public String getName();
/**
* Returns the modify time of this entry.
*
* @return the modify time of this entry
*/
public long getTime();
/**
* Returns the length of file in bytes.
*
* @return the length of file in bytes
*/
public long getSize();
/**
* Tests whether the entry is a directory.
*
* @return true if the entry is a directory
*/
public boolean isDirectory();
}
- Compile the implemented code and create a JAR file for
it. Add the dscrawler.jar file to the class path
when you compile. The crawler plug-in for archive files must be compiled
with the IBM Software Development
Kit (SDK) for Java Version 1.6.
- Verify the crawler plug-in with the com.ibm.es.crawler.plugin.archive.ArchiveFileTester
class. Add the dscrawler.jar file
and your plug-in code to the class path when you run this Java application.
- List the archive entries with your plug-in code. Confirm that this command returns correct information about
the archive file.
- AIX® or Linux
java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
- Windows
java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -tv input_archive_filepath
- Extract the archive entries with your plug-in code. Confirm that this command extracts all archive entries successfully.
- AIX or Linux
java -classpath $ES_INSTALL_ROOT/lib/dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
- Windows
java -classpath %ES_INSTALL_ROOT%\lib\dscrawler.jar:path_to_plugin_jar com.ibm.es.crawler.plugin.archive.ArchiveFileTester plugin_classname -xv input_archive_filepath
- Deploy the crawler plug-in.
- In the administration console, stop the crawler that
you want to use with your crawler plug-in for archive files.
- Enter the following command to create a configuration
file named crawler_typecrawler_ext.xml,
where crawler_ID identifies the crawler that you
want to configure, and crawler_type identifies
the prefix of the existing crawler configuration file. The existing
file is named crawler_typecrawler.xml and
it is located in the ES_NODE_ROOT/master_config/crawler_ID directory.
- AIX or Linux
- $ES_NODE_ROOT/master_config/crawler_ID/crawler_typecrawler_ext.xml
- Windows
- %ES_NODE_ROOT%/master_config/crawler_ID/crawler_typecrawler_ext.xml
- Use a text editor to update the crawler_typecrawler_ext.xml file
and add the rules for your crawler plug-in for archive files. Here is a template crawler configuration file for enabling your
crawler plug-in for archive files.
<ExtendedProperties>
<AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Type">archive_file_type</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Class">plugin_classname</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Classpath">path_to_required_jars</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">archive_file_extension</AppendChild>
</ExtendedProperties>
where:- archive_file_type
- Specifies the type of the archive files.
- plugin_classname
- Specifies the fully qualified class name of your crawler plug-in
for archive files.
- path_to_required_jars
- Specifies the class path, delimited by the path separator, that
are required to run your crawler plug-in for archive files.
- archive_file_extension
- Specifies the file extension of the archive files that you want
to process with your crawler plug-in for archive files.
- Restart the crawler that you stopped.
Here is a sample crawler configuration for enabling the crawler
plug-in for LZH archive files.
<ExtendedProperties>
<AppendChild XPath="/Crawler" Name="ArchiveFileRegistry" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry" Name="ArchiveFile" />
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Type">lzh</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Class">com.ibm.es.sample.archive.lzh.LzhFile</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Classpath">C:\lzhplugin;C:\lzhplugin\lzhplugin.jar</SetAttribute>
<SetAttribute XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Path"></SetAttribute>
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile"
Name="Extensions" />
<AppendChild XPath="/Crawler/ArchiveFileRegistry/ArchiveFile/Extensions"
Name="Extension">.lzh</AppendChild>
</ExtendedProperties>