Crawler plug-ins for non-web sources

Data source crawler plug-ins are Java applications that can change the content or metadata of crawled documents. You can configure a data source crawler plug-in for all non-web crawler types.

With the crawler plug-in for data source crawlers, you can add, change, or delete crawled content or metadata. You can also create a plug-in for extracting files from archive files and extend that plug-in to enable users to view the extracted content when they view the search results.
Restriction: The following type B data source crawlers do not support plug-ins to extract or fetch documents from archive files:
  • Agent for Windows file systems crawler
  • BoardReader crawler
  • Case Manager crawler
  • Exchange Server crawler
  • FileNet P8 crawler
  • SharePoint crawler

When you specify the Java class as the new crawler plug-in, the crawler calls the class for each document that it crawls.

For each document, the crawler passes to your Java classes the document identifier, the security tokens, the metadata, and the content that was specified by an administrator. Your Java class can return a new or modified set of security, metadata, and content.

Restriction: The crawler plug-in allows you to add security tokens, but it does not allow you to access the native access control lists (ACLs) that are collected by the crawlers that are provided with Watson Explorer Content Analytics.