Crawler plug-ins for non-web sources
Data source crawler plug-ins are Java applications that can change the content or metadata of crawled documents. You can configure a data source crawler plug-in for all non-web crawler types.
With the crawler plug-in for data source crawlers, you can add, change, or delete crawled content
or metadata. You can also create a plug-in for extracting files from archive files and extend
that plug-in to enable users to view the extracted content when they view the search
results.
Restriction: The following type B data source
crawlers do not support plug-ins to extract or fetch documents from archive files:
- Agent for Windows file systems crawler
- BoardReader crawler
- Case Manager crawler
- Exchange Server crawler
- FileNet P8 crawler
- SharePoint crawler
When you specify the Java class as the new crawler plug-in, the crawler calls the class for each document that it crawls.
For each document, the crawler passes to your Java classes the document identifier, the security tokens, the metadata, and the content that was specified by an administrator. Your Java class can return a new or modified set of security, metadata, and content.
Restriction: The crawler plug-in allows you
to add security tokens, but it does not allow you to access the native
access control lists (ACLs) that are collected by the crawlers that
are provided with Watson Explorer Content Analytics.