Crawler plug-ins

Crawler plug-ins are Java™ application programming interfaces (APIs) that you can use to change content or metadata in crawled documents.

Data source crawler plug-ins

You can apply business and security rules to enforce document-level security and add, update, or delete the crawled metadata and document content that is associated with documents in an index. The data source crawler plug-in APIs cannot be used with the web crawler.

You can also create a plug-in that extracts entries from archive files. The extracted files can then be parsed individually and included in collections.

Restriction: The following type B data source crawlers do not support plug-ins to extract or fetch documents from archive files:
  • Agent for Windows file systems crawler
  • BoardReader crawler
  • Case Manager crawler
  • Exchange Server crawler
  • FileNet P8 crawler
  • SharePoint crawler

Web crawler plug-ins

You can add fields to the HTTP request header that is sent to the origin server to request a document. You can also view the content, security tokens, and metadata of a document after the document is downloaded. You can add to, delete from, or replace any of these fields, or stop the document from being parsed.

Web crawler plug-ins support two kinds of filtering: prefetch and postparse. You can specify only a single Java class to be the web crawler plug-in, but because the prefetch and postparse plug-in behaviors are defined in two separate Java interfaces and because Java classes can implement any number of interfaces, the web crawler plug-in class can implement either or both behaviors.

The web crawler plug-in has two specific plug-in types:
Prefetch plug-in
A prefetch plug-in is called before the crawler downloads a document. Your plug-in is given the document URL, the fetch method, the HTTP version, and the HTTP request header. Your plug-in can use these elements to decide whether to modify the request header (for example, to add cookies) or even to cancel the download.
Postparse plug-in
The postparse plug-in is called after any download attempt. Before the plug-in is called, the target content is downloaded and parsed by the crawler. The plug-in is given the document URL, the metadata that is extracted by the crawler from various sources, and the document's content. The plug-in can determine whether to alter any of these items in the document and whether to save the content of the document before it is parsed.

Javadoc documentation for crawler plug-ins

For detailed information about each plug-in API, see the Javadoc documentation in the following directory: ES_INSTALL_ROOT/docs/api/.