Custom crawler plug-ins

When you configure properties for crawlers, you can specify a Java™ class to use to enforce document-level access controls. You can also use the Java class to update the index by adding, modifying, or removing metadata and document content. By writing a plug-in, you can also extend the crawler's ability to crawl archive files.

A plug-in contains a Java class that is called for each document that the crawler crawls. The Java class is passed the document identifier (URI) from the index, security tokens, metadata, and document content. The class can return new or modified security tokens, metadata, and content, or the class can remove security tokens, metadata, and content.

After all of the documents in the crawl space are crawled once, the plug-in is called only for new or modified documents. To change the security tokens, metadata, or content of documents that are in the index, but that were not updated in the original data source, start a full crawl of all documents in the crawl space and then rebuild the main index.

Plug-ins to enforce security

Document-level security is enforced by associating one or more security tokens (a comma-delimited string) with each document that a crawler crawls. Group identifiers are commonly used as the security tokens.

By default, each document is assigned a public token that makes the document available to everyone. The public token can be replaced with a value that is provided by the administrator or a value that is extracted from a field in the crawled document.

The plug-in allows you to apply your own business rules to determine the value of the security tokens for crawled documents. The security tokens that are associated with each document are stored in the index. They are used to filter documents that match the security tokens and ensure that only the documents that a user has permission to view are returned in the search results.

Plug-ins to add, modify, or remove metadata

Document metadata, such as the date that a document was last modified, is created for all crawled documents. The crawler plug-in allows you to apply your own business rules to determine the value of the metadata that is to be indexed for each document.

For example, metadata cannot be extracted from binary documents through the built-in parsers. However, you can create a postparse plug-in and use the MetadataPreferred attribute to add searchable metadata to documents after they are crawled.

The metadata is created as a name-value pair. Users can search the metadata with a free-text query or with a query that specifies the metadata field name.

For another example, assume that your source content includes metadata values that differ only in case, such as Email Correspondence in some documents and Email correspondence in other documents. To avoid having the facet for this metadata show two entries, you can create a crawler plug-in to cleanse the data. For example, depending on the pattern that you prefer, you can change the first character of every word to upper case, or you can change only the first character in the second word to upper case or lower case.

Plug-ins to add, modify, or remove document content

Document content comprises the parts of a document that contain searchable content and content that can become part of the dynamic document summary in the search results. The crawler plug-in allows you to apply your own business rules to determine the content that is to be indexed for each document.

If you plan to search multiple collections at the same time, you might want to create a plug-in to alter how sortable fields are displayed in the search results. In a federated search, field values are considered to be string values, not numeric. If you sort results by file size, for example, the file size values from each collection are sorted in alphabetic order, not numeric. A crawler plug-in can alter how the file size values are evaluated. For example:
  • The plug-in can directly modify the filesize field by prepending zeroes to create a fixed length string field.
  • The plug-in can define a new field, such as filesizesort, as a fixed length string field that uses prepended zeroes to enforce the fixed size. Be sure to set this field to be sortable and returnable in the administration console. The advantage of this approach is that it allows the filesize field to continue to be searched as a parametric field.
With both of these approaches, you can modify the searchResultsTable.jsp file to remove the prepended zeroes before showing the search results so that values like 000012345 are not displayed.

Web crawler plug-ins

With the application programming interfaces for the Web crawler, you can control how documents are crawled and how they are prepared for parsing. For example, you can add fields to the HTTP request header that will be used when the crawler requests a document. After a document is crawled, and before it is parsed and tokenized, you can change the content, security tokens, and metadata. You can also stop the document from being sent to the parser.

Archive file plug-ins

By writing a plug-in, you can extend the crawlers and enable support for crawling archive file formats other than ZIP and TAR. For example, you can write a plug-in to support the crawling of documents in LZH format.

Unfenced mode

When you configure some crawlers, you can select an option to run the plug-in in unfenced mode. In this mode, the plug-in process runs inside the crawler process, which can improve the plug-in performance.

Important: If the plug-in encounters a problem that is not recoverable when it runs in this mode, the crawler process might be terminated.