Text Indexing Preprocessors

Text indexing preprocessors are server-side, user-implemented action handlers that process document content before it is indexed by Elasticsearch or Content Search Services. You associate a text indexing preprocessor handler with a class definition. When a document with content-based retrieval (CBR) enabled enters the indexing queue, the associated handler is triggered.

Text indexing preprocessors allow content modifications that enhance search accuracy and relevance. For example, a text indexing preprocessor can be used to replace or augment the text extract, add custom metadata fields for indexing, or apply content transformations before the indexing process. For feature comparisons between text indexing preprocessors and other action handlers, see Action Handlers: Restrictions and Best Practices.

Text indexing preprocessors have these characteristics:

  • The document object in an indexing request is passed to a text indexing preprocessor handler. The handler can access document properties and content to generate the text extract and additional indexing fields.
  • You can implement a text indexing preprocessor handler as a Java or JavaScript component. A text indexing preprocessor that is implemented with Java can be placed in a code module, and can coexist in the same code module with other action handler types: event action, lifecycle action, change preprocessor, and document classifier. For more information, see Deploying Java Action Handlers.
  • You can set one or more text indexing preprocessors on a subscribable class definition, such as Document. For more information, see Setup Requirements.
  • A text indexing preprocessor set on a class is applied recursively to subclasses in the class hierarchy. For example, if you set a text indexing preprocessor on the Document class, it is also applied logically to the subclasses of the Document class. All enabled preprocessors that are associated with the document's class hierarchy are loaded and called per indexing request.

    If the same preprocessor is referenced multiple times within a hierarchy, the preprocessor nearest the root of the hierarchy is started and all others are silently skipped. For more information, see Text indexing preprocessors.

  • A text indexing preprocessor runs synchronously during the indexing process, so an exception prevents the document from being indexed.
  • Text indexing preprocessors are triggered when documents with CBR enabled enter the indexing queue.
  • Unless disabled, text indexing preprocessors are started unconditionally for documents with associated definitions.
  • The text extract and additional fields that are generated by the preprocessor are indexed by Elasticsearch or Content Search Services.

Implementation Guidelines

You implement the TextIndexingPreprocessor interface, which requires implementing the preprocess(action, services, sourceObject, fields, result) method. In implementing this interface, consider the following points:

  • You can retrieve the source object's properties collection to perform conditional processing. The properties collection contains all properties that are currently defined on the object, including any metadata that can be used to determine preprocessing logic.
  • You can use the services parameter to access utility methods, including getTextExtract() to obtain the default text extraction and addFieldValue() to add custom indexing fields.
  • The fields parameter is pre-populated with entries for all CBR-enabled properties of the source object and can be extended by the handler.
  • The preprocessor uses the result parameter (a TextIndexingPreprocessorServices.TextExtractionResult object) to provide the text extract and extraction status.
  • Return true if the handler preprocessed the object, or false if it declined to do so. If false is returned, the object is passed to other preprocessors or indexed normally.
  • If a chain of preprocessors runs recursively in the class hierarchy, each preprocessor in the chain can see the results from previously started preprocessors. The final text extract is indexed after all preprocessors complete.

Allowed Operations

  • You can read the source object's properties collection using getProperties() to perform conditional processing based on property values.
  • You can use services.getTextExtract() to obtain the default text extraction for the source object.
  • You can use services.addFieldValue() to add custom indexing fields to the fields map.
  • You can set the text extract and extraction status using result.setResult().

Disallowed Operations

You cannot start the following operations, which would modify the source object or interfere with the indexing process:

  • Methods of IndependentlyPersistableObject that modify the object, like save(), checkout(), checkin(), delete(), or refresh().
  • Content modification methods, like setCaptureSource() or operations that write to content streams.
  • Property modification methods, like getProperties().putValue().

Setup Requirements

To set up a text indexing preprocessor:

  1. Implement the TextIndexingPreprocessor interface.
  2. Create a CmTextIndexingPreprocessorAction object. This object contains a property for setting the implemented TextIndexingPreprocessor interface. It also includes an IsEnabled property, allowing an administrator to disable a preprocessor at the system scope, no matter where it is referenced in the class hierarchies.
  3. Create a CmTextIndexingPreprocessorDefinition object. This object contains a property for setting the implemented TextIndexingPreprocessorAction object. It also includes an IsEnabled property, allowing an administrator to disable a preprocessor at the class scope.
  4. Create a CmTextIndexingPreprocessorDefinitionList object and add the CmTextIndexingPreprocessorDefinition object to it.
  5. Get the SubscribableClassDefinition object that represents the class definition on which you want to set the text indexing preprocessor.
  6. Set CmTextIndexingPreprocessorDefinitionList on the SubscribableClassDefinition object and save it.

You can get all CmTextIndexingPreprocessorAction objects with the ObjectStore.CmTextIndexingPreprocessorActions property. You can get all CmTextIndexingPreprocessorDefinition objects with the SubscribableClassDefinition.TextIndexingPreprocessorDefinitions property.

For code examples of setup and retrieval operations, see Working with Text Indexing Preprocessors.