Text Indexing Preprocessors
Text indexing preprocessors are server-side, user-implemented action handlers that process document content before it is indexed by Elasticsearch or Content Search Services. You associate a text indexing preprocessor handler with a class definition. When a document with content-based retrieval (CBR) enabled enters the indexing queue, the associated handler is triggered.
Text indexing preprocessors allow content modifications that enhance search accuracy and relevance. For example, a text indexing preprocessor can be used to replace or augment the text extract, add custom metadata fields for indexing, or apply content transformations before the indexing process. For feature comparisons between text indexing preprocessors and other action handlers, see Action Handlers: Restrictions and Best Practices.
Text indexing preprocessors have these characteristics:
- The document object in an indexing request is passed to a text indexing preprocessor handler. The handler can access document properties and content to generate the text extract and additional indexing fields.
- You can implement a text indexing preprocessor handler as a Java or JavaScript component. A text indexing preprocessor that is implemented with Java can be placed in a code module, and can coexist in the same code module with other action handler types: event action, lifecycle action, change preprocessor, and document classifier. For more information, see Deploying Java Action Handlers.
- You can set one or more text indexing preprocessors on a subscribable
class definition, such as
Document. For more information, see Setup Requirements. - A text indexing preprocessor set on a class is applied recursively to
subclasses in the class hierarchy. For example, if you set a text indexing
preprocessor on the Document class, it is also applied logically to
the subclasses of the Document class. All enabled preprocessors that
are associated with the document's class hierarchy are loaded
and called per indexing request.
If the same preprocessor is referenced multiple times within a hierarchy, the preprocessor nearest the root of the hierarchy is started and all others are silently skipped. For more information, see Text indexing preprocessors.
- A text indexing preprocessor runs synchronously during the indexing process, so an exception prevents the document from being indexed.
- Text indexing preprocessors are triggered when documents with CBR enabled enter the indexing queue.
- Unless disabled, text indexing preprocessors are started unconditionally for documents with associated definitions.
- The text extract and additional fields that are generated by the preprocessor are indexed by Elasticsearch or Content Search Services.
Implementation Guidelines
You implement the TextIndexingPreprocessor interface, which requires implementing the preprocess(action, services, sourceObject, fields, result) method. In implementing this interface, consider the following points:
- You can retrieve the source object's properties collection to perform conditional processing. The properties collection contains all properties that are currently defined on the object, including any metadata that can be used to determine preprocessing logic.
- You can use the
servicesparameter to access utility methods, includinggetTextExtract()to obtain the default text extraction andaddFieldValue()to add custom indexing fields. - The
fieldsparameter is pre-populated with entries for all CBR-enabled properties of the source object and can be extended by the handler. - The preprocessor uses the
resultparameter (a TextIndexingPreprocessorServices.TextExtractionResult object) to provide the text extract and extraction status. - Return
trueif the handler preprocessed the object, orfalseif it declined to do so. Iffalseis returned, the object is passed to other preprocessors or indexed normally. - If a chain of preprocessors runs recursively in the class hierarchy, each preprocessor in the chain can see the results from previously started preprocessors. The final text extract is indexed after all preprocessors complete.
Allowed Operations
- You can read the source object's properties collection using
getProperties()to perform conditional processing based on property values. - You can use
services.getTextExtract()to obtain the default text extraction for the source object. - You can use
services.addFieldValue()to add custom indexing fields to thefieldsmap. - You can set the text extract and extraction status using
result.setResult().
Disallowed Operations
You cannot start the following operations, which would modify the source object or interfere with the indexing process:
- Methods of IndependentlyPersistableObject that modify the
object, like
save(),checkout(),checkin(),delete(), orrefresh(). - Content modification methods, like
setCaptureSource()or operations that write to content streams. - Property modification methods, like
getProperties().putValue().
Setup Requirements
To set up a text indexing preprocessor:
- Implement the
TextIndexingPreprocessorinterface. - Create a CmTextIndexingPreprocessorAction object. This object contains a property for setting the implemented TextIndexingPreprocessor interface. It also includes an IsEnabled property, allowing an administrator to disable a preprocessor at the system scope, no matter where it is referenced in the class hierarchies.
- Create a CmTextIndexingPreprocessorDefinition object. This object contains a property for setting the implemented TextIndexingPreprocessorAction object. It also includes an IsEnabled property, allowing an administrator to disable a preprocessor at the class scope.
- Create a CmTextIndexingPreprocessorDefinitionList object and
add the
CmTextIndexingPreprocessorDefinitionobject to it. - Get the SubscribableClassDefinition object that represents the class definition on which you want to set the text indexing preprocessor.
- Set
CmTextIndexingPreprocessorDefinitionListon theSubscribableClassDefinitionobject and save it.
You can get all CmTextIndexingPreprocessorAction objects
with the ObjectStore.CmTextIndexingPreprocessorActions property.
You can get all CmTextIndexingPreprocessorDefinition objects
with the SubscribableClassDefinition.TextIndexingPreprocessorDefinitions property.
For code examples of setup and retrieval operations, see Working with Text Indexing Preprocessors.