Text indexing preprocessors
A text indexing preprocessor is a custom user implementation that replaces or augments the text extract of document content before indexing. Preprocessors run instead of the normal text extraction process.
Text indexing preprocessing workflow
When a document with content-based retrieval (CBR) enabled is received in the indexing queue, the system follows this workflow:
- The system checks whether the document class has a text indexing preprocessor definition.
- If no text indexing preprocessor definition exists for the document, it follows the normal text extraction process through Content Search Services or Elasticsearch.
- If a text indexing preprocessor definition exists for the document, the preprocessor runs instead of the normal text extraction process, and the preprocessor output is then indexed.
Use cases for text indexing preprocessors
Text indexing preprocessors are useful in the following scenarios:
- Replacing or augmenting the text extract to improve search accuracy or relevance.
- Extracting and preprocessing metadata from document content.
- Applying custom text transformations or filtering.
- Enriching document content with additional information.
- Implementing custom content validation or sanitization.
The ibm-content-platform-engine-samples GitHub repository includes sample code that demonstrates how to use the Text Indexing Preprocessor interface. The sample shows how to create a properties-only indexing handler that skips content extraction and only indexes document properties. For more information, see Text Indexing Preprocessor sample code
.
Text indexing preprocessor components
A text indexing preprocessor consists of three components:
- A text indexing preprocessor definition that associates a text indexing preprocessor action to a class. A class can have one or more text indexing preprocessor definitions that are associated with it.
- A text indexing preprocessor action that references the JavaScript or code module that is used to perform the action.
- A Java action handler that performs the preprocessing. A Java-implemented action handler can be placed in a code module. In addition, it can coexist with other action handler types, such as event action, lifecycle action, and document classifier.
Text indexing preprocessor characteristics
When you work with text indexing preprocessors, note the following characteristics:
- Text indexing preprocessors are triggered when documents with CBR enabled enter the indexing queue.
- Text indexing preprocessors run instead of the normal text extraction process. The preprocessor output is then indexed by Elasticsearch or Content Search Services.
- Text indexing preprocessors can replace or augment the normally used text extract of document content and can generate additional fields for indexing beyond the CBR-enabled properties of the document.
- Unless disabled, text indexing preprocessors are invoked unconditionally for documents with associated definitions.
- Preprocessors can improve search accuracy by replacing or augmenting the text extract.