Text extraction indexing preprocessor considerations

When persistent text extraction (PTE) is enabled for a document class where Content Based Retrieval is already active, the standard CBR indexing pipeline is altered, resulting in reduced indexing performance.

In this scenario, documents are submitted to the CBR engine only after the preprocessor threads send them to OIT and store the extracted text as Annotations. The default configuration includes 8 preprocessor threads with each thread handling one document at a time. The CBR indexing engine receives documents only after PTE processing is finished. As a result, the overall indexing rate is reduced compared to scenarios where PTE is disabled.

To improve indexing performance, it is necessary to tune the number of preprocessor worker threads to a much higher value. You can adjust the Maximum preprocessing queue size and the Maximum preprocessing workers properties for the domain.

Tip: For information about accessing these properties, see Accessing subsystem configuration properties.