Text extraction indexing preprocessor considerations
When persistent text extraction (PTE) is enabled for a document class where Content Based Retrieval is already active, the standard CBR indexing pipeline is altered, resulting in reduced indexing performance.
In this scenario, documents are submitted to the CBR engine only after the preprocessor threads send them to OIT and store the extracted text as Annotations. The default configuration includes 8 preprocessor threads with each thread handling one document at a time. The CBR indexing engine receives documents only after PTE processing is finished. As a result, the overall indexing rate is reduced compared to scenarios where PTE is disabled.
To improve indexing performance, it is necessary to tune the number of preprocessor worker
threads to a much higher value. You can adjust the Maximum preprocessing queue size
and the Maximum preprocessing workers properties for the domain.