Controlling text extraction
Text extraction is part of the indexing process; it converts documents to text that will be indexed. Initiated by the text search subsystem dispatcher, text extraction is a resource intensive process.
To avoid degrading indexing performance, it's important to properly configure text extraction.
Indexable document types and text extraction
An indexable document is a document that Content Platform Engine deems eligible for indexing and that the Oracle Outside In Search Export product can convert to text. The specific types of convertible documents depend on the version of the Oracle product that is used in your Content Platform Engine release. Content Platform Engine determines the eligibility of a document for indexing by identifying the MIME type of the document. Some MIME types are considered to be ineligible for indexing.
Low disk space safeguard for text extraction
Content Platform Engine uses a temporary folder for extracting the text that is indexed by Content Search Services. To avoid negatively impacting indexing performance, place the temporary folder in a file system that has at least 5 GB of free disk space. Performance can be enhanced if the folder is located on a RAM disk or other fast storage, such as a solid-state drive.
Enabling PDF-specific text extraction
For more accurate indexing of PDF documents that are written in right-to-left language, specify that Content Platform Engine use the Apache PDFBox technology for text extraction. Some examples of right-to-left languages are Arabic and Hebrew.
Setting a work area to optimize text extraction performance
The text extraction process converts a document to text for purposes of indexing documents and stores the information in the temporary directory. Optimize text extraction performance by setting the temporary directory to a local directory on your operating system. Using a remote directory for text extraction can be much slower.