IBM Support

Indexing text in OCR-generated PDF documents

Troubleshooting


Problem

By default, eDiscovery Analyzer does not index text from PDF documents that were produced by optical character recognition (OCR) software.

Symptom

A PDF document generated by OCR software can be searched in a PDF viewer, but the text is not indexed by eDiscovery Analyzer.

Cause

Text that is generated by OCR software is saved as hidden text in PDFs. This text is not indexed by eDiscovery Analyzer by default.

Resolving The Problem

To index hidden text in PDF documents, change a setting in the Oracle Outside In Search Export filter.

Change the following lines in the configuration file stellent/searchexport.cfg under the eDiscovery Analyzer installation directory:

# SCCOPT_XML_SEARCHML_CHAR_ATTRS


(lines skipped)
#hidden yes
hidden no

to:
hidden yes
#hidden no

[{"Product":{"code":"SSJKLP","label":"eDiscovery Analyzer"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF033","label":"Windows"}],"Version":"2.1.1;2.1.1.1;2.1.1.2","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 June 2018

UID

swg21432316