Indexing text in OCR-generated PDF documents

Troubleshooting

Problem

By default, eDiscovery Analyzer does not index text from PDF documents that were produced by optical character recognition (OCR) software.

Symptom

A PDF document generated by OCR software can be searched in a PDF viewer, but the text is not indexed by eDiscovery Analyzer.

Cause

Text that is generated by OCR software is saved as hidden text in PDFs. This text is not indexed by eDiscovery Analyzer by default.

Resolving The Problem

To index hidden text in PDF documents, change a setting in the Oracle Outside In Search Export filter.

Change the following lines in the configuration file stellent/searchexport.cfg under the eDiscovery Analyzer installation directory:

# SCCOPT_XML_SEARCHML_CHAR_ATTRS

(lines skipped)
#hidden yes
hidden no

to:
hidden yes
#hidden no

[{"Product":{"code":"SSJKLP","label":"eDiscovery Analyzer"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"--","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF033","label":"Windows"}],"Version":"2.1.1;2.1.1.1;2.1.1.2","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Tips

Indexing text in OCR-generated PDF documents

Troubleshooting

Problem

Symptom

Cause

Resolving The Problem

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?