Optical character recognition processing

Optical character recognition (OCR) processing enables text extraction from graphic image files that are stored inside archives where the Include content tagging and full-text index option is selected.

After content typing inside the IBM® StoredIQ® processing pipeline, enabling OCR processing routes the following file types through an optical character recognition engine OCR to extract recognizable text.
  • Windows or OS/2 bitmap (BMP)
  • Tag image bitmap file (TIFF)
  • Bitmap (CompuServe) (GIF)
  • Portable Network Graphics (PNG)
  • Joint Picture Experts Group (JPG)

The text that is extracted from image files is processed through the IBM StoredIQ pipeline in the same manner as text extracted from other supported file types. Policies with a specific feature to write out extracted text to a separate file for supported file types do so for image files while OCR processing is enabled.

The OCR processing rate of image files is approximately 7-10 KB/sec per IBM StoredIQ harvester process.