Defining text extraction services

Configure text extraction services to extract text and metadata from documents for indexing, search, and content analysis.

Text extraction capabilities

Text extraction services enable the system to extract content from documents for various purposes including full-text search, content analysis, and metadata extraction. The system supports two complementary text extraction capabilities:

Persistent text extraction (PTE): Extracts text from documents and stores it as Text Extraction Annotation objects for reuse. This capability provides a foundation for efficient text indexing and search operations by eliminating the need to re-extract text for each search operation.
Enhanced extraction: Extends text extraction capabilities to handle complex document types including PDF forms, images, and structured data. Enhanced extraction uses specialized handlers to extract not only plain text but also structured metadata such as form fields, key-value pairs, tables, and markdown-formatted content.

Enhanced extraction handlers

Enhanced extraction supports multiple specialized handlers that process different document types:

PDF Forms handler: Extracts form field data from PDF documents and associates field names with their values. This handler is useful for processing structured forms where field-to-value relationships must be preserved.
Watson enhanced extraction handler: Extracts text from images and structured data using Watson Document Understanding (local deployment) or watsonx.ai Text Extraction (SaaS). This handler supports optical character recognition (OCR) and can extract multiple content types including plain text, markdown, key-value pairs, and tables.
Oracle OIT handler: Provides traditional text extraction using Oracle Outside In Technology. This handler remains available for backward compatibility with existing deployments.

Handlers execute in priority order, with only one handler processing each document. This priority chain ensures efficient processing while supporting multiple extraction technologies.

Getting started

The topics in this section provide procedures for configuring text extraction services in your environment.

To use enhanced extraction, you must first enable persistent text extraction for your document classes. After persistent text extraction is enabled, you can configure enhanced extraction handlers and enable enhanced extraction for specific document classes.