Enhanced extraction

The enhanced extraction framework extracts text and metadata from complex document types such as PDF forms, images, and structured data. This extensible framework allows different handlers to be plugged in that use different extraction engines, from IBM or other third-parties, extracting text from different document types. An initial set of handlers are provided in the Content Cortex 26.0.0, and additional handlers are added in future releases.

Overview

The Content Engine includes text extraction capabilities for document indexing and content-based retrieval. By default, text is extracted and indexed but not saved. With persistent text extraction, the extracted text is stored as an annotation on each document through the Text Extraction Annotation class. This stored text supports multiple purposes: content-based retrieval (CBR) indexing, vector indexing, document summarization, and generative AI inferences. For more information about the persistent text extraction classes and properties, see Persistent Text Extract Extensions.

Enhanced extraction builds on persistent text extraction by introducing an extensible handler framework. This framework allows you to integrate specialized text extraction engines alongside the default Oracle Outside In Technology (OIT) engine. Each specialized handler can extract additional metadata and process complex document types that OIT cannot handle effectively.

Enhanced extraction workflow

When a document with persistent text extraction enabled enters the extraction queue, the system follows this workflow:
  1. The system checks whether enhanced extraction is configured for the document class.
  2. If no enhanced extraction handlers are configured, the system uses Oracle Outside In Technology (OIT) for text extraction.
  3. If enhanced extraction handlers are configured, they execute in priority order. Each handler examines the document to determine if it can process it. If a handler can process the document, it extracts the text and metadata. If not, the handler passes the document to the next handler in the priority chain.
  4. The system stores the extracted text and metadata annotations on the document for use in indexing, summarization, and generative AI inferences.

Use cases for enhanced extraction

Enhanced extraction is useful in the following scenarios:
  • Extracting form fields from PDF forms with proper field-to-value associations.
  • Extracting text from images and scanned documents using optical character recognition (OCR).
  • Extracting structured data such as tables and key-value pairs from documents.
  • Preserving document structure and formatting through markdown rendering.
  • Improving search accuracy and relevance for complex document types.

Enhanced extraction components

Enhanced extraction is available through Content Engine system add-ons. For more information about the add-ons, see Enhanced Extraction Extensions, PDF Forms Extensions, and Watson Extraction Extensions. The add-ons provide the following components:
  • An extraction handler that processes specific document types and produces text and metadata annotations. Multiple handlers can be configured with different priorities. The framework supports the following handlers:
    • PDF Forms handler: Extracts form fields from PDF forms using the PDFBox library. Produces a key-value pairs annotation containing the form fields in JSON format. For more information, see PDF Forms Extensions.
    • Watson Text Extraction handler: Extracts text from images and structured data using Watson Document Understanding. Available as a SaaS solution through watsonx.ai Text Extraction or as an on-premises solution. Produces plain text, markdown, key-value pairs, and tables annotations. For more information, see Watson Extraction Extensions.
    • Oracle OIT handler: The default fallback handler that processes documents when no specialized handler is configured or when other handlers cannot process the document.
  • An Enhanced Extraction Annotation that stores the extracted metadata in formats such as key-value pairs, tables, or markdown. The MIME type is set to application/x-ibm.enhanced-extract.
  • A Text Extraction Annotation that stores the plain text extract of the document content.

Enhanced extraction characteristics

When you work with enhanced extraction, note the following characteristics:
  • Enhanced extraction requires persistent text extraction to be enabled for the document class.
  • You can configure multiple handlers for a document class. Each handler has a priority that determines its execution order.
  • Only one handler processes each document. Each handler examines the document and either processes it or passes it to the next handler in the priority chain.
  • Handlers can produce multiple annotation types: plain text, markdown, key-value pairs, and tables.
  • Applications can specify which annotation format to use for indexing, summarization, and generative AI inferences.
  • Enhanced extraction is configured at the document class level. Subclasses inherit the configuration from their parent class.