Text extraction parameters

When you submit a text extraction request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text extraction operation.

Make choices about the various common REST API settings used to process text in your documents. For details, see Text processing parameters.

In addition, choose how to configure the following settings in the text extraction REST API request body that meet your requirements:

Setting the output format with the requested_outputs parameter

By default, the extracted text is written in plain text. If you want the extracted text to be written in another format, such as Markdown, specify the following parameter in the API request body:

"parameters": {
  "requested_outputs": [
    "md"
  ]
}

The following table provides details about the different output formats generated by the text extraction process when you specify the requested_outputs parameter in your API request:

Output formats supported by the text extraction API
Requested output Generated file type Description
md Markdown Extracted information is serialized in Markdown format. Data structures such as section titles, tables, and paragraphs are represented using Markdown tags. The result does not contain key-value pair data.
html HTML Extracted information is serialized in HTML format. Data structures such as section titles, tables, and paragraphs are represented using HTML tags. The result does not contain key-value pair data.
plain_text Plain text Extracted information is serialized in plain text format. The result only contains unstructured text. The result does not contain tables, section titles, or key-value pair data.
assembly JSON Extract text into a JSON format. The result contains all unstructured text and data structures such as tables, key-value pairs, and visual bounding box information.
page_images PNG Extract each page of the document into a separate image.

Setting the processing mode with the mode parameter

You can control the speed at which your text extraction request is processed by setting the mode parameter in your API request.

"parameters": {
  "mode": "standard"
}

The high_quality processing mode preserves all data structures in your document but may take longer to process than the standard mode. In the standard mode, the extraction request completes faster but generates lower quality output that may lack details.

Specifying how to represent embedded images with the create_embedded_images parameters

You can configure how to process images embedded in your document and convert them to Markdown and JSON formats.

The embedded image is the area on a page of the document that represents only the picture without including portions of the page that contain text or tables. Text and tables in the original document are processed with optical character recognition (OCR). The embedded images extraction mode is used to specify how to serialize images in the document and preserve them in the extracted output.

Based on the embedded images extraction mode you specify, you can choose how embedded images are represented in the output:

  • Whether to include images in the extracted output. If images are included, they are stored in the embedded_images_assembly folder as .png files
  • Whether generic placeholder text or the text extracted by OCR directly from the image appears in the Markdown and JSON output formats
  • Whether image is verbalized by describing the image in natural language. For example, an image of a cat may be verbalized as The image displays a cat resting on the floor.

To extract embedded images including text that describes the images, specify the following parameter in the API request body:

"parameters": {
  "create_embedded_images": "enabled_verbalization"
}

Images extracted in a JSON output format are represented in the Picture object. Based on the embedded images mode you specify, the following attributes in the JSON object are used to store the image details:

  • text : Stores a string that contains the text extracted directly from the image
  • verbalization : Stores a string that contains the textual description of the image.
  • children_ids : Each word in the text related to an image is represented as tokens and stored as a list of token IDs.

For details about the JSON output schema, see Text extraction JSON schema.

The following table provides details about the different modes you can use in your API request to extract embedded images:

Embedded images extraction modes in the text extraction API
Mode Usage Image (in bytes) in output Markdown output details JSON output details
disabled Suited for an application that does not need to include images in the output. OCR processes tables and other data structures in the document. No None None
enabled_placeholder Suited for an application that needs to process images, but does not require image description and use a custom image verbalizer to generate image descriptions. Link to image location • Image in the pictures structure
picture.text is empty
• List of token IDs that represent generic placeholder text in picture.children_ids
enabled_text Suited for an application that needs to process images, but does not require image description and use a custom image verbalizer to generate image descriptions. Text is extracted from the image • Image in the pictures structure
• Text extracted directly from the image in picture.text
• List of token IDs that represent text extracted from the image in picture.children_ids
enabled_verbalization • Suited for an application that uses image descriptions to implements image search.
• Only some images of interest, such as graphs, charts, and screenshots, are verbalized.
• Link to image location
• Textual description of the image
• Image in the pictures structure
• Textual description of the image in picture.verbalization only if the image was verbalized in the original document
• List of token IDs that represent the textual description of the image
enabled_verbalization_all • Suited for an application that uses image descriptions to implements image search.
• All embedded images in the document are verbalized.
• Link to image location
• Textual description of the image
• Image in the pictures structure
• Textual description of the image in picture.verbalization only if the image was verbalized in the original document
• List of token IDs that represent the textual description of the image

Specifying how to extract data in key-value pairs with the kvp_mode parameter

The extracted text is stored in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is extracted by using a general-purpose foundation model or a model that is tuned for specific document formats.

For details about how key value pairs are processed by the document understanding technology, see Processing text as key-value pairs.

The following restrictions apply when you use the key-value pair extraction capability:

  • Key-value pair data extraction is only supported for English language documents.
  • The result of the key-value pair extraction is only available in the assembly output format. Key-value pairs are not extracted in the html, markdown, or plain_text output formats.

Extract generic labelled data and domain-specific data with a general purpose model into a key-value pair format by setting the kvp_mode in your text extraction request as follows:

"parameters": {
  "kvp_mode": "generic_with_semantic"
}
Note:

The foundation models that you can use with the generic_with_semantic mode vary by data center. For details, see Regional availability of foundation models.

Configuring the key-value pair text extraction pipeline with the semantic_config parameter

You can use various common fields of the semantic_config parameter in the REST API request body to configure the key-value pair extraction pipeline. For details, see Configuring the key-value pair processing pipeline.

In addition, you can configure the following capabilities specific to the key-value pair extraction pipeline in the semantic_config parameter:

Setting the extraction method with the enable_schema_kvp and enable_generic_kvp fields

Based on the contents of your input document, you can process key-value pair data with one of the following methods:

Schema-based key-value pair extraction
The schema-based process targets specific fields in documents by using pre-defined or custom schemas for common document types like invoices, utility bills, passports, and more. Every page is classified into one of the supported schema types. Based on the classification, you can extract text into the key-value pair format defined in the schema for the specific document type. By classifying the document and targeting domain-specific data, this method increases accuracy for known document types without requiring dedicated model training.
Generic key-value pair extraction
The generic extraction process identifies and extracts content that can be represented as key-value pairs from a document. This method is useful for extracting labeled information when you do not have domain-specific knowledge of the document. Generic extraction works best for pages that do not fit into one of the pre-defined templates.

By default, both the generic and schema-based methods are used to extract text in the generic_with_semantic mode. You can choose to use one or both of the extraction methods in the same API request to accomplish the following tasks:

  • Extract a targeted set of fields with a schema by using the schema-based extraction method
  • Perform a broad sweep of the document to extract all generic data by using the generic extraction method

To use the schema-based extraction method, set the following semantic_config parameter in the text extraction request:

"semantic_config": {
  "enable_schema_kvp": true
}

To use the generic extraction method, set the following semantic_config parameter in the text extraction request as follows:

"semantic_config": {
  "enable_generic_kvp": true
}
Note: If you use the generic key-value pair extraction method, a value may be extracted twice when the model uses both the generic and schema-based extraction methods. If you only want to extract schema-based key-value-pair data, set the enable_generic_kvp to false.

Defining pre-defined and custom schema interaction with the schemas_merge_strategy and force_schema_name fields

You can specify how a combination of pre-defined and custom schemas are used to process key-value pair data in your document.

You can use common semantic_config settings such as the schemas_merge_strategy field to control how pre-defined and custom schemas are used together. For details, see Controlling how pre-defined and custom schemas interact.

However, if your document can be classified into one of the pre-defined document types, such as a receipt, but contains some data in a customized format specific to your business needs, you can define your own custom schema for the document. You can then manually override the pre-defined schema and use the custom receipt schema directly by setting the force_schema_name parameter in your API request as follows:

"semantic_config": {
  "enable_schema_kvp": false,
  "enable_schema_kvp": true,
  "force_schema_name": "Receipt",  # Force the document to be processed as a receipt
  "schemas": [ {
      "document_type": "Receipt",
      "document_description": "A receipt issued for a purchase at ABC store.",
      "fields": {
        "receipt_number": {
          "default": "",
          "example": "R-20241027-ABC",
          "description": "Unique identifier on the receipt."
        },
        "customer_name": {
          "default": "",
          "example": "John Smith",
          "description": "Full name of the customer or payee."
        },
        "date_of_transaction": {
          "default": "",
          "example": "2023-01-01",
          "description": "Date when the purchase or payment occurred."
        },
        "total_paid": {
          "default": "",
          "example": "8.64",
          "description": "Final amount paid by the customer."
        },
        "payment_method": {
          "default": "",
          "example": "Credit Card",
          "description": "How payment was made, such as cash, card, check, etc.)."
        },
      }
  } ]
}
Note: The force_schema_name must exactly match the document_type from one of the pre-defined schemas or a custom schema you create. Otherwise, the text extraction process may fail to extract any fields or extract incorrect data from your document.

The following table describes the difference in the methods you can use to process your documents with custom schemas:

Ways to use custom schemas in text extraction
Custom schema selection method Description Recommended use
replace schema merge strategy • Text extraction does not use any pre-defined schemas and only considers custom schema you provide. You can provide one or multiple schemas in the API request.
• The text extraction process classifies the input document using the description from the custom schema and extracts fields from the document that match the fields in the custom schema. If the extracted data does not match any fields in any of the custom schemas you provide, no data is extracted.
• Use when you do not know the exact format of the input document which can contain unique fields that fit best into a custom schema definition.
• Use when your custom schema is similar to a pre-defined schema and you want to prevent the text extraction process from incorrectly using a pre-defined schema. For example, when you have an invoice schema but with fields specific to your business needs.
force_schema_name setting • Text extraction uses the custom schema you provide directly without classifying the document. • Use when you know beforehand the type your input documents can be categorized into, such as a utility bill.
• The schema name you specify can be a pre-defined schema type or a custom schema type.

Including key-value pair location with the grounding_mode field

A grounded key-value pair has both the key and value associated with a physical location in the document. The physical location is represented as a bounding box in the extracted output.

Grounding information for extracted structured and semi-structured data provides the following capabilities:

Enhanced verification
When a reviewer needs to validate or correct the extracted data, bounding boxes provide visual context, and make it easy to locate the key and value in the original document.
Traceability and auditability
Grounding allows end-users or downstream systems to trace each extracted piece of data back to its source location. Traceability supports audits and compliance checks, and facilitates dispute resolution processes.
Integration with UI workflows
Applications like document viewers or annotation tools rely on bounding boxes to highlight or allow interaction with extracted fields. Grounding data connects raw data with user interface experiences.

The following table provides details about the different modes you can use in the grounding_mode setting in the semantic_config parameter to set the precision of grounding information to include with extracted key-value pair data:

Grounding modes for key-value pairs
Mode Description
fast Optimized for speed with lower precision in grounding information accuracy.
precise Provides higher precision in grounding information accuracy with higher compute cost.

Learn more