Text extraction parameters
When you submit a text extraction request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text extraction operation.
Make choices about the various common REST API settings used to process text in your documents. For details, see Text processing parameters.
In addition, choose how to configure the following settings in the text extraction REST API request body that meet your requirements:
- Format in which to store the extracted text
- Quality and speed of text extraction
- Include text from images in the extracted output
- Include key-value pairs in the extracted output
Setting the output format with the requested_outputs parameter
By default, the extracted text is written in plain text. If you want the extracted text to be written in another format, such as Markdown, specify the following parameter in the API request body:
"parameters": {
"requested_outputs": [
"md"
]
}
The following table provides details about the different output formats generated by the text extraction process when you specify the requested_outputs parameter in your API request:
| Requested output | Generated file type | Description |
|---|---|---|
md |
Markdown | Extracted information is serialized in Markdown format. Data structures such as section titles, tables, and paragraphs are represented using Markdown tags. The result does not contain key-value pair data. |
html |
HTML | Extracted information is serialized in HTML format. Data structures such as section titles, tables, and paragraphs are represented using HTML tags. The result does not contain key-value pair data. |
plain_text |
Plain text | Extracted information is serialized in plain text format. The result only contains unstructured text. The result does not contain tables, section titles, or key-value pair data. |
assembly |
JSON | Extract text into a JSON format. The result contains all unstructured text and data structures such as tables, key-value pairs, and visual bounding box information. |
page_images |
PNG | Extract each page of the document into a separate image. |
Setting the processing mode with the mode parameter
You can control the speed at which your text extraction request is processed by setting the mode parameter in your API request.
"parameters": {
"mode": "standard"
}
The high_quality processing mode preserves all data structures in your document but may take longer to process than the standard mode. In the standard mode, the extraction request completes faster but generates lower
quality output that may lack details.
Specifying how to represent embedded images with the create_embedded_images parameters
You can configure how to process images embedded in your document and convert them to Markdown and JSON formats.
The embedded image is the area on a page of the document that represents only the picture without including portions of the page that contain text or tables. Text and tables in the original document are processed with optical character recognition (OCR). The embedded images extraction mode is used to specify how to serialize images in the document and preserve them in the extracted output.
Based on the embedded images extraction mode you specify, you can choose how embedded images are represented in the output:
- Whether to include images in the extracted output. If images are included, they are stored in the
embedded_images_assemblyfolder as.pngfiles - Whether generic placeholder text or the text extracted by OCR directly from the image appears in the Markdown and JSON output formats
- Whether image is verbalized by describing the image in natural language. For example, an image of a cat may be verbalized as
The image displays a cat resting on the floor.
To extract embedded images including text that describes the images, specify the following parameter in the API request body:
"parameters": {
"create_embedded_images": "enabled_verbalization"
}
Images extracted in a JSON output format are represented in the Picture object. Based on the embedded images mode you specify, the following attributes in the JSON object are used to store the image details:
text: Stores a string that contains the text extracted directly from the imageverbalization: Stores a string that contains the textual description of the image.children_ids: Each word in the text related to an image is represented as tokens and stored as a list of token IDs.
For details about the JSON output schema, see Text extraction JSON schema.
The following table provides details about the different modes you can use in your API request to extract embedded images:
| Mode | Usage | Image (in bytes) in output | Markdown output details | JSON output details |
|---|---|---|---|---|
disabled |
Suited for an application that does not need to include images in the output. OCR processes tables and other data structures in the document. | No | None | None |
enabled_placeholder |
Suited for an application that needs to process images, but does not require image description and use a custom image verbalizer to generate image descriptions. | ✓ | Link to image location | • Image in the pictures structure• picture.text is empty• List of token IDs that represent generic placeholder text in picture.children_ids |
enabled_text |
Suited for an application that needs to process images, but does not require image description and use a custom image verbalizer to generate image descriptions. | ✓ | Text is extracted from the image | • Image in the pictures structure• Text extracted directly from the image in picture.text• List of token IDs that represent text extracted from the image in picture.children_ids |
enabled_verbalization |
• Suited for an application that uses image descriptions to implements image search. • Only some images of interest, such as graphs, charts, and screenshots, are verbalized. |
✓ | • Link to image location • Textual description of the image |
• Image in the pictures structure• Textual description of the image in picture.verbalization only if the image was verbalized in the original document• List of token IDs that represent the textual description of the image |
enabled_verbalization_all |
• Suited for an application that uses image descriptions to implements image search. • All embedded images in the document are verbalized. |
✓ | • Link to image location • Textual description of the image |
• Image in the pictures structure• Textual description of the image in picture.verbalization only if the image was verbalized in the original document• List of token IDs that represent the textual description of the image |
Specifying how to extract data in key-value pairs with the kvp_mode parameter
The extracted text is stored in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is extracted by using a general-purpose foundation model or a model that is tuned for specific document formats.
For details about how key value pairs are processed by the document understanding technology, see Processing text as key-value pairs.
The following restrictions apply when you use the key-value pair extraction capability:
- Key-value pair data extraction is only supported for English language documents.
- The result of the key-value pair extraction is only available in the
assemblyoutput format. Key-value pairs are not extracted in thehtml,markdown, orplain_textoutput formats.
Extract generic labelled data and domain-specific data with a general purpose model into a key-value pair format by setting the kvp_mode in your text extraction request as follows:
"parameters": {
"kvp_mode": "generic_with_semantic"
}
The foundation models that you can use with the generic_with_semantic mode vary by data center. For details, see Regional availability of foundation models.
Configuring the key-value pair text extraction pipeline with the semantic_config parameter
You can use various common fields of the semantic_config parameter in the REST API request body to configure the key-value pair extraction pipeline. For details, see Configuring the key-value pair processing pipeline.
In addition, you can configure the following capabilities specific to the key-value pair extraction pipeline in the semantic_config parameter:
Setting the extraction method with the enable_schema_kvp and enable_generic_kvp fields
Based on the contents of your input document, you can process key-value pair data with one of the following methods:
- Schema-based key-value pair extraction
- The schema-based process targets specific fields in documents by using pre-defined or custom schemas for common document types like invoices, utility bills, passports, and more. Every page is classified into one of the supported schema types. Based on the classification, you can extract text into the key-value pair format defined in the schema for the specific document type. By classifying the document and targeting domain-specific data, this method increases accuracy for known document types without requiring dedicated model training.
- Generic key-value pair extraction
- The generic extraction process identifies and extracts content that can be represented as key-value pairs from a document. This method is useful for extracting labeled information when you do not have domain-specific knowledge of the document. Generic extraction works best for pages that do not fit into one of the pre-defined templates.
By default, both the generic and schema-based methods are used to extract text in the generic_with_semantic mode. You can choose to use one or both of the extraction methods in the same API request to accomplish the following
tasks:
- Extract a targeted set of fields with a schema by using the schema-based extraction method
- Perform a broad sweep of the document to extract all generic data by using the generic extraction method
To use the schema-based extraction method, set the following semantic_config parameter in the text extraction request:
"semantic_config": {
"enable_schema_kvp": true
}
To use the generic extraction method, set the following semantic_config parameter in the text extraction request as follows:
"semantic_config": {
"enable_generic_kvp": true
}
enable_generic_kvp to false.
Defining pre-defined and custom schema interaction with the schemas_merge_strategy and force_schema_name fields
You can specify how a combination of pre-defined and custom schemas are used to process key-value pair data in your document.
You can use common semantic_config settings such as the schemas_merge_strategy field to control how pre-defined and custom schemas are used together. For details, see Controlling how pre-defined and custom schemas interact.
However, if your document can be classified into one of the pre-defined document types, such as a receipt, but contains some data in a customized format specific to your business needs, you can define your own custom schema for the document.
You can then manually override the pre-defined schema and use the custom receipt schema directly by setting the force_schema_name parameter in your API request as follows:
"semantic_config": {
"enable_schema_kvp": false,
"enable_schema_kvp": true,
"force_schema_name": "Receipt", # Force the document to be processed as a receipt
"schemas": [ {
"document_type": "Receipt",
"document_description": "A receipt issued for a purchase at ABC store.",
"fields": {
"receipt_number": {
"default": "",
"example": "R-20241027-ABC",
"description": "Unique identifier on the receipt."
},
"customer_name": {
"default": "",
"example": "John Smith",
"description": "Full name of the customer or payee."
},
"date_of_transaction": {
"default": "",
"example": "2023-01-01",
"description": "Date when the purchase or payment occurred."
},
"total_paid": {
"default": "",
"example": "8.64",
"description": "Final amount paid by the customer."
},
"payment_method": {
"default": "",
"example": "Credit Card",
"description": "How payment was made, such as cash, card, check, etc.)."
},
}
} ]
}
force_schema_name must exactly match the document_type from one of the pre-defined schemas or a custom schema you create. Otherwise, the text extraction process may fail to extract any fields or extract incorrect
data from your document.
The following table describes the difference in the methods you can use to process your documents with custom schemas:
| Custom schema selection method | Description | Recommended use |
|---|---|---|
replace schema merge strategy |
• Text extraction does not use any pre-defined schemas and only considers custom schema you provide. You can provide one or multiple schemas in the API request. • The text extraction process classifies the input document using the description from the custom schema and extracts fields from the document that match the fields in the custom schema. If the extracted data does not match any fields in any of the custom schemas you provide, no data is extracted. |
• Use when you do not know the exact format of the input document which can contain unique fields that fit best into a custom schema definition. • Use when your custom schema is similar to a pre-defined schema and you want to prevent the text extraction process from incorrectly using a pre-defined schema. For example, when you have an invoice schema but with fields specific to your business needs. |
force_schema_name setting |
• Text extraction uses the custom schema you provide directly without classifying the document. | • Use when you know beforehand the type your input documents can be categorized into, such as a utility bill. • The schema name you specify can be a pre-defined schema type or a custom schema type. |
Including key-value pair location with the grounding_mode field
A grounded key-value pair has both the key and value associated with a physical location in the document. The physical location is represented as a bounding box in the extracted output.
Grounding information for extracted structured and semi-structured data provides the following capabilities:
- Enhanced verification
- When a reviewer needs to validate or correct the extracted data, bounding boxes provide visual context, and make it easy to locate the key and value in the original document.
- Traceability and auditability
- Grounding allows end-users or downstream systems to trace each extracted piece of data back to its source location. Traceability supports audits and compliance checks, and facilitates dispute resolution processes.
- Integration with UI workflows
- Applications like document viewers or annotation tools rely on bounding boxes to highlight or allow interaction with extracted fields. Grounding data connects raw data with user interface experiences.
The following table provides details about the different modes you can use in the grounding_mode setting in the semantic_config parameter to set the precision of grounding information to include with extracted key-value
pair data:
| Mode | Description |
|---|---|
fast |
Optimized for speed with lower precision in grounding information accuracy. |
precise |
Provides higher precision in grounding information accuracy with higher compute cost. |
Learn more
- For details about the different parameters you can set to customize your text extraction REST API request, see the watsonx.ai API reference documentation.
- Extracting text from documents
- Parsing extracted JSON structures