Text extraction parameters
When you submit a text extraction request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text extraction operation.
Make choices about the following parameters specific to the text extraction API that meet your requirements:
- Format in which to store the extracted text
- Quality and speed of text extraction
- Language of the input text
- Include text from images in the extracted output
- Include key-value pairs in the extracted output
For details about the different parameters you can set to customize your text extraction REST API request, see the watsonx.ai API reference documentation.
Specifying the output format with the requested_outputs parameter
By default, the extracted text is written in plain text. If you want the extracted text to be written in another format, such as Markdown, specify the following parameter in the API request body:
"parameters": {
"requested_outputs": [
"md"
]
}
The following table provides details about the different output formats generated by the text extraction process when you specify the requested_outputs paramater in your API request:
| Requested output | Generated file type | Description |
|---|---|---|
md |
Markdown | Extracted information is serialized in Markdown format. Data structures such as section titles, tables, and paragraphs are represented using Markdown tags. The result does not contain key-value pair data. |
html |
HTML | Extracted information is serialized in HTML format. Data structures such as section titles, tables, and paragraphs are represented using HTML tags. The result does not contain key-value pair data. |
plain_text |
Plain text | Extracted information is serialized in plain text format. The result only contains unstructured text. The result does not contain tables, section titles, or key-value pair data. |
assembly |
JSON | Extract text into a JSON format. The result contains all unstructured text and data structures such as tables, key-value pairs, and visual bounding box information. |
page_images |
PNG | Extract each page of the document into a separate image. |
Setting the processing mode with the mode parameter
You can control the speed at which your text extraction request is processed by setting the mode parameter in your API request.
"parameters": {
"mode": "standard"
}
The high quality processing mode preserves all data structures in your document but may take longer to process than the standard mode. In the standard mode, the extraction request completes faster but generates lower quality output that may lack details.
For details about the different processing modes, see the watsonx.ai API reference documentation.
Specifying the languages of the input document in the languages parameter
If your document is in a language other than English, you must specify the language by its ISO 639 language code in the languages parameter of your API request.
"parameters": {
"languages": [
"de"
]
}
If the document has a mix of languages, list each language separately.
For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin based. However, you cannot extract text from images in a document with a mix of Japanese and French text.
The language code you specify differs based on whether your document contains machine-printed text or handwriting.
Supported handwritten languages
If your document contains text in English handwriting, use the en_hw language code in your API request body.
Supported machine-printed languages
The following table provides details about the languages supported by the text extraction API for printed text recognition:
| Language | ISO 639 language code | API script code | Script |
|---|---|---|---|
| Acehnese | ‐ | latn |
Latin |
| Afrikaans | af |
latn |
Latin |
| Albanian | sq |
latn |
Latin |
| Araucanian/Mapuche | ‐ | latn |
Latin |
| Awadhi | ‐ | deva |
Devanagari |
| Aymara | ay |
latn |
Latin |
| Balinese | ‐ | latn |
Latin |
| Baso Minangkabau | ‐ | latn |
Latin |
| Basque | eu |
latn |
Latin |
| Belarusian | be |
cyrl |
Cyrillic |
| Bemba | ‐ | latn |
Latin |
| Bikol | ‐ | latn |
Latin |
| Bislama | bi |
latn |
Latin |
| Bhojpuri | ‐ | deva |
Devanagari |
| Bulgarian | bg |
cyrl |
Cyrillic |
| Catalan | ca |
latn |
Latin |
| Cebuano | ‐ | latn |
Latin |
| Chechen | ‐ | cyrl |
Cyrillic |
| Chinese (Simplified) | zh_cn |
cjk |
Han (Simplified) |
| Chinese (Traditional) | zh_tw |
cjk |
Han (Traditional) |
| Choctaw | ‐ | latn |
Latin |
| Cree | cr |
latn |
Latin |
| Dakota | ‐ | latn |
Latin |
| Danish | da |
latn |
Latin |
| Dogri | ‐ | deva |
Devanagari |
| Dutch | nl |
latn |
Latin |
| English | en |
latn |
Latin |
| Estonian | et |
latn |
Latin |
| Fijian | fj |
latn |
Latin |
| Filipino | fil |
latn |
Latin |
| Finnish | fi |
latn |
Latin |
| French | fr |
latn |
Latin |
| Galician | gl |
latn |
Latin |
| Gayo | ‐ | latn |
Latin |
| German | de |
latn |
Latin |
| Gilbertese | ‐ | latn |
Latin |
| Greek | el |
el |
Greek |
| Haitian Creole | ht |
latn |
Latin |
| Hebrew | he |
he |
Hebrew |
| Hiligaynon | ‐ | latn |
Latin |
| Hindi | hi |
deva |
Devanagari |
| Iban | ‐ | latn |
Latin |
| Iloko | ‐ | latn |
Latin |
| Indonesian | id |
latn |
Latin |
| Irish | ga |
latn |
Latin |
| Italian | it |
it |
Latin |
| Japanese | ja |
cjk |
Japanese |
| Javanese | jv |
latn |
Latin |
| Kachin | ‐ | latn |
Latin |
| Kalaallisut | kl |
latn |
Latin |
| Kanienʼkéha | ‐ | latn |
Latin |
| Khasi | ‐ | latn |
Latin |
| Kinyarwanda | rw |
latn |
Latin |
| Konkani | ‐ | deva |
Devanagari |
| Kongo | kg |
latn |
Latin |
| Korean | ko |
cjk |
Korean |
| Kosraean | ‐ | latn |
Latin |
| Kuanyama | kj |
latn |
Latin |
| Latin | la |
latn |
Latin |
| Lozi | ‐ | latn |
Latin |
| Low German | ‐ | latn |
Latin |
| Luo | ‐ | latn |
Latin |
| Malagasy | mg |
latn |
Latin |
| Maithili | ‐ | deva |
Devanagari |
| Manx | gv |
latn |
Latin |
| Marathi | mr |
deva |
Devanagari |
| Middle English | ‐ | latn |
Latin |
| Mittelhochdeutsch | ‐ | latn |
Latin |
| Macedonian | mk |
cyrl |
Cyrillic |
| Ndonga | ng |
latn |
Latin |
| Nepali | ne |
deva |
Devanagari |
| NorthNdebele | nd |
latn |
Latin |
| Norwegian | no |
no |
Latin |
| Nyankole | ‐ | latn |
Latin |
| Occitan | oc |
latn |
Latin |
| Ojibwa | oj |
latn |
Latin |
| Old English | ‐ | latn |
Latin |
| Old French | ‐ | latn |
Latin |
| Old High German | ‐ | latn |
Latin |
| Old Norse | ‐ | latn |
Latin |
| Old Provençal | ‐ | latn |
Latin |
| Pampanga | ‐ | latn |
Latin |
| Pangasinan | ‐ | latn |
Latin |
| Papiamento | ‐ | latn |
Latin |
| Polish | pl |
latn |
Latin |
| Portuguese | pt |
pt |
Latin |
| Quechua | qu |
latn |
Latin |
| Romansh | rm |
latn |
Latin |
| Rundi | rn |
latn |
Latin |
| Russian | ru |
cyrl |
Cyrillic |
| Sango | sg |
latn |
Latin |
| Sanskrit | sa |
deva |
Devanagari |
| Scots | ‐ | latn |
Latin |
| Serbian | sr |
cyrl |
Cyrillic |
| Shona | sn |
latn |
Latin |
| Spanish | es |
es |
Latin |
| Sundanese | su |
latn |
Latin |
| Swahili | sw |
latn |
Latin |
| Swati | ss |
latn |
Latin |
| Swedish | sv |
sv |
Latin |
| Tamil | ta |
deva |
Tamil |
| Telugu | te |
deva |
Telugu |
| Tsonga | ts |
latn |
Latin |
| Tswana | tn |
latn |
Latin |
| Ukrainian | uk |
cyrl |
Cyrillic |
| Uzbek | uz |
cyrl |
Cyrillic |
| Xhosa | xh |
latn |
Latin |
| Zulu | zu |
latn |
Latin |
Specifying how to extract text from images with the ocr_mode and create_embedded_images parameters
You can specify how you to process text in images in your document by using optical character recognition (OCR). Specify the following parameter in the API request body:
"parameters": {
"ocr_mode": "enabled"
}
For details about the different OCR modes, see the watsonx.ai API reference documentation.
The following table provides details about the different OCR modes you can use to specify how to process images in your API request:
| OCR mode | Description |
|---|---|
disabled |
Image files and scanned documents are not processed. For hybrid documents that contain both visuals and text, only text is extracted. |
enabled |
OCR is only run if no text could be extracted from the document. Images embedded in documents are processed. |
forced |
Each page of the document is converted to an image and processed with OCR. Every document type, including text-only files, are converted to images before they are processed. |
You can also configure how to process images embedded in your document and convert them to Markdown and JSON formats.
The embedded image is the area on a page of the document that represents only the picture without including portions of the page that contain text or tables. Text and tables in the original document are processed with OCR. The embedded images extraction mode is used to specify how to serialize images in the document and preserve them in the extracted output.
Based on the embedded images extraction mode you specify, you can choose how embedded images are represented in the output:
- Whether to include images in the extracted output. If images are included, they are stored in the
embedded_images_assemblyfolder as.pngfiles - Whether generic placeholder text or the text extracted by OCR directly from the image appears in the Markdown and JSON output formats
- Whether image is verbalized by describing the image in natural language. For example, an image of a cat may be verbalized as
The image displays a cat resting on the floor.
To extract embedded images including text that describes the images, specify the following parameter in the API request body:
"parameters": {
"create_embedded_images": "enabled_verbalization"
}
Images extracted in a JSON output format are represented in the Picture object. Based on the embedded images mode you specify, the following attributes in the JSON object are used to store the image details:
text: Stores a string that contains the text extracted directly from the imageverbalization: Stores a string that contains the textual description of the image.children_ids: Each word in the text releated to an image is represented as tokens and stored as a list of token IDs.
For details about the JSON output schema, see Text extraction JSON schema.
The following table provides details about the different modes you can use in your API request to extract embedded images:
| Mode | Usage | Image (in bytes) in output | Markdown output details | JSON output details |
|---|---|---|---|---|
disabled |
Suited for an application that does not need to include images in the output. OCR processes tables and other data structures in the document. | No | None | None |
enabled_placeholder |
Suited for an application that needs to process images, but does not require image description and use a custom im,age verbalizer to generate image descriptions. | ✓ | Link to image location | • Image in the pictures structure• picture.text is empty• List of token IDs that represent generic placeholder text in picture.children_ids |
enabled_text |
Suited for an application that needs to process images, but does not require image description and use a custom im,age verbalizer to generate image descriptions. | ✓ | Text is extracted from the image | • Image in the pictures structure• Text extracted directly from the image in picture.text• List of token IDs that represent text extracted from the image in picture.children_ids |
enabled_verbalization |
Suited for an application that uses image descriptions to implements image search. | ✓ | • Link to image location • Textual description of the image |
• Image in the pictures structure• Textual description of the image in picture.verbalization only if the image was verbalized in the original document• List of token IDs that represent the textual description of the image |
enabled_verbalization_all |
Suited for an application that uses image descriptions to implements image search. | ✓ | • Link to image location • Textual description of the image |
• Image in the pictures structure• Textual description of the image in picture.verbalization only if the image was verbalized in the original document• List of token IDs that represent the textual description of the image |
Specifying how to extract data in key-value pairs with the kvp_mode parameter
You can identify and extract structured information into key-value pairs from unstructured or semi-structured documents such as invoices, forms, contracts, or receipts. The extracted text is stored in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is extracted by using a general-purpose foundation model or a model that is tuned for specific document formats.
The following restrictions apply when you use the key-value pair extraction capability:
- Key-value pair data extraction is only supported for English language documents.
- The result of the key-value pair extraction is only available in the
assemblyoutput format. Key-value pairs are not extracted in thehtml,markdown, orplain_textoutput formats.
For details about various methods to use key-value pair extraction to process structured data in your documents, see Key-value pair extraction modes.