Common text processing parameters

When you submit a text processing request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text processing operation.

You can use multiple text processing APIs to understand and convert your documents into a simpler textual format that can be used in a RAG solution. You can use the text classification API to determine whether your document matches the structured data format of certain common document types. Based on the classification result, you can then customize your text extraction request to extract text and other structured content from your document more efficiently.

Make choices about the following settings that are common to both the text classification and extraction REST API requests:

Language of the input document
Process text from images in the input document
Process key-value pairs in the input document

In addition to the common REST API parameters, you can set parameters that are specific to the various text processing API methods in the document understanding library. For details, see the following topics:

Specifying the languages of the input document with the `languages` parameter

If your document is in a language other than English, you must specify the language by its ISO 639 language code in the languages parameter of your API request.

"parameters": {
  "languages": [
    "de"
  ]
}

If the document has a mix of languages, list each language separately. The language code you specify differs based on whether your document contains machine-printed text or handwriting. If your document has both printed and handwritten text in a specific language, you must specify both types of language codes in the list of languages.

Lanuage restrictions with the text processing APIs

Text classification: You can use the classification API with English language documents only.
Text extraction: You cannot use the extraction API with a mixed-language document when the languages do not share a common script. However, you can use documents with a mix of English and one other language in any script. For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin-based. However, you cannot extract text from images in a document with a mix of Japanese and French text.; You can use the text extraction API to extract key-value pair data from English language documents only.

Supported handwritten languages

If your document contains text in English handwriting, use the en_hw language code in your API request body.

Supported machine-printed languages

The following table provides details about the languages supported by the text extraction API for printed text recognition:

Note: If your document language does not have an ISO 639 language code listed, use the API script code.

Machine-printed languages supported in the text extraction API
Language	ISO 639 language code	API script code	Script
Acehnese	‐	`latn`	Latin
Afrikaans	`af`	`latn`	Latin
Albanian	`sq`	`latn`	Latin
Araucanian/Mapuche	‐	`latn`	Latin
Awadhi	‐	`deva`	Devanagari
Aymara	`ay`	`latn`	Latin
Balinese	‐	`latn`	Latin
Baso Minangkabau	‐	`latn`	Latin
Basque	`eu`	`latn`	Latin
Belarusian	`be`	`cyrl`	Cyrillic
Bemba	‐	`latn`	Latin
Bikol	‐	`latn`	Latin
Bislama	`bi`	`latn`	Latin
Bhojpuri	‐	`deva`	Devanagari
Bulgarian	`bg`	`cyrl`	Cyrillic
Catalan	`ca`	`latn`	Latin
Cebuano	‐	`latn`	Latin
Chechen	‐	`cyrl`	Cyrillic
Chinese (Simplified)	`zh_cn`	`cjk`	Han (Simplified)
Chinese (Traditional)	`zh_tw`	`cjk`	Han (Traditional)
Choctaw	‐	`latn`	Latin
Cree	`cr`	`latn`	Latin
Dakota	‐	`latn`	Latin
Danish	`da`	`latn`	Latin
Dogri	‐	`deva`	Devanagari
Dutch	`nl`	`latn`	Latin
English	`en`	`latn`	Latin
Estonian	`et`	`latn`	Latin
Fijian	`fj`	`latn`	Latin
Filipino	`fil`	`latn`	Latin
Finnish	`fi`	`latn`	Latin
French	`fr`	`latn`	Latin
Galician	`gl`	`latn`	Latin
Gayo	‐	`latn`	Latin
German	`de`	`latn`	Latin
Gilbertese	‐	`latn`	Latin
Greek	`el`	`el`	Greek
Haitian Creole	`ht`	`latn`	Latin
Hebrew	`he`	`he`	Hebrew
Hiligaynon	‐	`latn`	Latin
Hindi	`hi`	`deva`	Devanagari
Iban	‐	`latn`	Latin
Iloko	‐	`latn`	Latin
Indonesian	`id`	`latn`	Latin
Irish	`ga`	`latn`	Latin
Italian	`it`	`it`	Latin
Japanese	`ja`	`cjk`	Japanese
Javanese	`jv`	`latn`	Latin
Kachin	‐	`latn`	Latin
Kalaallisut	`kl`	`latn`	Latin
Kanienʼkéha	‐	`latn`	Latin
Khasi	‐	`latn`	Latin
Kinyarwanda	`rw`	`latn`	Latin
Konkani	‐	`deva`	Devanagari
Kongo	`kg`	`latn`	Latin
Korean	`ko`	`cjk`	Korean
Kosraean	‐	`latn`	Latin
Kuanyama	`kj`	`latn`	Latin
Latin	`la`	`latn`	Latin
Lozi	‐	`latn`	Latin
Low German	‐	`latn`	Latin
Luo	‐	`latn`	Latin
Malagasy	`mg`	`latn`	Latin
Maithili	‐	`deva`	Devanagari
Manx	`gv`	`latn`	Latin
Marathi	`mr`	`deva`	Devanagari
Middle English	‐	`latn`	Latin
Mittelhochdeutsch	‐	`latn`	Latin
Macedonian	`mk`	`cyrl`	Cyrillic
Ndonga	`ng`	`latn`	Latin
Nepali	`ne`	`deva`	Devanagari
NorthNdebele	`nd`	`latn`	Latin
Norwegian	`no`	`no`	Latin
Nyankole	‐	`latn`	Latin
Occitan	`oc`	`latn`	Latin
Ojibwa	`oj`	`latn`	Latin
Old English	‐	`latn`	Latin
Old French	‐	`latn`	Latin
Old High German	‐	`latn`	Latin
Old Norse	‐	`latn`	Latin
Old Provençal	‐	`latn`	Latin
Pampanga	‐	`latn`	Latin
Pangasinan	‐	`latn`	Latin
Papiamento	‐	`latn`	Latin
Polish	`pl`	`latn`	Latin
Portuguese	`pt`	`pt`	Latin
Quechua	`qu`	`latn`	Latin
Romansh	`rm`	`latn`	Latin
Rundi	`rn`	`latn`	Latin
Russian	`ru`	`cyrl`	Cyrillic
Sango	`sg`	`latn`	Latin
Sanskrit	`sa`	`deva`	Devanagari
Scots	‐	`latn`	Latin
Serbian	`sr`	`cyrl`	Cyrillic
Shona	`sn`	`latn`	Latin
Spanish	`es`	`es`	Latin
Sundanese	`su`	`latn`	Latin
Swahili	`sw`	`latn`	Latin
Swati	`ss`	`latn`	Latin
Swedish	`sv`	`sv`	Latin
Tamil	`ta`	`deva`	Tamil
Telugu	`te`	`deva`	Telugu
Tsonga	`ts`	`latn`	Latin
Tswana	`tn`	`latn`	Latin
Ukrainian	`uk`	`cyrl`	Cyrillic
Uzbek	`uz`	`cyrl`	Cyrillic
Xhosa	`xh`	`latn`	Latin
Zulu	`zu`	`latn`	Latin

Extracting text from images with the `ocr_mode` parameter

You can specify how you to process text in images in your document by using optical character recognition (OCR). Specify the following parameter in the API request body:

"parameters": {
  "ocr_mode": "enabled"
}

The following table provides details about the different OCR modes you can use to specify how to process images in your API request:

OCR modes in the text extraction API
OCR mode	Description
`disabled`	Image files and scanned documents are not processed. For hybrid documents that contain both visuals and text, only text is extracted.
`enabled`	OCR is only run if no text could be extracted from the document. Images embedded in documents are processed.
`forced`	Each page of the document is converted to an image and processed with OCR. Every document type, including text-only files, are converted to images before they are processed.

Configuring the key-value pair processing pipeline with the `semantic_config` parameter

You can identify and extract structured information into key-value pairs from unstructured or semi-structured documents such as invoices, forms, contracts, or receipts. The processed text is in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is processed by using a general-purpose foundation model or a model that is tuned for specific document formats.

Restriction:

Key-value pair classification and extraction is only supported for English language documents. The foundation models you can use for key-value pair processing vary by data center. For details, see Regional availability of foundation models.

You can use the text classification API to quickly check whether a document can be classified into one of several pre-defined schemas for common document types without performing the key-value pair extraction. If the document does not match a pre-defined type, you can then define a new document type and custom schema before you run a text extraction API request.

Use various semantic_config parameters in the REST API request body to configure the following key-value pair processing pipeline capabilities:

Schema definitions
Interaction between pre-defined and custom schemas

You also set fields in the semantic_config parameter that are specific to the text extraction API method . For details, see Generic and semantic key-value pair extraction mode.

Defining schemas with the `schemas` field

Based on the layout, documents can be broadly classified into the following types:

Variable layout documents: Documents without a consistent structure, such as invoices, purchase orders, or passports, where the structure spans multiple pages.
Fixed layout documents: Structured documents where each page follows a pre-defined format such as a tax form where each page has a specific layout.

Pre-defined schemas

You can classify or extract text from your files into pre-defined schemas for the following supported common document types:

Custom schemas

If your documents contains unique structured content, you can provide a custom schema that defines specific data and unique identifiers. When you specify a custom schema, the text extraction process automatically overrides classifying the document into one of the pre-defined schemas and only uses the schema you provide in the schemas parameter in the semantic_config.

For details about how to define parameters in a custom schema, see Creating custom schemas for key-value pair extraction.

The following example provides a custom schema for a receipt in the REST API request body:

"semantic_config": {
  "schemas": [ {
      "document_type": "Receipt",
      "document_description": "A receipt issued for a purchase at ABC store.",
      "fields": {
        "receipt_number": {
          "default": "",
          "example": "R-20241027-ABC",
          "description": "Unique identifier on the receipt."
        },
        "customer_name": {
          "default": "",
          "example": "John Smith",
          "description": "Full name of the customer or payee."
        },
        "date_of_transaction": {
          "default": "",
          "example": "2023-01-01",
          "description": "Date when the purchase or payment occurred."
        },
        "total_paid": {
          "default": "",
          "example": "8.64",
          "description": "Final amount paid by the customer."
        },
        "payment_method": {
          "default": "",
          "example": "Credit Card",
          "description": "How payment was made, such as cash, card, check, etc.)."
        },
      }
  } ]
}

Controlling how pre-defined and custom schemas interact with the `schemas_merge_strategy` field

You can define how custom schemas you create interact with the pre-defined schemas supported by the text processing API.

The following table provides details about the different ways pre-defined and custom schemas are processed when you configure the schemas_merge_strategy setting in the semantic_config parameter:

Strategies for processing schemas during key-value pair extraction
Schema strategy setting	Description
`replace`	Discard all pre-defined schemas and use only the custom schema.
`merge`	Custom schemas are merged with and override any pre-defined schemas that share the same `document_type` attribute in the schema definition.

By default, if you create a custom schema for a document type that matches a supported pre-defined schema, both schemas are merged before being used to process key-value pair data in your document.

Common text processing parameters

Specifying the languages of the input document with the languages parameter