Common text processing parameters

When you submit a text processing request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text processing operation.

You can use multiple text processing APIs to understand and convert your documents into a simpler textual format that can be used in a RAG solution. You can use the text classification API to determine whether your document matches the structured data format of certain common document types. Based on the classification result, you can then customize your text extraction request to extract text and other structured content from your document more efficiently.

Make choices about the following settings that are common to both the text classification and extraction REST API requests:

In addition to the common REST API parameters, you can set parameters that are specific to the various text processing API methods in the document understanding library. For details, see the following topics:

Specifying the languages of the input document with the languages parameter

If your document is in a language other than English, you must specify the language by its ISO 639 language code in the languages parameter of your API request.

"parameters": {
  "languages": [
    "de"
  ]
}

If the document has a mix of languages, list each language separately. The language code you specify differs based on whether your document contains machine-printed text or handwriting. If your document has both printed and handwritten text in a specific language, you must specify both types of language codes in the list of languages.

Lanuage restrictions with the text processing APIs

Text classification

You can use the classification API with English language documents only.

Text extraction

You cannot use the extraction API with a mixed-language document when the languages do not share a common script. However, you can use documents with a mix of English and one other language in any script. For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin-based. However, you cannot extract text from images in a document with a mix of Japanese and French text.

You can use the text extraction API to extract key-value pair data from English language documents only.

Supported handwritten languages

If your document contains text in English handwriting, use the en_hw language code in your API request body.

Supported machine-printed languages

The following table provides details about the languages supported by the text extraction API for printed text recognition:

Note: If your document language does not have an ISO 639 language code listed, use the API script code.
Machine-printed languages supported in the text extraction API
Language ISO 639 language code API script code Script
Acehnese latn Latin
Afrikaans af latn Latin
Albanian sq latn Latin
Araucanian/Mapuche latn Latin
Awadhi deva Devanagari
Aymara ay latn Latin
Balinese latn Latin
Baso Minangkabau latn Latin
Basque eu latn Latin
Belarusian be cyrl Cyrillic
Bemba latn Latin
Bikol latn Latin
Bislama bi latn Latin
Bhojpuri deva Devanagari
Bulgarian bg cyrl Cyrillic
Catalan ca latn Latin
Cebuano latn Latin
Chechen cyrl Cyrillic
Chinese (Simplified) zh_cn cjk Han (Simplified)
Chinese (Traditional) zh_tw cjk Han (Traditional)
Choctaw latn Latin
Cree cr latn Latin
Dakota latn Latin
Danish da latn Latin
Dogri deva Devanagari
Dutch nl latn Latin
English en latn Latin
Estonian et latn Latin
Fijian fj latn Latin
Filipino fil latn Latin
Finnish fi latn Latin
French fr latn Latin
Galician gl latn Latin
Gayo latn Latin
German de latn Latin
Gilbertese latn Latin
Greek el el Greek
Haitian Creole ht latn Latin
Hebrew he he Hebrew
Hiligaynon latn Latin
Hindi hi deva Devanagari
Iban latn Latin
Iloko latn Latin
Indonesian id latn Latin
Irish ga latn Latin
Italian it it Latin
Japanese ja cjk Japanese
Javanese jv latn Latin
Kachin latn Latin
Kalaallisut kl latn Latin
Kanienʼkéha latn Latin
Khasi latn Latin
Kinyarwanda rw latn Latin
Konkani deva Devanagari
Kongo kg latn Latin
Korean ko cjk Korean
Kosraean latn Latin
Kuanyama kj latn Latin
Latin la latn Latin
Lozi latn Latin
Low German latn Latin
Luo latn Latin
Malagasy mg latn Latin
Maithili deva Devanagari
Manx gv latn Latin
Marathi mr deva Devanagari
Middle English latn Latin
Mittelhochdeutsch latn Latin
Macedonian mk cyrl Cyrillic
Ndonga ng latn Latin
Nepali ne deva Devanagari
NorthNdebele nd latn Latin
Norwegian no no Latin
Nyankole latn Latin
Occitan oc latn Latin
Ojibwa oj latn Latin
Old English latn Latin
Old French latn Latin
Old High German latn Latin
Old Norse latn Latin
Old Provençal latn Latin
Pampanga latn Latin
Pangasinan latn Latin
Papiamento latn Latin
Polish pl latn Latin
Portuguese pt pt Latin
Quechua qu latn Latin
Romansh rm latn Latin
Rundi rn latn Latin
Russian ru cyrl Cyrillic
Sango sg latn Latin
Sanskrit sa deva Devanagari
Scots latn Latin
Serbian sr cyrl Cyrillic
Shona sn latn Latin
Spanish es es Latin
Sundanese su latn Latin
Swahili sw latn Latin
Swati ss latn Latin
Swedish sv sv Latin
Tamil ta deva Tamil
Telugu te deva Telugu
Tsonga ts latn Latin
Tswana tn latn Latin
Ukrainian uk cyrl Cyrillic
Uzbek uz cyrl Cyrillic
Xhosa xh latn Latin
Zulu zu latn Latin

Extracting text from images with the ocr_mode parameter

You can specify how you to process text in images in your document by using optical character recognition (OCR). Specify the following parameter in the API request body:

"parameters": {
  "ocr_mode": "enabled"
}

The following table provides details about the different OCR modes you can use to specify how to process images in your API request:

OCR modes in the text extraction API
OCR mode Description
disabled Image files and scanned documents are not processed. For hybrid documents that contain both visuals and text, only text is extracted.
enabled OCR is only run if no text could be extracted from the document. Images embedded in documents are processed.
forced Each page of the document is converted to an image and processed with OCR. Every document type, including text-only files, are converted to images before they are processed.

Configuring the key-value pair processing pipeline with the semantic_config parameter

You can identify and extract structured information into key-value pairs from unstructured or semi-structured documents such as invoices, forms, contracts, or receipts. The processed text is in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is processed by using a general-purpose foundation model or a model that is tuned for specific document formats.

Restriction:

Key-value pair classification and extraction is only supported for English language documents. The foundation models you can use for key-value pair processing vary by data center. For details, see Regional availability of foundation models.

You can use the text classification API to quickly check whether a document can be classified into one of several pre-defined schemas for common document types without performing the key-value pair extraction. If the document does not match a pre-defined type, you can then define a new document type and custom schema before you run a text extraction API request.

Use various semantic_config parameters in the REST API request body to configure the following key-value pair processing pipeline capabilities:

You also set fields in the semantic_config parameter that are specific to the text extraction API method . For details, see Generic and semantic key-value pair extraction mode.

Defining schemas with the schemas field

Based on the layout, documents can be broadly classified into the following types:

Variable layout documents
Documents without a consistent structure, such as invoices, purchase orders, or passports, where the structure spans multiple pages.
Fixed layout documents
Structured documents where each page follows a pre-defined format such as a tax form where each page has a specific layout.

Pre-defined schemas

You can classify or extract text from your files into pre-defined schemas for the following supported common document types:

Custom schemas

If your documents contains unique structured content, you can provide a custom schema that defines specific data and unique identifiers. When you specify a custom schema, the text extraction process automatically overrides classifying the document into one of the pre-defined schemas and only uses the schema you provide in the schemas parameter in the semantic_config.

For details about how to define parameters in a custom schema, see Creating custom schemas for key-value pair extraction.

The following example provides a custom schema for a receipt in the REST API request body:

"semantic_config": {
  "schemas": [ {
      "document_type": "Receipt",
      "document_description": "A receipt issued for a purchase at ABC store.",
      "fields": {
        "receipt_number": {
          "default": "",
          "example": "R-20241027-ABC",
          "description": "Unique identifier on the receipt."
        },
        "customer_name": {
          "default": "",
          "example": "John Smith",
          "description": "Full name of the customer or payee."
        },
        "date_of_transaction": {
          "default": "",
          "example": "2023-01-01",
          "description": "Date when the purchase or payment occurred."
        },
        "total_paid": {
          "default": "",
          "example": "8.64",
          "description": "Final amount paid by the customer."
        },
        "payment_method": {
          "default": "",
          "example": "Credit Card",
          "description": "How payment was made, such as cash, card, check, etc.)."
        },
      }
  } ]
}

Controlling how pre-defined and custom schemas interact with the schemas_merge_strategy field

You can define how custom schemas you create interact with the pre-defined schemas supported by the text processing API.

The following table provides details about the different ways pre-defined and custom schemas are processed when you configure the schemas_merge_strategy setting in the semantic_config parameter:

Strategies for processing schemas during key-value pair extraction
Schema strategy setting Description
replace Discard all pre-defined schemas and use only the custom schema.
merge Custom schemas are merged with and override any pre-defined schemas that share the same document_type attribute in the schema definition.

By default, if you create a custom schema for a document type that matches a supported pre-defined schema, both schemas are merged before being used to process key-value pair data in your document.