Common text processing parameters
When you submit a text processing request by using the watsonx.ai REST API, you include a payload that specifies configuration details for the text processing operation.
You can use multiple text processing APIs to understand and convert your documents into a simpler textual format that can be used in a RAG solution. You can use the text classification API to determine whether your document matches the structured data format of certain common document types. Based on the classification result, you can then customize your text extraction request to extract text and other structured content from your document more efficiently.
Make choices about the following settings that are common to both the text classification and extraction REST API requests:
- Language of the input document
- Process text from images in the input document
- Process key-value pairs in the input document
In addition to the common REST API parameters, you can set parameters that are specific to the various text processing API methods in the document understanding library. For details, see the following topics:
Specifying the languages of the input document with the languages parameter
If your document is in a language other than English, you must specify the language by its ISO 639 language code in the languages parameter of your API request.
"parameters": {
"languages": [
"de"
]
}
If the document has a mix of languages, list each language separately. The language code you specify differs based on whether your document contains machine-printed text or handwriting. If your document has both printed and handwritten text in a specific language, you must specify both types of language codes in the list of languages.
Lanuage restrictions with the text processing APIs
- Text classification
-
You can use the classification API with English language documents only.
- Text extraction
-
You cannot use the extraction API with a mixed-language document when the languages do not share a common script. However, you can use documents with a mix of English and one other language in any script. For example, you can extract text from images in a document with a mix of English and French text because both languages are Latin-based. However, you cannot extract text from images in a document with a mix of Japanese and French text.
-
You can use the text extraction API to extract key-value pair data from English language documents only.
Supported handwritten languages
If your document contains text in English handwriting, use the en_hw language code in your API request body.
Supported machine-printed languages
The following table provides details about the languages supported by the text extraction API for printed text recognition:
| Language | ISO 639 language code | API script code | Script |
|---|---|---|---|
| Acehnese | ‐ | latn |
Latin |
| Afrikaans | af |
latn |
Latin |
| Albanian | sq |
latn |
Latin |
| Araucanian/Mapuche | ‐ | latn |
Latin |
| Awadhi | ‐ | deva |
Devanagari |
| Aymara | ay |
latn |
Latin |
| Balinese | ‐ | latn |
Latin |
| Baso Minangkabau | ‐ | latn |
Latin |
| Basque | eu |
latn |
Latin |
| Belarusian | be |
cyrl |
Cyrillic |
| Bemba | ‐ | latn |
Latin |
| Bikol | ‐ | latn |
Latin |
| Bislama | bi |
latn |
Latin |
| Bhojpuri | ‐ | deva |
Devanagari |
| Bulgarian | bg |
cyrl |
Cyrillic |
| Catalan | ca |
latn |
Latin |
| Cebuano | ‐ | latn |
Latin |
| Chechen | ‐ | cyrl |
Cyrillic |
| Chinese (Simplified) | zh_cn |
cjk |
Han (Simplified) |
| Chinese (Traditional) | zh_tw |
cjk |
Han (Traditional) |
| Choctaw | ‐ | latn |
Latin |
| Cree | cr |
latn |
Latin |
| Dakota | ‐ | latn |
Latin |
| Danish | da |
latn |
Latin |
| Dogri | ‐ | deva |
Devanagari |
| Dutch | nl |
latn |
Latin |
| English | en |
latn |
Latin |
| Estonian | et |
latn |
Latin |
| Fijian | fj |
latn |
Latin |
| Filipino | fil |
latn |
Latin |
| Finnish | fi |
latn |
Latin |
| French | fr |
latn |
Latin |
| Galician | gl |
latn |
Latin |
| Gayo | ‐ | latn |
Latin |
| German | de |
latn |
Latin |
| Gilbertese | ‐ | latn |
Latin |
| Greek | el |
el |
Greek |
| Haitian Creole | ht |
latn |
Latin |
| Hebrew | he |
he |
Hebrew |
| Hiligaynon | ‐ | latn |
Latin |
| Hindi | hi |
deva |
Devanagari |
| Iban | ‐ | latn |
Latin |
| Iloko | ‐ | latn |
Latin |
| Indonesian | id |
latn |
Latin |
| Irish | ga |
latn |
Latin |
| Italian | it |
it |
Latin |
| Japanese | ja |
cjk |
Japanese |
| Javanese | jv |
latn |
Latin |
| Kachin | ‐ | latn |
Latin |
| Kalaallisut | kl |
latn |
Latin |
| Kanienʼkéha | ‐ | latn |
Latin |
| Khasi | ‐ | latn |
Latin |
| Kinyarwanda | rw |
latn |
Latin |
| Konkani | ‐ | deva |
Devanagari |
| Kongo | kg |
latn |
Latin |
| Korean | ko |
cjk |
Korean |
| Kosraean | ‐ | latn |
Latin |
| Kuanyama | kj |
latn |
Latin |
| Latin | la |
latn |
Latin |
| Lozi | ‐ | latn |
Latin |
| Low German | ‐ | latn |
Latin |
| Luo | ‐ | latn |
Latin |
| Malagasy | mg |
latn |
Latin |
| Maithili | ‐ | deva |
Devanagari |
| Manx | gv |
latn |
Latin |
| Marathi | mr |
deva |
Devanagari |
| Middle English | ‐ | latn |
Latin |
| Mittelhochdeutsch | ‐ | latn |
Latin |
| Macedonian | mk |
cyrl |
Cyrillic |
| Ndonga | ng |
latn |
Latin |
| Nepali | ne |
deva |
Devanagari |
| NorthNdebele | nd |
latn |
Latin |
| Norwegian | no |
no |
Latin |
| Nyankole | ‐ | latn |
Latin |
| Occitan | oc |
latn |
Latin |
| Ojibwa | oj |
latn |
Latin |
| Old English | ‐ | latn |
Latin |
| Old French | ‐ | latn |
Latin |
| Old High German | ‐ | latn |
Latin |
| Old Norse | ‐ | latn |
Latin |
| Old Provençal | ‐ | latn |
Latin |
| Pampanga | ‐ | latn |
Latin |
| Pangasinan | ‐ | latn |
Latin |
| Papiamento | ‐ | latn |
Latin |
| Polish | pl |
latn |
Latin |
| Portuguese | pt |
pt |
Latin |
| Quechua | qu |
latn |
Latin |
| Romansh | rm |
latn |
Latin |
| Rundi | rn |
latn |
Latin |
| Russian | ru |
cyrl |
Cyrillic |
| Sango | sg |
latn |
Latin |
| Sanskrit | sa |
deva |
Devanagari |
| Scots | ‐ | latn |
Latin |
| Serbian | sr |
cyrl |
Cyrillic |
| Shona | sn |
latn |
Latin |
| Spanish | es |
es |
Latin |
| Sundanese | su |
latn |
Latin |
| Swahili | sw |
latn |
Latin |
| Swati | ss |
latn |
Latin |
| Swedish | sv |
sv |
Latin |
| Tamil | ta |
deva |
Tamil |
| Telugu | te |
deva |
Telugu |
| Tsonga | ts |
latn |
Latin |
| Tswana | tn |
latn |
Latin |
| Ukrainian | uk |
cyrl |
Cyrillic |
| Uzbek | uz |
cyrl |
Cyrillic |
| Xhosa | xh |
latn |
Latin |
| Zulu | zu |
latn |
Latin |
Extracting text from images with the ocr_mode parameter
You can specify how you to process text in images in your document by using optical character recognition (OCR). Specify the following parameter in the API request body:
"parameters": {
"ocr_mode": "enabled"
}
The following table provides details about the different OCR modes you can use to specify how to process images in your API request:
| OCR mode | Description |
|---|---|
disabled |
Image files and scanned documents are not processed. For hybrid documents that contain both visuals and text, only text is extracted. |
enabled |
OCR is only run if no text could be extracted from the document. Images embedded in documents are processed. |
forced |
Each page of the document is converted to an image and processed with OCR. Every document type, including text-only files, are converted to images before they are processed. |
Configuring the key-value pair processing pipeline with the semantic_config parameter
You can identify and extract structured information into key-value pairs from unstructured or semi-structured documents such as invoices, forms, contracts, or receipts. The processed text is in a format where each piece of data (the value) is associated with a unique identifier (the key). Key-value pair data is processed by using a general-purpose foundation model or a model that is tuned for specific document formats.
Key-value pair classification and extraction is only supported for English language documents. The foundation models you can use for key-value pair processing vary by data center. For details, see Regional availability of foundation models.
You can use the text classification API to quickly check whether a document can be classified into one of several pre-defined schemas for common document types without performing the key-value pair extraction. If the document does not match a pre-defined type, you can then define a new document type and custom schema before you run a text extraction API request.
Use various semantic_config parameters in the REST API request body to configure the following key-value pair processing pipeline capabilities:
You also set fields in the semantic_config parameter that are specific to the text extraction API method . For details, see Generic and semantic key-value pair extraction mode.
Defining schemas with the schemas field
Based on the layout, documents can be broadly classified into the following types:
- Variable layout documents
- Documents without a consistent structure, such as invoices, purchase orders, or passports, where the structure spans multiple pages.
- Fixed layout documents
- Structured documents where each page follows a pre-defined format such as a tax form where each page has a specific layout.
Pre-defined schemas
You can classify or extract text from your files into pre-defined schemas for the following supported common document types:
- Invoice
- Utility bill
- Mortgage lending document
- Bill of lading
- Customs form
- Delivery receipt
- Expense report
- Receipt
- Purchase order
- Tax form
- Financial statement
- Remittance or Payment Advice
- Bank statement
- Credit card statement
- Driver's license
- Passport
- National ID card
- W-4 form
- I-9 form
- Patient intake form
- Insurance claim
- Transcript
- Diploma or certification
- Life insurance standard disability claim form
- Standard life insurance authorization form
- Association for Cooperative Operations Research and Development (ACORD) standardized insurance form
- Claimant's statement - death claim form
- Business license and permit
Custom schemas
If your documents contains unique structured content, you can provide a custom schema that defines specific data and unique identifiers. When you specify a custom schema, the text extraction process automatically overrides classifying the
document into one of the pre-defined schemas and only uses the schema you provide in the schemas parameter in the semantic_config.
For details about how to define parameters in a custom schema, see Creating custom schemas for key-value pair extraction.
The following example provides a custom schema for a receipt in the REST API request body:
"semantic_config": {
"schemas": [ {
"document_type": "Receipt",
"document_description": "A receipt issued for a purchase at ABC store.",
"fields": {
"receipt_number": {
"default": "",
"example": "R-20241027-ABC",
"description": "Unique identifier on the receipt."
},
"customer_name": {
"default": "",
"example": "John Smith",
"description": "Full name of the customer or payee."
},
"date_of_transaction": {
"default": "",
"example": "2023-01-01",
"description": "Date when the purchase or payment occurred."
},
"total_paid": {
"default": "",
"example": "8.64",
"description": "Final amount paid by the customer."
},
"payment_method": {
"default": "",
"example": "Credit Card",
"description": "How payment was made, such as cash, card, check, etc.)."
},
}
} ]
}
Controlling how pre-defined and custom schemas interact with the schemas_merge_strategy field
You can define how custom schemas you create interact with the pre-defined schemas supported by the text processing API.
The following table provides details about the different ways pre-defined and custom schemas are processed when you configure the schemas_merge_strategy setting in the semantic_config parameter:
| Schema strategy setting | Description |
|---|---|
replace |
Discard all pre-defined schemas and use only the custom schema. |
merge |
Custom schemas are merged with and override any pre-defined schemas that share the same document_type attribute in the schema definition. |
By default, if you create a custom schema for a document type that matches a supported pre-defined schema, both schemas are merged before being used to process key-value pair data in your document.