Text extraction

Extract text to convert high-quality business documents into a simpler file format that can be used by AI models or to find and isolate key pieces of information from documents such as contracts.

Text extraction is powerful for use cases where you want to extract specific entities or categories of information from a document based on the document structure.

Compatibility and specifications

Cloud platforms

Supported input file types

You can extract text from documents in different languages, or from a document that has a mix of multiple languages. Extract text from the following file types:

Supported document formats:

File type	Extension	Text extraction support	Semantic key-value pair extraction
PDF file	.pdf	Yes	Yes
Powerpoint	.pptx .ppt	Yes	Yes
Word document	.docx .doc	Yes	Yes
Excel	.xlsx	Yes	No

Supported image formats:

File type	Extension	Text extraction support	Semantic key-value pair extraction
BMP image	.bmp	Yes	Yes
HEIC/HEIF image	.heic .heif	Yes	Yes
JFIF image	.jfif	Yes	Yes
JPEG image	.jpeg .jpg	Yes	Yes
PNG image	.png	Yes	Yes
TIFF image/multi-image	.tif .tiff	Yes	Yes

Other supported formats:

File type	Extension	Text extraction support	Semantic key-value pair extraction
HTML	.html	Yes	Yes
Markdown	.md	Yes	No

Note: You cannot use the text extraction API to extract key-value pair data from XLSX documents.

Supported output file types

You can store the extracted text in the following file formats:

JSON
Markdown
HTML
TXT

For details about the contents of the extracted result in each output file type, see Specifying the output format.

Supported storage types

You can store your input documents in the following connected storage types:

IBM Cloud Object Storage
Amazon S3
Any generic Amazon S3-compatible storage
Box
IBM watsonx.data SharePoint
IBM FileNet P8

Note: The IBM FileNet P8 connection is only available in the Toronto data center and for a managed cloud service provider (MCSP).

You can store the text extraction output files in the following connected storage types:

IBM Cloud Object Storage
Amazon S3
Any generic Amazon S3-compatible storage
Box

Note: The text extraction API is certified for use with the generic Amazon S3-compatible MinIO object storage.

For details about how to create a connection to the various types of data stores in your project, see Connectors for watsonx.ai.

Supported foundation models

The text extraction API is certified to use the mistral-small-3-1-24b-instruct-2503 model for key-value pair extraction and image verbalization. You can also use alternative models that can process visual input and respond in a JSON format such as:

llama-4-maverick-17b-128e-instruct-fp8
mistral-medium-2505

For foundation model details, see Supported foundation models.

Supported languages

For scanned PDF documents and images, text extraction uses Optical Character Recognition (OCR). Supported languages depend on the OCR language models.
For programmatic documents, such as HTML, Markdown, and Microsoft Word, PowerPoint, and Excel files, text extraction supports all languages.

Image verbalization is supported for documents in all languages. However, the generated image descriptions are returned in English, regardless of the language of the input document.

Semantic key-value pair extraction is officially supported for the following languages:

Chinese
English
French
German
Italian
Japanese
Portuguese
Spanish

When using the API, specify the corresponding language code or script supported by OCR.

Restrictions

You can extract text from specific input file types and store the extracted output in certain file types. Every input file type cannot be extracted into every supported output format. The following table provides details about which input file type is compatible with the various output formats:

Input file type and extracted output format compatibility for the text extraction API
Input file type	Compatible output file formats
Programmatic PDF	All formats
Scanned PDF	All formats
Image	All formats
Microsoft PowerPoint file	All formats
Microsoft Word file	All formats
Markdown	All formats
Microsoft Excel file	Markdown, JSON, plain text
HTML file	Markdown, JSON, plain text

Key-value pair extraction results are returned only in the assembly output format, so key-value pair data is not included in HTML, Markdown, or plain text output formats.

Ways to work

You must generate credentials to authenticate with watsonx.ai APIs. For details, see Generating a bearer token.

You can extract text from documents stored in your watsonx.ai project with these programmatic methods:

REST API
Python
Node.js

REST API

You can extract text from files in IBM watsonx.ai programmatically by using the text extraction method of the watsonx.ai REST API.

For details about how to customize a text extraction request, see Text extraction parameters.

For API method details, see the watsonx.ai API reference documentation.

Python

You can extract text from files in IBM watsonx.ai programmatically by using the Python library.

See the TextExtractionsV2 class of the watsonx.ai Python library.

Try the sample notebook: Use the watsonx.ai Text Extraction V2 service to extract text from file.

Node.js

You can extract text from files in IBM watsonx.ai programmatically by using the Node.js SDK. For more information, see the following resources:

To learn more, see the code example.