Text extraction

Extract text to convert high-quality business documents into a simpler file format that can be used by AI models or to find and isolate key pieces of information from documents such as contracts.

Text extraction is powerful for use cases where you want to extract specific entities or categories of information from a document based on the document structure.

Compatibility and specifications

Supported input file types

You can extract text from documents in different languages, or from a document that has a mix of multiple languages. Extract text from the following file types:

  • BMP
  • DOC
  • DOCX
  • GIF
  • HTML
  • JFIF
  • JPG
  • Markdown
  • PDF
  • PNG
  • PPT
  • PPTX
  • TIFF
  • XLSX
Supported output file types

You can store the extracted text in the following file formats:

  • JSON
  • Markdown
  • HTML
  • TXT

For details about the contents of the extracted result in each output file type, see Specifying the output format.

Supported storage types

You can store your input documents in the following connected storage types:

  • IBM Cloud Object Storage
  • Amazon S3
  • Any generic Amazon S3-compatible storage
  • Box
  • IBM watsonx.data SharePoint
  • IBM FileNet P8
  • Storage volume with a persistent volume claim (PVC)

You can store the text extraction output files in the following connected storage types:

  • IBM Cloud Object Storage

  • Amazon S3

  • Any generic Amazon S3-compatible storage

  • Box

  • Storage volume with a persistent volume claim (PVC)

    Note: The text extraction API is certified for use with the generic Amazon S3-compatible MinIO object storage.

For details about how to create a connection to the various types of data stores in your project, see Connectors for watsonx.ai.

You can also store your documents and text extraction results in a container associated with a project or deployment space.

Supported foundation models

The text extraction API is certified to use the mistral-small-3-1-24b-instruct-2503 model for key-value pair extraction and image verbalization. You can also use alternative models that can process visual input and respond in a JSON format such as:

  • llama-4-maverick-17b-128e-instruct-fp8
  • mistral-medium-2505
  • mistral-medium-2508

The text extraction API is also certified to use the pixtral-12b model for key-value pair extraction and image verbalization. However, the model is deprecated from version 2.3.0 and will be removed in a future release. Move your text extraction API workloads to the latest recommended models.

For foundation model details, see Supported foundation models.

Restrictions

  • You can extract text from specific input file types and store the extracted output in certain file types. Every input file type cannot be extracted into every supported output format. The following table provides details about which input file type is compatible with the various output formats:

    Input file type and extracted output format compatibility for the text extraction API
    Input file type Compatible output file formats
    Programmatic PDF All formats
    Scanned PDF All formats
    Image All formats
    Microsoft PowerPoint file All formats
    Microsoft Word file All formats
    Markdown All formats
    Microsoft Excel file Markdown, JSON, plain text
    HTML file Markdown, JSON, plain text
  • Key-value pair extraction is only supported for English language documents.

  • The results of a text extraction request that processes key-value pairs is only available in the assembly output format. Key-value pairs are not extracted in HTML, Markdown, or plain text output formats.

  • IBM Cloud Object Storage requires a TLS update to include support for Extended Master Secret (EMS). All text extraction requests that use Cloud Object Storage connections fail. Use other storage types to store your input documents and text extraction results.

Ways to work

You can extract text from documents stored in your watsonx.ai project with these programmatic methods:

To set up access to use the text extraction API, see the Developer resources.

REST API

You can extract text from files in IBM watsonx.ai programmatically by using the text extraction method of the watsonx.ai REST API.

For details about how to customize a text extraction request, see Text extraction parameters.

For API method details, see the watsonx.ai API reference documentation.

Python

You can extract text from files in IBM watsonx.ai programmatically by using the Python library.

See the TextExtractionsV2 class of the watsonx.ai Python library.

Try the sample notebook: Use the watsonx.ai Text Extraction V2 service to extract text from file.

Node.js

You can extract text from files in IBM watsonx.ai programmatically by using the Node.js SDK. For more information, see the following resources:

To learn more, see the code example.

Learn more