Text extraction
Extract text to convert high-quality business documents into a simpler file format that can be used by AI models or to find and isolate key pieces of information from documents such as contracts.
Simplifying your business documents by converting them into a text-based format is especially useful for retrieval-augmented generation tasks where you want to find information that is relevant to a user query and include it with the input to a foundation model. Including accurate contextual information in model input helps the foundation model to incorporate factual and up-to-date information in the model output.
Capabilities
The document understanding technology uses the following methods to extract text:
- Optical character recognition
- Optical character recognition (OCR) extracts text from images, scanned documents, and tables, and is useful for preserving information that is depicted in images, diagrams, or in text that is embedded in files such as scanned PDFs. Although OCR can extract text from noisy images, the quality of the image files must meet the minimum requirement of 80 DPI (dots per inch).
- Document structure identification
- The text extraction API processes document content from various data structures including tables, section titles, bulleted lists, paragraphs, and footnotes. The API also identifies and removes commonly used content such as headers and footers.
- Key-value pair extraction
- Use key-value pair extraction to process documents that contain generic or domain-specific structured data, like invoices, utility bills, and more. The extraction mode classifies documents based on the document type. The extracted text is stored in a datastructure called a schema where each piece of data (the value) is associated with a unique identifier (the key). The mode uses a pre-defined schema or a custom schema that you define. Key-value pairs are extracted with large language models (LLMs) and advanced vision-language processing.
Requirements
If you signed up for watsonx.ai and you have a sandbox project, all requirements are met and you're ready to use the text extraction service.
You must meet the following requirements:
- You must have a project.
- The project must have an associated Watson Machine Learning service instance.
- Required permissions
-
To run a text extraction job, you must have the Admin or Editor role in a project.
- Supported input file types
-
You can extract text from documents in different languages, or from a document that has a mix of multiple languages. Extract text from the following file types:
- BMP
- DOC
- DOCX
- GIF
- HTML
- JFIF
- JPG
- Markdown
- PNG
- PPT
- PPTX
- TIFF
- XLSX
Note: You cannot use the text extraction and classification APIs to extract key-value pair data from XLSX documents. - Supported output file types
-
You can store the extracted text in the following file formats:
- JSON
- Markdown
- HTML
- TXT
For details about the contents of the extracted result in each output file type, see Specifying the output format.
- Supported storage types
-
You can store your input documents in the following connected storage types:
- IBM Cloud Object Storage
- Amazon S3
- Any generic Amazon S3-compatible storage
- Storage volume with a persistent volume claim (PVC)
You can store the text extraction output files in the following connected storage types:
- IBM Cloud Object Storage
- Amazon S3
- Any generic Amazon S3-compatible storage
- Storage volume with a persistent volume claim (PVC)
Note:The text extraction API is certified for use with the generic Amazon S3-compatible MinIO object storage. For details about how to create a connection to the various types of data stores in your project, see Connectors for watsonx.ai.
You can also store your documents and text extraction results in a container associated with a project or deployment space.
- Supported foundation models
-
The text extraction API is certified to use the
pixtral-12bmodel for key-value pair extraction and image verbalization. -
You can use alternative models that can process visual input and respond in a JSON format such as:
llama-4-maverick-17b-128e-instruct-fp8mistral-medium-2505mistral-medium-2508
For foundation model details, see Supported foundation models.
Restrictions
-
You can extract text from specific input file types and store the extracted output in certain file types. Every input file type cannot be extracted into every supported output format. The following table provides details about which input file type is compatible with the various output formats:
Input file type and extracted output format compatibility for the text extraction API Input file type Compatible output file formats Programmatic PDF All formats Scanned PDF All formats Image All formats Microsoft PowerPoint file All formats Microsoft Word file All formats Markdown All formats Microsoft Excel file Markdown, JSON, plain text HTML file Markdown, JSON, plain text -
Image verbalization and key-value pair extraction is only supported for English language documents.
-
IBM Cloud Object Storage requires a TLS update to include support for Extended Master Secret (EMS). All text extraction requests that use Cloud Object Storage connections fail. Use other storage types to store your input documents and text extraction results.
Ways to work
You can extract text from documents stored in your watsonx.ai project with these programmatic methods:
To set up access to use the text extraction API, see the Developer resources.
REST API
You can extract text from files in IBM watsonx.ai programmatically by using the text extraction method of the watsonx.ai REST API.
For details about how to customize a text extraction request, see Text extraction parameters.
For API method details, see the watsonx.ai API reference documentation.
Python
You can extract text from files in IBM watsonx.ai programmatically by using the Python library.
See the TextExtractionsV2 class of the watsonx.ai Python library.
Try the sample notebook: Use the watsonx.ai Text Extraction V2 service to extract text from file.
Node.js
You can extract text from files in IBM watsonx.ai programmatically by using the Node.js SDK. For more information, see the following resources:
To learn more, see the code example.