Processing documents with generative AI

You can use the generative AI activity to process documents such as PDFs, images, and other file types by using large language models (LLMs) that support multimodal capabilities. You can extract information, analyze content, and generate insights from documents as part of your business automation workflows.

Overview

Processing documents is a common requirement in business automation workflows. Traditionally, this required separate optical character recognition (OCR) products integrated with Business Automation Workflow. With LLMs that support document processing, you can now analyze and extract information from documents directly within your workflows by using the generative AI activity.

Document processing is available when you use LLMs that support multimodal capabilities. Supported configurations include IBM watsonx.ai and providers that are available through the IBM Model Gateway, including OpenAI, Google Gemini, and AWS Bedrock.

Important: Document processing features are available only through LLMs that support chat-based interactions and multimodal capabilities.

Supported document types

The types of documents you can process depend on the capabilities of the selected LLM provider. Common supported formats include:
  • PDF files
  • Image files (JPEG, PNG, and other formats)
  • Text documents
  • Audio files
  • Video files

Each LLM provider supports different document types and has different limitations on the number of documents that can be processed in a single request. Some LLMs support only a single document per request, while others support multiple documents.

Note: Verify that your selected LLM supports the document types and quantity requirements for your use case.

Document input methods

You can provide documents to the generative AI activity in the following ways.

Important: Document processing is not supported in Workflow Process Service because ECMDocument support is not available.
ECM document references
Reference documents that are stored in the BPM document store or an external Enterprise Content Management (ECM) server by using variables of type ECMDocument or ECMDocumentInfo.
Server files
Reference files that are stored in the process application or toolkit. These files can be used in training examples or for testing purposes during development.

When you define a prompt input in the generative AI activity, the attachment icon is available only after you add the Content Management toolkit as a dependency. You can then add document attachments, assign test files in the test data table, and verify the prompt generation before deploying your workflow.

For detailed instructions on adding documents to a generative AI task in a service flow, see Adding a generative AI task to a service flow.

Training examples with documents

You can create training examples that include document references to help the LLM understand the expected output format and content. When you save a training example that references server files, the system tracks which files are used. You are responsible for managing these server files:

  • Include only the server files that are used in training examples when you deploy your process application or toolkit.
  • Delete server files that are no longer referenced in training examples to avoid unnecessary storage consumption.
  • Delete files that are no longer referenced in test data to avoid unnecessary storage consumption.

If you delete a server file that is referenced in a training example, you must also delete the training example to maintain consistency.

Performance considerations

Document processing can impact system performance and resource usage:

Memory consumption
Document content is extracted and encoded to Base64 format, which increases memory usage and network payload size.
Network traffic
Larger payloads result in increased network traffic between Business Automation Workflow and the LLM provider.
Processing time
Document analysis typically takes longer than text-only prompts because the LLM must process and interpret the document content.
Storage requirements
Process applications and toolkits that include server files for training examples consume more storage space. Remove unused server files to optimize storage usage.
Tip:
  • Delete files that are no longer referenced in test data to avoid unnecessary storage consumption.
  • Test your document processing workflows with representative document sizes and types to assess performance impact in your environment.

Compatibility with an earlier version

Document processing is not supported with the deprecated text/generation API in IBM watsonx.ai. To use document processing features, select an LLM that supports chat-based interactions and multimodal capabilities.

For existing watsonx.ai configurations:

  • The deprecated text/generation API continues to work for text-only prompts currently.
  • Select an LLM that supports chat-based interactions to enable document processing.
  • You cannot automatically migrate existing activities because the replacement API might produce different responses.
  • Test and validate all service flows after switching from the deprecated text/generation API.

Operational considerations

When implementing document processing in your workflows, consider the following operational aspects:

LLM configuration
Ensure that the selected LLM in the generative AI activity supports the document types and processing capabilities that are required for your use cases.
File management
Establish processes for managing server files used in training examples. Remove files that are no longer referenced to prevent unnecessary storage consumption.
Testing
Use the test data functionality in the generative AI activity to upload sample documents and verify that the LLM produces the expected results before deploying to production.

Content Management toolkit integration

The Content Management toolkit provides the ECMDocument and ECMDocumentInfo types that you can use to reference documents in the generative AI activity. These types enable you to:
  • Retrieve documents from the BPM document store.
  • Access documents from external ECM servers.
  • Pass document references between activities in your service flows.

In the Process Designer, you can use existing UI components to parse document stores and select ECMDocumentInfo instances. In service flows, activities are available to retrieve documents, which return ECMDocument instances that can be passed to the generative AI activity.

To use the Content Management toolkit in your workflow automation or toolkit, you must add it as a dependency. For instructions, see Modifying toolkit dependencies.