Chatting with documents and media files
Quickly add documents and media files to chat about in foundation model prompts that you create in the Prompt Lab.
Capabilities
If you want a foundation model to incorporate current, factual information in its output, you can ground the model by associating documents with your prompt. This approach, known as retrieval-augmented generation (RAG), is especially helpful for generating accurate answers to questions.
You can use Prompt Lab to quickly test both document quality and foundation model capabilities before implementing a full RAG solution.
You can associate files in the following formats with your prompt:
Chatting with documents
When you upload a document, the text in the document is converted into embeddings and stored in a vector index for quick searches. When you submit a question, the system performs a similarity search on this index, adds the most relevant content to your prompt as context, and sends the information to the foundation model.
The contextual information that you add can include product documentation, company policy details, industry performance data, facts and figures related to a particular subject, or whatever content matters for your use case. Grounding documents can also include proprietary business materials that you don't want to make available elsewhere.
You can upload a document directly from your local machine or your project. The locally available document is stored in an in-memory vector store that is created for you automatically.
To implement a more robust solution, you can use remote vector datastores to store your documents. For details about the different types of vector stores and supported grounding document file types, see Adding vectorized documents for grounding foundation model prompts.
You can simultaneously upload documents from up to 10 vector indexes in the chat mode. However, you cannot select more than one in-memory vector index at a time from the available vector indexes.
Chatting with images
Upload an image to add to the input that you submit to a multimodal foundation model. After you add the image, you can ask questions about the image content.
You can add only one image at a time to a chat session. One image is counted as approximately 1,200–3,000 tokens depending on the image size.
You might want to add an image and convert the visual information into text to help with the various tasks, including:
- Automating the generation of alternative text for images to help blind users to perceive meaningful visuals on a web page and meet accessibility requirements
- Summarizing photos of property damage that accompany insurance claims
- Converting images from a document into text before the document is used as grounding information for a RAG use case.
Uploading images to a chat is supported only if at least one multimodal foundation model is installed in your deployment.
Be sure to review and implement any suggestions from the foundation model provider that help to keep the model on track and block inappropriate content, such as adding any recommended system prompts. For more information about how to edit a system prompt, see Chat templates.
The following table shows the file types that can be added as images.
| Supported file type | Maximum total file size |
|---|---|
| PNG | 4 MB |
| JPEG | 4 MB |
The image that you add is saved in the IBM Cloud Object Storage bucket that is associated with your project as a data asset.
Procedure
To chat with a document or media file, complete the following steps:
-
From the Prompt Lab in chat mode, select a foundation model, and then specify any model parameters that you want to use for prompting.
-
Click the Upload files icon
, and then choose the file type from the following:
-
Documents: Browse to upload a file or choose a data asset in your project with the file that you want to add.
-
Images: Browse to upload an image file or choose a data asset in your project with the image file that you want to add.
-
-
Click Add.
For documents, click Create. A message might be displayed that says the vector index build is in progress. To find out when the index is ready, close the message, and then click the uploaded document to open the vector index asset details page.
-
Submit questions about information from the document or media file to see how well the model can use the contextual information to answer your questions.
For example, you can ask the foundation model to summarize or ask about concepts that are explained in the file.
For documents, if answers that you expect to be returned are not found, you can review the configuration of the vector index asset and make adjustments. See Managing a vector index.
-
Optional: Save the prompt as a prompt template or prompt session.
Note: You cannot save a chat with an added image as a prompt notebook.For more information, see Saving your work.
Examples
See sample prompts that are used to chat about documents and media files with the following foundation models:
- llama-4-scout-17b-16e-instruct
- llama-4-maverick-17b-128e-instruct-fp8
- llama-3-2-11b-vision-instruct
- llama-3-2-90b-vision-instruct
- llama-guard-3-11b-vision
- pixtral-12b
Prompting the Llama 4 Instruct models
Llama-4-scout-17b-16e-instruct
In the following example, an image of a bar graph is submitted to the llama-4-scout-17b-16e-instruct foundation model along with the instruction, Provide a short summary of the data shown in the bar graph.

The llama-4-scout-17b-16e-instruct foundation model responds with the following comprehensive explanation:
Description of the Bar Graph: The bar graph illustrates the distribution of readers across various academic backgrounds.
Data Points:
- Security Cyber Information: approximately 30
- Engineering: approximately 40
- Quantum Physics: around 50
- Mathematics Applied math: below 25
- Computer Science: over 120
- Business IT Information: around 85
- Cybersecurity (SRA): 5
- Quantum Computing: 5
Analysis:
- The field with the highest number of readers is Computer Science, with over 120.
- Business IT Information has around 85 readers.
- Quantum Physics, Mathematics Applied math, Cybersecurity (SRA), and Quantum Computing have relatively low numbers of readers.
Conclusion: The graph shows that Computer Science and Business IT Information have the highest number of readers, while Cybersecurity (SRA) and Quantum Computing have the lowest.
Llama-4-maverick-17b-128e-instruct-fp8
In the following example, a document on InfoSphere MDM in French is submitted to the llama-4-maverick-17b-128e-instruct-fp8 foundation model along with the instruction, Briefly summarize the attached document in English.
The model responds with the following summarization:

The llama-4-maverick-17b-128e-instruct-fp8 foundation model translates the document from French to English and generates a summary, including source page numbers.
Prompting the Llama 3.2 vision models
llama-3-2-11b-vision-instruct
In the following example, a diagram of the retrieval-augmented generation (RAG) pattern is submitted to the llama-3-2-11b-vision-instruct foundation model along with the instruction, Explain the process that is shown in the image.

The llama-3-2-11b-vision-instruct foundation model responds with the following comprehensive explanation:
The image illustrates the process of a user interacting with a chatbot or conversational AI system. Here's a step-by-step breakdown:
-
User Input: The user initiates the conversation by asking a question or providing input.
-
Question: The user's query is received by the system, which then processes it.
-
Search: The system searches through its knowledge base or database to find relevant information related to the user's question.
-
Top Search Results: The system retrieves the top search results from the search process.
-
Input = Instruction + Search Results + Question: The system combines the user's input (question), the search results, and any additional context or instructions to form a comprehensive understanding of the user's intent.
-
Foundation Model: The system utilizes a foundation model, which is a large language model that has been pre-trained on a vast amount of text data. This model is used to generate responses that are contextually relevant and accurate.
-
Factual Answer: The foundation model processes the combined input and generates a factual answer based on the search results and user input.
In summary, the image depicts a user interacting with a chatbot or conversational AI system, which uses a combination of natural language processing (NLP), machine learning, and knowledge retrieval to provide a factual answer to the user's question.
llama-3-2-90b-vision-instruct
This example illustrates how the instruction that you give in text affects the quality of the foundation model output.
The prompt that is submitted to the llama-3-2-90b-vision-instruct foundation model includes an image that illustrates effective alternative text to use for an image of a revenue chart.

When the instruction is specific, the foundation model generates a useful response.
- Instruction
- What does the image convey about alternative image text?
- Response
- The image shows how alternative text can be used to provide a concise and informative description of an image for visually impaired individuals. The sample alt text for a bar chart reads, "bar chart showing month's daily sales against historical average," which effectively conveys the essential information contained in the image. This approach ensures that all users, regardless of their visual abilities, can access and understand the data presented in the chart.
When you give the model a generic instruction, the foundation model describes the visual elements of the image in detail, but does not always interpret the information that is conveyed in the image.
- Instruction
-
Explain the image.
- Response
-
Details the visual elements and attempts to explain the information that is conveyed in the dummy image, without capturing the bigger idea being expressed, as this excerpt from the response shows:
Overall, the image suggests that the company has experienced significant growth in revenue over the past year, with a notable increase in February. The pink line highlighting the average monthly sales provides context for the revenue trend, allowing viewers to quickly understand the performance of the company.
Using the Llama Guard vision model to check image safety
The llama-guard-3-11b-vision model evaluates both images and accompanying text for safety concerns. To find this model, select View all foundation models from the Model field drop-down menu, clear the Task>Chat filter, and select the Modalities>Image option.
Important: Do not edit the system prompt for this foundation model.

The model evaluates the combined context of both image and text. For example, an innocuous image of a building paired with text asking how to break in would be classified as unsafe:

The S2 category shown above identifies references to non-violent crimes. For more information about safety categories and text-only prompts, see the Llama Guard 3 documentation and Sample prompt: Classify prompts for safety.
Prompting multimodal Mistral models
You can use the pixtral-12b and pixtral-large-instruct-2411 foundation models for advanced image and document understanding use cases.
In the following example, a diagram explaining the differences between traditional AI models and foundation models is submitted to the pixtral-12b foundation model along with the instruction, What information does this image convey?.

The pixtral-12b foundation model responds with the following comprehensive explanation:
The image contrasts traditional AI models with foundation models, highlighting their respective mechanisms and capabilities.
Traditional AI Models On the left side of the image, traditional AI models are depicted. These models operate as follows:
- Training: Each model is trained on a specific type of data (represented by different icons such as a document, database, spreadsheet, etc.).
- Models: Each type of data results in the creation of a distinct model (model_1, model_2, model_3, and so on).
- Tasks: Each model is specialized for particular tasks, indicated by green checkmarks.
Foundation Models On the right side of the image, foundation models are illustrated:
- Massive External Data: Foundation models are trained on a vast amount of diverse external data, represented by various icons including a database, document, code, and more.
- One Foundation Model: This model is trained on the massive external data, enabling it to handle a wide range of tasks.
- Prompts: Tasks are performed using prompts, which guide the model's responses.
- Tasks: The foundation model can handle multiple tasks, including:
- Translation
- Text generation
- Code generation
- Factual question-answering
Summary
- Traditional AI Models: Specialized for specific tasks after being trained on particular datasets.
- Foundation Models: Trained on vast, diverse datasets, enabling them to handle multiple tasks with a single model using prompts.
Learn more
- Adding vectorized documents for grounding foundation model prompts
- Retrieval-augmented generation (RAG)
- You can also use the watsonx.ai chat API to prompt a foundation model about various file types. For more information, see Adding generative chat function to your applications with the chat API.