Build a PPT AI image analysis question answering system with Granite vision model

Publication Date

26/02/2025

 

Granite vision model recipe overview

In this recipe, you will learn to build an AI-driven system capable of answering real-time user queries from PowerPoint slides, using both text and images as context.

Authors

Vrunda Gadesha

AI Advocate | Technical Contect Author

As AI-driven technologies evolve, image analysis is becoming increasingly sophisticated, enabling deeper insights from visual data. With advancements in machine learning models, AI can process uploaded images, extract metadata and support content moderation at large scale. These analysis tools also contribute to predictive modeling for applications like pricing, visual optimization and image generation, making workflows more cost-effective and efficient. By integrating data-driven approaches, AI enhances automation and decision-making, offering new possibilities for intelligent visual interpretation.

Use cases

With the rapid advancements in Computer Vision and advanced AI, businesses and researchers are leveraging image-based technologies for a wide range of applications. From image classification and OCR (optical character recognition) to segmentation and video analysis, AI-powered tools are transforming the way we extract and analyze visual information.

In industries like social media, AI enhances content moderation by analyzing images at the pixel level, ensuring compliance and improving engagement. Businesses can also utilize Vision API for automated document processing, converting scanned files, excels and reports into structured data. These applications streamline workflows, improve efficiency and enable organizations to extract meaningful insights from large-scale visual datasets.

These use cases highlight the growing role of AI-powered image analysis across industries. In this tutorial, we focus on applying these capabilities to PowerPoint presentations, enabling interactive Q&A on text and images using advanced computer vision and AI models

AI-powered interactive Q&A for presentations

Large language models (LLMs) have revolutionized machine learning by enabling intelligent insights from vast datasets of unstructured text. However, traditional LLMs often struggle with image analysis, making it challenging to extract insights from charts, diagrams and visual elements in presentations.

IBM Granite™ Vision 3.2 large language model (LLM) bridges this gap by integrating AI tools with advanced object detection algorithms, allowing users to automate multimodal analysis. This tutorial demonstrates how to streamline your workflow by using AI to extract and analyze text and images from PowerPoint (.pptx) files, enabling interactive Q&A for enhanced presentation insights.

In this tutorial, you will learn to build an AI-driven system capable of answering real-time user queries from PowerPoint slides by using both text and images as context. This tutorial will guide you through:

PowerPoint processing: Extract text and images from .pptx files for AI-based analysis.

Text-based Q&A: Use Granite Vision to generate answers based on extracted slide text.

Image-based Q&A: Ask AI to analyze images, charts and diagrams from slides.

Optimized question formulation: Learn how to craft effective questions for accurate and relevant AI responses.

Technologies used

This tutorial leverages cutting-edge AI technologies, including:

1. IBM Granite Vision: A powerful vision-language model (VLM) that processes both text and images.

2. Python-PPTX: A library for extracting text and images from PowerPoint files.

3. Transformers: A framework to process AI model inputs efficiently.

What you will achieve

By the end of this tutorial, you will:

1. Extract and process PowerPoint content (text and images).

2. Use Granite vision 3.2 model for AI-driven Q&A on slide content.

3. Ask AI insightful questions about text and images.

4. Improve user interaction with presentations by using AI-powered explanations.

This tutorial is designed for AI developers, researchers, content creators, and business professionals looking to enhance their presentations with AI-driven insights.

Video

Prerequisites

You need an IBM Cloud account to create a watsonx.ai project.

Steps

Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

1. Log in to watsonx.ai by using your IBM Cloud account.

2. Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.

3. Create a Jupyter Notebook.

4. Upload the PPTX file as asset in watsonx.ai

This step opens a notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This tutorial is also available on GitHub.

Note: This tutorial needs GPU infrastructure to run the code, so it is recommended to use watsonx.ai as illustrated in this tutorial.

Step 2: Install required dependencies

Before we begin extracting and processing PowerPoint content, we need to install the necessary Python libraries:

transformers: Provides access to IBM Granite Vision and other AI models.

torch: A deep learning framework required for running the model efficiently.

python-pptx: A library to extract text and images from PowerPoint (.pptx) files.

Run the following commands to install and upgrade these packages:

!pip install --upgrade transformers
!pip install --upgrade torch
!pip install python-pptx
!pip install botocore
!pip install ibm-cos-sdk

Step 3: Import Required Libraries

In this step, we import the necessary libraries for processing PowerPoint files, handling images, and interacting with the IBM Granite Vision model:

  1. os and io: For file handling and input/output operations.
  2. torch: Ensures compatibility with the AI model.
  3. pptx.Presentation: Extracts text and images from PowerPoint (.pptx) files.
  4. PIL.Image: Processes images extracted from slides.
  5. transformers: Loads IBM Granite Vision for AI-based Q&A.
  6. botocore.client.Config & ibm_boto3: Handles cloud-based storage access (IBM Cloud Object Storage).
import os
import io
import torch
from pptx import Presentation
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from botocore.client import Config
import ibm_boto3

Step 4: Connect to IBM Cloud Object Storage

In this step, we establish a connection to IBM Cloud Object Storage to access and retrieve PowerPoint files stored in the cloud.

You can levrage the python support, provided through a fork of the boto3 library with features to make the most of IBM Cloud® Object Storage. Check out the official documentation to get these credentials.

ibm_boto3.client: Creates a client to interact with IBM Cloud Object Storage.

ibm_api_key_id: Your IBM Cloud API key for authentication.

ibm_auth_endpoint: The authentication endpoint for IBM Cloud.

endpoint_url: The specific cloud object storage (COS) storage endpoint.

# IBM COS credentials
cos_client = ibm_boto3.client(
    service_name='s3',
    ibm_api_key_id='Enter your API Key',
    ibm_auth_endpoint='[Enter your auth end-point url]',
    config=Config(signature_version='oauth'),
    endpoint_url='[Enter your end-point url]'
)

Note: When you upload a file as assets in watsonx.ai, it is automatically stored in IBM Cloud Object Storage. When you later import the file into a Jupyter Notebook, watsonx.ai generates and inserts the necessary credentials (API key, authentication endpoint and storage endpoint) into your notebook. The provided IBM Cloud Object Storage credentials allow secure access to retrieve files from storage, enabling seamless integration between watsonx.ai Assets and the notebook environment for further processing.

By configuring this connection, we can seamlessly import and process PowerPoint presentations stored in IBM Cloud for AI-powered analysis

Step 5: Define storage parameters

In this step, we specify the IBM Cloud Object Storage bucket and file details to locate and retrieve the PowerPoint presentation (.pptx) for processing.

Check out this official document to get the bucket configuration details through the IBM Cloud UI.

bucket: The name of the IBM Cloud Object Storage bucket where the file is stored.

object_key: The exact filename of the PowerPoint presentation to be accessed

bucket = 'Enter your bucket key'
object_key = 'Application Integration client presentation.PPTX [You can replace this with your PPT name]'

Step 6: Retrieve the PowerPoint file from IBM Cloud Object Storage

In this step, we download the PowerPoint (.pptx) file from IBM Cloud Object Storage to process it locally.

cos_client.get_object(): Retrieves the file from the specified bucket and object key.

streaming_body.read(): Reads the file contents into a byte stream for further processing.

# Download PPTX file from IBM COS
streaming_body = cos_client.get_object(Bucket=bucket, Key=object_key)['Body']
pptx_bytes = streaming_body.read()

Step 7: Save the PowerPoint file on local path

In this step, we store the downloaded PowerPoint file (.pptx) locally so it can be processed.

pptx_path: Defines the local filename where the presentation will be saved.

open(pptx_path, 'wb'): Opens the file in write-binary mode to store the retrieved bytes.

f.write(pptx_bytes): Writes the downloaded file content into the newly created .pptx file.

# Save the bytes to a local PPTX file
pptx_path = "downloaded_presentation.pptx"
with open(pptx_path, 'wb') as f:
    f.write(pptx_bytes)

Step 8: Confirm file save location

In this step, we print a confirmation message to ensure that the PowerPoint file has been successfully saved. `print` function displays the file path where the .pptx file is stored locally.

print(f"PPTX file saved as: {pptx_path}")

Step 9: Extract text and images from the PowerPoint file

In this step, we define a function to process the PowerPoint file (.pptx) and extract its content:

slide_texts: Stores extracted text from each slide.

slide_images: Stores extracted images as Python imaging library (PIL) image objects, along with their corresponding slide numbers.

Iterates through slides to extract text from shapes containing textual content and images embedded within slides.

This function separates the text and images from the PPT, allowing the chat agent to easily answer user questions based on the extracted content.

def extract_text_and_images_from_pptx(pptx_path):
        presentation = Presentation(pptx_path)
        slide_texts = []
        slide_images = []
        for slide_number, slide in enumerate(presentation.slides):
                # Extract text from slide
                slide_text = []
                for shape in slide.shapes:
                        if hasattr(shape, "text"):
                                slide_text.append(shape.text)
                                slide_texts.append("\n".join(slide_text))
                # Extract images from slide
                for shape in slide.shapes:
                        if hasattr(shape, "image"):
                                image_stream = BytesIO(shape.image.blob)
                                image = Image.open(image_stream)
                                slide_images.append((slide_number, image))
return slide_texts, slide_images

Step 10: Process the PowerPoint file

In this step, we call the function to extract text and images from the saved PowerPoint file.

pptx_path: Specifies the local path of the downloaded PowerPoint file.

extract_text_and_images_from_pptx(pptx_path): Extracts text and images from the slides.

slide_texts: Stores the extracted text from all slides.

slide_images: Stores the extracted images.

pptx_path = "downloaded_presentation.pptx"
slide_texts, slide_images = extract_text_and_images_from_pptx(pptx_path)

Step 11: Display extracted text from slides

In this step, we print the extracted text from each slide to verify that the PowerPoint content has been processed correctly.

enumerate(slide_texts): Iterates through the extracted text, associating each with its respective slide number.

Separator ('-' * 40): Helps visually distinguish content from different slides.

# Display extracted text and images
for i, text in enumerate(slide_texts):
        print(f"Slide {i + 1} Text:\n{text}\n{'-'*40}")

Step 12: Display extracted images from slides

In this step, we confirm and visualize the extracted images from the PowerPoint slides.

len: Counts the total number of images extracted.

img.show(): Opens each extracted image for review.

You can replace `.show()` with `.save('filename.png')` to store the images locally.

print(f"\nExtracted {len(slide_images)} images.")
for slide_num, img in slide_images:
        img.show() # This will open the image, or you can save it using img.save('filename.png')

Step 13: Load the IBM Granite vision model

In this step, we initialize the IBM Granite-Vision-3.2-2B model for AI-powered text and image processing.

MODEL_NAME specifies the pre-trained Granite Vision model to be used and torch.cuda.is_available() checks if a GPU (CUDA) is available for faster processing; otherwise, it defaults to the CPU.

# Load IBM Granite-Vision-3.1-2B-Preview model and processor
MODEL_NAME = "ibm/granite-vision-3-2-2b"
device = "cuda" if torch.cuda.is_available() else "cpu"

Step 14: Initialize the model and processor

In this step, we load the IBM Granite Vision model and its corresponding processor to handle both text and image inputs.

AutoProcessor.from_pretrained(MODEL_NAME, trust_remote_code=True): Loads the pre-trained processor to format inputs (text and images) for the model.

AutoModelForVision2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True, ignore_mismatched_sizes=True).to(device): Loads the Granite Vision model and transfers it to the available device (GPU or CPU).

where,

trust_remote_code=True: Ensures compatibility with custom model implementations.

ignore_mismatched_sizes=True: Prevents errors if there are minor inconsistencies in model size.

Note: This may take a while to load.

processor = AutoProcessor.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True, ignore_mismatched_sizes=True).to(device)

Step 15: Implement text-based AI chat

In this step, we create a chat function that allows users to ask questions based on the extracted text from the PowerPoint slides.

How it works:

  1. The user inputs a question related to the slide content.
  2. The entire extracted text from the PPT is formatted into a structured conversation for the model. This will give the model accurate contaxt to generate the specific answer from the PPT content itself.
  3. apply_chat_template() prepares the input for the AI model in a conversational format.
  4. model.generate() generates a response based on the input query.
  5. processor.decode() decodes the AI-generated response into human-readable text.
  6. The loop continues until the user types `exit` to quit the chat.
# Chat based on Text Only
def chat_with_text(model, processor, slide_texts):
    while True:
        query = input("Ask a question based on the presentation text (or type 'exit' to quit): ")
        if query.lower() == 'exit':
            break
        conversation = [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "\n".join(slide_texts) + f"\nQuery: {query}"},
                ],
            },
        ]
        inputs = processor.apply_chat_template(
            conversation,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt"
        ).to(device)
        outputs = model.generate(**inputs, max_new_tokens=150)
        response = processor.decode(outputs[0], skip_special_tokens=True)
        print("Model Response:", response)

Step 16: Implement image-based AI chat

In this step, we create a chat function that allows users to ask questions about individual images extracted from the PowerPoint slides.

How it works:

  1. The user inputs a question related to slide images.
  2. They specify a slide number to reference a particular image.
  3. The selected image is saved temporarily as `slide_image_temp.png`.
  4. A structured conversation is created, including: (a) The image file path. (b) The user's question.
  5. apply_chat_template() processes the input in a format suitable for the AI model.
  6. model.generate() generates a response based on the image and query.
  7. The response is decoded and printed for the user.
  8. The loop continues until the user types exit to quit.
# Chat based on Images Only
def chat_with_images(model, processor, slide_images):
    while True:
        query = input("Ask a question based on the presentation images (or type 'exit' to quit): ")
            if query.lower() == 'exit':
                break
            slide_num = int(input(f"Enter slide number (1 to {len(slide_images)}) to ask about its image: ")) - 1
            image = slide_images[slide_num][1]
            img_path = "slide_image_temp.png"
            image.save(img_path) # Save the image temporarily
            conversation = [
                {
                    "role": "user",
                    "content": [
                            {"type": "image", "url": img_path},
                            {"type": "text", "text": query},
                    ],
                },
            ]
            inputs = processor.apply_chat_template(
                conversation,
                add_generation_prompt=True,
                tokenize=True,
                return_dict=True,
                return_tensors="pt"
            ).to(device)
            outputs = model.generate(**inputs, max_new_tokens=150)
            response = processor.decode(outputs[0], skip_special_tokens=True)
            print("Model Response:", response)

Step 17: Run the text-based AI chat

In this step, we call the chat_with_text function, allowing the user to ask questions about the extracted text from the PowerPoint slides.

How it works:

  1. chat_with_text() starts the text-based Q&A session.
  2. The function continuously prompts the user for input, answering questions based on the extracted slide text.
  3. The chat loop continues until the user types exit to quit.
chat_with_text(model, processor, slide_texts)

OUTPUT

Query: Is integration a competitive advantage for your organization?

<|assistant|>

Yes, integration is a competitive advantage for your organization. It helps you move faster and overcome challenges, and can lead to increased costs, inefficiencies, security risks, and a poor user experience, ultimately jeopardizing an organization's competitiveness and ability to thrive in a rapidly evolving business landscape.

Ask a question based on the presentation text (or type 'exit' to quit): exit

When the user asked, "Is integration a competitive advantage for your organization?", the Granite Vision model processed the query using the extracted PowerPoint slide text and generated a response.

The model recognized "integration" as a business concept and provided a structured answer from `slide number 7` explaining both its benefits and risks. It highlighted that integration enhances speed and problem-solving but also noted potential downsides such as increased costs, inefficiencies, security risks and poor user experience if not managed effectively.

This response demonstrates the model's ability to interpret extracted slide text and generate a contextually relevant and well-balanced answer

Step 18: Run the image-based AI chat

In this step, we call the chat_with_images function, enabling the user to ask questions about images extracted from the PowerPoint slides.

How it works:

  1. chat_with_images() starts the image-based Q&A session.
  2. The function prompts the user to select a specific slide number containing an image.
  3. The selected image is processed and formatted into a structured conversation for the vision model.
  4. The model generates a response based on the image content and the user’s query.
  5. The loop continues until the user types exit to quit.
chat_with_images(model, processor, slide_images)

OUTPUT

Ask a question based on the presentation images (or type 'exit' to quit): what is this image?

Enter slide number (1 to 41) to ask about its image: 2

Model Response: <|system|>

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

<|user|>

what is this image?

<|assistant|>

3d model

Ask a question based on the presentation images (or type 'exit' to quit): explain this image

Enter slide number (1 to 41) to ask about its image: 2

Model Response: <|system|>

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

<|user|>

explain this image

<|assistant|>

the image is a 3d model of a cube

Ask a question based on the presentation images (or type 'exit' to quit): can you explain this chart?

Enter slide number (1 to 41) to ask about its image: 1

Model Response: <|system|>

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

<|user|>

can you explain this chart?

<|assistant|>

Here a is a bar plot titled Maturity progression of the enterprise cloud market from 1st to 4th generation. The x-axis measures Generations Progression while the y-axis plots Maturity progression over the years. The chart shows that as the generations progress, the maturity of the enterprise cloud market increases.

Ask a question based on the presentation images (or type 'exit' to quit): exit

When the user asked image-related questions, the Granite Vision model processed the selected images and generated responses based on its understanding of visual content.

For the question "What is this image?" (slide 2), the model identified the image as a "3D model" but provided a minimal description.

For "Explain this image" (slide 2), the model refined its response, identifying it as "a 3D model of a cube."

For "Can you explain this chart?" (slide 1), the model provided a detailed description of the bar chart, explaining its title, x-axis, y-axis, and overall trend, indicating how enterprise cloud maturity progresses across generations.

This step allows users to interact with visual elements, such as charts, diagrams and infographics, by leveraging IBM Granite Vision model for intelligent analysis and explanations

Key takeaways

  1. The model recognizes basic shapes and objects but might provide generalized descriptions for some images.
  2. For charts and diagrams, it provides structured insights, including titles, axis labels, and trends, making it useful for business and data presentations.
  3. The accuracy of responses depends on image clarity and complexity, simpler visuals (such as 3D models) can receive shorter responses, while structured visuals (such as charts) get more detailed insights.

This tutorial demonstrates IBM Granite Vision’s ability to interpret images.

Related solutions
IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services

Think Newsletter

 

The latest AI and tech insights from Think

Sign up today
Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai Book a live demo