26/02/2025
In this recipe, you will learn to build an AI-driven system capable of answering real-time user queries from PowerPoint slides, using both text and images as context.
As AI-driven technologies evolve, image analysis is becoming increasingly sophisticated, enabling deeper insights from visual data. With advancements in machine learning models, AI can process uploaded images, extract metadata and support content moderation at large scale. These analysis tools also contribute to predictive modeling for applications like pricing, visual optimization and image generation, making workflows more cost-effective and efficient. By integrating data-driven approaches, AI enhances automation and decision-making, offering new possibilities for intelligent visual interpretation.
With the rapid advancements in Computer Vision and advanced AI, businesses and researchers are leveraging image-based technologies for a wide range of applications. From image classification and OCR (optical character recognition) to segmentation and video analysis, AI-powered tools are transforming the way we extract and analyze visual information.
In industries like social media, AI enhances content moderation by analyzing images at the pixel level, ensuring compliance and improving engagement. Businesses can also utilize Vision API for automated document processing, converting scanned files, excels and reports into structured data. These applications streamline workflows, improve efficiency and enable organizations to extract meaningful insights from large-scale visual datasets.
These use cases highlight the growing role of AI-powered image analysis across industries. In this tutorial, we focus on applying these capabilities to PowerPoint presentations, enabling interactive Q&A on text and images using advanced computer vision and AI models
Large language models (LLMs) have revolutionized machine learning by enabling intelligent insights from vast datasets of unstructured text. However, traditional LLMs often struggle with image analysis, making it challenging to extract insights from charts, diagrams and visual elements in presentations.
IBM Granite™ Vision 3.2 large language model (LLM) bridges this gap by integrating AI tools with advanced object detection algorithms, allowing users to automate multimodal analysis. This tutorial demonstrates how to streamline your workflow by using AI to extract and analyze text and images from PowerPoint (.pptx) files, enabling interactive Q&A for enhanced presentation insights.
In this tutorial, you will learn to build an AI-driven system capable of answering real-time user queries from PowerPoint slides by using both text and images as context. This tutorial will guide you through:
PowerPoint processing: Extract text and images from .pptx files for AI-based analysis.
Text-based Q&A: Use Granite Vision to generate answers based on extracted slide text.
Image-based Q&A: Ask AI to analyze images, charts and diagrams from slides.
Optimized question formulation: Learn how to craft effective questions for accurate and relevant AI responses.
This tutorial leverages cutting-edge AI technologies, including:
1. IBM Granite Vision: A powerful vision-language model (VLM) that processes both text and images.
2. Python-PPTX: A library for extracting text and images from PowerPoint files.
3. Transformers: A framework to process AI model inputs efficiently.
By the end of this tutorial, you will:
1. Extract and process PowerPoint content (text and images).
2. Use Granite vision 3.2 model for AI-driven Q&A on slide content.
3. Ask AI insightful questions about text and images.
4. Improve user interaction with presentations by using AI-powered explanations.
This tutorial is designed for AI developers, researchers, content creators, and business professionals looking to enhance their presentations with AI-driven insights.
You need an IBM Cloud account to create a watsonx.ai project.
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
1. Log in to watsonx.ai by using your IBM Cloud account.
2. Create a watsonx.ai project. You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
3. Create a Jupyter Notebook.
4. Upload the PPTX file as asset in watsonx.ai
This step opens a notebook environment where you can copy the code from this tutorial. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This tutorial is also available on GitHub.
Note: This tutorial needs GPU infrastructure to run the code, so it is recommended to use watsonx.ai as illustrated in this tutorial.
Before we begin extracting and processing PowerPoint content, we need to install the necessary Python libraries:
transformers: Provides access to IBM Granite Vision and other AI models.
torch: A deep learning framework required for running the model efficiently.
python-pptx: A library to extract text and images from PowerPoint (.pptx) files.
Run the following commands to install and upgrade these packages:
In this step, we import the necessary libraries for processing PowerPoint files, handling images, and interacting with the IBM Granite Vision model:
In this step, we establish a connection to IBM Cloud Object Storage to access and retrieve PowerPoint files stored in the cloud.
You can levrage the python support, provided through a fork of the boto3 library with features to make the most of IBM Cloud® Object Storage. Check out the official documentation to get these credentials.
ibm_boto3.client: Creates a client to interact with IBM Cloud Object Storage.
ibm_api_key_id: Your IBM Cloud API key for authentication.
ibm_auth_endpoint: The authentication endpoint for IBM Cloud.
endpoint_url: The specific cloud object storage (COS) storage endpoint.
Note: When you upload a file as assets in watsonx.ai, it is automatically stored in IBM Cloud Object Storage. When you later import the file into a Jupyter Notebook, watsonx.ai generates and inserts the necessary credentials (API key, authentication endpoint and storage endpoint) into your notebook. The provided IBM Cloud Object Storage credentials allow secure access to retrieve files from storage, enabling seamless integration between watsonx.ai Assets and the notebook environment for further processing.
By configuring this connection, we can seamlessly import and process PowerPoint presentations stored in IBM Cloud for AI-powered analysis
In this step, we specify the IBM Cloud Object Storage bucket and file details to locate and retrieve the PowerPoint presentation (.pptx) for processing.
Check out this official document to get the bucket configuration details through the IBM Cloud UI.
bucket: The name of the IBM Cloud Object Storage bucket where the file is stored.
object_key: The exact filename of the PowerPoint presentation to be accessed
In this step, we download the PowerPoint (.pptx) file from IBM Cloud Object Storage to process it locally.
cos_client.get_object(): Retrieves the file from the specified bucket and object key.
streaming_body.read(): Reads the file contents into a byte stream for further processing.
In this step, we store the downloaded PowerPoint file (.pptx) locally so it can be processed.
pptx_path: Defines the local filename where the presentation will be saved.
open(pptx_path, 'wb'): Opens the file in write-binary mode to store the retrieved bytes.
f.write(pptx_bytes): Writes the downloaded file content into the newly created .pptx file.
In this step, we print a confirmation message to ensure that the PowerPoint file has been successfully saved. `print` function displays the file path where the .pptx file is stored locally.
In this step, we define a function to process the PowerPoint file (.pptx) and extract its content:
slide_texts: Stores extracted text from each slide.
slide_images: Stores extracted images as Python imaging library (PIL) image objects, along with their corresponding slide numbers.
Iterates through slides to extract text from shapes containing textual content and images embedded within slides.
This function separates the text and images from the PPT, allowing the chat agent to easily answer user questions based on the extracted content.
In this step, we call the function to extract text and images from the saved PowerPoint file.
pptx_path: Specifies the local path of the downloaded PowerPoint file.
extract_text_and_images_from_pptx(pptx_path): Extracts text and images from the slides.
slide_texts: Stores the extracted text from all slides.
slide_images: Stores the extracted images.
In this step, we print the extracted text from each slide to verify that the PowerPoint content has been processed correctly.
enumerate(slide_texts): Iterates through the extracted text, associating each with its respective slide number.
Separator ('-' * 40): Helps visually distinguish content from different slides.
In this step, we confirm and visualize the extracted images from the PowerPoint slides.
len: Counts the total number of images extracted.
img.show(): Opens each extracted image for review.
You can replace `.show()` with `.save('filename.png')` to store the images locally.
In this step, we initialize the IBM Granite-Vision-3.2-2B model for AI-powered text and image processing.
MODEL_NAME specifies the pre-trained Granite Vision model to be used and torch.cuda.is_available() checks if a GPU (CUDA) is available for faster processing; otherwise, it defaults to the CPU.
In this step, we load the IBM Granite Vision model and its corresponding processor to handle both text and image inputs.
AutoProcessor.from_pretrained(MODEL_NAME, trust_remote_code=True): Loads the pre-trained processor to format inputs (text and images) for the model.
AutoModelForVision2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True, ignore_mismatched_sizes=True).to(device): Loads the Granite Vision model and transfers it to the available device (GPU or CPU).
where,
trust_remote_code=True: Ensures compatibility with custom model implementations.
ignore_mismatched_sizes=True: Prevents errors if there are minor inconsistencies in model size.
Note: This may take a while to load.
In this step, we create a chat function that allows users to ask questions based on the extracted text from the PowerPoint slides.
How it works:
In this step, we create a chat function that allows users to ask questions about individual images extracted from the PowerPoint slides.
How it works:
In this step, we call the chat_with_text function, allowing the user to ask questions about the extracted text from the PowerPoint slides.
How it works:
OUTPUT
Query: Is integration a competitive advantage for your organization?
<|assistant|>
Yes, integration is a competitive advantage for your organization. It helps you move faster and overcome challenges, and can lead to increased costs, inefficiencies, security risks, and a poor user experience, ultimately jeopardizing an organization's competitiveness and ability to thrive in a rapidly evolving business landscape.
Ask a question based on the presentation text (or type 'exit' to quit): exit
When the user asked, "Is integration a competitive advantage for your organization?", the Granite Vision model processed the query using the extracted PowerPoint slide text and generated a response.
The model recognized "integration" as a business concept and provided a structured answer from `slide number 7` explaining both its benefits and risks. It highlighted that integration enhances speed and problem-solving but also noted potential downsides such as increased costs, inefficiencies, security risks and poor user experience if not managed effectively.
This response demonstrates the model's ability to interpret extracted slide text and generate a contextually relevant and well-balanced answer
In this step, we call the chat_with_images function, enabling the user to ask questions about images extracted from the PowerPoint slides.
How it works:
OUTPUT
Ask a question based on the presentation images (or type 'exit' to quit): what is this image?
Enter slide number (1 to 41) to ask about its image: 2
Model Response: <|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<|user|>
what is this image?
<|assistant|>
3d model
Ask a question based on the presentation images (or type 'exit' to quit): explain this image
Enter slide number (1 to 41) to ask about its image: 2
Model Response: <|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<|user|>
explain this image
<|assistant|>
the image is a 3d model of a cube
Ask a question based on the presentation images (or type 'exit' to quit): can you explain this chart?
Enter slide number (1 to 41) to ask about its image: 1
Model Response: <|system|>
A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
<|user|>
can you explain this chart?
<|assistant|>
Here a is a bar plot titled Maturity progression of the enterprise cloud market from 1st to 4th generation. The x-axis measures Generations Progression while the y-axis plots Maturity progression over the years. The chart shows that as the generations progress, the maturity of the enterprise cloud market increases.
Ask a question based on the presentation images (or type 'exit' to quit): exit
When the user asked image-related questions, the Granite Vision model processed the selected images and generated responses based on its understanding of visual content.
For the question "What is this image?" (slide 2), the model identified the image as a "3D model" but provided a minimal description.
For "Explain this image" (slide 2), the model refined its response, identifying it as "a 3D model of a cube."
For "Can you explain this chart?" (slide 1), the model provided a detailed description of the bar chart, explaining its title, x-axis, y-axis, and overall trend, indicating how enterprise cloud maturity progresses across generations.
This step allows users to interact with visual elements, such as charts, diagrams and infographics, by leveraging IBM Granite Vision model for intelligent analysis and explanations
This tutorial demonstrates IBM Granite Vision’s ability to interpret images.
We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at a low price.
Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.
Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.
Learn how to confidently incorporate generative AI and machine learning into your business.
Dive into the three critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.