My IBM

Use Llama 3.2-90b-vision-instruct for multimodal AI queries in Python with watsonx

25 September 2024

Authors

Anna Gutowska

AI Engineer, Developer Advocate

IBM

Erika Russi

Data Scientist

IBM

Jess Bozorg

Lead, AI Advocacy

IBM

What is multimodal AI?

In this tutorial, you will discover how to apply the Meta Llama 3.2-90b-vision-instruct model now available on watsonx.ai to computer vision tasks such as image captioning and visual question answering.

Overview of multimodal AI

Multimodal versus unimodal AI models

Many of us are familiar with unimodal AI applications. A popular unimodal AI tool is ChatGPT. Chatbots like ChatGPT use natural language processing (NLP) to understand user questions and automate responses in real time. The type of input these unimodal large language models (LLMs) can be applied to is limited to text.

Multimodal artificial intelligence (AI) relies on machine learning models built on neural networks. These neural networks are capable of processing and integrating information from multiple data types using complex deep learning techniques. These different modalities produced by the generative AI model, sometimes called gen AI models, can include text, images, video and audio input.

Multimodal AI systems have many real-world use cases ranging from medical image diagnoses in healthcare settings using computer vision to speech recognition in translation applications. These AI technology advancements can optimize various domains. The major advantage of multimodal architectures is the ability to process different types of data.

Multimodal AI: how it works

Multimodal AI entails three elements:

Input module

The input module is built upon multiple unimodal neural networks for pre-processing different data types. Here, the data is prepared for machine learning algorithms performed in the fusion module.

Fusion module

The combining, aligning and processing of data occurs in this module. The fusion process occurs for each data modality. Several techniques are commonly used in this module. One example is early fusion, where raw data of all input types is combined. Additionally, mid-fusion is when data of different modalities are encoded at different preprocessing stages. Lastly, late fusion consolidates the data after being initially processed in the input module by different models corresponding to each modality.

Output module

The output module generates results in the desired output format by making sense of the data produced in the fusion module. These outputs can take on various forms such as text, image or a combination of formats.

Steps

Refer to this IBM Technology YouTube video that walks you through the following set up instructions in steps 1 and 2.

Step 1. Set up your environment

While you can choose from several tools, this tutorial is best suited for a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code with various data sources like text, images and data visualizations.

This tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.

You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.

This step will open a notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.

To avoid Python package dependency conflicts, we recommend setting up a virtual environment.

Step 2. Set up a watsonx.ai Runtime instance and API key

For this tutorial, we suggest using the Meta 3.2-90b-vision-instruct model with watsonx.ai to achieve similar results. You are free to use any AI model that supports multimodal learning of your choice. There are several multimodal AI models to choose from including OpenAI’s GPT-4 V(ision) and DALL-E 3 as well as Google’s Gemini. Ensure you are using the appropriate API if working with other models as this tutorial is designed for watsonx.ai.

Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.

Step 3. Install and import relevant libraries and set up your credentials

We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.

#installations %pip install image | tail -n 1 %pip install -U "ibm_watsonx_ai>=1.1.14" | tail -n 1 %pip install python-dotenv | tail -n 1 #imports import requests import base64 import os from PIL import Image from ibm_watsonx_ai import Credentials from ibm_watsonx_ai.foundation_models import ModelInference from dotenv import load_dotenv load_dotenv(os.getcwd()+"/.env", override=True)

To set our credentials, we will need the Watsonx WATSONX_APIKEY and WATSONX_PROJECT_ID you generated in Step 1. You can either store them in a .env file in your directory or replace the placeholder text. We will also set the URL serving as the API endpoint.

WATSONX_APIKEY = os.getenv('WATSONX_APIKEY', "<YOUR_WATSONX_APIKEY_HERE>") WATSONX_PROJECT_ID = os.getenv('WATSONX_PROJECT_ID', "<YOUR_WATSONX_PROJECT_ID_HERE>") URL = "https://us-south.ml.cloud.ibm.com"

We can use the Credentials class to encapsulate our passed credentials.

credentials = Credentials( url=URL, api_key=WATSONX_APIKEY )

Step 4. Encode images

In this tutorial, we will be working with several images for multimodal AI applications such as image captioning and object detection. The images we will be using can be accessed using the following URLs. We can store these URLs in a list to iteratively encode them.

url_image_1 = 'https://assets.ibm.com/is/image/ibm/hv6b0935?$original$' url_image_2 = 'https://assets.ibm.com/is/image/ibm/c30a2d57-a62b-4bb3-818895bfe2fc7bf8?$original$' url_image_3 = 'https://assets.ibm.com/is/image/ibm/nt170969?$original$' url_image_4 = 'https://assets.ibm.com/is/image/ibm/fb123b45-6530-4dd9-a758-10a7ec234d9d?$original$' image_urls = [url_image_1, url_image_2, url_image_3, url_image_4]

To gain a better understanding of our data input, let's display the images.

for idx, url in enumerate(image_urls): print(f'url_image_{idx}') display(Image.open(requests.get(url, stream=True).raw))

Output:

url_image_0

url_image_1

url_image_2

url_image_3

To encode these images in a way that is digestible for the LLM, we will be encoding the images to bytes that we then decode to UTF-8 representation.

encoded_images = [] for url in image_urls: encoded_images.append(base64.b64encode(requests.get(url).content).decode("utf-8"))

Step 5. Set up the API request and LLM

Now that our images can be passed to the LLM, let's set up a function for our watsonx API calls. The augment_api_request_body function takes the user query and image as parameters and augments the body of the API request. We will use this function in each iteration.

def augment_api_request_body(user_query, image):
    messages = [
        {
            "role": "user",
            "content": [{
                "type": "text",
                "text": 'You are a helpful assistant. Answer the following user query in 1 or 2 sentences: ' + user_query
            },
            {
                "type": "image_url",
                "image_url": {
                "url": f"data:image/jpeg;base64,{image}",
                }
            }]
        }
    ]

    return messages

Let's instantiate the model interface using theModelInference class. In this tutorial, we will use the themeta-llama/llama-3-2-90b-vision-instruct model.

model = ModelInference( model_id="meta-llama/llama-3-2-90b-vision-instruct", credentials=credentials, project_id=WATSONX_PROJECT_ID, params={ "max_tokens": 200 } )

Step 6. Image captioning

Now, we can loop through our images to see the text descriptions produced by the model in response to the query, "What is happening in this image?"

for i in range(len(encoded_images)): image = encoded_images[i] user_query = "What is happening in this image?" messages = augment_api_request_body(user_query, image) response = model.chat(messages=messages) print(response['choices'][0]['message']['content'])

Output:

This image shows a busy city street with tall buildings and cars, and people walking on the sidewalk. The street is filled with traffic lights, trees, and street signs, and there are several people crossing the street at an intersection.
The image depicts a woman in athletic attire running down the street, with a building and a car visible in the background. The woman is wearing a yellow hoodie, black leggings, and sneakers, and appears to be engaged in a casual jog or run.
The image depicts a flooded area, with water covering the ground and surrounding buildings. The flooding appears to be severe, with the water level reaching the roofs of some structures.
**Image Description**

* The image shows a close-up of a nutrition label, with a finger pointing to it.
* The label provides detailed information on the nutritional content of a specific food item, including:
        + Calories
        + Fat
        + Sodium
        + Carbohydrates
        + Other relevant information
        * The label is displayed on a white background with black text, making it easy to read and understand.

The Llama 3.2-90b-vision-instruct model was able to successfully caption each image in significant detail.

Step 7. Object detection

Now that we have showcased the model's ability to perform image-to-text conversion in the previous step, let's ask the model some questions that require object detection. Regarding the second image depicting the woman running outdoors, we will be asking the model, "How many cars are in this image?"

image = encoded_images[1] user_query = "How many cars are in this image?" messages = augment_api_request_body(user_query, image) response = model.chat(messages=messages) print(response['choices'][0]['message']['content'])

Output: There is one car in this image. The car is parked on the street, to the right of the building.

The model correctly identified the singular vehicle in the image. Now, let's inquire about the damage depicted in the image of flooding.

image = encoded_images[2] user_query = "How severe is the damage in this image?" messages = augment_api_request_body(user_query, image) response = model.chat(messages=messages) print(response['choices'][0]['message']['content'])

Output: The damage in this image is severe, with the floodwaters covering a significant portion of the land and potentially causing extensive damage to the structures and crops. The water level appears to be at least waist-deep, which could lead to significant losses for the property owners and farmers in the area.

This response highlights the value that multimodal AI has for domains like insurance. The model was able to detect the severity of the damage caused to the flooded home. This could be a powerful tool for improving insurance claim processing time.

Next, let's ask the model how much sodium content is in the nutrition label image.

image = encoded_images[3] user_query = "How much sodium is in this product?" request_body = augment_api_request_body(user_query, image) messages = augment_api_request_body(user_query, image) response = model.chat(messages=messages) print(response['choices'][0]['message']['content'])

Output: **Sodium Content:** 640 milligrams (mg)

Great! The model was able to discern objects within the images following user queries. We encourage you to try out more queries to further demonstrate the model's performance.

Summary

In this tutorial, you used the Llama 3.2-90b-vision-instruct model to perform multimodal operations including image captioning and visual question answering. For more use cases of this model, we encourage you to check out the official documentation page. There you will find more information on the model’s parameters and capabilities. The Python output is important as it shows the multimodal system's ability to extract information from multimodal data.

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

IBM is named a Leader in Data Science & Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

From AI projects to profits: How agentic AI can sustain financial returns

Learn how organizations are shifting from launching AI in disparate pilots to using it to drive transformation at the core.

Level up your AI expertise

Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at a low price.

Explore IBM Granite

IBM® Granite® is a family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

IBM AI Academy

Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

The 2025 CEO’s guide: 5 mindshifts to supercharge business growth

Activate these five mindshifts to cut through the uncertainty, spur business reinvention, and supercharge growth with agentic AI.

Unlock the power of generative AI and ML

Learn how to confidently incorporate generative AI and machine learning into your business.

How to thrive in this new era of AI with trust and confidence

Dive into the three critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.