My IBM

Use Pixtral 12B for multimodal AI queries in Python with watsonx

10 December 2024

Authors

Erika Russi

Data Scientist

IBM

In this tutorial, you will discover how to apply Mistral AI’s Pixtral 12B multimodal model now available on watsonx.ai for multimodal tasks such as image captioning and visual question answering.

Pixtral 12B

In September 2024, Mistral AI launched Pixtral 12B, an open-source large language model (LLM) under the Apache 2.0 license.

With 12 billion parameters, the multimodal model is built on Mistral AI's Nemo 12B LLM. Pixtral 12B has two components: the vision encoder to tokenize images and a multimodal transformer decoder to predict the following text token given a sequence of text and images. The vision encoder has 400 million parameters and supports variable image sizes.

The model excels at multiple use cases, including understanding graphs, diagrams, charts and documents in high resolution, which may be used for document question answering, instruction following or retrieval augmented generation (RAG) tasks. Additionally, Pixtral 12B has a 128,000-token context window, which allows for the consumption of multiple images simultaneously.

In terms of benchmarks, Pixtral 12B outperforms various models, including Qwen2-VL, Gemini-1.5 Flash 8B and Claude-3 Haiku. For certain benchmarks, including DocVQA (ANLS) and VQAv2 (VQA Match), the model outperforms OpenAI's GPT-4o and Claude-3.5 Sonnet.

Besides being able to run Pixtral 12B on watsonx.ai, the model is also available via Hugging Face, on Le Chat, Mistral's conversational chatbot, or via API endpoint through Mistral's La Plateforme.

Steps

Refer to this IBM Technology YouTube video that walks you through the following set up instructions in steps 1 and 2.

Step 1. Set up your environment

While you can choose from several tools, this tutorial is best suited for a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code with various data sources like text, images and data visualizations.

This tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

Log in to watsonx.ai using your IBM Cloud account. Please note that Pixtral 12B is currently only available on the IBM Europe Frankfurt and London regions.
Create a watsonx.ai project.

You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.

This step will open a notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.

To avoid Python package dependency conflicts, we recommend setting up a virtual environment.

Step 2. Set up watsonx.ai Runtime service and API key

Create a watsonx.ai Runtime service instance (choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service to the project you created in watsonx.ai.

Step 3. Install and import relevant libraries and set up your credentials

We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.

#installations
%pip install image | tail -n 1
%pip install -U "ibm_watsonx_ai>=1.1.14" | tail -n 1

#imports
import requests
import base64
import getpass

from PIL import Image
from ibm_watsonx_ai import Credentials
from ibm_watsonx_ai.foundation_models import ModelInference

Input your WATSONX_EU_APIKEY and WATSONX_EU_PROJECT_ID that you created in steps 1 and 2 upon running the following cell. We will also set the URL serving as the API endpoint.

WATSONX_EU_APIKEY = getpass.getpass("Please enter your watsonx.ai Runtime API key (hit enter): ")
WATSONX_EU_PROJECT_ID = getpass.getpass("Please enter your project ID (hit enter): ")
URL = "https://eu-gb.ml.cloud.ibm.com"

We can use the Credentials class to encapsulate our passed credentials.

credentials = Credentials(
url=URL,
api_key=WATSONX_EU_APIKEY
)

Step 4. Encode images

In this tutorial, we will be working with several images for multimodal AI applications such as image captioning and object detection. The images we will be using can be accessed using the following URLs. We can store these URLs in a list to iteratively encode them.

url_image_1 = 'https://assets.ibm.com/is/image/ibm/img_2803copy?$original$'
url_image_2 = 'https://assets.ibm.com/is/image/ibm/img_2857?$original$'
url_image_3 = 'https://assets.ibm.com/is/image/ibm/1ab001c5-751d-45aa-8a57-d1ce31ea0588?$original$'
url_image_4 = 'https://assets.ibm.com/is/image/ibm/langchain?$original$&fmt=png-alpha'

image_urls = [url_image_1, url_image_2, url_image_3, url_image_4]

To gain a better understanding of our data input, let's display the images.

for idx, url in enumerate(image_urls):
print(f'url_image_{idx}')
display(Image.open(requests.get(url, stream=True).raw))

Output:

url_image_0

url_image_1

url_image_2

url_image_3

To encode these images in a way that is digestible for the LLM, we will be encoding the images to bytes that we then decode to UTF-8 representation.

encoded_images = []

for url in image_urls:
encoded_images.append(base64.b64encode(requests.get(url).content).decode("utf-8"))

Step 5. Set up the API request and LLM

Now that our images can be passed to the LLM, let's set up a function for our watsonx API calls. The augment_api_request_body function takes the user query and image as parameters and augments the body of the API request. We will use this function in each iteration.

def augment_api_request_body(user_query, image):
    messages = [
        {
            "role": "user",
            "content": [{
                "type": "text",
                "text": 'You are a helpful assistant. Answer the following user query in 1 or 2 sentences: ' + user_query
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{image}",
                }
            }]
        }
    ]

return messages

Let's instantiate the model interface using theModelInference class. In this tutorial, we will use the the mistralai/pixtral-12b model.

model = ModelInference(
    model_id="mistralai/pixtral-12b",
    credentials=credentials,
    project_id=WATSONX_EU_PROJECT_ID,
    params={
        "max_tokens": 200
        }
    )

Step 6. Image captioning

Now, we can loop through our images to see the text descriptions produced by the model in response to the query, "What is happening in this image?"

for i in range(len(encoded_images)):
        image = encoded_images[i]
        user_query = "What is happening in this image?"
        messages = augment_api_request_body(user_query, image)
        response = model.chat(messages=messages)
        print(response['choices'][0]['message']['content'])

Output:

The image depicts a vibrant field of flowers in full bloom under a clear blue sky, with the sun shining brightly, creating a serene and picturesque scene.

In the image, a person is seated at a table, using a laptop while holding a credit card. There are some apples in a bowl on the table next to the laptop.

A person is standing next to a heavily snow-covered car, holding a red umbrella to shield themselves from the falling snow.

The image depicts a workflow for processing and storing data, likely for machine learning or data analysis. It starts with loading source data (like HTML or XML documents), transforming the data into a suitable format, embedding it into numerical vectors, storing these vectors in a database, and finally retrieving the data when needed.

The Pixtral 12B model was able to successfully caption each image in significant detail.

Step 7. Object detection

Now that we have showcased the model's ability to perform image captioning in the previous step, let's ask the model some questions that require object detection. Regarding the second image depicting the woman online shopping, we will be asking the model "What does the woman have in her hand?"

image = encoded_images[1]
user_query = "What does the woman have in her hand?"
messages = augment_api_request_body(user_query, image)
response = model.chat(messages=messages)
print(response['choices'][0]['message']['content'])

Output: The woman is holding a credit card in her hand.

The model correctly identified the object in the woman's hand. Now, let's inquire about the issue in the image of the car covered in snow.

image = encoded_images[2]
user_query = "What is likely the issue with this car?"
messages = augment_api_request_body(user_query, image)
response = model.chat(messages=messages)
print(response['choices'][0]['message']['content'])

Output: The car is likely stuck in the deep snow, making it difficult or impossible to move.

This response highlights the value that multimodal AI has for domains like insurance. The model was able to detect the problem with the car stuck in the snow. This could be a powerful tool for improving insurance claim processing time.

Next, let's ask the model about the steps in the flowchart image.

image = encoded_images[3]
user_query = "Name the steps in this diagram"
request_body = augment_api_request_body(user_query, image)
messages = augment_api_request_body(user_query, image)
response = model.chat(messages=messages)
print(response['choices'][0]['message']['content'])

Output: The diagram illustrates a process involving several steps: "Load," "Transform," "Embed," "Store," and "Retrieve." This sequence likely represents a workflow for processing and storing data, transforming it into embedded vectors for efficient storage and retrieval.

Great! The model was able to discern objects within the images following user queries. We encourage you to try out more queries to further demonstrate the model's performance.

Summary

In this tutorial, you used the Pixtral 12B model to perform multimodal operations including image captioning and visual question answering.

To try other multimodal models, check out this tutorial on Meta's multimodal model Llama 3.2 on watsonx.ai.

How to choose the right foundation model

Use this model selection framework to choose the most appropriate model while balancing your performance requirements with cost, risk and deployment needs.

Resources

Use Llama 3.2-90b-vision-instruct for multimodal AI queries in Python with watsonx

Tutorial

Get started

What is multimodal AI?