10 December 2024
In this tutorial, you will discover how to apply Mistral AI’s Pixtral 12B multimodal model now available on watsonx.ai for multimodal tasks such as image captioning and visual question answering.
In September 2024, Mistral AI launched Pixtral 12B, an open-source large language model (LLM) under the Apache 2.0 license.
With 12 billion parameters, the multimodal model is built on Mistral AI's Nemo 12B LLM. Pixtral 12B has two components: the vision encoder to tokenize images and a multimodal transformer decoder to predict the following text token given a sequence of text and images. The vision encoder has 400 million parameters and supports variable image sizes.
The model excels at multiple use cases, including understanding graphs, diagrams, charts and documents in high resolution, which may be used for document question answering, instruction following or retrieval augmented generation (RAG) tasks. Additionally, Pixtral 12B has a 128,000-token context window, which allows for the consumption of multiple images simultaneously.
In terms of benchmarks, Pixtral 12B outperforms various models, including Qwen2-VL, Gemini-1.5 Flash 8B and Claude-3 Haiku. For certain benchmarks, including DocVQA (ANLS) and VQAv2 (VQA Match), the model outperforms OpenAI's GPT-4o and Claude-3.5 Sonnet.
Besides being able to run Pixtral 12B on watsonx.ai, the model is also available via Hugging Face, on Le Chat, Mistral's conversational chatbot, or via API endpoint through Mistral's La Plateforme.
Refer to this IBM Technology YouTube video that walks you through the following set up instructions in steps 1 and 2.
While you can choose from several tools, this tutorial is best suited for a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code with various data sources like text, images and data visualizations.
This tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai using your IBM Cloud account. Please note that Pixtral 12B is currently only available on the IBM Europe Frankfurt and London regions.
Create a watsonx.ai project.
You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
This step will open a notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.
To avoid Python package dependency conflicts, we recommend setting up a virtual environment.
We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.
Input your WATSONX_EU_APIKEY and WATSONX_EU_PROJECT_ID that you created in steps 1 and 2 upon running the following cell. We will also set the URL serving as the API endpoint.
We can use the Credentials class to encapsulate our passed credentials.
In this tutorial, we will be working with several images for multimodal AI applications such as image captioning and object detection. The images we will be using can be accessed using the following URLs. We can store these URLs in a list to iteratively encode them.
To gain a better understanding of our data input, let's display the images.
Output:
url_image_0
url_image_1
url_image_2
url_image_3
To encode these images in a way that is digestible for the LLM, we will be encoding the images to bytes that we then decode to UTF-8 representation.
Now that our images can be passed to the LLM, let's set up a function for our watsonx API calls. The augment_api_request_body function takes the user query and image as parameters and augments the body of the API request. We will use this function in each iteration.
Let's instantiate the model interface using theModelInference class. In this tutorial, we will use the the mistralai/pixtral-12b model.
Now, we can loop through our images to see the text descriptions produced by the model in response to the query, "What is happening in this image?"
Output:
The image depicts a vibrant field of flowers in full bloom under a clear blue sky, with the sun shining brightly, creating a serene and picturesque scene.
In the image, a person is seated at a table, using a laptop while holding a credit card. There are some apples in a bowl on the table next to the laptop.
A person is standing next to a heavily snow-covered car, holding a red umbrella to shield themselves from the falling snow.
The image depicts a workflow for processing and storing data, likely for machine learning or data analysis. It starts with loading source data (like HTML or XML documents), transforming the data into a suitable format, embedding it into numerical vectors, storing these vectors in a database, and finally retrieving the data when needed.
The Pixtral 12B model was able to successfully caption each image in significant detail.
Now that we have showcased the model's ability to perform image captioning in the previous step, let's ask the model some questions that require object detection. Regarding the second image depicting the woman online shopping, we will be asking the model "What does the woman have in her hand?"
Output: The woman is holding a credit card in her hand.
The model correctly identified the object in the woman's hand. Now, let's inquire about the issue in the image of the car covered in snow.
Output: The car is likely stuck in the deep snow, making it difficult or impossible to move.
This response highlights the value that multimodal AI has for domains like insurance. The model was able to detect the problem with the car stuck in the snow. This could be a powerful tool for improving insurance claim processing time.
Next, let's ask the model about the steps in the flowchart image.
Output: The diagram illustrates a process involving several steps: "Load," "Transform," "Embed," "Store," and "Retrieve." This sequence likely represents a workflow for processing and storing data, transforming it into embedded vectors for efficient storage and retrieval.
Great! The model was able to discern objects within the images following user queries. We encourage you to try out more queries to further demonstrate the model's performance.
In this tutorial, you used the Pixtral 12B model to perform multimodal operations including image captioning and visual question answering.
To try other multimodal models, check out this tutorial on Meta's multimodal model Llama 3.2 on watsonx.ai.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data.
Redefine how you work with AI for business. IBM Consulting™ is working with global clients and partners to co-create what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting edge AI solutions and automation across your business.
IBM’s artificial intelligence solutions help you build the future of your business. These include: IBM® watsonx™, our AI and data platform and portfolio of AI-powered assistants; IBM® Granite™, our family of open-sourced, high-performing and cost-efficient models trained on trusted enterprise data; IBM Consulting, our AI services to redesign workflows; and our hybrid cloud offerings that enable AI-ready infrastructure to better scale AI.
Get started
Get started
Get started
Get started