25 September 2024
In this tutorial, you will discover how to apply the Meta Llama 3.2-90b-vision-instruct model now available on watsonx.ai to computer vision tasks such as image captioning and visual question answering.
Many of us are familiar with unimodal AI applications. A popular unimodal AI tool is ChatGPT. Chatbots like ChatGPT use natural language processing (NLP) to understand user questions and automate responses in real time. The type of input these unimodal large language models (LLMs) can be applied to is limited to text.
Multimodal artificial intelligence (AI) relies on machine learning models built on neural networks. These neural networks are capable of processing and integrating information from multiple data types using complex deep learning techniques. These different modalities produced by the generative AI model, sometimes called gen AI models, can include text, images, video and audio input.
Multimodal AI systems have many real-world use cases ranging from medical image diagnoses in healthcare settings using computer vision to speech recognition in translation applications. These AI technology advancements can optimize various domains. The major advantage of multimodal architectures is the ability to process different types of data.
Multimodal AI entails three elements:
Input module
The input module is built upon multiple unimodal neural networks for pre-processing different data types. Here, the data is prepared for machine learning algorithms performed in the fusion module.
Fusion module
The combining, aligning and processing of data occurs in this module. The fusion process occurs for each data modality. Several techniques are commonly used in this module. One example is early fusion, where raw data of all input types is combined. Additionally, mid-fusion is when data of different modalities are encoded at different preprocessing stages. Lastly, late fusion consolidates the data after being initially processed in the input module by different models corresponding to each modality.
Output module
The output module generates results in the desired output format by making sense of the data produced in the fusion module. These outputs can take on various forms such as text, image or a combination of formats.
Refer to this IBM Technology YouTube video that walks you through the following set up instructions in steps 1 and 2.
While you can choose from several tools, this tutorial is best suited for a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code with various data sources like text, images and data visualizations.
This tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.
You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
This step will open a notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.
To avoid Python package dependency conflicts, we recommend setting up a virtual environment.
For this tutorial, we suggest using the Meta 3.2-90b-vision-instruct model with watsonx.ai to achieve similar results. You are free to use any AI model that supports multimodal learning of your choice. There are several multimodal AI models to choose from including OpenAI’s GPT-4 V(ision) and DALL-E 3 as well as Google’s Gemini. Ensure you are using the appropriate API if working with other models as this tutorial is designed for watsonx.ai.
Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.
We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.
To set our credentials, we will need the Watsonx WATSONX_APIKEY and WATSONX_PROJECT_ID you generated in Step 1. You can either store them in a .env file in your directory or replace the placeholder text. We will also set the URL serving as the API endpoint.
We can use the Credentials class to encapsulate our passed credentials.
In this tutorial, we will be working with several images for multimodal AI applications such as image captioning and object detection. The images we will be using can be accessed using the following URLs. We can store these URLs in a list to iteratively encode them.
To gain a better understanding of our data input, let's display the images.
Output:
url_image_0
url_image_1
url_image_2
url_image_3
To encode these images in a way that is digestible for the LLM, we will be encoding the images to bytes that we then decode to UTF-8 representation.
Now that our images can be passed to the LLM, let's set up a function for our watsonx API calls. The augment_api_request_body function takes the user query and image as parameters and augments the body of the API request. We will use this function in each iteration.
def augment_api_request_body(user_query, image): messages = [ { "role": "user", "content": [{ "type": "text", "text": 'You are a helpful assistant. Answer the following user query in 1 or 2 sentences: ' + user_query }, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{image}", } }] } ] return messages
Let's instantiate the model interface using theModelInference class. In this tutorial, we will use the themeta-llama/llama-3-2-90b-vision-instruct model.
Now, we can loop through our images to see the text descriptions produced by the model in response to the query, "What is happening in this image?"
Output:
This image shows a busy city street with tall buildings and cars, and people walking on the sidewalk. The street is filled with traffic lights, trees, and street signs, and there are several people crossing the street at an intersection.
The image depicts a woman in athletic attire running down the street, with a building and a car visible in the background. The woman is wearing a yellow hoodie, black leggings, and sneakers, and appears to be engaged in a casual jog or run.
The image depicts a flooded area, with water covering the ground and surrounding buildings. The flooding appears to be severe, with the water level reaching the roofs of some structures.
**Image Description**
* The image shows a close-up of a nutrition label, with a finger pointing to it.
* The label provides detailed information on the nutritional content of a specific food item, including:
+ Calories
+ Fat
+ Sodium
+ Carbohydrates
+ Other relevant information
* The label is displayed on a white background with black text, making it easy to read and understand.
The Llama 3.2-90b-vision-instruct model was able to successfully caption each image in significant detail.
Now that we have showcased the model's ability to perform image-to-text conversion in the previous step, let's ask the model some questions that require object detection. Regarding the second image depicting the woman running outdoors, we will be asking the model, "How many cars are in this image?"
Output: There is one car in this image. The car is parked on the street, to the right of the building.
The model correctly identified the singular vehicle in the image. Now, let's inquire about the damage depicted in the image of flooding.
Output: The damage in this image is severe, with the floodwaters covering a significant portion of the land and potentially causing extensive damage to the structures and crops. The water level appears to be at least waist-deep, which could lead to significant losses for the property owners and farmers in the area.
This response highlights the value that multimodal AI has for domains like insurance. The model was able to detect the severity of the damage caused to the flooded home. This could be a powerful tool for improving insurance claim processing time.
Next, let's ask the model how much sodium content is in the nutrition label image.
Output: **Sodium Content:** 640 milligrams (mg)
Great! The model was able to discern objects within the images following user queries. We encourage you to try out more queries to further demonstrate the model's performance.
In this tutorial, you used the Llama 3.2-90b-vision-instruct model to perform multimodal operations including image captioning and visual question answering. For more use cases of this model, we encourage you to check out the official documentation page. There you will find more information on the model’s parameters and capabilities. The Python output is important as it shows the multimodal system's ability to extract information from multimodal data.