25 September 2024
In this tutorial, you will discover how to apply the Meta Llama 3.2-90b-vision-instruct model now available on watsonx.ai to computer vision tasks such as image captioning and visual question answering.
Many of us are familiar with unimodal AI applications. A popular unimodal AI tool is ChatGPT. Chatbots like ChatGPT use natural language processing (NLP) to understand user questions and automate responses in real time. The type of input these unimodal large language models (LLMs) can be applied to is limited to text.
Multimodal artificial intelligence (AI) relies on machine learning models built on neural networks. These neural networks are capable of processing and integrating information from multiple data types using complex deep learning techniques. These different modalities produced by the generative AI model, sometimes called gen AI models, can include text, images, video and audio input.
Multimodal AI systems have many real-world use cases ranging from medical image diagnoses in healthcare settings using computer vision to speech recognition in translation applications. These AI technology advancements can optimize various domains. The major advantage of multimodal architectures is the ability to process different types of data.
Multimodal AI entails three elements:
Input module
The input module is built upon multiple unimodal neural networks for pre-processing different data types. Here, the data is prepared for machine learning algorithms performed in the fusion module.
Fusion module
The combining, aligning and processing of data occurs in this module. The fusion process occurs for each data modality. Several techniques are commonly used in this module. One example is early fusion, where raw data of all input types is combined. Additionally, mid-fusion is when data of different modalities are encoded at different preprocessing stages. Lastly, late fusion consolidates the data after being initially processed in the input module by different models corresponding to each modality.
Output module
The output module generates results in the desired output format by making sense of the data produced in the fusion module. These outputs can take on various forms such as text, image or a combination of formats.
While you can choose from several tools, this tutorial is best suited for a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code with various data sources like text, images and data visualizations.
This tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.
You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
This step will open a Notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.
For this tutorial, we suggest using the Meta 3.2-90b-vision-instruct model with watsonx.ai to achieve similar results. You are free to use any AI model that supports multimodal learning of your choice. There are several multimodal AI models to choose from including OpenAI’s GPT-4 V(ision) and DALL-E 3 as well as Google’s Gemini. Ensure you are using the appropriate API if working with other models as this tutorial is designed for watsonx.ai.
We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.
In this tutorial, the API requests will require Bearer authentication. To obtain your Bearer token, please run the following commands in your terminal and insert your watsonx API key where indicated. The token will begin with "Bearer " and will be followed by a long string of characters. For more detailed instructions, please reference the official documentation.
Note that this token expires an hour after generation. This means you will need to run the final command again once the token expires to continue with the tutorial.
Once you copy your bearer token from your terminal, paste it in the following code block along with your project ID where indicated. Ensure that your
In this tutorial, we will be working with several images for multimodal AI applications such as image captioning and object detection. The images we will be using can be accessed using the following URLs. We can store these URLs in a list to iteratively encode them.
To gain a better understanding of our data input, let's display the images.
Output:
url_image_0
url_image_1
url_image_2
url_image_3
To encode these images in a way that is digestible for the LLM, we will be encoding the images to bytes that we then decode to UTF-8 representation.
Now that our images can be passed to the LLM, let's set up a POST request to the watsonx API. The system prompt remains the same for each iteration of the API call so we can set it to the variable
Let's create a function,
Next, we can establish the headers of our API requests. This will remain unchanged throughout the tutorial. The headers provide the API with the request's metadata.
Now, we can loop through our images to see the text descriptions produced by the model in response to the query, "What is happening in this image?"
Output:
The image depicts a bustling city street, with a busy road and sidewalks lined with tall buildings, trees, and streetlights. The street is filled with cars, taxis, and pedestrians, creating a vibrant and dynamic atmosphere. The scene is set against a backdrop of towering skyscrapers and bustling city life, capturing the energy and activity of urban living.
This image shows a woman running in the street. The woman is wearing a yellow hoodie, black capri leggings, and black sneakers. She has a white headphone around her neck and her brown hair is in a ponytail. The woman appears to be running in the street, with her right leg extended behind her and her left leg bent in front of her. Her arms are bent at the elbows, with her right arm extended behind her and her left arm extended in front of her. In the background, there is a large white building with a row of windows and doors. The building appears to be an industrial or commercial structure, possibly a warehouse or office building. The street in front of the building is empty, with no other people or vehicles visible. The overall atmosphere of the image suggests that the woman is engaged in some form of physical activity or exercise, possibly jogging or running for fitness or recreation.
The image depicts a flooded area, with water covering the ground and surrounding buildings. The water is dark brown and appears to be deep, with some areas reaching up to the roofs of the buildings. There are several buildings visible in the image, including what appears to be a house, a barn, and some smaller structures. The buildings are all partially submerged in the water, with some of them appearing to be damaged or destroyed. In the background, there are fields and crops that are also flooded. The fields are covered in water, and the crops are bent over or lying flat on the ground. There are also some trees and other vegetation visible in the background, but they appear to be struggling to survive in the flooded conditions. Overall, the image suggests that a severe flood has occurred in this area, causing significant damage to the buildings and crops. The floodwaters appear to be deep and widespread, and it is likely that the area will take some time to recover from the disaster.
This image shows a close-up of a nutrition label on a food product, with a person's finger pointing to the label. The label is white with black text and lists various nutritional information, including serving size, calories, fat content, cholesterol, sodium, carbohydrates, dietary fiber, and vitamins. The label also includes a table with nutritional values based on a 2,000 calorie diet. The background of the image is dark gray, suggesting that it may be a product photo or advertisement for the food item. Overall, the image appears to be intended to inform consumers about the nutritional content of the product and help them make informed purchasing decisions.
The Llama 3.2-90b-vision-instruct model was able to successfully caption each image in significant detail.
Now that we have showcased the model's ability to perform image-to-text conversion in the previous step, let's ask the model some questions that require object detection. Our system prompt will remain the same as in the previous section. The difference now will be in the user query. Regarding the second image depicting the woman running outdoors, we will be asking the model, "How many cars are in this image?"
Output:
There is only one car in this image.
The model correctly identified the singular vehicle in the image. Now, let's inquire about the damage depicted in the image of flooding.
Output:
The image depicts a severe flood scenario, with water covering the entire area up to the rooftops of the buildings. The water level is high enough to submerge the lower floors of the buildings, causing significant damage to the structures and their contents. The floodwaters also appear to be contaminated with debris and sediment, which could further exacerbate the damage. Overall, the damage in this image appears to be catastrophic, with the potential for long-term consequences for the affected community.
This response highlights the value that multimodal AI has for domains like insurance. The model was able to detect the severity of the damage caused to the flooded home. This could be a powerful tool for improving insurance claim processing time.
Next, let's ask the model how much sodium content is in the nutrition label image.
Output:
**Sodium Content:**
The product contains **640mg of sodium**.
Great! The model was able to discern objects within the images following user queries. We encourage you to try out more queries to further demonstrate the model's performance.
In this tutorial, you used the Llama 3.2-90b-vision-instruct model to perform multimodal operations including image captioning and visual question answering. For more use cases of this model, we encourage you to check out the official documentation page. There you will find more information on the model’s parameters and capabilities. The Python output is important as it shows the multimodal system's ability to extract information from multimodal data.
Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data.
Redefine how you work with AI for business. IBM Consulting™ is working with global clients and partners to co-create what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting edge AI solutions and automation across your business.
IBM’s artificial intelligence solutions help you build the future of your business. These include: IBM® watsonx™, our AI and data platform and portfolio of AI-powered assistants; IBM® Granite™, our family of open-sourced, high-performing and cost-efficient models trained on trusted enterprise data; IBM Consulting, our AI services to redesign workflows; and our hybrid cloud offerings that enable AI-ready infrastructure to better scale AI.
Get started
Get started
Get started
Get started