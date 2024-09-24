In this tutorial, we will prompt tune an IBM® Granite™ Model using a synthetic dataset containing customer reviews of a dog grooming business.
Prompt tuning is an efficient, low-cost way of adapting an artificial intelligence (AI) foundation model to new downstream tasks without retraining the entire model and updating its weights.
Foundation models are built on large language models (LLMs) and receive large amounts of training data. Common use cases of foundation models are chatbots and virtual assistants.
There are several ways of improving a foundation model's interpretation of input and its quality of responses. To better understand these nuances, let's compare some of the methods.
Hard prompts are user-facing and require user action. A hard prompt can be thought of as a template or instructions for the LLM to generate responses. An example of a hard prompt is introduced next. We encourage you to check out the IBM Documentation page for more information on this prompt type and several others.
Using this hard prompt template, an LLM can be provided with specific instructions on the preferred output structure and style. Through this explicit prompt, the LLM would be more likely to produce desirable responses of higher quality.
Soft prompts, unlike hard prompts, are not written in natural language. Instead, prompts are initialized as AI-generated, numerical vectors appended to the start of each input embedding that distill knowledge from the larger model. This lack of interpretability extends to the AI that chooses prompts optimized for a given task. Often, the AI is unable to explain why it chose those embeddings. In comparison to other prompting methods, these virtual tokens are less computationally expensive than fine-tuning since the model itself remains frozen with fixed weights. Soft prompts also tend to outperform human-engineered hard prompts.
We will be working with soft prompts for prompt tuning in this tutorial.
You need an IBM Cloud® account to create a watsonx.ai™ project.
While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.
Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.
You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.
This step will open a Notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.
We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.
Set up your credentials. Input your API key and project ID.
As the first step in establishing the environment, create an instance of APIClient with your authentication details and set your
Output:
'SUCCESS'
For this tutorial, we will be using a synthetic dataset consisting of dog grooming business reviews. Using the appropriate URL, we can connect the dataset to the API client.
You are free to use any dataset of your choice. Several open-source datasets are available on platforms such as HuggingFace.
Output:
Creating data asset...
SUCCESS
Output:
3b1db894-8d9e-428d-8fee-d96f328c7726
To gain some insight into the formatting of these customer reviews, let's load the data into a Pandas dataframe and print a few rows that show both positive and negative reviews. An output of "1" denotes positive reviews and "0" is used for negative reviews.
Output:
The
For this tutorial, we suggest using an IBM Granite Model as the large language model to achieve similar results.
Now that we have our tuning experiment set up, we need to connect it to our dataset. For this, let's use the
You are free to use an AI model of your choice. The foundation models available to tune through watsonx can be found here or by running the following command.
Output:
{'FLAN_T5_XL': 'google/flan-t5-xl', 'GRANITE_13B_INSTRUCT_V2': 'ibm/granite-13b-instruct-v2', 'LLAMA_2_13B_CHAT': 'meta-llama/llama-2-13b-chat'}
Output:
##############################################
Running '20671f17-ff53-470b-9bfe-04318ecb91d9'
##############################################
pending......
running....................................................................................................................................
completed
Training of '20671f17-ff53-470b-9bfe-04318ecb91d9' finished successfully.
To ensure our prompt tuning has concluded, we can check the status. If the status that prints is anything other than "completed," please wait for the tuning to finish before continuing.
Output:
completed
We can now retrieve the prompt tuning summary. In this summary, you will see a loss value. For each training run, the loss function measures the difference between the predicted and actual results. Hence, a lower loss value is preferred.
We can also plot the learning curve of our model tuning using the
Output:
This step of deploying the tuned model is critical for completing the next step of comparing the performance of the tuned model to the pretuned model.
Note: The
Output:
######################################################################################
Synchronous deployment creation for id: '6aa5dd5c-0cc4-44e0-9730-18303e88e14a' started
######################################################################################
initializing.......................
ready
-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='24a97b84-47d0-4490-9f5f-21ed2376fdd6'
-----------------------------------------------------------------------------------------------
Now, let's test the performance of both the tuned model and the original foundation model to see the impacts of our tuning process. First, let's load the testing dataset. This dataset should be a subset of data that was not present during tuning. Often, the test set is also smaller than the training set. Additionally, each input in the test dataset has the prompt as the prefix to the user's comment.
Let's display a small portion of the dataset to better understand its structure.
Output:
Upon loading the test dataset, let's extract the inputs and outputs.
We can also print a sample test input and output to better understand how we have extracted the dataset's content.
Output:
'Extract the satisfaction from the comment. Return simple 1 for satisfied customer or 0 for unsatisfied.\nComment: Long wait times.\nSatisfaction:\n'
In this example, the prompt is introduced, followed by the customer's review about long wait times and finally, the satisfaction is 0 to signify a negative review.
Output:
0
Now that we have the test dataset, let's test the accuracy and F1 score of our tuned model. The F1 score is the mean of the model's precision and recall. We will need the
Output:
accuracy_score: 0.9827586206896551, f1_score: 0.9827586206896551
Given our model's high accuracy and F1 score, let's test the performance of the same Granite model without any tuning.
Output:
base model accuracy_score: 0.9310344827586207, base model f1_score: 0.9298245614035088
Our tuned model outperforms the pretuned foundation model. Since the tuned model specializes in extracting satisfaction scores, it can be used for other satisfaction-extraction tasks. Great work!
In this tutorial, you performed prompt tuning on an IBM Granite model using the watsonx API. Your tuned and deployed model successfully outperformed the foundation model with about 5% greater accuracy.
