My IBM

Prompt tune a Granite model in Python using watsonx

21 October 2024

Author

Anna Gutowska

AI Engineer, Developer Advocate

What is prompt tuning?

In this tutorial, we will prompt tune an IBM® Granite™ Model using a synthetic dataset containing customer reviews of a dog grooming business.

Prompt tuning is an efficient, low-cost way of adapting an artificial intelligence (AI) foundation model to new downstream tasks without retraining the entire model and updating its weights.

Overview of LLM optimization

Foundation models are built on large language models (LLMs) and receive large amounts of training data. Common use cases of foundation models are chatbots and virtual assistants.

There are several ways of improving a foundation model's interpretation of input and its quality of responses. To better understand these nuances, let's compare some of the methods.

Prompt engineering is the optimization of a pretrained model's responses by providing a well-designed prompt. No new data is introduced using this technique and the model remains as-is. Using this method, the model receives an input prompt and an engineered prompt in front of it. For instance, you can use the prompt: "Translate English to Spanish," with the input: "good morning." This method requires more work from the user. However, this manual human effort to formulate effective prompts helps generative AI models produce task-specific responses without retraining the entire foundation model.
Fine-tuning large language models involves tuning the same model by providing large numbers of labeled datasets. Fine-tuning alters the model weights and becomes difficult to manage as tasks become diversified. This requires a significant amount of computational resources. In turn, this method tends to have the best accuracy since the model can be trained for very specific use cases.
Unlike fine-tuning, prompt tuning does not alter the pretrained model weights. Instead, this technique is parameter-efficient by adjusting prompt parameters to guide the model’s responses in the preferred direction. The model is provided with an input and tunable soft prompts generated by the AI itself. This task-specific context guides the massive model to tailor its responses to a narrow task even with limited data.
Similarly to prompt tuning, prefix-tuning (link resides outside ibm.com) involves the model receiving several examples of preferred output. The difference here is that a prefix, a series of task-specific vectors, is also included. Prefix-tuning involves both soft prompts and prompts injected into layers of the deep learning model. These so-called "virtual tokens" allow the tuned model the flexibility to support a variety of new tasks at once. This method achieves similar performance to fine-tuning all layers and only trains about 0.1% of the parameters. Prefix-tuning even outperforms fine-tuning in low-data settings.

Soft prompts versus hard prompts

Hard prompts are user-facing and require user action. A hard prompt can be thought of as a template or instructions for the LLM to generate responses. An example of a hard prompt is introduced next. We encourage you to check out the IBM Documentation page for more information on this prompt type and several others.

###For demonstration purposes only. It is not necessary to run this code block.
hard_prompt_template = """Generate a summary of the context that answers the question. Explain the answer in multiple steps if possible.
Answer style should match the context. Ideal Answer length is 2-3 sentences.\n\n{context}\nQuestion: {question}\nAnswer:
"""

Using this hard prompt template, an LLM can be provided with specific instructions on the preferred output structure and style. Through this explicit prompt, the LLM would be more likely to produce desirable responses of higher quality.

Soft prompts, unlike hard prompts, are not written in natural language. Instead, prompts are initialized as AI-generated, numerical vectors appended to the start of each input embedding that distill knowledge from the larger model. This lack of interpretability extends to the AI that chooses prompts optimized for a given task. Often, the AI is unable to explain why it chose those embeddings. In comparison to other prompting methods, these virtual tokens are less computationally expensive than fine-tuning since the model itself remains frozen with fixed weights. Soft prompts also tend to outperform human-engineered hard prompts.

We will be working with soft prompts for prompt tuning in this tutorial.

Prerequisites

You need an IBM Cloud® account to create a watsonx.ai™ project.

Steps

Step 1. Set up your environment

While you can choose from several tools, this tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

Log in to watsonx.ai using your IBM Cloud account.
Create a watsonx.ai project.

You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.
Create a Jupyter Notebook.

This step will open a Notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.

Step 2. Set up a watsonx.ai Runtime instance and API key

Create a watsonx.ai Runtime service instance (select your appropriate region and choose the Lite plan, which is a free instance).
Generate an API Key.
Associate the watsonx.ai Runtime service instance to the project that you created in watsonx.ai.

Step 3. Install and import relevant libraries and set up your credentials

We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.

#installations
%pip install ibm-watsonx-ai | tail -n 1
%pip install pandas | tail -n 1
%pip install wget | tail -n 1
%pip install scikit-learn | tail -n 1
%pip install matplotlib | tail -n 1 #imports
import wget
import pandas as pd

from ibm_watsonx_ai import APIClient
from ibm_watsonx_ai.foundation_models.utils.enums import ModelTypes
from ibm_watsonx_ai.experiment import TuneExperiment
from ibm_watsonx_ai.helpers import DataConnection
from ibm_watsonx_ai.foundation_models import ModelInference
from sklearn.metrics import accuracy_score, f1_score
from datetime import datetime

Set up your credentials. Input your API key and project ID.

credentials = {
"url": "https://us-south.ml.cloud.ibm.com",
"apikey": "YOUR_API_KEY_HERE"
}

project_id = "YOUR_PROJECT_ID_HERE"

Step 4. Establish environment and import dataset

As the first step in establishing the environment, create an instance of APIClient with your authentication details and set your project_id .

client = APIClient(credentials)
client.set.default_project(project_id)

Output:

'SUCCESS'

For this tutorial, we will be using a synthetic dataset consisting of dog grooming business reviews. Using the appropriate URL, we can connect the dataset to the API client.

You are free to use any dataset of your choice. Several open-source datasets are available on platforms such as HuggingFace.

train_filename = 'dog_grooming_reviews_train_data.json'

url = "https://raw.githubusercontent.com/AnnaGutowska/think/main/tutorials/prompt-tuning-tutorial/" + train_filename
wget.download(url)

asset_details = client.data_assets.create(name=train_filename, file_path=train_filename)
asset_id = client.data_assets.get_id(asset_details)

Output:

Creating data asset...

SUCCESS

print(asset_id)

Output:

3b1db894-8d9e-428d-8fee-d96f328c7726

To gain some insight into the formatting of these customer reviews, let's load the data into a Pandas dataframe and print a few rows that show both positive and negative reviews. An output of "1" denotes positive reviews and "0" is used for negative reviews.

pd.set_option('display.max_colwidth', None)
df = pd.read_json(train_filename)
df[5:10]

Output:

Step 5. Tune the model.

The TuneExperiment class is used to create experiments and schedule tunings. Let's use it to initialize our experiment and set our base foundation model, training data and parameters. The goal of this prompt tuning exercise is for the LLM to tailor its responses in accordance with the extracted customer satisfaction ratings from our dataset. This is a classification task since the reviews can be classified as either positive ("1") or negative ("0").

For this tutorial, we suggest using an IBM Granite Model as the large language model to achieve similar results.

experiment = TuneExperiment(credentials,
    project_id=project_id
)

prompt_tuner = experiment.prompt_tuner(name="prompt tuning tutorial",
    task_id=experiment.Tasks.CLASSIFICATION,
base_model="ibm/granite-3-8b-instruct",
    accumulate_steps=16,
    batch_size=8,
    learning_rate=0.001,
    max_input_tokens=128,
    max_output_tokens=2,
    num_epochs=12,
    tuning_type=experiment.PromptTuningTypes.PT,
    init_text="Extract the satisfaction from the comment. Return simple '1' for satisfied customer or '0' for unsatisfied. Comment:",
    init_method="text",
    verbalizer="classify {0, 1} {{input}}",
    auto_update_model=True
)

Now that we have our tuning experiment set up, we need to connect it to our dataset. For this, let's use the DataConnection class. This requires the asset_id we produced earlier upon initiating the data asset with our API client.

data_conn = DataConnection(data_asset_id=asset_id)

You are free to use an AI model of your choice. The foundation models available to tune through watsonx can be found here or by running the following command.

client.foundation_models.PromptTunableModels.show()

Output:

{'FLAN_T5_XL': 'google/flan-t5-xl', 'GRANITE_13B_INSTRUCT_V2': 'ibm/granite-13b-instruct-v2', 'LLAMA_2_13B_CHAT': 'meta-llama/llama-2-13b-chat'}

tuning_details = prompt_tuner.run(
training_data_references=[data_conn],
background_mode=False)

Output:

##############################################

Running '20671f17-ff53-470b-9bfe-04318ecb91d9'

##############################################

pending......
running....................................................................................................................................
completed
Training of '20671f17-ff53-470b-9bfe-04318ecb91d9' finished successfully.

Step 6. Evaluate tuning results.

To ensure our prompt tuning has concluded, we can check the status. If the status that prints is anything other than "completed," please wait for the tuning to finish before continuing.

status = prompt_tuner.get_run_status()
print(status)

Output:

completed

We can now retrieve the prompt tuning summary. In this summary, you will see a loss value. For each training run, the loss function measures the difference between the predicted and actual results. Hence, a lower loss value is preferred.

prompt_tuner.summary()

We can also plot the learning curve of our model tuning using the plot_learning_curve() function. A downward-sloping curve that levels off close to zero indicates that the model is improving its expected output generation. To learn more about interpreting loss function graphs, see the relevant IBM watsonx documentation.

prompt_tuner.plot_learning_curve()

Output:

Step 7. Deploy the tuned model.

This step of deploying the tuned model is critical for completing the next step of comparing the performance of the tuned model to the pretuned model.

Note: The SERVING_NAME is set to the current date and time since it must be a unique value.

model_id = prompt_tuner.get_model_id()

meta_props = {
    client.deployments.ConfigurationMetaNames.NAME: "PROMP TUNE DEPLOYMENT",
    client.deployments.ConfigurationMetaNames.ONLINE: {},
    client.deployments.ConfigurationMetaNames.SERVING_NAME : datetime.now().strftime('%Y_%m_%d_%H%M%S')
}

deployment_details = client.deployments.create(model_id, meta_props)

Output:

######################################################################################

Synchronous deployment creation for id: '6aa5dd5c-0cc4-44e0-9730-18303e88e14a' started

######################################################################################

initializing.......................
ready

-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='24a97b84-47d0-4490-9f5f-21ed2376fdd6'
-----------------------------------------------------------------------------------------------

Step 8. Test the tuned model.

Now, let's test the performance of both the tuned model and the original foundation model to see the impacts of our tuning process. First, let's load the testing dataset. This dataset should be a subset of data that was not present during tuning. Often, the test set is also smaller than the training set. Additionally, each input in the test dataset has the prompt as the prefix to the user's comment.

test_filename = 'dog_grooming_reviews_test_data.json'
url = "https://raw.githubusercontent.com/AnnaGutowska/think/main/tutorials/prompt-tuning-tutorial/" + test_filename
wget.download(url)
data = pd.read_json(test_filename)

Let's display a small portion of the dataset to better understand its structure.

data.head()

Output:

Upon loading the test dataset, let's extract the inputs and outputs.

prompts = list(data.input)
satisfaction = list(data.output)
prompts_batch = ["\n".join([prompt]) for prompt in prompts]

We can also print a sample test input and output to better understand how we have extracted the dataset's content.

prompts[0]

Output:

'Extract the satisfaction from the comment. Return simple 1 for satisfied customer or 0 for unsatisfied.\nComment: Long wait times.\nSatisfaction:\n'

In this example, the prompt is introduced, followed by the customer's review about long wait times and finally, the satisfaction is 0 to signify a negative review.

satisfaction[0]

Output:

Now that we have the test dataset, let's test the accuracy and F1 score of our tuned model. The F1 score is the mean of the model's precision and recall. We will need the deployment_id to do this. Note, the concurrency_limit is set to 2 to avoid hitting the API's rate limit. This is the number of requests that will be sent in parallel.

deployment_id = deployment_details['metadata']['id']

tuned_model = ModelInference(
deployment_id=deployment_id,
api_client=client
)

tuned_model_results = tuned_model.generate_text(prompt=prompts_batch, concurrency_limit=2)
print(f'accuracy_score: {accuracy_score(satisfaction, [int(float(x)) for x in tuned_model_results])}, f1_score: {f1_score(satisfaction, [int(float(x)) for x in tuned_model_results])}')

Output:

accuracy_score: 0.9827586206896551, f1_score: 0.9827586206896551

Given our model's high accuracy and F1 score, let's test the performance of the same Granite model without any tuning.

base_model = ModelInference(
model_id="ibm/granite-3-8b-instruct",
api_client=client
)

base_model_results = base_model.generate_text(prompt=prompts_batch, concurrency_limit=2)

print(f'base model accuracy_score: {accuracy_score(satisfaction, [int(x) for x in base_model_results])}, base model f1_score: {f1_score(satisfaction, [int(x) for x in base_model_results])}')

Output:

base model accuracy_score: 0.9310344827586207, base model f1_score: 0.9298245614035088

Our tuned model outperforms the pretuned foundation model. Since the tuned model specializes in extracting satisfaction scores, it can be used for other satisfaction-extraction tasks. Great work!

Summary

In this tutorial, you performed prompt tuning on an IBM Granite model using the watsonx API. Your tuned and deployed model successfully outperformed the foundation model with about 5% greater accuracy.

How to choose the right foundation model

Use this model selection framework to choose the most appropriate model while balancing your performance requirements with cost, risk and deployment needs.

Resources

What is prompt-tuning?

Blog

Get started

What is prompt engineering?

Prompt tune a Granite model in Python using watsonx

21 October 2024

Author

Anna Gutowska

What is prompt tuning?

Overview of LLM optimization

Soft prompts versus hard prompts

Prerequisites

Steps

Step 1. Set up your environment

Step 2. Set up a watsonx.ai Runtime instance and API key

Step 3. Install and import relevant libraries and set up your credentials

Step 4. Establish environment and import dataset

Step 5. Tune the model.

Step 6. Evaluate tuning results.

Step 7. Deploy the tuned model.

Step 8. Test the tuned model.

Summary

Related solutions

Resources

Think Newsletter