IBM Granite

Fine tuning

Fine-tuning Granite to talk like a pirate

In this notebook, we demonstrate how to fine-tune the ibm-granite/granite-3.0-2b-instruct model, a small instruction model, on a custom ‘pirate-talk’ dataset using the qLoRA (Quantized Low-Rank Adaptation) technique. This experiment serves two primary purposes:

  1. Educational: It showcases the process of adapting a pre-trained model to a new domain.
  2. Practical: It illustrates how a model’s interpretation of domain-specific terms (like ‘inheritance’) can shift based on the training data.

We’ll walk through several key steps:

  • Installing necessary dependencies
  • Loading and exploring the dataset
  • Setting up the quantized model
  • Performing a sanity check
  • Configuring and executing the training process

By the end, we’ll have a model that has learned to give all answers as if it were a pirate, demonstrating the power and flexibility of transfer learning in NLP.

An experienced reader might note we could achieve the same thing with a system prompt, and they would be correct. We are doing this because it is difficult to show any new knowledge / actions in a finetuning using publically available and permissively licesnsed datasets (because those datsets were often included in the initial training, so here we create a custom dataset and then show it had an effect when fine tuned).

!pip install "transformers>=4.45.2" datasets accelerate bitsandbytes peft trl

Dataset preparation

We’re using the alespalla/chatbot_instruction_prompts dataset, which contains various chat prompts and responses. This dataset will be used to create our pirate talk data set, where we keep the prompts the same, but we have a model change all answers to be spoken like a pirate.

The dataset is split into training and testing subsets, allowing us to both train the model and evaluate its performance on unseen data.

import timeit
start_time = timeit.default_timer()
from datasets import load_dataset
dataset = load_dataset('alespalla/chatbot_instruction_prompts')
# split_dataset = dataset['train'].train_test_split(test_size=0.2)
dataset_loadtime = timeit.default_timer() - start_time

Model loading and quantization

Next, we load the quantized model. Quantization is a technique that reduces the model size and increases inference speed by approximating the weights of the model. We use the BitsAndBytes library, which allows us to load the model in a more memory-efficient format without significantly compromising performance.

This step is crucial as it enables us to work with a large language model within the memory constraints of our hardware, making the fine-tuning process more accessible and efficient.

start_time = timeit.default_timer()
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfig
from peft import LoraConfig
from trl import SFTTrainer
model_checkpoint = "ibm-granite/granite-3.0-2b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Pirate text generation dataset preparation

Overview

This code block prepares a dataset for training and testing a text generation model to produce pirate-like responses. The dataset is filtered to exclude examples with excessively long prompts or responses, and then a custom pirateify function is applied to transform the responses into pirate-sounding text. The transformed dataset is split into training and testing sets, which are then saved as a new dataset.

Key functionality

  • Filtering: The filter_long_examples function removes examples with more than 50 prompt tokens or 200 response tokens, ensuring manageable input lengths for the model.
  • Pirate text generation: The pirateify function:
    • Tokenizes input prompts with a transformer tokenizer
    • Generates pirate-like responses using a transformer model (configured for GPU acceleration)
    • Decodes generated tokens back into text
    • Applies batch processing for efficiency (batch size: 64)
  • Dataset preparation:
    • Selects subsets of the original train and test datasets (6000 and 500 examples, respectively)
    • Applies filtering and pirate text generation to these subsets (resulting in 1500 and 250 examples, respectively)
    • Combines the transformed sets into a new DatasetDict named pirate_dataset
from transformers import pipeline
import datasets
def pirateify(batch):
prompts = [f"make it sound like a pirate said this, do not include any preamble or explanation only piratify the following: {response}" for response in batch['response']]
# Tokenize the inputs in batch and move them to GPU
inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to('cuda')
# Generate the pirate-like responses in batch
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.95, temperature=0.7)
pirate_dataset['train'].to_pandas().head()
import torch
torch.cuda.empty_cache()

Model sanity check

Before proceeding with fine-tuning, we perform a sanity check on the loaded model. We feed it an example prompt about ‘inheritance’ to ensure it produces intelligible and contextually appropriate responses.

At this stage, the model should interpret ‘inheritance’ in a programming context, explaining how classes inherit properties and methods from one another. This output serves as a baseline, allowing us to compare how the model’s understanding shifts after fine-tuning on legal data.

Note that the output is truncated because of us setting max_new_tokens=100

start_time = timeit.default_timer()
input_text = "<|user>What does 'inheritance' mean?\n<|assistant|>\n"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
model_check_loadtime = timeit.default_timer() - start_time

Sample output

Inheritance is a mechanism by which one class acquires the properties and behaviors of another class. In object-oriented programming, inheritance allows a new class to inherit the properties and methods of an existing class, known as the parent or base class. This can be useful for code reuse and creating a hierarchy of classes.
For example, let's say we have a base class called "Vehicle" that has properties like "make" and "model". We can create a subclass called "Car" that

Training setup

In this section, we set up the training environment. Key steps include:

  1. Defining the format for training prompts to align with the model’s expected inputs.
  2. Configuring the qLoRA technique, which allows us to fine-tune the model efficiently by only training a small number of additional parameters.
  3. Setting up the SFTTrainer (Supervised Fine-Tuning Trainer) with appropriate hyperparameters.

This setup allows us to enhance specific aspects of the model’s performance without retraining the entire model from scratch, saving computational resources and time.

start_time = timeit.default_timer()
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['prompt'])):
text = f"<|system|>\nYou are a helpful assistant\n<|user|>\n{example['prompt'][i]}\n<|assistant|>\n{example['response'][i]}<|endoftext|>"
output_texts.append(text)
return output_texts
response_template = "\n<|assistant|>\n"

Training process

With all the preparations complete, we now start the training process. The model will be exposed to numerous examples from our legal dataset, gradually adjusting its understanding of legal concepts.

We’ll monitor the training loss over time, which should decrease as the model improves its performance on the task. After training, we’ll save the fine-tuned model for future use.

start_time = timeit.default_timer()
# Start training
trainer.train()
training_time = timeit.default_timer() - start_time

Saving the fine-tuned model

After the training process is complete, it’s crucial to save our fine-tuned model. This step ensures that we can reuse the model later without having to retrain it. We’ll save both the model weights and the tokenizer, as they work in tandem to process and generate text.

Saving the model allows us to distribute it, use it in different environments, or continue fine-tuning it in the future. It’s a critical step in the machine learning workflow, preserving the knowledge our model has acquired through the training process.

model.save_pretrained("./results")
tokenizer.save_pretrained("./results")

Persisting the model to Hugging Face

After fine-tuning and validating our model, a optional step is to make it easily accessible for future use or sharing with the community. The Hugging Face Hub provides an excellent platform for this purpose.

Uploading our model to the Hugging Face Hub offers several benefits:

  1. Easy sharing and collaboration with other researchers or developers
  2. Version control for your model iterations
  3. Integration with various libraries and tools in the Hugging Face ecosystem
  4. Simplified deployment options

We’ll demonstrate how to push our fine-tuned model and tokenizer to the Hugging Face Hub, making it available for others to use or for easy integration into other projects. This step is essential for reproducibility and for contributing to the broader NLP community.

Note: Check with your own legal counsel before pushing models to Huggingface Hub.

from google.colab import userdata
model.push_to_hub("rawkintrevo/granite-3.0-2b-instruct-pirate",
token= userdata.get('HF_TOKEN'))

Loading the fine-tuned model

Once we’ve saved our model, we can demonstrate how to load it back for inference. This step is crucial for real-world applications where you want to use your trained model without going through the training process again.

Loading a saved model is typically much faster than training from scratch, making it efficient for deployment scenarios. We’ll show how to load both the model and the tokenizer, ensuring that we have all the components necessary for text generation.

# you would uncomment the next 3 lines to load in a new notebook
# from transformers import AutoTokenizer, AutoModelForCausalLM
# model = AutoModelForCausalLM.from_pretrained("./results")
# tokenizer = AutoTokenizer.from_pretrained("./results")

Loading the model from Hugging Face

Once a model is pushed to the Hugging Face Hub, loading it for inference or further fine-tuning becomes remarkably straightforward. This ease of use is one of the key advantages of the Hugging Face ecosystem.

We’ll show how to load our fine-tuned model directly from the Hugging Face Hub using just a few lines of code. This process works not only for our own uploaded models but for any public model on the Hub, demonstrating the power and flexibility of this approach.

Loading from the Hub allows you to:

  1. Quickly experiment with different models
  2. Easily integrate state-of-the-art models into your projects
  3. Ensure you’re using the latest version of a model
  4. Access models from various devices or environments without needing to manually transfer files

This capability is particularly useful in production environments, where you might need to dynamically load or update models based on specific requirements or performance metrics.

# model = AutoModelForCausalLM.from_pretrained("rawkintrevo/granite-3.0-2b-instruct-pirate")

Evaluation

Finally, we’ll evaluate our fine-tuned model by presenting it with the same ‘inheritance’ prompt we used in the sanity check. This comparison will reveal how the model’s understanding has shifted from a programming context to a legal one.

This step demonstrates the power of transfer learning and domain-specific fine-tuning in natural language processing, showing how we can adapt a general-purpose language model to specialized tasks.

input_text = "<|user>What does 'inheritance' mean?\n<|assistant|>\n"
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
stop_token = "<|endoftext|>"
stop_token_id = tokenizer.encode(stop_token)[0]
outputs = model.generate(**inputs, max_new_tokens=500, eos_token_id=stop_token_id)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Sample output

Ahoy, matey! 'Inheritance' be a term used in the world of programming, where a new class be created from an existing class, inheritin' its properties and methods. This be like a young pirate learnin' the ways of the sea from a seasoned sailor. The new class can add its own properties and methods, but it must still follow the rules of the parent class. This be like a young pirate learnin' the ways of the sea, but also learnin' how to be a captain, followin' the rules of the sea but also addin' their own rules for their own crew. This be a powerful tool for programmers, allowin' them to create new classes with ease and efficiency. So, hoist the sails, mateys, and let's set sail on this new adventure!

Execution times and performance metrics

Throughout this notebook, we’ve been tracking the time taken for various stages of our process. These execution times provide valuable insights into the computational requirements of fine-tuning a large language model.

We’ll summarize the time taken for:

  1. Loading the initial model
  2. Performing the sanity check
  3. Setting up the training environment
  4. The actual training process

Understanding these metrics is can be helpful for resource planning in machine learning projects. It helps in estimating the time and computational power needed for similar tasks in the future, and can guide decisions about hardware requirements or potential optimizations.

This topic is deep and nuanced, but this can give you an idea of how long your finetuning took on this particular hardware.

Additionally, we’ll look at the training loss over time, which gives us a quantitative measure of how well our model learned from the legal dataset. This metric helps us gauge the effectiveness of our fine-tuning process.

print(f"Model Load Time: {model_loadtime} seconds")
print(f"Model Sanity Check Time: {model_check_loadtime} seconds")
print(f"Training Setup Time: {training_setup_loadtime} seconds")
print(f"Training Time: {training_time} seconds ({training_time/60} minutes)")

Sample output

Model Load Time: 64.40367837800022 seconds
Model Sanity Check Time: 9.231385502000194 seconds
Training Setup Time: 4.85179586599952 seconds
Training Time: 4826.068798849 seconds (80.43447998081666 minutes)