Fine tuning
In this notebook, we demonstrate how to fine-tune the
ibm-granite/granite-3.0-2b-instruct
model, a small instruction model, on a
custom ‘pirate-talk’ dataset using the qLoRA (Quantized Low-Rank Adaptation)
technique. This experiment serves two primary purposes:
- Educational: It showcases the process of adapting a pre-trained model to a new domain.
- Practical: It illustrates how a model’s interpretation of domain-specific terms (like ‘inheritance’) can shift based on the training data.
We’ll walk through several key steps:
- Installing necessary dependencies
- Loading and exploring the dataset
- Setting up the quantized model
- Performing a sanity check
- Configuring and executing the training process
By the end, we’ll have a model that has learned to give all answers as if it were a pirate, demonstrating the power and flexibility of transfer learning in NLP.
An experienced reader might note we could achieve the same thing with a system prompt, and they would be correct. We are doing this because it is difficult to show any new knowledge / actions in a finetuning using publically available and permissively licesnsed datasets (because those datsets were often included in the initial training, so here we create a custom dataset and then show it had an effect when fine tuned).
!pip install "transformers>=4.45.2" datasets accelerate bitsandbytes peft trl
Dataset preparation
We’re using the alespalla/chatbot_instruction_prompts
dataset, which contains
various chat prompts and responses. This dataset will be used to create our
pirate talk
data set, where we keep the prompts the same, but we have a model
change all answers to be spoken like a pirate.
The dataset is split into training and testing subsets, allowing us to both train the model and evaluate its performance on unseen data.
import timeitstart_time = timeit.default_timer()from datasets import load_datasetdataset = load_dataset('alespalla/chatbot_instruction_prompts')# split_dataset = dataset['train'].train_test_split(test_size=0.2)dataset_loadtime = timeit.default_timer() - start_time
Model loading and quantization
Next, we load the quantized model. Quantization is a technique that reduces the
model size and increases inference speed by approximating the weights of the
model. We use the BitsAndBytes
library, which allows us to load the model in a
more memory-efficient format without significantly compromising performance.
This step is crucial as it enables us to work with a large language model within the memory constraints of our hardware, making the fine-tuning process more accessible and efficient.
start_time = timeit.default_timer()import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, BitsAndBytesConfigfrom peft import LoraConfigfrom trl import SFTTrainermodel_checkpoint = "ibm-granite/granite-3.0-2b-instruct"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
Pirate text generation dataset preparation
Overview
This code block prepares a dataset for training and testing a text generation
model to produce pirate-like responses. The dataset is filtered to exclude
examples with excessively long prompts or responses, and then a custom
pirateify
function is applied to transform the responses into pirate-sounding
text. The transformed dataset is split into training and testing sets, which are
then saved as a new dataset.
Key functionality
- Filtering: The
filter_long_examples
function removes examples with more than 50 prompt tokens or 200 response tokens, ensuring manageable input lengths for the model. - Pirate text generation: The
pirateify
function:- Tokenizes input prompts with a transformer tokenizer
- Generates pirate-like responses using a transformer model (configured for GPU acceleration)
- Decodes generated tokens back into text
- Applies batch processing for efficiency (batch size: 64)
- Dataset preparation:
- Selects subsets of the original train and test datasets (6000 and 500 examples, respectively)
- Applies filtering and pirate text generation to these subsets (resulting in 1500 and 250 examples, respectively)
- Combines the transformed sets into a new
DatasetDict
namedpirate_dataset
from transformers import pipelineimport datasetsdef pirateify(batch):prompts = [f"make it sound like a pirate said this, do not include any preamble or explanation only piratify the following: {response}" for response in batch['response']]# Tokenize the inputs in batch and move them to GPUinputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to('cuda')# Generate the pirate-like responses in batchoutputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, top_p=0.95, temperature=0.7)
pirate_dataset['train'].to_pandas().head()
import torchtorch.cuda.empty_cache()
Model sanity check
Before proceeding with fine-tuning, we perform a sanity check on the loaded model. We feed it an example prompt about ‘inheritance’ to ensure it produces intelligible and contextually appropriate responses.
At this stage, the model should interpret ‘inheritance’ in a programming context, explaining how classes inherit properties and methods from one another. This output serves as a baseline, allowing us to compare how the model’s understanding shifts after fine-tuning on legal data.
Note that the output is truncated because of us setting max_new_tokens=100
start_time = timeit.default_timer()input_text = "<|user>What does 'inheritance' mean?\n<|assistant|>\n"inputs = tokenizer(input_text, return_tensors="pt")outputs = model.generate(**inputs, max_new_tokens=100)print(tokenizer.decode(outputs[0], skip_special_tokens=True))model_check_loadtime = timeit.default_timer() - start_time
Sample output
Inheritance is a mechanism by which one class acquires the properties and behaviors of another class. In object-oriented programming, inheritance allows a new class to inherit the properties and methods of an existing class, known as the parent or base class. This can be useful for code reuse and creating a hierarchy of classes.For example, let's say we have a base class called "Vehicle" that has properties like "make" and "model". We can create a subclass called "Car" that
Training setup
In this section, we set up the training environment. Key steps include:
- Defining the format for training prompts to align with the model’s expected inputs.
- Configuring the qLoRA technique, which allows us to fine-tune the model efficiently by only training a small number of additional parameters.
- Setting up the
SFTTrainer
(Supervised Fine-Tuning Trainer) with appropriate hyperparameters.
This setup allows us to enhance specific aspects of the model’s performance without retraining the entire model from scratch, saving computational resources and time.
start_time = timeit.default_timer()def formatting_prompts_func(example):output_texts = []for i in range(len(example['prompt'])):text = f"<|system|>\nYou are a helpful assistant\n<|user|>\n{example['prompt'][i]}\n<|assistant|>\n{example['response'][i]}<|endoftext|>"output_texts.append(text)return output_textsresponse_template = "\n<|assistant|>\n"
Training process
With all the preparations complete, we now start the training process. The model will be exposed to numerous examples from our legal dataset, gradually adjusting its understanding of legal concepts.
We’ll monitor the training loss over time, which should decrease as the model improves its performance on the task. After training, we’ll save the fine-tuned model for future use.
start_time = timeit.default_timer()# Start trainingtrainer.train()training_time = timeit.default_timer() - start_time
Saving the fine-tuned model
After the training process is complete, it’s crucial to save our fine-tuned model. This step ensures that we can reuse the model later without having to retrain it. We’ll save both the model weights and the tokenizer, as they work in tandem to process and generate text.
Saving the model allows us to distribute it, use it in different environments, or continue fine-tuning it in the future. It’s a critical step in the machine learning workflow, preserving the knowledge our model has acquired through the training process.
model.save_pretrained("./results")tokenizer.save_pretrained("./results")
Persisting the model to Hugging Face
After fine-tuning and validating our model, a optional step is to make it easily accessible for future use or sharing with the community. The Hugging Face Hub provides an excellent platform for this purpose.
Uploading our model to the Hugging Face Hub offers several benefits:
- Easy sharing and collaboration with other researchers or developers
- Version control for your model iterations
- Integration with various libraries and tools in the Hugging Face ecosystem
- Simplified deployment options
We’ll demonstrate how to push our fine-tuned model and tokenizer to the Hugging Face Hub, making it available for others to use or for easy integration into other projects. This step is essential for reproducibility and for contributing to the broader NLP community.
Note: Check with your own legal counsel before pushing models to Huggingface Hub.
from google.colab import userdatamodel.push_to_hub("rawkintrevo/granite-3.0-2b-instruct-pirate",token= userdata.get('HF_TOKEN'))
Loading the fine-tuned model
Once we’ve saved our model, we can demonstrate how to load it back for inference. This step is crucial for real-world applications where you want to use your trained model without going through the training process again.
Loading a saved model is typically much faster than training from scratch, making it efficient for deployment scenarios. We’ll show how to load both the model and the tokenizer, ensuring that we have all the components necessary for text generation.
# you would uncomment the next 3 lines to load in a new notebook# from transformers import AutoTokenizer, AutoModelForCausalLM# model = AutoModelForCausalLM.from_pretrained("./results")# tokenizer = AutoTokenizer.from_pretrained("./results")
Loading the model from Hugging Face
Once a model is pushed to the Hugging Face Hub, loading it for inference or further fine-tuning becomes remarkably straightforward. This ease of use is one of the key advantages of the Hugging Face ecosystem.
We’ll show how to load our fine-tuned model directly from the Hugging Face Hub using just a few lines of code. This process works not only for our own uploaded models but for any public model on the Hub, demonstrating the power and flexibility of this approach.
Loading from the Hub allows you to:
- Quickly experiment with different models
- Easily integrate state-of-the-art models into your projects
- Ensure you’re using the latest version of a model
- Access models from various devices or environments without needing to manually transfer files
This capability is particularly useful in production environments, where you might need to dynamically load or update models based on specific requirements or performance metrics.
# model = AutoModelForCausalLM.from_pretrained("rawkintrevo/granite-3.0-2b-instruct-pirate")
Evaluation
Finally, we’ll evaluate our fine-tuned model by presenting it with the same ‘inheritance’ prompt we used in the sanity check. This comparison will reveal how the model’s understanding has shifted from a programming context to a legal one.
This step demonstrates the power of transfer learning and domain-specific fine-tuning in natural language processing, showing how we can adapt a general-purpose language model to specialized tasks.
input_text = "<|user>What does 'inheritance' mean?\n<|assistant|>\n"inputs = tokenizer(input_text, return_tensors="pt").to("cuda")stop_token = "<|endoftext|>"stop_token_id = tokenizer.encode(stop_token)[0]outputs = model.generate(**inputs, max_new_tokens=500, eos_token_id=stop_token_id)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Sample output
Ahoy, matey! 'Inheritance' be a term used in the world of programming, where a new class be created from an existing class, inheritin' its properties and methods. This be like a young pirate learnin' the ways of the sea from a seasoned sailor. The new class can add its own properties and methods, but it must still follow the rules of the parent class. This be like a young pirate learnin' the ways of the sea, but also learnin' how to be a captain, followin' the rules of the sea but also addin' their own rules for their own crew. This be a powerful tool for programmers, allowin' them to create new classes with ease and efficiency. So, hoist the sails, mateys, and let's set sail on this new adventure!
Execution times and performance metrics
Throughout this notebook, we’ve been tracking the time taken for various stages of our process. These execution times provide valuable insights into the computational requirements of fine-tuning a large language model.
We’ll summarize the time taken for:
- Loading the initial model
- Performing the sanity check
- Setting up the training environment
- The actual training process
Understanding these metrics is can be helpful for resource planning in machine learning projects. It helps in estimating the time and computational power needed for similar tasks in the future, and can guide decisions about hardware requirements or potential optimizations.
This topic is deep and nuanced, but this can give you an idea of how long your finetuning took on this particular hardware.
Additionally, we’ll look at the training loss over time, which gives us a quantitative measure of how well our model learned from the legal dataset. This metric helps us gauge the effectiveness of our fine-tuning process.
print(f"Model Load Time: {model_loadtime} seconds")print(f"Model Sanity Check Time: {model_check_loadtime} seconds")print(f"Training Setup Time: {training_setup_loadtime} seconds")print(f"Training Time: {training_time} seconds ({training_time/60} minutes)")
Sample output
Model Load Time: 64.40367837800022 secondsModel Sanity Check Time: 9.231385502000194 secondsTraining Setup Time: 4.85179586599952 secondsTraining Time: 4826.068798849 seconds (80.43447998081666 minutes)