Fine-tuning LoRA with Granite LLM

Author

Joshua Noble

Data Scientist

Fine-tuning Granite with LoRA

Low-Rank Adaptation (LoRA) is an efficient method of fine-tuning which reduces the number of trainable parameters that increases the speed and resource usage of training while retaining the same quality of output. Instead of updating all parameters in a neural network during fine-tuning, LoRA freezes the original pretrained weights and adds small, trainable low-rank matrices that approximate the changes needed for the new task. This approach is based on the hypothesis that the weight updates during adaptation have a low "intrinsic rank."

One additional benefit of LoRA is that because the pretrained weights are kept frozen, the generated adapter is lightweight and portable and can easily be stored.

In this tutorial, you'll use LLaMa Factory. LLaMa Factory is an large language model (LLM) training and fine-tuning low and no-code platform that allows users to tune LLMs on custom datasets, evaluate performance and serve models. It has both a web UI and CLI that is easy to use and supports over 100 LLMs. The platform supports datasets in Alpaca and ShareGPT formats. LLaMa Factory isn’t the only way to fine-tune LLMs, the PEFT library for Parameter-efficient fine-tuning is another option for updating large models. PEFT provides the ability to perform quantized LoRA (QLoRA) to even further compact the fine-tuned model. In this tutorial, you’ll use a non-quantized version of Granite 3.3.

Although LLaMa Factory can run without the use of extensive compute resources, it does require a GPU and significant memory resources. In this tutorial, you'll use LLaMa Factory on watsonx® to provide GPU resources and storage for generated adapter.

Configuration

Watson Studio config

a. Log in to watsonx.ai® using your IBM Cloud® account.

b. Create a watsonx.ai project. Take note of your project ID in project > Manage > General > Project ID.  
You'll need this ID for this tutorial.

c. Create a watsonx.ai Runtime service instance. For this tutorial, you'll need to create a paid instance to access a GPU.

d. Generate a watsonx application programming interface (API Key).

e. Associate the watsonx.ai Runtime service to the project that you created in watsonx.ai.

Cloud Object Storage

a. To create Cloud Object Storage for your notebook, you'll go to https://cloud.ibm.com/ and then select "Create Instance".

b. That will take you to a create dialog where you can select a pricing plan. For this tutorial, a standard plan will be adequate.

c. Then, give your Cloud Object Storage instance a name.

d. Once you've created your Instance, go back to the Project and select "New Asset", then select "Connect to a data source".

Image showing watsonx data connection for Cloud Object Storage Configuring the data connection for Cloud Object Storage

e. Select "Cloud Object Storage"

f. In the next dialog, select the Instance that you created in steps ad by name.

g. Select "Create".

Create a Jupyter Notebook

Create a Jupyter Notebook.

a. Select the Assets tab in your project environment.

b. Click New asset.

c. Select the Working with models option in the left panel.

d. Click Working with data and models by using Python and R notebooks.

e. Enter a name for your notebook in the Name field. Choose Runtime 23.1 on Python (4 vCPU 16 GB RAM) to define the configuration.

f. Select Create.

Setup

Next, you'll install dependencies onto the runtime. First, llama-factory to generate the low rank adapters, and then Pandas to format the dataset in Alpaca format.

!pip install -q llamafactory 2>/dev/null
# pandas needed to format the dataset
!pip install -q --upgrade pandas 2>/dev/null

Check the GPU environment

Next, you'll ensure that your watsonx environment has provided a Torch compatible GPU that will be required to use LLaMa-Factory.

import torch

try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("No GPU found, please set up a GPU before using LLaMA Factory.")

If the preceding code snippet doesn't print "No GPU found," then you're good to go.

Next, you'll import libraries to manipulate data and to create the LLaMa Factory configuration file used for training.

# Import libraries
import pandas as pd
import json
import yaml

Download and process the MedReason dataset

In this tutorial, you'll use a part of the MedReason dataset. MedReason is a large-scale high-quality medical reasoning dataset designed to help enable explainable medical problem-solving in LLMs. While MedReason focuses on the reasoning of a model and validating the chains of thought that a model uses, in this case it is also helpful to provide a dataset that is too recent to be included in the training data for IBM® Granite® 3.3.

Granite 3.3 has been designed to learn through fine-tuning, both of which will be run with LLaMa Factory. Granite models can be efficiently fine-tuned even with limited computing resources.

You'll load a selection of the MedReason dataset from GitHub:

from datasets import load_dataset

training = pd.read_json("https://raw.githubusercontent.com/UCSC-VLAA/MedReason/refs/heads/main/eval_data/medbullets_op4.jsonl", lines=True)

LLaMa Factory requires the dataset to be preformatted in Alpaca or ShareGPT formats. Thus, we reformat the question and answer fields of the original legal dataset to contain instruction, input and output fields according to the Alpaca format.

Alpaca is a JSON format to represent an instruction, user input and system output like so:

{
    "instruction": "user instruction (required)",
    "input": "user input (optional)",
    "output": "model response (required)",
    "system": "system prompt (optional)",
}

Because MedReason isn't formatted in Alpaca, you'll create an Alpaca dataset in the next cell:

!mkdir -p data

# Format Med Dataset to Alpaca Format
formatted_data = [
    {
        "instruction": row["question"] + str(row["options"]),
        "input": "",
        "output": row["answer"]
    }
    for _, row in training.iterrows()
]

# output formatted MedReason dataset
with open("data/med.json", "w", encoding="utf-8") as f:
  json.dump(formatted_data, f, indent=2, ensure_ascii=False)

Llama Factory uses a specific file to understand how to load datasets for training. This file must exist at path data/dataset_info.json. Thus, we must create a dataset_info.json file that includes the path to the new formatted medical dataset we created for the Llama Factory CLI to access the dataset. For details on the dataset_info.json file see the documentation. Within the Llama Factory repository there are datasets available to use, however, because we are using our own custom dataset we must add our dataset to the JSON file.

# "med" will be the identifier for the dataset 
# which points to the local file that contains the dataset
dataset_info = {
  "med": {
    "file_name": "med.json",
  }
}

# Create dataset_info.json with legal dataset so can reference with llama factory
with open("data/dataset_info.json", "w", encoding="utf-8") as f:
  json.dump(dataset_info, f, indent=2, ensure_ascii=False)

Now that the Alpaca formatted JSON object has been saved into the environment, you're ready to start training.

Fine-tuning

The next step is to set up the training configurations and then write the configs to a YAML file that LLaMa-Factory uses to run training.

Now you'll run supervised fine-tuning (SFT) on the subset of the MedReason dataset. LLaMa Factory supports several different types of training. Some of the most commonly used are:

  • Pretraining: Where a model undergoes initial training by using an extensive dataset to generate responses to fundamental language and ideas.

  • Supervised fine-tuning (SFT): Where a model receives additional training with annotated data to enhance precision for a particular function or on a specific topic.

  • Reward modeling: Where the model acquires knowledge on how to achieve a specific incentive or reward that will inform its output proximal policy optimization (PPO).

  • Training: A reinforcement learning (RL) technique where the model is further honed through policy gradient techniques to boost its effectiveness in a specific setting.

There are many settings used in configuring LoRA but a few of the most important and commonly used are:

  • Learning rate (LR): The learning rate determines how significantly each model parameter is updated during each iteration of training. A higher LR can speed up convergence by allowing larger updates but risks overshooting the optimal solution or oscillating around it. A lower LR leads to slower but more stable convergence, reducing the risk of instability near the optimal solution.

  • loraplus_lr_ratio: This step sets the ratio of learning rates. Generally, it should be > 1, but the optimal choice of loraplus_lr_ratio is model and task dependent. As a guideline, loraplus_lr_ratio should be larger when the task is more difficult and the model needs to update its features to learn well. In this case, it helps to make the learning rate slightly smaller (for example, by a factor of 2) than typical LoRA learning rates.

  • Effective batch size: Correctly configuring your batch size is critical for balancing training stability with the VRAM limitations of the GPU you're using. The effective batch size is set by the product of per_device_train_batch_size * gradient_accumulation_steps. A larger effective batch size generally leads to smoother, more stable training, but also might require more VRAM than your GPU contains. A smaller effective batch size might introduce more variance.

Here's the code that configures training:

# setup training configurations
args = dict(
  stage="sft",                                                      # do supervised fine-tuning
  do_train=True,                                                    # we're actually training
  model_name_or_path="ibm-granite/granite-3.3-2b-instruct",         # use IBM Granite 3.3 2b instruct model
  dataset="med",                                                    # use medical datasets we created
  template="granite3",                                              # use granite3 prompt template
  finetuning_type="lora",                                           # use LoRA adapters to save memory
  lora_target="all",                                                # attach LoRA adapters to all linear layers
  loraplus_lr_ratio=16.0,                                           # use LoRA+ algorithm with lambda=16.0
  output_dir="granite3_lora",                                       # the path to save LoRA adapters
  per_device_train_batch_size=4,                                    # the batch size
  gradient_accumulation_steps=2,                                    # the gradient accumulation steps
  learning_rate=1e-4,                                               # the learning rate
  num_train_epochs=3.0,                                             # the epochs of training
  max_samples=500,                                                  # use 500 examples in each dataset
  fp16=True,                                                        # use float16 mixed precision training
  report_to="none",                                                 # disable wandb logging
)

# create training config file to run with llama factory
with open("train_granite3_lora_med.yaml", "w", encoding="utf-8") as file:
  yaml.dump(args, file, indent=2)

The next cell will train the model and can take up to 10 minutes to run:

!llamafactory-cli train train_granite3_lora_med.yaml;

Using Cloud Object Storage

Next, you'll create two methods to upload and download data from IBM Cloud Object Storage:

from ibm_botocore.client import Config
import ibm_boto3

def upload_file_cos(credentials, local_file_name, key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.upload_file(Filename=local_file_name, Bucket=credentials['BUCKET'],Key=key)
    except Exception as e:
        print(Exception, e)
    else:
        print(' File Uploaded')


def download_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.download_file(Bucket=credentials['BUCKET'],Key=key,Filename=local_file_name)
    except Exception as e:
        print(Exception, e)
    else:
        print('File Downloaded')

The next cell contains the credentials of Cloud Object Storage.

In your notebook, click the Code Snippets tab in the right corner. This step opens a menu with several options for generated code snippets. Select "Read Data":

The dialog for using a code snippet in Watson Studio Using a prepared code snippet in Watson Studio

This step opens a menu to select a data file. If you haven't uploaded anything to your Cloud Object Storage instance, you'll need to upload something to generate credentials and that can be a classic dataset like wine.csv.

Selecting a data asset in Watson Studio Selecting a data asset in Watson Studio

After clicking "Select" you can now generate the credentials snippet under the "Load as" option. Choose "Insert code to cell":

Inserting a generated code snippet in Watson Studio Inserting a generated code snippet in Watson Studio

This step generates a cell like the following one with credentials containing the correct IDs and endpoints generated:

# @hidden_cell
# The following code contains metadata for a file in your project storage.
# You might want to remove secret properties before you share your notebook.

storage_metadata = {
    'IAM_SERVICE_ID': '',
    'IBM_API_KEY_ID': '',
    'ENDPOINT': '',
    'IBM_AUTH_ENDPOINT': '',
    'BUCKET': '',
    'FILE': ''
}

Now the zip folder containing the adapter and the information about the adapter itself:

!zip -r "granite3_lora.zip" "granite3_lora"

Check that you've created the zip correctly:

!ls

Inference

Now it's time to run inference. The inference will be backed by HuggingFace generation, which provides a model.generate() method for text generation by using PyTorch.

This tutorial shows asking the base model a medical question pulled from the MedReason dataset. It's reasonable that the base model might not be able to answer this question because it is a general-purpose model trained on large, diverse datasets.

First, set up the inference configurations:

# setup inference configurations
args = dict(
  model_name_or_path="ibm-granite/granite-3.3-2b-instruct",       # use IBM Granite 3.3 2b instruct model
  template="granite3",                                            # set to the same one used in training, template for constructing prompts
  infer_backend="huggingface"                                     # choices: [huggingface, vllm]
)

# create inference config file to run with llama factory
with open("inference_config.yaml", "w", encoding="utf-8") as file:
  yaml.dump(args, file, indent=2)

Now you’ll ask the chatbot one of the questions from the MedReason dataset:

from llamafactory.chat import ChatModel
chat_model = ChatModel(args)
messages = []

# run inference chatbot
question = '''
A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\u00b0F (36.8\u00b0C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"
'''
messages.append({"role": "user", "content": question})

response = ""
for new_text in chat_model.stream_chat(messages):
    response += new_text

print(response)
messages.append({"role": "assistant", "content": response})

Here's the sample output from the base Granite 3.3 model:

User:

A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\F (36.8\C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"

Response:

The most strongly associated condition with this patient's condition is "C": "Optic glioma".

The patient's symptoms of hypsarrhythmia on EEG, seizure frequency increase, and the presence of hypopigmented macules and a thickened plaque on the lower back are indicative of a neurological disorder. Optic glioma is a type of brain tumor that can present with these symptoms, including seizures and visual disturbances.

Option A, "Cardiac rhabdomyoma", typically presents with cardiac involvement and is not associated with the described EEG findings or skin manifestations.

Option B, "Glaucoma", is an eye disease that can lead to vision loss but is not associated with the EEG findings or skin lesions described.

Option D, "Polyostotic fibrous dysplasia", is a bone disorder characterized by multiple bone lesions and is not associated with the neurological symptoms and EEG findings presented.

Therefore, based on the clinical presentation, the most likely diagnosis is an optic glioma.

The correct response from the dataset is:

answer: Cardiac rhabdomyoma

So the base model doesn't generate the correct answer.

Infer with the LoRA fine-tuned adapter

We compare the results by running against the base model and the LoRA tuned adapter. We then ask the same question to see how tuning with the legal dataset allowed the model to better understand and answer medical questions.

The following cell won't be necessary if you've performed LoRA in the same session. However, if you're coming back to the Jupyter Notebook and don't want to retrain, you can download the fine-tuned adapters from your COS Instance.

download_file_cos(credentials, "granite3_lora.zip", "granite3_lora.zip")
!unzip granite3_lora.zip

Now you'll configure the options for the ChatModel so that it will incorporate the adapters.

# setup inference configurations
args = dict(
  model_name_or_path="ibm-granite/granite-3.3-2b-instruct",       # use IBM Granite 3.3 2b instruct model
  adapter_name_or_path="granite3_lora",                           # load the saved LoRA adapters
  template="granite3",                                            # set to the same one used in training, template for constructing prompts
  finetuning_type="lora",                                         # which fine-tuning technique used in training
  infer_backend="huggingface"                                     # choices: [huggingface, vllm]
)

# create inference config file to run with llama factory
with open("inference_config.yaml", "w", encoding="utf-8") as file:
  yaml.dump(args, file, indent=2)


from llamafactory.chat import ChatModel
chat_model = ChatModel(args)

Now we can test the same reasoning challenge to the fine-tuned model:

messages = []

# run inference chatbot
question = '''
A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\u00b0F (36.8\u00b0C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"
'''
messages.append({"role": "user", "content": question})

response = ""
for new_text in chat_model.stream_chat(messages):
    response += new_text

print(response)
messages.append({"role": "assistant", "content": response})
Cardiac rhabdomyoma

Sample output from fine-tuned model:

User:

A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\u00b0F (36.8\u00b0C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"

Response:

Cardiac rhabdomyoma

This time the model did generate the correct answer, thanks to the trained adapter.

One aspect to note, the model no longer responds with its reasoning. This result is because the dataset that was used for LoRA has only the correct answer as the expected model output. LoRA fine-tuning can be used to both provide new information but also to instruct the model how to respond.

Summary

In this tutorial you LoRA fine-tuned the IBM Granite-3.3-2b-Instruct model with new medical knowledge and a detailed template for how to respond. You saw Granite 3.3's capacity to learn even with a small model and limited samples from the dataset.

Related solutions
Foundation models

Explore the IBM library of foundation models in the watsonx portfolio to scale generative AI for your business with confidence.

Explore watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Explore the IBM library of foundation models in the IBM watsonx portfolio to scale generative AI for your business with confidence.

Discover watsonx.ai Explore IBM Granite AI models