Granite LLMによるLoRAのファイン・チューニング

LoRAを使用したGraniteのファイン・チューニング

[低ランク適応]（https：//www.ibm.com/jp-ja/think/topics/lora）（LoRA）は、[ファイン・チューニング]（https：//www.ibm.com/jp-ja/think/topics/fine-tuning）の効率的な手法であり、トレーニング可能なパラメータの数を減らすことで、トレーニングの速度とリソースの使用量を増やすと同時に、同じ品質のアウトプットを実現します。ファイン・チューニング中に[ニューラル・ネットワーク]（https：//www.ibm.com/jp-ja/think/topics/neural-networks）のすべてのパラメーターを更新する代わりに、LoRAは元の事前トレーニング済みの重みを固定し、新しいタスクに必要な変更に近似した、小さくトレーニング可能な低ランク行列を追加します。このアプローチは、適応中の重み更新の「本質的なランク」が低いという仮説に基づいています。

LoRAのさらなるメリットの1つは、事前トレーニングされた重みが固定されているため、生成されたアダプターは軽量、ポータブルかつ簡単に保管できることです。

このチュートリアルでは、LLaMa Factoryを使用します。LLaMa Factoryは、ローコード・ノーコード・プラットフォームをトレーニングおよびファイン・チューニングするための[大規模言語モデル]（https：//www.ibm.com/jp-ja/think/topics/large-language-models）（LLM）であり、そのプラットフォームでは、ユーザーはカスタム・データセット上でLLMのチューニング、パフォーマンスの評価、モデルの提供を行うことができます。使いやすいWeb UIとCLIの両方を備えており、100以上のLLMをサポートしています。このプラットフォームは、AlpacaおよびShareGPT形式のデータセットをサポートしています。LLaMa Factoryは、LLMをファイン・チューニングする唯一の方法ではなく、パラメーター効率の高いファイン・チューニング用の[PEFT]（https：//www.ibm.com/jp-ja/think/topics/Parameter-eficient- Fine-Tuning）ライブラリーは、大規模モデルを更新するためのもう一つの選択肢です。PEFTは[量子化された]（https：//www.ibm.com/jp-ja/think/topics/quantization）LoRA（QLoRA）を実行する能力を備えており、ファイン・チューニングされたモデルをさらにコンパクト化しました。このチュートリアルでは、Granite 3.3の非量子化バージョンを使用します。

LLaMa Factoryは、大規模なコンピューティング・リソースを使用しなくても実行できますが、 [GPU]（https：//www.ibm.com/jp-ja/think/topics/gpu）と大量のメモリ・リソースが必要です。このチュートリアルでは、watsonx®上のLLaMa Factoryを使用して、生成されたアダプターのGPUリソースとストレージを提供します。

設定

Watson Studioの構成

a. IBM® Cloudアカウントを使用してwatsonx.ai®にログインします。

b. watsonx.aiプロジェクトを作成します。[プロジェクト] > [管理] > [一般] > [プロジェクトID]から、プロジェクトIDをメモします。
本チュートリアルではこのIDが必要です。

c. watsonx.ai Runtimeサービス・インスタンスを作成します。このチュートリアルでは、GPUにアクセスするための有料インスタンスを作成する必要があります。

d.watsonxのアプリケーション・プログラミング・インターフェース（APIキー）を生成します。

e. watsonx.ai Runtimeサービスをwatsonx.aiで作成したプロジェクトに関連付けます。

Cloud Object Storage

a. ノートブック用のCloud Object Storageを作成するには、 https://cloud.ibm.com/にアクセスし、[インスタンスの作成]を選択します。

b. これにより、料金体系プランを選択できる作成ダイアログが表示されます。このチュートリアルでは、標準プランで十分な対応ができます。

c. 次に、Cloud Object Storageインスタンスに名前を付けます。

d. インスタンスを作成したら、プロジェクトに戻り、[新規アセット]を選択し、]データ・ソースに接続]を選択します。

Cloud Object Storage向けのwatsonxのデータ・コレクションを示す画像

Cloud Object Storageのデータ接続の構成

e. [Cloud Object Storage]を選択します。

f. 次のダイアログで、ステップ a–d で作成したインスタンスを名前別に選択します。g.[作成]を選択します。

Jupyter Notebookの作成

a.プロジェクト環境で[資産]タブを選択します。

b.[新規資産]を選択します。

c.左側のパネルで[モデルの操作]オプションを選択します。

d.[PythonとRノートブックを使用してデータとモデルを操作]をクリックします。

e.[名前]フィールドにノートブックの名前を入力します。PythonでRuntime 23.1（4 vCPU 16 GB RAM）を選択し、構成を定義します。

f.[作成]を選択します。

セットアップ

次に、ランタイムに依存関係をインストールします。まず、llama-actory を使用して低ランクアダプターを生成し、次に Pandas を使用してデータセットを Alpaca 形式でフォーマットします。

!pip install -q llamafactory 2>/dev/null
# pandas needed to format the dataset
!pip install -q --upgrade pandas 2>/dev/null

++GPU 環境のチェック

次に、watsonx 環境が LLaMa-Factory を使用するために必要な Torch 互換 GPU を提供していることを確認します。

import torch

try:
  assert torch.cuda.is_available() is True
except AssertionError:
  print("No GPU found, please set up a GPU before using LLaMA Factory.")

もし先のコード・スニペットで"No GPU found," と表示されなければ、問題ありません。次に、ライブラリーをインポートしてデータを操作し、トレーニングに使用されるLLaMa Factory設定ファイルを作成します。

# Import libraries
import pandas as pd
import json
import yaml

##Medreason データセットをダウンロードして処理するこのチュートリアルでは、Medreason の一部を使用します。データセット。Medreasonは、LLMにおける説明可能な医療問題解決を可能にするために設計された、大規模で高品質な医療推論データセットです。MedReasonはモデルの推論とモデルが使用する思考の連鎖の検証に重点を置いていますが、この場合、IBM® Granite® 3.3のトレーニング・データには含められないほど新しいデータセットを提供することも役立ちます。Granite 3.3はファイン・チューニングを通じて学習するように設計されており、どちらもLLaMa Factoryで実行されます。Granite モデルは、限られたコンピューティングリソースでも効率的に微調整できます。GitHub から選択した Medreason データセットをロードします:

from datasets import load_dataset

training = pd.read_json("https://raw.githubusercontent.com/UCSC-VLAA/MedReason/refs/heads/main/eval_data/medbullets_op4.jsonl", lines=True)

LLaMa Factory では、データセットが Alpaca または ShareGPT 形式でプリフォーマットされている必要があります。そこで、元の法的データセットの質問と回答のフィールドを、 Alpaca形式に従って再フォーマットし、インプット、出力フィールドを含めます。

{
    "instruction": "user instruction (required)",
    "input": "user input (optional)",
    "output": "model response (required)",
    "system": "system prompt (optional)",
}

MedReasonはAlpacaでフォーマットされていないので、次のセルでAlpacaデータセットを作成します。

!mkdir -p data

# Format Med Dataset to Alpaca Format
formatted_data = [
    {
        "instruction": row["question"] + str(row["options"]),
        "input": "",
        "output": row["answer"]
    }
    for _, row in training.iterrows()
]

# output formatted MedReason dataset
with open("data/med.json", "w", encoding="utf-8") as f:
  json.dump(formatted_data, f, indent=2, ensure_ascii=False)

Llama Factoryは特定のファイルを使用して、トレーニング用のデータセットのロード方法を理解します。このファイルはパス「data/dataset_info.json」に存在する必要があります。したがって、「dataset_info.json」を作成する必要があります。Llama Factory CLIがデータセットにアクセスするために作成した、新しい形式化された医療データセットへのパスを含むファイルを指定します。「dataset_info.json」の詳細については、ファイルはドキュメンテーションをご覧ください。Llama Factoryのリポジトリには使用可能なデータセットが用意されていますが、今回は独自のカスタムデータセットを使用するため、JSONファイルにデータセットを追加する必要があります。

# "med" will be the identifier for the dataset 
# which points to the local file that contains the dataset
dataset_info = {
  "med": {
    "file_name": "med.json",
  }
}

# Create dataset_info.json with legal dataset so can reference with llama factory
with open("data/dataset_info.json", "w", encoding="utf-8") as f:
  json.dump(dataset_info, f, indent=2, ensure_ascii=False)

Alpaca形式のJSONオブジェクトが環境に保存されたので、トレーニングを開始する準備ができました。

ファイン・チューニング

次のステップは、トレーニング構成を設定し、LLaMa-Factoryがトレーニングの実行に使用するYAMLファイルにその構成を書き込むことです。次に、MedReasonデータセットのサブセットに対して、教師ありファイン・チューニング（SFT）を実行します。LLaMa Factoryは、いくつかの異なるタイプのトレーニングをサポートしています。最も一般的に使用されるものは次のとおりです。

事前トレーニング：基本的な言語やアイデアに対する応答を生成するために、広範なデータセットを使用してモデルが初期トレーニングを受けます。
教師ありファイン・チューニング（SFT）：モデルが注釈付きデータを使用して追加のトレーニングを受け、特定の機能または特定のトピックの精度を向上させます。
報酬モデリング：モデルは、アウトプットの近接政策最適化（PPO）に役立つ特定のインセンティブや報酬を達成する方法に関する知識を獲得します。
トレーニング：ポリシー勾配手法を通じてモデルをさらに洗練させ、特定の環境での有効性を高める強化学習（RL）手法。

LoRAの設定には数多くの設定がありますが、最も重要で一般的に使用される設定は次のとおりです。

学習率（LR）： 学習率は、トレーニングの各反復中に各モデル・パラメーターがどの程度大幅に更新されるかを決定します。LRが高いと、大規模な更新が可能になるため収束を加速できますが、最適解をオーバーシュートしたり、その周辺を変動したりするリスクがあります。LRが低いと、収束は遅くなりますが、より安定した収束が可能となり、最適な解に近い不安定性のリスクが軽減されます。
loraplus_lr_ratio：このステップでは、学習率の比率を設定します。一般的には、1以上である必要がありますが、loraplus_lr_ratioの最適な選択は、モデルとタスクによって異なります。ガイドラインとして、タスクがより困難で、モデルがうまく学習するために機能を更新する必要がある場合には、loraplus_lr_ratioは大きくなる必要があります。この場合、学習率を一般的なLoRA学習率よりもわずかに小さくする（例：2分の1）ことが役立ちます。- 効果的なバッチサイズ：トレーニングの安定性と使用しているGPUのVRAM制限のバランスを取るには、バッチサイズを正しく設定することが重要です。有効なバッチ・サイズは、'per_device_train_batch_size'とgradient_accumulation_stepsの積です。効果的なバッチ・サイズが大きいと、一般にトレーニングがよりスムーズで安定しますが、GPUに含まれるよりも多くのVRAMが必要になる場合もあります。有効なバッチサイズが小さいと、差異が大きくなる可能性があります。

トレーニングを設定するコードは次のとおりです。

# setup training configurations
args = dict(
  stage="sft",  # do supervised fine-tuning
  do_train=True,  # we're actually training
  model_name_or_path="ibm-granite/granite-3.3-2b-instruct",  # use IBM Granite 3.3 2b instruct model
  dataset="med",  # use medical datasets we created
  template="granite3",   # use granite3 prompt template
  finetuning_type="lora", # use LoRA adapters to save memory
  lora_target="all",  # attach LoRA adapters to all linear layers
  loraplus_lr_ratio=16.0,  # use LoRA+ algorithm with lambda=16.0
  output_dir="granite3_lora",  # the path to save LoRA adapters
  per_device_train_batch_size=4,  # the batch size
  gradient_accumulation_steps=2,  # the gradient accumulation steps
  learning_rate=1e-4,  # the learning rate
  num_train_epochs=3.0, # the epochs of training
  max_samples=500,  # use 500 examples in each dataset
  fp16=True,  # use float16 mixed precision training
  report_to="none", # disable wandb logging
)

# create training config file to run with llama factory
with open("train_granite3_lora_med.yaml", "w", encoding="utf-8") as file:
  yaml.dump(args, file, indent=2)

次のセルでモデルをトレーニングし、実行に最大10分かかります。

!llamafactory-cli train train_granite3_lora_med.yaml;

Cloud Object Storageの使用

次に、IBM Cloud Object Storage からデータをアップロードおよびダウンロードする 2 つの方法を作成します。

from ibm_botocore.client import Config
import ibm_boto3

def upload_file_cos(credentials, local_file_name, key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.upload_file(Filename=local_file_name, Bucket=credentials['BUCKET'],Key=key)
    except Exception as e:
        print(Exception, e)
    else:
        print(' File Uploaded')


def download_file_cos(credentials,local_file_name,key):  
    cos = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])
    try:
        res=cos.download_file(Bucket=credentials['BUCKET'],Key=key,Filename=local_file_name)
    except Exception as e:
        print(Exception, e)
    else:
        print('File Downloaded')

次のセルには、Cloud Object Storage の認証情報が含まれています。ノートブック内で、右隅にある「コード・スニペット」タブをクリックします。このステップでは、生成されたコード・スニペットのいくつかのオプションを含むメニューが開きます。"Read Data" を選択する：

Watson Studioで用意したコードスニペットを使用する

このステップでは、メニューからデータ・ファイルを選択します。Cloud Object Storageインスタンスに何もアップロードしていない場合は、認証情報を生成するために何かをアップロードする必要があります。これには「wine.csv」のような従来型のデータセットが使用できます。

Watson Studioでのデータ資産の選択

[選択]をクリックすると、[読み込み]オプションで認証情報スニペットを生成できるようになります。[セルにコードを挿入]を選択します。

生成されたコードスニペットをWatson Studioに挿入

このステップでは、生成された正しいIDとエンドポイントを含む認証情報を持つ次のようなセルが生成されます:

# @hidden_cell
# The following code contains metadata for a file in your project storage.
# You might want to remove secret properties before you share your notebook.

storage_metadata = {
    'IAM_SERVICE_ID': '',
    'IBM_API_KEY_ID': '',
    'ENDPOINT': '',
    'IBM_AUTH_ENDPOINT': '',
    'BUCKET': '',
    'FILE': ''
}

アダプターとアダプター自体に関する情報を含むzipフォルダーが表示されます:

!zip -r "granite3_lora.zip" "granite3_lora"

zipフォルダーが正しく作成されたことを確認します。

!ls

推論

さあ、推論を実行する時間です。この推論は、「model.generate ()」を提供する HuggingFace 生成によって裏付けられます。PyTorchを使用したテキスト生成の方法ですこのチュートリアルでは、MedReasonデータセットから抽出した医学的な質問をベースモデルに尋ねる方法を紹介します。基本モデルが大規模で多様なデータセットでトレーニングされた汎用モデルであるため、この質問に答えられないのは当然です。まず、推論設定を設定します:

# setup inference configurations
args = dict(
  model_name_or_path="ibm-granite/granite-3.3-2b-instruct",  # use IBM Granite 3.3 2b instruct model
  template="granite3",  # set to the same one used in training, template for constructing prompts
  infer_backend="huggingface"  # choices： [huggingface, vllm]
)

# create inference config file to run with llama factory
with open("inference_config.yaml", "w", encoding="utf-8") as file:
  yaml.dump(args, file, indent=2)

次に、MedReasonデータセットから1つの質問をチャットボットに尋ねます:

from llamafactory.chat import ChatModel
chat_model = ChatModel(args)
messages = []

# run inference chatbot
question = '''
A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\u00b0F (36.8\u00b0C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"
'''

その質問を使って新しいメッセージを作成し、ベースモデルに渡します:

messages.append({"role": "user", "content": question})

response = ""
for new_text in chat_model.stream_chat(messages):
    response += new_text

print(response)
messages.append({"role": "assistant", "content": response})

以下は、Granite®のベースモデルからのサンプルアウトプットです：

User：

A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\F (36.8\C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"

返答:

The most strongly associated condition with this patient's condition is "C": "Optic glioma".

The patient's symptoms of hypsarrhythmia on EEG, seizure frequency increase, and the presence of hypopigmented macules and a thickened plaque on the lower back are indicative of a neurological disorder. Optic glioma is a type of brain tumor that can present with these symptoms, including seizures and visual disturbances.

Option A, "Cardiac rhabdomyoma", typically presents with cardiac involvement and is not associated with the described EEG findings or skin manifestations.

Option B, "Glaucoma", is an eye disease that can lead to vision loss but is not associated with the EEG findings or skin lesions described.

Option D, "Polyostotic fibrous dysplasia", is a bone disorder characterized by multiple bone lesions and is not associated with the neurological symptoms and EEG findings presented.

Therefore, based on the clinical presentation, the most likely diagnosis is an optic glioma.

データセットからの正しい回答は次の通りです:

answer: Cardiac rhabdomyoma

つまり、ベースモデルは正しい答えを生成しません。## LoRA ファイン・チューニング・アダプターによる推論ベース・モデルと LoRA 調整アダプターに対して実行した成果を比較します。次に、法的データセットを用いたチューニングにより、モデルが医学的な質問をよりよく理解し、答えられるようになった方法を確認するために、同じ質問をします。同じセッションで LoRA を実行した場合は、次のセルは必要ありません。ただし、Jupyter Notebookに戻って、再トレーニングしたくない場合は、COSインスタンスから微調整されたアダプターをダウンロードできます。

download_file_cos(credentials, "granite3_lora.zip", "granite3_lora.zip")
!unzip granite3_lora.zip

ChatModelのオプションを設定してアダプターを組み込むようにします。

# setup inference configurations
args = dict(
  model_name_or_path="ibm-granite/granite-3.3-2b-instruct",  # use IBM Granite 3.3 2b instruct model
  adapter_name_or_path="granite3_lora", # load the saved LoRA adapters
  template="granite3", # set to the same one used in training, template for constructing prompts
  finetuning_type="lora", # which fine-tuning technique used in training
  infer_backend="huggingface" # choices： [huggingface, vllm]
)

# create inference config file to run with llama factory
with open("inference_config.yaml", "w", encoding="utf-8") as file:
  yaml.dump(args, file, indent=2)


from llamafactory.chat import ChatModel
chat_model = ChatModel(args)

現在ファイン・チューニングされたモデルでも、同じ推論の課題をテストできます。

messages = []

# run inference chatbot
question = '''
A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\u00b0F (36.8\u00b0C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"
'''

質問を使用して新しいメッセージを作成し、ファイン・チューニングされたモデルに渡します:

messages.append({"role": "user", "content": question})

response = ""
for new_text in chat_model.stream_chat(messages):
    response += new_text

print(response)
messages.append({"role": "assistant", "content": response})
Cardiac rhabdomyoma

++ ファイン・チューニングされたモデルのアウトプット例： ユーザー：

A 1-year-old girl is brought to a neurologist due to increasing seizure frequency over the past 2 months. 
She recently underwent a neurology evaluation which revealed hypsarrhythmia on electroencephalography (EEG) with a mix of slow waves, multifocal spikes, and asynchrony. 
Her parents have noticed the patient occasionally stiffens and spreads her arms at home. She was born at 38-weeks gestational age without complications. 
She has no other medical problems. Her medications consist of lamotrigine and valproic acid. Her temperature is 98.3\u00b0F (36.8\u00b0C), blood pressure is 90/75 mmHg, pulse is 94/min, and respirations are 22/min. 
Physical exam reveals innumerable hypopigmented macules on the skin and an irregularly shaped, thickened, and elevated plaque on the lower back. 
Which of the following is most strongly associated with this patient's condition?"
"A": "Cardiac rhabdomyoma", "B": "Glaucoma", "C": "Optic glioma", "D": "Polyostotic fibrous dysplasia"

返答:

Cardiac rhabdomyoma

今回、トレーニングされたアダプターのおかげで、モデルは正しい答えを生成しました。注意すべき側面の1つは、モデルが独自の推論で応答しなくなったことです。LoRAに使用したデータセットが、期待されるモデルのアウトプットとして正しい答えしか得られないためです。LoRAファイン・チューニングは、新しい情報を提供するだけでなく、モデルに応答方法を指示するためにも使用できます。

まとめ

このチュートリアルでは、新しい医学的知識と対応方法の詳細なテンプレートを活用して、LoRAでIBM® Granite-3.3-2b-Instructモデルをファイン・チューニングしました。小規模なモデルやデータセットのサンプルが限られている場合でも、Granite 3.3には学習能力があることがわかりました。