Parameters for tuning foundation models

Tuning parameters configure the tuning experiments that you use to tune a foundation model.

Note: The parameters that you change when you tune a foundation model apply to the tuning experiment, not to the underlying foundation model.

Some tuning parameters are common across all tuning methods. Others parameters are specific to the tuning method in use.

The parameter that you can use to tune foundation models in watsonx.ai are described in the following sections:

Common tuning parameters

The following table describes tuning parameters that you can customize for all tuning methods.

Common tuning parameter value description references
Parameter name Description Value options Learn more
Batch size Number of labeled examples to process at one time. 1–16 Segmenting the training data
Accumulate steps Number of batches to process before adjustments are made. 1–128 Segmenting the training data
Learning rate Determines the scope of the change to make when the model is adjusted. 0.00001–0.5 Managing the learning rate
Number of epochs (number of training cycles) Number of times to cycle through the training data. 1–50 Choosing the number of training cycles to complete

Prompt tuning parameters

In addition to the common tuning parameters, the following table describes the tuning parameters that are associated with prompt tuning specifically.

Parameter values for prompt tuning foundation models
Parameter name Description Value options Learn more
Initialization method Specifies how to initialize the prompt vector. Random, Text Initializing prompt tuning
Initialization text Text to use as the prompt for the first run of the experiment. Initializing prompt tuning

Setting parameter values for prompt tuning

The best hyperparameter values to use for a prompt-tuning experiment differ based on your data and use case.

The following table captures the parameter values to use as a starting point for prompt tuning a third-party foundation model.

Tuning parameter values for third-party foundation models
Parameter name Default value for flan-t5-xl-3b Default value for llama-2-13b-chat
Initialization method Random Random
Initialization text None None
Batch size 16 8
Accumulate steps 16 16
Learning rate 0.3 0.002
Number of epochs (number of training cycles) 20 20

The default parameters that are used for prompt tuning the granite-13b-instruct-v2 foundation model are adjusted based on the type of task you want the tuned model to do.

The following table captures the parameter values to use as a starting point per supported task type for prompt tuning the granite-13b-instruct-v2 foundation model.

Tuning parameter values for the granite-13b-instruct-v2 foundation model
Parameter name Default value for classification Default value for generation Default value for summarization
Batch size 8 16 8
Accumulate steps 32 16 1
Learning rate 0.0006 0.0002 0.0002
Number of epochs (number of training cycles) 20 20 40

Parameters for fine tuning methods

Additional parameters are used to control tuning experiments for the following fine tuning methods:

Full fine tuning parameters

In addition to the common tuning parameters, the following table describes the tuning parameter that is associated with full fine tuning specifically.

Parameter value for full fine tuning foundation models
Parameter name Notes Learn more
Number of GPUs Use 4 or more for larger models and datasets Configuring GPUs

Setting parameter values for full fine tuning

The best hyperparameter values to use for a fine-tuning experiment vary based on your data, the foundation model you use, and the type of task you want the model to do.

The following table shows useful starting parameter values for the IBM-provided foundation models that can be fine tuned. You can adjust the parameter values as you learn more about what works best through experimentation.

Default parameter values
Parameter name allam-1-13b-instruct granite-3b-code-instruct granite-8b-code-instruct granite-20b-code-instruct llama-3-1-8b-instruct
Batch size 5 20 5 5 10
Accumulate steps 1 1 1 1 1
Learning rate 0.00001 0.2 0.00003 0.2 0.00001
Number of epochs (number of training cycles) 10 5 5 5 5
Number of GPUs 4 2 4 4 4

The following table shows useful starting parameter values for the IBM-provided foundation models that can be fine tuned starting with the 2.1.1 release. You can adjust the parameter values as you learn more about what works best through experimentation.

Default parameter values for fine tuning provided models
Parameter name granite-3-1-8b-base llama-3-1-8b llama-3-1-70b
Batch size 5 5 5
Accumulate steps 1 1 1
Learning rate 0.00001 0.00001 0.00001
Number of epochs (number of training cycles) 10 10 10
Number of GPUs 4 4 4

The following table captures hyperparameter values that worked well for fine tuning the llama-3-8b-base and mistral-7b foundation models on various datasets. If you bring your own base foundation model to fine tune, consider starting with values similar to these.

Parameter values for fine tuning base foundation models
Parameter name Starting value Alternate value Notes Learn more
Batch size 2 8 Use a smaller size if resources are constrained Segmenting the training data
Accumulate steps 8 4 Avoid 16 when training data has fewer examples Segmenting the training data
Learning rate 0.00001 0.00003 Avoid 0.0001 and 0.000001 Managing the learning rate
Number of epochs (number of training cycles) 10 5 When training data has more examples (closer to 1,000), try with 5 Choosing the number of training cycles to complete
Number of GPUs 4 2 Use 4 or more for larger models and datasets Configuring GPUs

Low-rank adaptation fine tuning parameters

In addition to the common tuning parameters, the following table describes the tuning parameters that are associated with low-rank adaptation (LoRA) and quantized low-rank adaptation (QLoRA) fine tuning specifically.

Parameter value for low-rank adaptation fine tuning foundation models
Parameter name Notes
Alpha Multiplier to apply to the adapter weight changes before they are added to the base model weights. This setting controls how much impact the adapter weights have. The alpha value is divided by the rank value to calculate the percentage of weight change to apply.
Dropout Percentage of adapter weights to randomly reset to zero to prevent overfitting to the training data.
Rank Number to use in calculations that select the subset of model parameter weights to adjust.
Target modules Specifies the layers of the base foundation model to target for adaptation.

For more information about these parameters, see Configuring low-rank adaptation (LoRA or QLoRA fine tuning only).

Setting parameter values for LoRA fine tuning

The best hyperparameter values to use for a LoRA fine-tuning experiment vary based on your data, the foundation model you use, and the type of task you want the model to do.

The following table shows useful starting parameter values. You can adjust the parameter values as you learn more about what works best through experimentation.

Default parameter values for a low-rank adaptation fine tuning experiment
Parameter name granite-3-1-8b-base llama-3-1-8b llama-3-1-70b llama-3-1-70b-gptq
Alpha 32 64 32 32
Accumulate steps 1 1 1 1
Batch size 5 8 5 5
Dropout 0.05 0.05 0.05 0.05
Learning rate 0.00001 0.00001 0.00001 0.00001
Number of epochs (number of training cycles) 10 5 10 10
Number of GPUs 4 4 4 4
Rank 8 32 8 8
Target modules ["all-linear"] [] meaning all layers [] meaning all layers [] meaning all layers

For more information about these parameters, see Configuring low-rank adaptation (QLoRA or LoRA fine tuning only).

Parameter descriptions

Segmenting the training data

When an experiment runs, the experiment first breaks the training data into smaller batches, and then trains on one batch at a time. Each batch must fit in GPU memory to be processed. To reduce the amount of GPU memory that is needed, you can configure the tuning experiment to postpone making adjustments until more than one batch is processed. Tuning runs on a batch and its performance metrics are calculated, but no adjustments are made immediately. Instead, the performance information is collected over some number of batches before the cumulative performance metrics are evaluated.

Use the following parameters to control how the training data is segmented:

Batch size Number of labeled examples (also known as samples) to process at one time.

For example, for a data set with 1,000 examples and a batch size of 10, the data set is divided into 100 batches of 10 examples each.

If the training data set is small, specify a smaller batch size to ensure that each batch has enough examples in it.

Accumulation steps: Number of batches to process before adjustments are made.

For example, if the data set is divided into 100 batches and you set the accumulation steps value to 10, then adjustments are made 10 times instead of 100 times.

Choosing the number of training cycles to complete

The Number of epochs parameter specifies the number of times to cycle through the complete training dataset.

For example, with a batch size of 10 and a data set with 1,000 examples, one epoch must process 100 batches and make adjustments 100 times. If you set the number of epochs to 20, the model is passed through the data set 20 times, which means it processes a total of 2,000 batches during the tuning process.

The higher the number of epochs and bigger your training data, the longer it takes to tune a model. If you set the number of epochs too low, the model might not learn adequately. If you set the number of epochs too high, you can overfit the model to the data set. Overfitting is a term used to describe the phenomena where a model is so closely tuned to its training data that it cannot generalize and apply what it learns when new data is introduced.

Managing the learning rate

The learning rate parameter determines the scope of the change to make when the model is adjusted. The higher the number, the greater the change. Setting the learning rate too low might prevent the model from learning adequately from the new data presented. Setting the learning rate too high might prevent the model from learning gradually enough to be able to apply what it learns to new, unseen data.

This parameter is one that you might want to set conservatively, and then change gradually as you experiment to find the best hyperparameters for the dataset and foundation model that you are customizing.

Setting token limits

You can change the number of tokens that are allowed in the model input and output during a tuning experiment by setting the max_seq_length parameter. The maximum sequence length is the maximum number of input tokens plus the output tokens allowed for each prompt.

The larger the number of allowed input and output tokens, the longer it takes to tune the model. Set this parameter to the smallest number that is possible to use but still represent your use case properly.

Create input and output examples in your training data that conform to the limit you plan to use for tuning. Examples that are longer than the specified maximum sequence length are truncated during the experiment. For example, if you set this parameter to 200 and the training data has an example input with 1,000 tokens, only the first 200 tokens of the example input are used.

Remember, the sequence length also includes the output tokens for each prompt, which means the setting controls the number of tokens that the model is allowed to generate as output during the tuning experiment.

Editing the verbalizer

The verbalizer is a sort of template that defines how your training samples are formatted when they are submitted to the foundation model during a tuning experiment.

The format of the verbalizer can change depending on the base model. You might want to customize the verbalizer if more descriptive prefix text can guide the foundation model to generate better answers. However, if you edit the verbalizer, follow these guidelines:

  • Only change the verbalizer after prompt engineering to validate that the custom format improves foundation model output.

  • Do not edit the {{input}} variable.

    This variable instructs the tuning experiment to extract text from the input segment of the examples in your training data file.

  • If you change the verbalizer that is used to tune a foundation model, use the same prefixes when you inference the tuned model later.

Configuring low-rank adaptation (LoRA or QLoRA fine tuning only)

For LoRA and QLoRA fine-tuning experiments only, you can adjust the following tuning experiment parameters:

  • alpha: Determines the multiplier to apply to the adapter weight changes when they are added to the base model weights. The alpha value is divided by the rank value to calculate the percentage of weight change to apply. For example, if alpha is 2 and rank is 8, then the adapter weights are reduced by 1/4 or 25% before they are added to the model weights. The resulting value controls how much impact you want the added weights to have. This setting has a similar function to learning rate, so you might want to adjust only one or the other setting during experimentation.

  • dropout: Resets weights in some of the LoRA adapter parameters to zero randomly. Specify a decimal value to indicate the percentage of weights to reset, such as 0.1 for 10%. Dropout helps to prevent overfitting, which occurs when the model learns to respond to the specific training dataset, but cannot generalize its learning to respond as expected to new inputs.

  • rank: Rank to use when doing matrix decomposition calculations on the base model matrices. A lower number means a faster job because there are fewer trainable parameters to adjust in the adapter. A lower number also introduces the possibility of lower fidelity to the original model parameter weights. Use a higher number if the task that you want the foundation model to learn is contrary or completely new compared to the tasks that the base foundation model can already do. Maximum value allowed is 64.

  • target_modules: The layers of the base foundation model where you want to add low-rank adapters during tuning. Options include:

    • ["all-linear"]: Selects all linear and 1-dimensional convolutional neural network layers, except the output layer. This options works with all decoder-only models.
    • [] (empty array): Adds adapters to the defaults specified for the model architecture.
    • ["$layer-name", "$layer-name"]: Lists a subset of layers. The layer names differ by model architecture. See the Model layers of foundation model architectures table.
  • type: Specify one of the following options:

    • lora: Runs a low-rank adaptation fine-tuning experiment. This type can only be applied to non-quantized foundation models and works best with base foundation models that are not instruction tuned.
    • qlora: Runs a quantization low-rank adaptation fine-tuning experiment. This type can only be applied to quantized foundation models.
    • none: Runs a full fine-tuning experiment. None is the default value.
Model layers of foundation model architectures
Base foundation model architecture Layers Layers targeted by default
llama [down_proj, up_proj, gate_proj, q_proj, k_proj, v_proj, o_proj] ["q_proj", "v_proj"]
granite [down_proj, up_proj, gate_proj, q_proj, k_proj, v_proj, o_proj] ["q_proj", "v_proj"]

Configuring GPUs (Fine tuning only)

Tuning Studio uses GPU processors to handle the high resources that are required for fine tuning a foundation model.

For fine-tuning experiments only, you can adjust the value in the Number of GPUs field based on the resource needs of your experiment. Use a minimum of 2 GPUs. Larger models and use cases that require long sequence lengths or large batch sizes need at least 4 GPUs.

Initializing the prompt (Prompt tuning only)

When you create a prompt-tuning experiment, you can choose whether to specify your own text to serve as the initial prompt vector or let the experiment generate it for you. These new tokens start the training process either in random positions, or based on the embedding of a vocabulary or instruction that you specify in text. Studies show that as the size of the underlying model grows beyond 10 billion parameters, the initialization method that is used becomes less important.

The choice that you make when you create the tuning experiment customizes how the prompt is initialized.

Initialization method: Choose a method from the following options:

  • Text: The Prompt Tuning method is used where you specify the initialization text of the prompt yourself.
  • Random: The Prompt Tuning method is used that allows the experiment to add values that are chosen at random to include with the prompt.

Initialization text: The text that you want to add. Specify a task description or instructions similar to what you use for zero-shot prompting.

Learn more

Parent topic: Tuning a model