Parameters for tuning foundation models

Tuning parameters configure the tuning experiments that you use to tune the foundation model.

Note: The parameters that you change when you tune a foundation model apply to the tuning experiment, not to the underlying foundation model.

Learn more about the steps that occur during a tuning experiment and how parameters that you can configure affect the process.

The workflow differs based on the tuning method that you choose:

For descriptions of each parameter, see Parameter descriptions.

Fine-tuning workflow

During the experiment, the parameter weights of the tuning model are repeatedly adjusted so that its predictions can get better over time.

The following diagram illustrates the steps that occur during a fine-tuning experiment run. The parts of the experiment flow that you can configure are highlighted with a user icon user. These decision points correspond with experiment tuning parameters that you control.

Fine-tuning experiment run process details

The diagram shows the following steps of the experiment:

  1. The experiment reads the training data, tokenizes it, and converts it into batches.

    The size of the batches is determined by the batch size parameter.

  2. Sends input from the examples in the batch to the foundation model for the model to process and generate output.

  3. Compares the model's output to the output from the training data that corresponds to the training data input that was submitted. Then, computes the loss gradient, which is the difference between the predicted output and the actual output from the training data.

    At some point, the experiment adjusts the foundation model parameter weights based on the performance of the model. When this adjustment occurs depends on how the Accumulation steps parameter is configured.

  4. Adjustments are applied to the parameter weights of the foundation model. The degree to which the weights are changed is controlled by the Learning rate parameter.

  5. Input from the next example in the training data is submitted to the foundation model as input.

  6. The process repeats until all of the examples in all of the batches are processed.

  7. The entire set of batches are processed again as many times as is specified in the Number of epochs parameter.

Inintial parameter values for fine tuning

The best hyperparameter values to use for a fine-tuning experiment vary based on your data, the foundation model you use, and the type of task you want the model to do.

The following table shows useful starting parameter values for the IBM-provided foundation models that can be fine tuned. You can adjust the parameter values as you learn more about what works best through experimentation.

Table 0: Default parameter values
Parameter name allam-1-13b-instruct granite-3b-code-instruct granite-8b-code-instruct granite-20b-code-instruct llama-3-1-8b-instruct
Batch size 5 20 5 5 10
Accumulate steps 1 1 1 1 1
Learning rate 0.00001 0.2 0.00003 0.2 0.00001
Number of epochs (number of training cycles) 10 5 5 5 5
Number of GPUs 4 2 4 4 4

The following table captures hyperparameter values that worked well for fine tuning the llama-3-8b-base and mistral-7b foundation models on various datasets. If you bring your own base foundation model to fine tune, consider starting with values similar to these.

Table 00: Parameter values for fine tuning base foundation models
Parameter name Starting value Alternate value Notes Learn more
Batch size 2 8 Use a smaller size if resources are constrained Segmenting the training data
Accumulate steps 8 4 Avoid 16 when training data has fewer examples Segmenting the training data
Learning rate 0.00001 0.00003 Avoid 0.0001 and 0.000001 Managing the learning rate
Number of epochs (number of training cycles) 10 5 When training data has more examples (closer to 1,000), try with 5 Choosing the number of training runs to complete
Number of GPUs 4 2 Use 4 or more for larger models and datasets Configuring GPUs

Prompt-tuning workflow

During the experiment, the tuning model repeatedly adjusts the structure of the prompt so that its predictions can get better over time.

The following diagram illustrates the steps that occur during a prompt-tuning experiment run. The parts of the experiment flow that you can configure are highlighted with a user icon user. These decision points correspond with experiment tuning parameters that you control.

Prompt-tuning experiment run process details

The diagram shows the following steps of the experiment:

  1. Starts from the initialization method that you choose to use to initialize the prompt.

    If the initialization method parameter is set to text, then you must add the initialization text.

  2. If specified, tokenizes the initialization text and converts it into a prompt vector.

  3. Reads the training data, tokenizes it, and converts it into batches.

    The size of the batches is determined by the batch size parameter.

  4. Sends input from the examples in the batch to the foundation model for the model to process and generate output.

  5. Compares the model's output to the output from the training data that corresponds to the training data input that was submitted. Then, computes the loss gradient, which is the difference between the predicted output and the actual output from the training data.

    At some point, the experiment adjusts the prompt vector that is added to the input based on the performance of the model. When this adjustment occurs depends on how the Accumulation steps parameter is configured.

  6. Adjustments are applied to the prompt vector that was initialized in Step 2. The degree to which the vector is changed is controlled by the Learning rate parameter. The edited prompt vector is added as a prefix to the input from the next example in the training data, and is submitted to the model as input.

  7. The process repeats until all of the examples in all of the batches are processed.

  8. The entire set of batches are processed again as many times as is specified in the Number of epochs parameter.

Note: No layer of the base foundation model is changed during this process.

Default parameters for prompt tuning

The best hyperparameter values to use for a prompt-tuning experiment differ based on your data and use case.

The following table captures the parameter values to use as a starting point for prompt tuning a third-party foundation model.

Table 1: Tuning parameter values for third-party foundation models
Parameter name Default value for flan-t5-xl-3b Default value for llama-2-13b-chat Learn more
Initialization method Random Random Initializing prompt tuning
Initialization text None None Initializing prompt tuning
Batch size 16 8 Segmenting the training data
Accumulate steps 16 16 Segmenting the training data
Learning rate 0.3 0.002 Managing the learning rate
Number of epochs (number of training cycles) 20 20 Choosing the number of training runs to complete

The default parameters that are used for prompt tuning the granite-13b-instruct-v2 foundation model are adjusted based on the type of task you want the tuned model to do.

The following table captures the parameter values to use as a starting point per supported task type for prompt tuning the granite-13b-instruct-v2 foundation model.

Table 2: Tuning parameter values for the granite-13b-instruct-v2 foundation model
Parameter name Default value for classification Default value for generation Default value for summarization Learn more
Batch size 8 16 8 Segmenting the training data
Accumulate steps 32 16 1 Segmenting the training data
Learning rate 0.0006 0.0002 0.0002 Managing the learning rate
Number of epochs (number of training cycles) 20 20 40 Choosing the number of training runs to complete

Parameter descriptions

The following table describes the tuning parameters that you can customize.

Table 3: Tuning parameter value description references
Parameter name Description Value options Learn more
Initialization method (prompt tuning) Specifies how to initialize the prompt vector. Random, Text Initializing prompt tuning
Initialization text (prompt tuning) Text to use as the prompt for the first run of the experiment. Initializing prompt tuning
Batch size Number of labeled examples to process at one time. 1–16 Segmenting the training data
Accumulate steps Number of batches to process before adjustments are made. 1–128 Segmenting the training data
Learning rate Determines the scope of the change to make when the model is adjusted. 0.00001–0.5 Managing the learning rate
Number of epochs (number of training cycles) Number of times to cycle through the training data. 1–50 Choosing the number of training runs to complete
Number of GPUs (fine tuning) Number of GPU processors for the experiment to use. 1–8 Configuring GPUs

Segmenting the training data

When an experiment runs, the experiment first breaks the training data into smaller batches, and then trains on one batch at a time. Each batch must fit in GPU memory to be processed. To reduce the amount of GPU memory that is needed, you can configure the tuning experiment to postpone making adjustments until more than one batch is processed. Tuning runs on a batch and its performance metrics are calculated, but no adjustments are made immediately. Instead, the performance information is collected over some number of batches before the cumulative performance metrics are evaluated.

Use the following parameters to control how the training data is segmented:

Batch size Number of labeled examples (also known as samples) to process at one time.

For example, for a data set with 1,000 examples and a batch size of 10, the data set is divided into 100 batches of 10 examples each.

If the training data set is small, specify a smaller batch size to ensure that each batch has enough examples in it.

Accumulation steps: Number of batches to process before adjustments are made.

For example, if the data set is divided into 100 batches and you set the accumulation steps value to 10, then adjustments are made 10 times instead of 100 times.

Choosing the number of training runs to complete

The Number of epochs parameter specifies the number of times to cycle through the training data.

For example, with a batch size of 10 and a data set with 1,000 examples, one epoch must process 100 batches and make adjustments 100 times. If you set the number of epochs to 20, the model is passed through the data set 20 times, which means it processes a total of 2,000 batches during the tuning process.

The higher the number of epochs and bigger your training data, the longer it takes to tune a model.

Managing the learning rate

The learning rate parameter determines the scope of the change to make when the model is adjusted. The higher the number, the greater the change.

Configuring GPUs (Fine tuning only)

Tuning Studio uses GPU processors to handle the high resources that are required for fine tuning a foundation model.

For fine-tuning experiments only, you can adjust the value in the Number of GPUs field based on the resource needs of your experiment. Use a minimum of 2 GPUs. Larger models and use cases that require long sequence lengths or large batch sizes need at least 4 GPUs.

Initializing the prompt (Prompt tuning only)

When you create a prompt-tuning experiment, you can choose whether to specify your own text to serve as the initial prompt vector or let the experiment generate it for you. These new tokens start the training process either in random positions, or based on the embedding of a vocabulary or instruction that you specify in text. Studies show that as the size of the underlying model grows beyond 10 billion parameters, the initialization method that is used becomes less important.

The choice that you make when you create the tuning experiment customizes how the prompt is initialized.

Initialization method: Choose a method from the following options:

  • Text: The Prompt Tuning method is used where you specify the initialization text of the prompt yourself.
  • Random: The Prompt Tuning method is used that allows the experiment to add values that are chosen at random to include with the prompt.

Initialization text: The text that you want to add. Specify a task description or instructions similar to what you use for zero-shot prompting.

Learn more

Parent topic: Tuning a model