Data formats for tuning foundation models
Prepare a set of prompt examples to use to tune the model. The examples must contain the type of input that the model will need to process at run time and the appropriate output for the model to generate in response.
You can add training data from one file or one connected data store.
To use training data from an external data store, you must first set up a connection to the data store. The following connected data stores are supported:
Training data requirements
Follow these guidelines when you create your training data:
-
Add 100 to 1,000 labeled examples.
Between 50 to 10,000 examples are allowed for prompt tuning. For fine tuning, the number of examples is unlimited. However, a maximum of 10,000 examples can be added from a connected data store and a JSON or JSONL file with examples that you upload cannot be larger than 200 MB.
-
For prompt-tuning experiments, the language of the training data must be English.
-
For fine-tuning experiments, the language of the training data must be in a language that is supported by the model.
Note that the
input
andoutput
labels must continue to be in English. -
Keep your input and output examples within the maximum token limits that are used by the experiment. Otherwise, your example text will be truncated.
For more information, see the appropriate details for the tuning method you're using:
- Fine tuning: Setting fine-tuning token limits
- Prompt tuning: Setting prompt-tuning token limits
How tokens are counted differs by model, which makes the number of tokens difficult to estimate. For language-based foundation models, you can think of 256 tokens as about 130—170 words and 128 tokens as about 65—85 words. For more information, see Tokens and tokenization.
If you plan to use the tuned foundation model to classify data, follow these extra guidelines:
- Try to limit the number of class labels to 10 or fewer.
- Include an equal number of examples of each class type.
You can use the Prompt Lab to craft examples for the training data. For more information, see Prompt Lab.
After you collect a representative set of examples, group the examples into a set to use for training and a separate, smaller set to use for testing purposes.
File format requirements
The training data file must meet these requirements:
- Use one of the following formats:
- JavaScript Object Notation (JSON)
- JSON Lines (JSONL) format
- The maximum file size that is allowed is 200 MB.
- Each example must include one
input
andoutput
pair. - If the input or output text includes quotation marks, escape each quotation mark with a backslash(
\
). For example,He said, \"Yes.\"
. - To represent a carriage return or line break, you can use
\n
escape sequence to represent the new line. For example,...end of paragraph.\nStart of new paragraph
.
JSON example
The following example shows an excerpt from a training data file with labeled prompts for a classification task in JSON format.
[
{
"input":"Message: When I try to log in, I get an error.",
"output":"Class name: Problem"
},
{
"input":"Message: Where can I find the plan prices?",
"output":"Class name: Question"
},
{
"input":"Message: What is the difference between trial and paygo?",
"output":"Class name: Question"
},
{
"input":"Message: The registration page crashed, and now I can't create a new account.",
"output":"Class name: Problem"
},
{
"input":"Message: What regions are supported?",
"output":"Class name: Question"
},
{
"input":"Message: I can't remember my password.",
"output":"Class name: Problem"
},
{
"input":"Message: I'm having trouble registering for a new account.",
"output":"Classname: Problem"
},
{
"input":"Message: A teammate shared a service instance with me, but I can't access it. What's wrong?",
"output":"Class name: Problem"
},
{
"input":"Message: What extra privileges does an administrator have?",
"output":"Class name: Question"
},
{
"input":"Message: Can I create a service instance for data in a language other than English?",
"output":"Class name: Question"
}
]
JSONL example
The following example shows an excerpt from a training data file with labeled prompts for a classification task in JSONL format.
{"input":"Message: When I try to log in, I get an error.","output":"Class name: Problem"}
{"input":"Message: Where can I find the plan prices?","output":"Class name: Question"}
{"input":"Message: What is the difference between trial and paygo?","output":"Class name: Question"}
{"input":"Message: The registration page crashed, and now I can't create a new account.","output":"Class name: Problem"}
{"input":"Message: What regions are supported?","output":"Class name: Question"}
{"input":"Message: I can't remember my password.","output":"Class name: Problem"}
{"input":"Message: I'm having trouble registering for a new account.","output":"Classname: Problem"}
{"input":"Message: A teammate shared a service instance with me, but I can't access it. What's wrong?","output":"Class name: Problem"}
{"input":"Message: What extra privileges does an administrator have?","output":"Class name: Question"}
{"input":"Message: Can I create a service instance for data in a language other than English?","output":"Class name: Question"}
Tabular data format requirements
You can add data directly from a connected data source or from a data asset that you create with data from a connected data source.
The data must be formatted as follows:
- When you select data from a tabular database, such as Presto or watsonx.data, the table must have two columns with the header names
input
andoutput
. The column header names must be lowercase. Columns that are namedInput
orINPUT
, for example, are ignored. The table can contain more columns, however, any extra columns are ignored.
Tabular data example
The following table illustrates how training data with labeled prompts for a classification task is structured.
input | output |
---|---|
Message: When I try to log in, I get an error. | Class name: Problem |
Message: Where can I find the plan prices? | Class name: Question |
Message: What is the difference between trial and paygo? | Class name: Question |
Message: The registration page crashed, and now I can't create a new account. | Class name: Problem |
Message: What regions are supported? | Class name: Question |
Message: I can't remember my password. | Class name: Problem |
Message: I'm having trouble registering for a new account. | Classname: Problem |
Message: A teammate shared a service instance with me, but I can't access it. What's wrong? | Classname: Problem |
Message: What extra privileges does an administrator have? | Class name: Question |
Message: Can I create a service instance for data in a language other than English? | Class name: Question |
Parent topic: Tuning foundation models