Knowledge data builder
The knowledge data builder generates instruction and response pairs based on examples in the knowledge branch in the training taxonomy of a tuned foundation model.
Seed data and reference documents
Create an input YAML file that contains sample question and answer (QnA) pairs. The QnA pairs should be questions that a person who is learning the subject might ask. Files with content are also required to serve as a knowledge base. The reference documents for the knowledge base are listed in the YAML file. These documents in the knowledge base act as reference material from which answers are extracted.
Use the following structure for the YAML file:
domain: <A phrase denoting your use case's domain>
task_description: "<Description of this task>"
seed_examples:
- answer: <sample answer 1>
question: <sample question 1>
- answer: <sample answer 2>
question: <sample question 2>
include:
documents:
<doc-set-1-name>: <name of the reference document(s). Specify either one document or wildcard to refer to multiple documents>
<doc-set-2-name>: <name of the reference document(s). Specify either one document or wildcard to refer to multiple documents>
Sample seed data and reference documents
To download sample seed data for the knowledge data builder, see watsonx-ai-samples.
Request body example for JSON file
The following request body has the configuration for a unstructured synthetic data generation job that uses the knowledge data builder builder. It has additional advanced settings for generating tokens.
{
"project_id": "<ID of the project to create the job in>",
"name": "<Name of the job that you want to create>",
"description": "<Description for the job>",
"configuration": {
"pipeline": "knowledge",
"num_outputs_to_generate": <A value between 1 to 1000>,
"generator": {
"model_id": "<LLM. For example ibm/granite-3-8b>",
"temperature": 0.5,
"max_new_tokens": 1024,
"min_new_tokens": 100,
"decoding_method": "sample",
"top_p": 0.9,
"top_k": 50
},
"validators": [
{
"type": "rouge_scorer",
"threshold": 0.9
}
],
"seed_data_reference": {
"type": "container",
"location": {
"path": "<YAML file name in project assets. For example qna_know_seed.yaml>"
}
},
"knowledge_base_references": [
{
"type": "container",
"location": {
"path": "<File name for reference documents. For example origins_*.pdf>"
}
},
{
"type": "container",
"location": {
"path": "<File name for additional reference documents. For example education-at-ibm.md>"
}
}
],
"results_reference": {
"type": "container",
"location": {
"path": "<File name for the generated data output. For example sdg-output-know.jsonl>"
}
}
},
"job": {
"schedule": "0 0 1 * *",
"schedule_info": {
"repeat": true,
"startOn": 1547578689512,
"endOn": 1547578689512
}
}
}