Text to SQL data builder
The text to SQL data builder generates synthetic SQL data that contains a natural language statment describing a database operation and an equivalent SQL statement to perform the database operation.
Note: Currently, the natural language statment must be written in English.
Seed data format
Create an input YAML file that contains sample plain text statements. Add the following information to the file:
- The operations to perform on data stored in a relational database
- The corresponding SQL queries to execute each operation
- A database schema that defines how the data is organized and stored
Use the following structure for the YAML file:
task_description: <Description of this task>
seed_examples:
- utterance: <input question 1>
query: <sample SQL 1>
- utterance: <input question 2>
query: <sample SQL 2>
database:
schema: "<Data Definition Language (DDL) statement of one or more tables. Separate each DDL by a semi-colon>"
Sample seed data
To download sample seed data for the text to SQL data builder, see watsonx-ai-samples.
Request body example for JSON file
The following request body has the configuration for a unstructured synthetic data generation job that uses the text to SQL data builder. It has additional advanced settings for generating tokens.
{
"project_id": "<ID of the project to create the job in>",
"name": "<Name of the job that you want to create>",
"description": "<Description for the job>",
"configuration": {
"pipeline": "nl2sql",
"num_outputs_to_generate": <A value between 1 to 1000>,
"generator": {
"model_id": "<LLM. For example ibm/granite-3-8b>",
"temperature": 0.5,
"max_new_tokens": 1024,
"min_new_tokens": 100,
"decoding_method": "sample",
"top_p": 0.9,
"top_k": 50
},
"validators": [
{
"type": "rouge_scorer",
"threshold": 0.9
}
],
"seed_data_reference": {
"type": "container",
"location": {
"path": "<YAML file name in project assets. For example qna_know_seed.yaml>"
}
},
"results_reference": {
"type": "container",
"location": {
"path": "<File name for the generated data output. For example sdg-output-know.jsonl>"
}
}
},
"job": {
"schedule": "0 0 1 * *",
"schedule_info": {
"repeat": true,
"startOn": 1547578689512,
"endOn": 1547578689512
}
}
}