Text to SQL data builder

The text to SQL data builder generates synthetic SQL data that contains a natural language statment describing a database operation and an equivalent SQL statement to perform the database operation.

Note: Currently, the natural language statment must be written in English.

Seed data format

Create an input YAML file that contains sample plain text statements. Add the following information to the file:

The operations to perform on data stored in a relational database
The corresponding SQL queries to execute each operation
A database schema that defines how the data is organized and stored

Use the following structure for the YAML file:

task_description: <Description of this task>
seed_examples:
   - utterance: <input question 1>
     query: <sample SQL 1>
   - utterance: <input question 2>
     query: <sample SQL 2>
database:
   schema: "<Data Definition Language (DDL) statement of one or more tables. Separate each DDL by a semi-colon>"

Sample seed data

To download sample seed data for the text to SQL data builder, see watsonx-ai-samples.

Request body example for JSON file

The following request body has the configuration for a unstructured synthetic data generation job that uses the text to SQL data builder. It has additional advanced settings for generating tokens.

{
    "project_id": "<ID of the project to create the job in>",
    "name": "<Name of the job that you want to create>",
    "description": "<Description for the job>",
    "configuration": {
        "pipeline": "nl2sql",
        "num_outputs_to_generate": <A value between 1 to 1000>,
        "generator": {
            "model_id": "<LLM. For example ibm/granite-3-8b>",
            "temperature": 0.5,
            "max_new_tokens": 1024,
            "min_new_tokens": 100,
            "decoding_method": "sample",
            "top_p": 0.9,
            "top_k": 50
        },
        "validators": [            
            {
                "type": "rouge_scorer",
                "threshold": 0.9
            }
        ],
        "seed_data_reference": {
            "type": "container",
            "location": {
                "path": "<YAML file name in project assets. For example qna_know_seed.yaml>"
            }
        },
        "results_reference": {
            "type": "container",
            "location": {
                "path": "<File name for the generated data output. For example sdg-output-know.jsonl>"
            }
        }
    },
    "job": {
        "schedule": "0 0 1 * *",
        "schedule_info": {
            "repeat": true,
            "startOn": 1547578689512,
            "endOn": 1547578689512
        }
    }
}