Configuration for creating jobs to generate unstructured synthetic data

You have various options for customizing how unstructured synthetic data is generated. Whether you're creating the jobs through the user interface or through the watsonx.ai synthetic data generation API, you have access to the same settings.

Data builder configuration

Each data builder has specific requirements for seed data and reference documents. For example, you can use several files as reference documents for the knowledge data builder. However, the text to SQL data builder does not use reference documents.

The Data builder configuration tab guides you through the requirements in the user interface.

To configure your API request, see the following resources:

Generator advanced settings

You can use the advanced settings for the generator to configure various options for generating unstructured synthetic data. For example, you can configure how tokens are selected and which LLM to use.

Table 1. Generator settings
Field Name in user interface Type Default Allowed values Description
num_outputs_to_generate Number of rows to generate Integer 10 Range: 1value1000 This parameter determines the amount of synthetic data to generate. If you want to experiment with different models, you can set this parameter to 10 to reduce the amount of synthetic data generated while you are testing.
model_id Model ID String ibm/granite-3-8b-instruct See Supported models The model to use for generating tokens.
temperature Temperature float 0.5 Range: 0.05value2 This parameter adjusts the probability distribution for the next token, which controls the balance between creativity and determinism in text generation. Lower values (≤0.3) narrow the distribution, which reduces randomness and produces more focused, repeatable output. Higher values (≥0.7) widen the distribution, allowing for more diverse and creative responses, though sometimes at the expense of accuracy.
max_new_tokens Max tokens per QnA pair Integer 1024 Any value above min_new_tokens This parameter sets a hard cap on the number of tokens generated in one response, which prevents overly long outputs.
min_new_tokens Min tokens per QnA pair Integer 100 Any value above 50 This parameter sets the minimum number of tokens that the model must generate, which prevents it from stopping too early.
top_p top_p float 0.9 Range: 0 < value1 This parameter determines the pool of candidate tokens for the next generated token. The model calculates probabilities for all the possible next tokens, sorts them by probability, and then selects the smallest set of tokens whose cumulative probability is at least p. Higher values (0.9) mean less-probable tokens are included, which produces a more diverse and creative output.
top_k top_k Integer 50 Range: 1value100 This parameter determines how many tokens are considered as candidates for the next generated token. It limits the pool of candidate tokens to the k most probable options at each step. For example, when k = 50, the model calculates probabilities for all the possible next tokens and sorts them by probability. From this ranked list of tokens, only the 50 most-probable tokens are considered. The rest are ignored. Smaller values generate more predictable output. Larger values increase variety. A midrange value (50) helps to balance diversity and relevance.
threshold Threshold (De-duplicator parameters) Integer 0.9 Any value between 0.01 and 1 Sets the similarity score that is used to decide when two outputs are considered duplicates. The model uses the Rouge-L scoring method to determine similarity. If set to 0, all QnA pairs are treated as invalid, which produces no output. If set to 1, no duplicates are removed.

Supported models

You can pick which model to use to generate the synthetic data. The following models are certified for use with the Synthetic Data Generator service for generating unstructured synthetic data:

  • ibm/granite-3-8b
  • mistralai/mistral-large

Validator advanced settings

The advanced settings for the validator determine how Synthetic Data Generator evaluates the tokens that it generates.

Table 2. Validator settings
Field Name in user interface Type Default Allowed values Comments
threshold Threshold (De-duplicator parameters) Integer 0.9 Any value between 0.01 and 1.00 Sets the similarity score that is used to decide when two outputs are considered duplicates. The model uses the Rouge-L scoring method to determine similarity. If set to 0, all QnA pairs are treated as invalid, which produces no output. If set to 1, no duplicates are removed.

Request sample

The following command submits a request to create an unstructured synthetic data generation job. The header for the curl is the same for all the data builders:

curl -X POST \
  'https://api.<region>.dai.cloud.ibm.com/v1/synthetic_data/generation/unstructured?version=2025-04-17' \
  --header 'Accept: application/json' \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer eyJraWQiOi...' \
  --data @payload.json

You can use a separate JSON file, such as payload.json, to pass the request body that has the configuration details for the job. The JSON file can change depending on the data builder and the setting that you want to configure. Generally, it has the following structure:

{
    "project_id": "<ID of the project to create the job in>",
    "name": "<Name of the job that you want to create>",
    "description": "<Description for the job>",
    "configuration": {
        "pipeline": "<data builder>",
        "num_outputs_to_generate": <A value between 1 to 1000>,
        "generator": {
            "model_id": "<LLM. For example ibm/granite-3-8b>",
            "temperature": 0.5,
            "max_new_tokens": 1024,
            "min_new_tokens": 100,
            "decoding_method": "sample",
            "top_p": 0.9,
            "top_k": 50
        },
        "validators": [            
            {
                "type": "rouge_scorer",
                "filter": true,
                "threshold": 0.9
            }
        ],
        "seed_data_reference": {
            "type": "container",
            "location": {
                "path": "<Input YAML file name in project assets. For example qna_know_seed.yaml>"
            }
        },
        "knowledge_base_references": [
            {
                "type": "container",
                "location": {
                    "path": "<File name for reference documents. For example origins_*.pdf>"
                }
            },
            {
                "type": "container",
                "location": {
                    "path": "<File name for additional reference documents. For example education-at-ibm.md>"
                }
            }
        ],
        "results_reference": {
            "type": "container",
            "location": {
                "path": "<Unique file name for the generated data output in project assets. For example sdg-output-knowledge-1.jsonl>"
            }
        }
    },
    "job": {
        "schedule": "0 0 1 * *",
        "schedule_info": {
            "repeat": true,
            "startOn": 1547578689512,
            "endOn": 1547578689512
        }
    }
}