Generate node
In Synthetic Data Generator, you use the Generate node to create the structured synthetic data.
The Generate node can generate synthetic data either from scratch by using the statistical distributions that you specify or automatically using the distributions obtained from running a Mimic node on existing data.
- Description
- The Generate node in a Synthetic Data Generator flow creates synthetic data based on the statistical properties and relationships in the source dataset. It uses various algorithms and statistical models to generate artificial yet realistic-looking data.
- You can include the Generate node multiple times in a flow if you need to create different types of synthetic data. However, the synthetic data cannot be merged in Synthetic Data Generator.
- Using the node
- If you want to use production data, add a Mimic node to the canvas. A Generate node is automatically created and added after the Mimic node in the flow once the Mimic node runs.
- If you want to use a custom data schema, add a Generate node and define the data schema in it. Build tabular data by defining what data should be created for each column. You cannot connect an Import node or Anonymize node directly to a Generate node.
- Mandatory or optional
- The Generate node is mandatory. If you want to use a custom data schema, you must add the node manually. However, the node is automatically added after a Mimic node.
Scripting with the Generate node
You can use scripting languages, like Python, to progammatically set properties for nodes.
Generate node properties
The following properties are specific to the Generate node. For information about common node properties, see Properties for flows and nodes.
| Property name | Data type | Property description |
|---|---|---|
correlations |
Structured property | Set the correlation between fields in the dataset. For more information, see Correlations example section. |
create_iteration_field |
boolean | |
fields |
Structured property | Set the requirements for a custom dataset by using the fields property. For more information, see Fields example |
iteration_field_name |
String | When create_iteration_field is set to True, the new iteration field uses the name that you set. |
keep_min_max_setting |
boolean | Set to True to use the value for max_cases |
locale |
String | The locale determines which dictionary types are available for field generation, it can have one value from ["de_DE", "en_US", "es_ES", "fr_FR", "it_IT", "ja_JP", "ko_KR", "pl_PL", "pt_BR", "ru_RU", "zh_CN"] |
max_cases |
Integer | Set the maximum number of rows of synethtic data to generate. The minimum value is 1000. The maximum value is 2,147,483,647. |
random_seed |
Integer | |
refit_correlations |
boolean | |
replicate_results |
boolean | |
parameter_xml |
String | Returns the parameter Xml as a string |
Fields property example
The fields property is a structured property with the following syntax:
generate.setPropertyValue("fields", [
[field1, storage, locked, [distribution1], min, max],
[field2, storage, locked, [distribution2], min, max],
[field3, storage, locked, [distribution3], min, max]
])
You can use it to define the custom requirements for the synthetic data to generate. The distribution is a declaration of the distribution name followed by a list containing pairs of attribute names and values. You can't set the
distribution directly; you must use it in conjunction with the fields property. Each distribution is defined in the following way:
[distributionname, [[par1], [par2], [par3]]]
generate = sdg.script.stream().createAt("generate", u"Generate", 726, 322)
generate.setPropertyValue("fields", [["Age", "integer", False, ["Uniform",[["min","1"],["max","2"]]], "", ""]])
For example, to create a node that generates a single field with a Binomial distribution, you could use the following script:
generate_node1 = sdg.script.stream().createAt("generate", u"Generate", 200, 200)
generate_node1.setPropertyValue("fields", [["Education", "Real", False, ["Binomial", [["n", 32],
["prob", 0.7]]], "", ""]])
The Binomial distribution takes 2 parameters: n and prob. Since a Binomial distribution does not support minimum and maximum values, these are supplied as an empty string.
The following examples show all the possible distribution types. Note that the threshold is entered as thresh in both NegativeBinomialFailures and NegativeBinomialTrial.
stream = sdg.script.stream()
generate = stream.createAt("generate", u"Generate", 200, 200)
beta_dist = ["Field1", "Real", False, ["Beta",[["shape1","1"],["shape2","2"]]], "", ""]
binomial_dist = ["Field2", "Real", False, ["Binomial",[["n" ,"1"],["prob","1"]]], "", ""]
categorical_dist = ["Field3", "String", False, ["Categorical", [["A",0.3],["B",0.5],["C",0.2]]], "", ""]
dice_dist = ["Field4", "Real", False, ["Dice", [["1" ,"0.5"],["2","0.5"]]], "", ""]
exponential_dist = ["Field5", "Real", False, ["Exponential", [["scale","1"]]], "", ""]
fixed_dist = ["Field6", "Real", False, ["Fixed", [["value","1" ]]], "", ""]
gamma_dist = ["Field7", "Real", False, ["Gamma", [["scale","1"],["shape"," 1"]]], "", ""]
lognormal_dist = ["Field8", "Real", False, ["Lognormal", [["a","1"],["b","1" ]]], "", ""]
negbinomialfailures_dist = ["Field9", "Real", False, ["NegativeBinomialFailures",[["prob","0.5"],["thresh","1"]]], "", ""]
negbinomialtrial_dist = ["Field10", "Real", False, ["NegativeBinomialTrials",[["prob","0.2"],["thresh","1"]]], "", ""]
normal_dist = ["Field11", "Real", False, ["Normal", [["mean","1"] ,["stddev","2"]]], "", ""]
poisson_dist = ["Field12", "Real", False, ["Poisson", [["mean","1"]]], "", ""]
range_dist = ["Field13", "Real", False, ["Range", [["BEGIN","[1,3]"] ,["END","[2,4]"],["PROB","[[0.5],[0.5]]"]]], "", ""]
triangular_dist = ["Field14", "Real", False, ["Triangular", [["min","0"],["max","1"],["mode","1"]]], "", ""]
uniform_dist = ["Field15", "Real", False, ["Uniform", [["min","1"],["max","2"]]], "", ""]
weibull_dist = ["Field16", "Real", False, ["Weibull", [["a","0"],["b","1 "],["c","1"]]], "", ""]
generate.setPropertyValue("fields", [\
beta_dist, \
binomial_dist, \
categorical_dist, \
dice_dist, \
exponential_dist, \
fixed_dist, \
gamma_dist, \
lognormal_dist, \
negbinomialfailures_dist, \
negbinomialtrial_dist, \
normal_dist, \
poisson_dist, \
range_dist, \
triangular_dist, \
uniform_dist, \
weibull_dist
])
Correlations property example
The correlations property is a structured property with the following syntax:
generate.setPropertyValue("correlations", [
[field1, field2, correlation],
[field1, field3, correlation],
[field2, field3, correlation]
])
The correlation can be any number between +1 and -1. You can specify as many or as few correlations as you like. Any unspecified correlations are set to zero. If any fields are unknown, the correlation value should
be set on the correlation matrix (or table). When there are unknown fields, it's not possible to run the node.