Evaluate node
In Synthetic Data Generator, you can use the Evaluate node to compare the synthetic data to the seed data and assess its quality.
- Description
- Use the Evaluate node in a Synthetic Data Generator flow to assess the quality and privacy metrics of the generated synthetic dataset by comparing it to the original dataset.
- Using the node
- Connect the Evaluate node to the Generate node and either the Import node or Anonymize node. The output is a report that displays the metrics. You can see fidelity and data distinguishability metrics for the synthetic dataset relative to the original. You can also see privacy metrics like a leakage and proximity score.
- You can include the Evaluate node multiple times in a Synthetic Data Generator flow. However, you can only connect it to one set of nodes at a time.
- Large or complex datasets require more memory for the Evaluate node to analyze. For example, a complex dataset has many correlations between columns. If you have large or complex datasets, avoid running multiple flows with Evaluate nodes simultaneously. If the Evaluate nodes run out of memory, it can cause errors. For details, see Troubleshooting Synthetic Data Generator.
- Mandatory or optional
- The Evaluate node is optional, but it is highly recommended to ensure the synthetic dataset meets the desired quality standards.
Scripting with the Evaluate node
You can use scripting languages, like Python, to progammatically set properties for nodes.
Evaluate node properties
The following properties are specific to the Evaluate node. For information about common node properties, see Properties for flows and nodes.
| Property name | Data type | Property description |
|---|---|---|
asset_type |
DataAsset, Connection | Specify your data type: DataAsset or Connection. |
asset_id |
String | When DataAsset is set for the asset_type, this is the ID of the asset. |
asset_name |
String | When DataAsset is set for the asset_type, this is the name of the asset. |
baseline_input |
String | The Evaluate node requires connections from two nodes: one that contains your original baseline data and one that contains the generated synthetic data. |
connection_id |
String | When Connection is set for the asset_type, this is the ID of the connection. |
connection_name |
String | When Connection is set for the asset_type, this is the name of the connection. |
connection_path |
String | When Connection is set for the asset_type, this is the path of the connection. |
fast_evaluation |
Boolean | Set to True to use the simple assessment mode is selected. In simple assessment mode, metrics are run on one single ML (machine learning) model. In full assessment mode, metrics are evaluated and averaged against multiple
ML (machine learning) models whenever possible. |
feature_distribution_bin_nb |
Integer | Number of bins to distribute the features in. |
predictive_utility_column |
String | Column to use for the show_predictive_utility property. |
show_predictive_utility |
Boolean | Set to True to measure the usefulness of the synthetic data for predictive downstream tasks. It evaluates the performance of predictive models trained from the synthetic data to accurately predict a selected target using
real data as test data. |
show_privacy_leakage_score |
Boolean | Set to True to measure the percentage of rows in the synthetic data that are different from the rows in the real data. A leakage prevention of 100% means that all synthetic rows are new, while a leakage prevention of 0%
means that all the synthetic rows are copied from the real data. |
show_privacy_proximity_score |
Boolean | Set to True to compute the distance between points in the synthetic data and the real data. The smaller this distance, the easier it is to isolate some rows from the real data, which increases privacy risk. |
show_data_distinguishability |
Boolean | Set to True to rate the ability for a binary classifier to separate real data from synthetic data. The harder to train such a classifier, the better the quality of the synthetic data with respect to its ability to mimc statistical
properties of the real data. |
show_fidelity_score |
Boolean | Set to True to aggregate multiple metrics and rate the similarity between real data and synthetic data for distributions of individual columns. It also rates the similarity of correlations for all pairs of columns. |
user_settings |
String | Escaped JSON string containing the interaction properties for the connection, for example: user_settings: "{\"interactionProperties\":{\"write_mode\":\"write\",\"file_name\":\"output.csv\",\"file_format\":\"csv\",\"quote_numerics\":true,\"encoding\":\"utf-8\",\"first_line_header\":true,\"include_types\":false}}"These values will change based on the type of connection you're using. |
Example
The following is an example of the properties used in a scriipt.
import json
stream = sdg.script.stream()
dataassetimport = stream.findByID("<import nodeId>")
# loads the string settings as a json object
userSettings = json.loads(dataassetimport.getPropertyValue("user_settings"))
userSettings["interactionProperties"]["sheet_name"] = "<new sheet name>"
dataassetimport.setPropertyValue("user_settings", json.dumps(userSettings))