Evaluate node

In Synthetic Data Generator, you can use the Evaluate node to compare the synthetic data to the seed data and assess its quality.

Description
Use the Evaluate node in a Synthetic Data Generator flow to assess the quality and privacy metrics of the generated synthetic dataset by comparing it to the original dataset.
Using the node
Connect the Evaluate node to the Generate node and either the Import node or Anonymize node. The output is a report that displays the metrics. You can see fidelity and data distinguishability metrics for the synthetic dataset relative to the original. You can also see privacy metrics like a leakage and proximity score.
You can include the Evaluate node multiple times in a Synthetic Data Generator flow. However, you can only connect it to one set of nodes at a time.
Large or complex datasets require more memory for the Evaluate node to analyze. For example, a complex dataset has many correlations between columns. If you have large or complex datasets, avoid running multiple flows with Evaluate nodes simultaneously. If the Evaluate nodes run out of memory, it can cause errors. For details, see Troubleshooting Synthetic Data Generator.
Mandatory or optional
The Evaluate node is optional, but it is highly recommended to ensure the synthetic dataset meets the desired quality standards.

Scripting with the Evaluate node

You can use scripting languages, like Python, to progammatically set properties for nodes.

Evaluate node properties

The following properties are specific to the Evaluate node. For information about common node properties, see Properties for flows and nodes.

Table 1. Node properties for scripting
Property name Data type Property description
asset_type DataAsset, Connection Specify your data type: DataAsset or Connection.
asset_id String When DataAsset is set for the asset_type, this is the ID of the asset.
asset_name String When DataAsset is set for the asset_type, this is the name of the asset.
baseline_input String The Evaluate node requires connections from two nodes: one that contains your original baseline data and one that contains the generated synthetic data.
connection_id String When Connection is set for the asset_type, this is the ID of the connection.
connection_name String When Connection is set for the asset_type, this is the name of the connection.
connection_path String When Connection is set for the asset_type, this is the path of the connection.
fast_evaluation Boolean Set to True to use the simple assessment mode is selected. In simple assessment mode, metrics are run on one single ML (machine learning) model. In full assessment mode, metrics are evaluated and averaged against multiple ML (machine learning) models whenever possible.
feature_distribution_bin_nb Integer Number of bins to distribute the features in.
predictive_utility_column String Column to use for the show_predictive_utility property.
show_predictive_utility Boolean Set to True to measure the usefulness of the synthetic data for predictive downstream tasks. It evaluates the performance of predictive models trained from the synthetic data to accurately predict a selected target using real data as test data.
show_privacy_leakage_score Boolean Set to True to measure the percentage of rows in the synthetic data that are different from the rows in the real data. A leakage prevention of 100% means that all synthetic rows are new, while a leakage prevention of 0% means that all the synthetic rows are copied from the real data.
show_privacy_proximity_score Boolean Set to True to compute the distance between points in the synthetic data and the real data. The smaller this distance, the easier it is to isolate some rows from the real data, which increases privacy risk.
show_data_distinguishability Boolean Set to True to rate the ability for a binary classifier to separate real data from synthetic data. The harder to train such a classifier, the better the quality of the synthetic data with respect to its ability to mimc statistical properties of the real data.
show_fidelity_score Boolean Set to True to aggregate multiple metrics and rate the similarity between real data and synthetic data for distributions of individual columns. It also rates the similarity of correlations for all pairs of columns.
user_settings String Escaped JSON string containing the interaction properties for the connection, for example:
user_settings: "{\"interactionProperties\":{\"write_mode\":\"write\",\"file_name\":\"output.csv\",\"file_format\":\"csv\",\"quote_numerics\":true,\"encoding\":\"utf-8\",\"first_line_header\":true,\"include_types\":false}}"
These values will change based on the type of connection you're using.

Example

The following is an example of the properties used in a scriipt.

import json

stream = sdg.script.stream()

dataassetimport = stream.findByID("<import nodeId>")
# loads the string settings as a json object
userSettings = json.loads(dataassetimport.getPropertyValue("user_settings"))

userSettings["interactionProperties"]["sheet_name"] = "<new sheet name>"
dataassetimport.setPropertyValue("user_settings", json.dumps(userSettings))