Evaluate node

In Synthetic Data Generator, you can use the Evaluate node to compare the synthetic data to the seed data and assess its quality.

Description: Use the Evaluate node in a Synthetic Data Generator flow to assess the quality and privacy metrics of the generated synthetic dataset by comparing it to the original dataset.
Using the node: Connect the Evaluate node to the Generate node and either the Import node or Anonymize node. The output is a report that displays the metrics. You can see fidelity and data distinguishability metrics for the synthetic dataset relative to the original. You can also see privacy metrics like a leakage and proximity score.; You can include the Evaluate node multiple times in a Synthetic Data Generator flow. However, you can only connect it to one set of nodes at a time.; Large or complex datasets require more memory for the Evaluate node to analyze. For example, a complex dataset has many correlations between columns. If you have large or complex datasets, avoid running multiple flows with Evaluate nodes simultaneously. If the Evaluate nodes run out of memory, it can cause errors. For details, see Troubleshooting Synthetic Data Generator.
Mandatory or optional: The Evaluate node is optional, but it is highly recommended to ensure the synthetic dataset meets the desired quality standards.

Scripting with the Evaluate node

You can use scripting languages, like Python, to progammatically set properties for nodes.

Evaluate node properties

The following properties are specific to the Evaluate node. For information about common node properties, see Properties for flows and nodes.

Table 1. Node properties for scripting
Property name	Data type	Property description
`asset_type`	DataAsset, Connection	Specify your data type: `DataAsset` or `Connection`.
`asset_id`	String	When `DataAsset` is set for the `asset_type`, this is the ID of the asset.
`asset_name`	String	When `DataAsset` is set for the `asset_type`, this is the name of the asset.
`baseline_input`	String	The Evaluate node requires connections from two nodes: one that contains your original baseline data and one that contains the generated synthetic data.
`connection_id`	String	When `Connection` is set for the `asset_type`, this is the ID of the connection.
`connection_name`	String	When `Connection` is set for the `asset_type`, this is the name of the connection.
`connection_path`	String	When `Connection` is set for the `asset_type`, this is the path of the connection.
`fast_evaluation`	Boolean	Set to `True` to use the simple assessment mode is selected. In simple assessment mode, metrics are run on one single ML (machine learning) model. In full assessment mode, metrics are evaluated and averaged against multiple ML (machine learning) models whenever possible.
`feature_distribution_bin_nb`	Integer	Number of bins to distribute the features in.
`predictive_utility_column`	String	Column to use for the `show_predictive_utility` property.
`show_predictive_utility`	Boolean	Set to `True` to measure the usefulness of the synthetic data for predictive downstream tasks. It evaluates the performance of predictive models trained from the synthetic data to accurately predict a selected target using real data as test data.
`show_privacy_leakage_score`	Boolean	Set to `True` to measure the percentage of rows in the synthetic data that are different from the rows in the real data. A leakage prevention of 100% means that all synthetic rows are new, while a leakage prevention of 0% means that all the synthetic rows are copied from the real data.
`show_privacy_proximity_score`	Boolean	Set to `True` to compute the distance between points in the synthetic data and the real data. The smaller this distance, the easier it is to isolate some rows from the real data, which increases privacy risk.
`show_data_distinguishability`	Boolean	Set to `True` to rate the ability for a binary classifier to separate real data from synthetic data. The harder to train such a classifier, the better the quality of the synthetic data with respect to its ability to mimc statistical properties of the real data.
`show_fidelity_score`	Boolean	Set to `True` to aggregate multiple metrics and rate the similarity between real data and synthetic data for distributions of individual columns. It also rates the similarity of correlations for all pairs of columns.
`user_settings`	String	Escaped JSON string containing the interaction properties for the connection, for example: `user_settings: "{\"interactionProperties\":{\"write_mode\":\"write\",\"file_name\":\"output.csv\",\"file_format\":\"csv\",\"quote_numerics\":true,\"encoding\":\"utf-8\",\"first_line_header\":true,\"include_types\":false}}"` These values will change based on the type of connection you're using.

Example

The following is an example of the properties used in a scriipt.

import json

stream = sdg.script.stream()

dataassetimport = stream.findByID("<import nodeId>")
# loads the string settings as a json object
userSettings = json.loads(dataassetimport.getPropertyValue("user_settings"))

userSettings["interactionProperties"]["sheet_name"] = "<new sheet name>"
dataassetimport.setPropertyValue("user_settings", json.dumps(userSettings))