Using differential privacy

Differential privacy protects user data from being traced back to individual users. The parameters involved are known as the privacy budget. This is a metric of privacy loss based on adding or removing one entry in a data set.

Before you can configure differential privacy settings in the Mimic node, you need to create a Synthetic Data Generator flow. For more information see, Creating a synthetic data flow.

  1. Open your Synthetic Data Generator flow in the Synthetic Data Generator graphical flow editor.

  2. If your flow doesn't already have a Mimic node, add one by double-clicking Mimic node in the nodes pallete and connecting it to the flow.

  3. Hover over the Mimic node and click Edit.

  4. Scroll down and select Privacy. In the Privacy section, turn on Enable differential privacy.

    This will ensure that no sensitive data specific to any individual is exposed in the synthetic output. You can control the level of privacy protection by adjusting the privacy budget (epsilon) and leakage (delta) parameters.

  5. Adjust the Privacy budget (epsilon).

    The privacy budget allows you to tune the level of privacy protection required in your synthetic output. A smaller value provides greater privacy protection, with some loss in accuracy. A larger value provides greater accuracy, with less privacy protection.

  6. Adjust the Privacy leakage probability (delta).

    Delta is usually referred to as the maximum allowable probability of a privacy leakage. Delta should be less than or equal to 1/n*n, where n = sample size. The smaller the delta is, the better the privacy is preserved.

  7. Generate a Random seed. When differential privacy is enabled, this random seed value will enable you to reproduce your differentially private synthetic output. When differential privacy is disabled, the random seed value can be adjusted in the Generate node.

  8. Manually adjust the Column bounds (optional). Column bounds are automatically applied, but you can manually adjust these bounds to restrict the range of values used for fitting. You can only select numeric columns.

    Note: After a flow runs, the column bounds are not updated in the **Generate** node results even if the column bounds are set here. This is expected behavior. If you enter a value larger or smaller than the real data column bounds, then the differential privacy values are adjusted to the new values. However, the minimum/maximum column bounds are only applied to the real data and not to the generated synthetic data. The benefit of this is that the differential privacy results are not be disrupted by a specified minimum/maximum column bounds during the **Generate** node. Manually setting the minimum and maximum could potentially result in privacy leakage.
  9. After updating the Privacy options, select Save.

Save privacy options

  1. Select Run all.
Note: Parameters that are based on the synthetically generated dataset where differential privacy has been enabled will differ from the parameters in your original dataset.

Learn more

Creating synthetic data from a custom data schema