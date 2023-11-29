Although considered artificial data or “fake data” because it is computer-generated rather than created by actual events (such as a customer purchase, an internet login or a patient diagnosis), synthetic data can still reveal personally identifiable information (PII) when used as training data for AI models. For instance, if a business prioritizes accuracy in generating synthetic data, the resulting output may inadvertently include too many personally identifiable attributes, thereby increasing the company’s privacy risk exposure unknowingly. Furthermore, as modeling techniques become increasingly sophisticated in data science, including deep learning and predictive and generative models, companies and vendors must work diligently to prevent unintentional connections that could leak a person’s identity and expose them to third-party attacks.

Fortunately, enterprises interested in synthetic data can take steps to reduce their privacy risk:

Keep your data where it is

While many companies are migrating their existing software applications to the cloud for cost savings, improved performance and scalability, on-premises deployments continue to play a pivotal role in enhancing privacy and protection. This is partially true for synthetic data. When dealing with fully synthetic data (data generated without existing data for model training) or synthetic data that contains no confidential or PII, there is minimal risk associated with using a public cloud deployment method. However, companies should consider on-premises deployments when their synthetic data has dependencies on existing sensitive data. Although third-party cloud providers offer robust built-in security and privacy safeguards, sending and storing sensitive PII customer data in such clouds may expose your organization to potential risks and may be blocked by your privacy team.

Have control and robust protection

Not all synthetic data use cases require privacy, but some do. Therefore, risk, security and compliance leaders should implement a mechanism to control their desired level of privacy risk during the synthetic data generation process. “Differential privacy” is one such mechanism, enabling data scientists and risk teams to manage their desired level of privacy (typically within an epsilon range of 1 to 10, with 1 representing the highest privacy). This method masks the contribution of any individual, making it impossible to infer specific information about a person, including whether their information was used at all. It automatically identifies vulnerable individual data points and introduces “noise” to obscure their specific information. Although adding noise slightly reduces output accuracy (this is the “cost” of differential privacy), it does not compromise utility or data quality compared to traditional data masking techniques. In other words, a differentially private synthetic dataset still reflects the statistical properties of your real dataset. Additionally, there are benefits to using differential privacy techniques, including robust data protection against potential privacy attacks, provable privacy guarantees regarding cumulative risk from successive data releases, and data transparency, as there is no need to keep differential private computation or parameters secret.

Have insight into privacy-related metrics

When differential privacy isn’t an option, business users should maintain a line of sight into privacy-related metrics, to help them comprehend the extent of their privacy exposure. Here are two common metrics that, while not comprehensive, serve as a solid foundation: