Data privacy guide to AI and machine learning

Person securing digital data on a tablet with a padlock and checkmark icon.

While data privacy in general has long been a concern, the term “AI data privacy” acknowledges that the emerging technology of artificial intelligence brings with it new risks and privacy concerns.  

During training, AI systems learn from vast datasets. The Common Crawl dataset that many models train on contains over 9.5 petabytes of data.1 Many people who use AI daily might also be feeding systems sensitive data, not fully aware that they are eroding their individual privacy. And as the deployment of AI extends to an era of AI agents, new types of privacy breaches become possible in the absence of proper access controls or AI governance.

A transformed risk landscape

AI models don’t just process more data; they also handle data differently from legacy systems. If a piece of traditional software accidentally exposes sensitive information, an engineer can go in and debug the code. But AI models (including large language models such as ChatGPT) are not coded so much as made to evolve through a process called machine learning. Their own creators do not know exactly how they work, making “debugging” nontrivial, if not impossible.

Accidental outputs are one category of concern, but organizations also need to be wary of deliberate, malicious attacks. Researchers have demonstrated that AI tools contain new types of vulnerabilities that clever hackers can exploit, a field known as adversarial machine learning. 

In recent years, for instance, cybersecurity experts have demonstrated that by exploiting one quirk of AI models—namely, that their outputs are given higher confidence scores when responding to data they’ve trained on—a bad actor can infer whether certain data was in a training set. In certain scenarios, such an inference would be a major privacy breach. For instance, consider an AI model known to have trained on private healthcare records of HIV-positive patients.

In another well-known instance, researchers went beyond merely inferring whether data was in a training set. They created an algorithmic attack that could effectively reverse-engineer the actual data that was used to train a model. By exploiting an aspect of AI models known as their “gradients,” researchers were able to iteratively refine a noise-filled image into an image closely approximating an actual face that had been used to train a facial recognition model.2

The stakes around data protection remain high: IBM’s 2025 Cost of a Data Breach Report determined that the average cost of such breaches was USD 4.4 million. (Such breaches also entail a difficult-to-quantify cost in the form of damaged public trust in one’s brand.)

While many of these data breaches do not implicate AI, an increasing number do. Stanford’s 2025 AI Index Report found that the number of AI privacy and security incidents jumped 56.4% in a year, with 233 reported cases in 2024.3

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

An evolving regulatory environment

Policymakers globally have asserted that AI technologies should by no means be exempt from the responsibility of basic privacy protections. The European Union’s General Data Protection Regulation (GDPR), long considered a baseline for the handling of personal data (no matter the jurisdiction), applies to firms’ use of AI systems. Principles of GDPR include data minimization (collecting only the minimum data needed for a purpose), transparency (informing users of how data is used) and storage limitation (retaining data no longer than necessary).

The year 2024 was a landmark year in this space, when several regulators began to enforce privacy laws in cases involving AI applications.

For instance, in 2024 Ireland’s Data Protection Commission fined the social media network LinkedIn 310 million euros for an AI-related privacy violation. LinkedIn tracked certain subtle user behaviors, such as how long a person lingered on a post. The site then used AI to derive inferences about these users (such as whether they actively sought new jobs, or whether they were at high risk for burnout). This profiling was then used to target advertising and update certain internal LinkedIn ranking systems.

The Irish commission ultimately determined that despite a sheen of seeming anonymization, these AI-derived inferences ultimately could be tracked back to identifiable individuals’ data, thereby running afoul of data privacy laws. The courts ruled that LinkedIn did not respect the GDPR principle of purpose limitation, nor did it secure informed consent from users, thus violating consumer privacy. The ruling also forced LinkedIn to implement real-time consent mechanisms and revise the defaults of its advertising personalization settings.4

Also in 2024, a law enforcement action against facial recognition firm Clearview AI illustrated the principle that biometric data (such as photos of faces) raises further privacy issues, even if the data is technically publicly available (such as on an unsecured social media account).

Clearview had scraped 30 billion images from sites such as Facebook and Instagram, arguing that the firm didn’t need users’ permission, as the photos were publicly available online. This massive data collection operation then fueled Clearview’s development of an AI-driven facial recognition database.

Dutch law enforcement officials excoriated Clearview’s approach. The Dutch Data Protection Authority ultimately imposed a fine of 30.5 million euros on the firm, deeming that the individual rights of Dutch citizens included in Clearview’s data collection were violated.5

Finally, 2024 saw the European Union expand AI-specific regulation with its AI Act, which went into effect in August of that year. The act’s remit is wider than AI-related data, extending to risks of AI and AI development more broadly). However, many of its provisions touch on data security, data sharing and data governance. To cite one prominent example: The act prohibits biometric identification systems that use data and AI models to identify individuals based on sensitive attributes such as race, religion or sexual orientation.

AI Academy

Uniting security and governance for the future of AI

While grounding the conversation in today’s newest trend, agentic AI, this AI Academy episode explores the tug-of-war that risk and assurance leaders experience between governance and security. It’s critical to establish a balance and prioritize a working relationship for both to achieve better, more trustworthy data and AI your organization can scale.

Principles to minimize AI data privacy risk

In this fast-moving landscape, with the need to embrace innovation seemingly in tension with the need to do so responsibly, what are steps firms might take to strike this balance? Entire books can be written on the subject, but a few principles can begin to guide the enterprise as it implements AI responsibly.

Governing the whole AI data lifecycle

Old paradigms of data security are insufficient when data is ingested, processed and produced at multiple stages of an AI model’s lifecycle. Data stewards, compliance professionals and other stakeholders should attend to the integrity of their training data, ideally conducting audits for privacy risk. One firm claims to have found 12,000 API keys and passwords in the Common Crawl dataset.6

And when it comes to the use of big data generated by a firm’s activity, standards such as the GDPR and related privacy regulations can be useful guides.

Staying ahead in the arms race

AI is a highly active field, with new research and discoveries trickling in almost daily. It is important for cybersecurity professionals to stay on top of the latest technological advancements, the better to patch vulnerabilities before a threat actor exploits them.

Firms can use privacy-enhancing technologies such as federated learning, differential privacy and synthetic data. As always, they can insist on strong access controls to prevent unauthorized access by humans and AI agents alike.

Privacy-aware decision-making

As more firms use generative AI and other AI technologies to automate decision-making, executives should bring a privacy lens to AI-fueled practices where the notion of “data” might have become muddy. This principle is in evidence in the LinkedIn ruling mentioned earlier: in some circumstances, drawing inferences based on data patterns, while it might have a sheen of anonymization about it, can still run afoul of GDPR and related regulations.

As AI grows more powerful at spotting patterns, it might subvert long-held notions about what constitutes “anonymized” data. One 2019 study in Nature showed that with the right generative model, “99.98% of Americans could be correctly reidentified in any dataset using 15 demographic attributes”. The finding suggests that the very notion of what constitutes personal data is undergoing a transformation.7

Author

David Zax

Staff Writer

IBM Think

Related solutions
IBM watsonx.governance

Govern generative AI models from anywhere and deploy on the cloud or on premises with IBM watsonx.governance.

Discover watsonx.governance
AI governance solutions

See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.

Discover AI governance solutions
AI governance consulting services

Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting.

Discover AI governance services
Take the next step

Direct, manage and monitor your AI with a single portfolio to speed responsible, transparent and explainable AI.

Explore watsonx.governance Book a live demo