Data privacy guide to AI and machine learning

Published 16 December 2025

Person securing digital data on a tablet with a padlock and checkmark icon.

By David Zax

While data privacy in general has long been a concern, the term “AI data privacy” acknowledges that the emerging technology of artificial intelligence brings with it new risks and privacy concerns.

During training, AI systems learn from vast datasets. The Common Crawl dataset that many models train on contains over 9.5 petabytes of data.¹ Many people who use AI daily might also be feeding systems sensitive data, not fully aware that they are eroding their individual privacy. And as the deployment of AI extends to an era of AI agents, new types of privacy breaches become possible in the absence of proper access controls or AI governance.

A transformed risk landscape

AI models don’t just process more data; they also handle data differently from legacy systems. If a piece of traditional software accidentally exposes sensitive information, an engineer can go in and debug the code. But AI models (including large language models such as ChatGPT) are not coded so much as made to evolve through a process called machine learning. Their own creators do not know exactly how they work, making “debugging” nontrivial, if not impossible.

Accidental outputs are one category of concern, but organizations also need to be wary of deliberate, malicious attacks. Researchers have demonstrated that AI tools contain new types of vulnerabilities that clever hackers can exploit, a field known as adversarial machine learning.

In recent years, for instance, cybersecurity experts have demonstrated that by exploiting one quirk of AI models—namely, that their outputs are given higher confidence scores when responding to data they’ve trained on—a bad actor can infer whether certain data was in a training set. In certain scenarios, such an inference would be a major privacy breach. For instance, consider an AI model known to have trained on private healthcare records of HIV-positive patients.

In another well-known instance, researchers went beyond merely inferring whether data was in a training set. They created an algorithmic attack that could effectively reverse-engineer the actual data that was used to train a model. By exploiting an aspect of AI models known as their “gradients,” researchers were able to iteratively refine a noise-filled image into an image closely approximating an actual face that had been used to train a facial recognition model.²

The stakes around data protection remain high: IBM’s 2025 Cost of a Data Breach Report determined that the average cost of such breaches was USD 4.4 million. (Such breaches also entail a difficult-to-quantify cost in the form of damaged public trust in one’s brand.)

While many of these data breaches do not implicate AI, an increasing number do. Stanford’s 2025 AI Index Report found that the number of AI privacy and security incidents jumped 56.4% in a year, with 233 reported cases in 2024.³

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

An evolving regulatory environment

Policymakers globally have asserted that AI technologies should by no means be exempt from the responsibility of basic privacy protections. The European Union’s General Data Protection Regulation (GDPR), long considered a baseline for the handling of personal data (no matter the jurisdiction), applies to firms’ use of AI systems. Principles of GDPR include data minimization (collecting only the minimum data needed for a purpose), transparency (informing users of how data is used) and storage limitation (retaining data no longer than necessary).

The year 2024 was a landmark year in this space, when several regulators began to enforce privacy laws in cases involving AI applications.

For instance, in 2024 Ireland’s Data Protection Commission fined the social media network LinkedIn 310 million euros for an AI-related privacy violation. LinkedIn tracked certain subtle user behaviors, such as how long a person lingered on a post. The site then used AI to derive inferences about these users (such as whether they actively sought new jobs, or whether they were at high risk for burnout). This profiling was then used to target advertising and update certain internal LinkedIn ranking systems.

The Irish commission ultimately determined that despite a sheen of seeming anonymization, these AI-derived inferences ultimately could be tracked back to identifiable individuals’ data, thereby running afoul of data privacy laws. The courts ruled that LinkedIn did not respect the GDPR principle of purpose limitation, nor did it secure informed consent from users, thus violating consumer privacy. The ruling also forced LinkedIn to implement real-time consent mechanisms and revise the defaults of its advertising personalization settings.⁴

Also in 2024, a law enforcement action against facial recognition firm Clearview AI illustrated the principle that biometric data (such as photos of faces) raises further privacy issues, even if the data is technically publicly available (such as on an unsecured social media account).

Clearview had scraped 30 billion images from sites such as Facebook and Instagram, arguing that the firm didn’t need users’ permission, as the photos were publicly available online. This massive data collection operation then fueled Clearview’s development of an AI-driven facial recognition database.

Dutch law enforcement officials excoriated Clearview’s approach. The Dutch Data Protection Authority ultimately imposed a fine of 30.5 million euros on the firm, deeming that the individual rights of Dutch citizens included in Clearview’s data collection were violated.⁵

Finally, 2024 saw the European Union expand AI-specific regulation with its AI Act, which went into effect in August of that year. The act’s remit is wider than AI-related data, extending to risks of AI and AI development more broadly). However, many of its provisions touch on data security, data sharing and data governance. To cite one prominent example: The act prohibits biometric identification systems that use data and AI models to identify individuals based on sensitive attributes such as race, religion or sexual orientation.

AI Academy

Uniting security and governance for the future of AI

While grounding the conversation in today’s newest trend, agentic AI, this AI Academy episode explores the tug-of-war that risk and assurance leaders experience between governance and security. It’s critical to establish a balance and prioritize a working relationship for both to achieve better, more trustworthy data and AI your organization can scale.

Go to episode

Principles to minimize AI data privacy risk

In this fast-moving landscape, with the need to embrace innovation seemingly in tension with the need to do so responsibly, what are steps firms might take to strike this balance? Entire books can be written on the subject, but a few principles can begin to guide the enterprise as it implements AI responsibly.

Governing the whole AI data lifecycle

Old paradigms of data security are insufficient when data is ingested, processed and produced at multiple stages of an AI model’s lifecycle. Data stewards, compliance professionals and other stakeholders should attend to the integrity of their training data, ideally conducting audits for privacy risk. One firm claims to have found 12,000 API keys and passwords in the Common Crawl dataset.⁶

And when it comes to the use of big data generated by a firm’s activity, standards such as the GDPR and related privacy regulations can be useful guides.

Staying ahead in the arms race

AI is a highly active field, with new research and discoveries trickling in almost daily. It is important for cybersecurity professionals to stay on top of the latest technological advancements, the better to patch vulnerabilities before a threat actor exploits them.

Firms can use privacy-enhancing technologies such as federated learning, differential privacy and synthetic data. As always, they can insist on strong access controls to prevent unauthorized access by humans and AI agents alike.

Privacy-aware decision-making

As more firms use generative AI and other AI technologies to automate decision-making, executives should bring a privacy lens to AI-fueled practices where the notion of “data” might have become muddy. This principle is in evidence in the LinkedIn ruling mentioned earlier: in some circumstances, drawing inferences based on data patterns, while it might have a sheen of anonymization about it, can still run afoul of GDPR and related regulations.

As AI grows more powerful at spotting patterns, it might subvert long-held notions about what constitutes “anonymized” data. One 2019 study in Nature showed that with the right generative model, “99.98% of Americans could be correctly reidentified in any dataset using 15 demographic attributes”. The finding suggests that the very notion of what constitutes personal data is undergoing a transformation.⁷

Author

David Zax

Staff Writer

IBM Think

Maximize AI ROI through smarter governance

Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.

Resources

AI governance imperative: evolving regulations and emergence of agentic AI

Learn how evolving regulations and the emergence of AI agents are reshaping the need for robust AI governance frameworks.

IDC MarketScape names IBM a Leader in Unified AI Governance

Enterprises looking to scale AI initiatives responsibly will require a strong AI governance platform. Read the full analysis and discover how IBM watsonx.governance can support your Al strategy.

IDC MarketScape names IBM a Leader in 2025 GenAI evaluation technology

Download the report to learn why IDC MarketScape names IBM a Leader in 2025 GenAI evaluation technology, and how watsonx.governance advances risk management, reporting, and integration.

IBM named a leader in The Forrester Wave™: AI Governance Solutions, Q3 2025

See why Forrester recognized IBM as a Leader for its watsonx.governance solution—helping enterprises manage AI risk, compliance, and trust at scale.

Building a strong data foundation for trustworthy AI

Explore the Data Matters hub to see how strong data practices and governance lay the foundation for scalable AI success.

Maximize AI ROI through smarter governance

Learn ways to maximize AI ROI—prioritizing high-impact use cases, governing risks, optimizing costs, and accelerating adoption with watsonx.

IBM Named a Leader in the Gartner® Magic Quadrant™ for GRC

Unlock insights into IBM's OpenPages and learn why we were named a Leader

The AI oversight gap

The Cost of a Data Breach Report 2025 reveals how do-it-now Al adoption is outpacing security and governance.

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about the new challenges of generative AI, the need for governing AI and ML models and steps to build a trusted, transparent and explainable AI framework.

Getting Ready for the EU AI Act, Phase 2: Risk-Assess and Categorize

Understand the importance of establishing a defensible assessment process and consistently categorizing each use case into the appropriate risk tier.

AI lifecycle governance

Read about driving ethical and compliant practices with a portfolio of AI products for generative AI models.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.