While data privacy in general has long been a concern, the term “AI data privacy” acknowledges that the emerging technology of artificial intelligence brings with it new risks and privacy concerns.
During training, AI systems learn from vast datasets. The Common Crawl dataset that many models train on contains over 9.5 petabytes of data.1 Many people who use AI daily might also be feeding systems sensitive data, not fully aware that they are eroding their individual privacy. And as the deployment of AI extends to an era of AI agents, new types of privacy breaches become possible in the absence of proper access controls or AI governance.
AI models don’t just process more data; they also handle data differently from legacy systems. If a piece of traditional software accidentally exposes sensitive information, an engineer can go in and debug the code. But AI models (including large language models such as ChatGPT) are not coded so much as made to evolve through a process called machine learning. Their own creators do not know exactly how they work, making “debugging” nontrivial, if not impossible.
Accidental outputs are one category of concern, but organizations also need to be wary of deliberate, malicious attacks. Researchers have demonstrated that AI tools contain new types of vulnerabilities that clever hackers can exploit, a field known as adversarial machine learning.
In recent years, for instance, cybersecurity experts have demonstrated that by exploiting one quirk of AI models—namely, that their outputs are given higher confidence scores when responding to data they’ve trained on—a bad actor can infer whether certain data was in a training set. In certain scenarios, such an inference would be a major privacy breach. For instance, consider an AI model known to have trained on private healthcare records of HIV-positive patients.
In another well-known instance, researchers went beyond merely inferring whether data was in a training set. They created an algorithmic attack that could effectively reverse-engineer the actual data that was used to train a model. By exploiting an aspect of AI models known as their “gradients,” researchers were able to iteratively refine a noise-filled image into an image closely approximating an actual face that had been used to train a facial recognition model.2
The stakes around data protection remain high: IBM’s 2025 Cost of a Data Breach Report determined that the average cost of such breaches was USD 4.4 million. (Such breaches also entail a difficult-to-quantify cost in the form of damaged public trust in one’s brand.)
While many of these data breaches do not implicate AI, an increasing number do. Stanford’s 2025 AI Index Report found that the number of AI privacy and security incidents jumped 56.4% in a year, with 233 reported cases in 2024.3
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
Policymakers globally have asserted that AI technologies should by no means be exempt from the responsibility of basic privacy protections. The European Union’s General Data Protection Regulation (GDPR), long considered a baseline for the handling of personal data (no matter the jurisdiction), applies to firms’ use of AI systems. Principles of GDPR include data minimization (collecting only the minimum data needed for a purpose), transparency (informing users of how data is used) and storage limitation (retaining data no longer than necessary).
The year 2024 was a landmark year in this space, when several regulators began to enforce privacy laws in cases involving AI applications.
For instance, in 2024 Ireland’s Data Protection Commission fined the social media network LinkedIn 310 million euros for an AI-related privacy violation. LinkedIn tracked certain subtle user behaviors, such as how long a person lingered on a post. The site then used AI to derive inferences about these users (such as whether they actively sought new jobs, or whether they were at high risk for burnout). This profiling was then used to target advertising and update certain internal LinkedIn ranking systems.
The Irish commission ultimately determined that despite a sheen of seeming anonymization, these AI-derived inferences ultimately could be tracked back to identifiable individuals’ data, thereby running afoul of data privacy laws. The courts ruled that LinkedIn did not respect the GDPR principle of purpose limitation, nor did it secure informed consent from users, thus violating consumer privacy. The ruling also forced LinkedIn to implement real-time consent mechanisms and revise the defaults of its advertising personalization settings.4
Also in 2024, a law enforcement action against facial recognition firm Clearview AI illustrated the principle that biometric data (such as photos of faces) raises further privacy issues, even if the data is technically publicly available (such as on an unsecured social media account).
Clearview had scraped 30 billion images from sites such as Facebook and Instagram, arguing that the firm didn’t need users’ permission, as the photos were publicly available online. This massive data collection operation then fueled Clearview’s development of an AI-driven facial recognition database.
Dutch law enforcement officials excoriated Clearview’s approach. The Dutch Data Protection Authority ultimately imposed a fine of 30.5 million euros on the firm, deeming that the individual rights of Dutch citizens included in Clearview’s data collection were violated.5
Finally, 2024 saw the European Union expand AI-specific regulation with its AI Act, which went into effect in August of that year. The act’s remit is wider than AI-related data, extending to risks of AI and AI development more broadly). However, many of its provisions touch on data security, data sharing and data governance. To cite one prominent example: The act prohibits biometric identification systems that use data and AI models to identify individuals based on sensitive attributes such as race, religion or sexual orientation.
In this fast-moving landscape, with the need to embrace innovation seemingly in tension with the need to do so responsibly, what are steps firms might take to strike this balance? Entire books can be written on the subject, but a few principles can begin to guide the enterprise as it implements AI responsibly.
Old paradigms of data security are insufficient when data is ingested, processed and produced at multiple stages of an AI model’s lifecycle. Data stewards, compliance professionals and other stakeholders should attend to the integrity of their training data, ideally conducting audits for privacy risk. One firm claims to have found 12,000 API keys and passwords in the Common Crawl dataset.6
And when it comes to the use of big data generated by a firm’s activity, standards such as the GDPR and related privacy regulations can be useful guides.
AI is a highly active field, with new research and discoveries trickling in almost daily. It is important for cybersecurity professionals to stay on top of the latest technological advancements, the better to patch vulnerabilities before a threat actor exploits them.
Firms can use privacy-enhancing technologies such as federated learning, differential privacy and synthetic data. As always, they can insist on strong access controls to prevent unauthorized access by humans and AI agents alike.
As more firms use generative AI and other AI technologies to automate decision-making, executives should bring a privacy lens to AI-fueled practices where the notion of “data” might have become muddy. This principle is in evidence in the LinkedIn ruling mentioned earlier: in some circumstances, drawing inferences based on data patterns, while it might have a sheen of anonymization about it, can still run afoul of GDPR and related regulations.
As AI grows more powerful at spotting patterns, it might subvert long-held notions about what constitutes “anonymized” data. One 2019 study in Nature showed that with the right generative model, “99.98% of Americans could be correctly reidentified in any dataset using 15 demographic attributes”. The finding suggests that the very notion of what constitutes personal data is undergoing a transformation.7
Govern generative AI models from anywhere and deploy on the cloud or on premises with IBM watsonx.governance.
See how AI governance can help increase your employees’ confidence in AI, accelerate adoption and innovation, and improve customer trust.
Prepare for the EU AI Act and establish a responsible AI governance approach with the help of IBM Consulting.
1. “Mozilla Report: How Common Crawl’s Data Infrastructure Shaped the Battle Royale over Generative AI,” Mozilla, 6 February 2024
2. “Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures,” CCS’15, October 2015
3. “The 2025 AI Index Report,” Stanford HAI (Human-Centered Artificial Intelligence), April 2025
4. “Fines for GDPR violations in AI systems and how to avoid them,” EU Data Privacy Office, 16 October 2025
5. “Dutch DPA imposes a fine on Clearview because of illegal data collection for facial recognition,” Autoriteit Persoonsgegevens, 03 September 2024
6. “Research finds 12,000 ‘Live’ API Keys and Passwords in DeepSeek’s Training Data,” Truffle Security, 27 February 2025
7. “Estimating the success of re-identifications in incomplete datasets using generative models,” Nature Communications, 23 July 2019