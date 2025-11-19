Privacy attacks exploit quirks of AI systems in order to indirectly infer or extract sensitive information that was part of their training data set. In theory, ML models are not meant to “remember” the data they train on—they extract useful patterns across datasets and do not retain the data they train on, as a hard drive would. The reality of AI “memory,” though, is in fact more complex. In practice, researchers have observed that in some respects, models do seem to “remember” their training data. In particular, ML systems will often express higher confidence levels in their predictions when those predictions relate to data points they saw in training. (While consumer chatbots like ChatGPT don’t display confidence scores, these values are often accessible via developer APIs or researcher tools.)

In a privacy attack method known as membership inference, an attacker could be able to infer sensitive information about someone: for instance, whether they had been a patient in a psychiatric facility. So long as the attacker has some data on a given individual (perhaps a partial medical chart), that attacker could query a model known to have trained on sensitive data sets (e.g., psychiatric facility records). By observing the confidence scores returned by the model, the attacker could infer that their target was indeed a member of the group used to train the model.

A model inversion attack goes further, essentially enabling an adversary to reverse-engineer actual data that trained the model. The attacker can do this by using brute force techniques, iteratively using the model’s returned confidence scores as guidance of how to shape random, noisy data into something that actually resembles real training data for the model. For instance, in 2015, academic researchers were able to exploit a facial recognition model’s confidence scores to reconstruct images approximating the real faces used to train the model. They did this by beginning with an image of pure noise, iteratively tweaking the image and using the confidence scores of the model output to guide the next tweak.5