What is adversarial machine learning?

Author

David Zax

Staff Writer

IBM Think

Adversarial machine learning, defined

Adversarial machine learning is the art of tricking AI systems. The term refers both to threat agents who pursue this art maliciously, as well as the good-intentioned researchers seeking to expose vulnerabilities to ultimately advance model robustness. 

The field presents new challenges in cybersecurity due to machine learning models’ complexity and the wide range of their attack surfaces—including, often, the physical world. 

IBM-Guardium-Data-Protection-Social-Asset

IBM Guardium® Data Protection

#1 placement in the 2026 G2 Best Software Awards

A real-world example

To begin to illustrate how different adversarial machine learning attacks can be from legacy cybersecurity threats, consider an example from the realm of self-driving cars. Self-driving cars are driven by complex AI systems that take in sensor input and then form classifications that determine the car’s behavior. For instance, as an autonomous vehicle approaches a stop sign, its machine learning algorithms will identify it, safely bringing the car to a stop. 

The problem is that the machine learning systems that have learned to classify stop signs use different criteria than the human mind. This aspect in turn creates an eerie vulnerability, which researchers at several universities demonstrated in 2017.1

By making only subtle but strategic alterations to stop signs—the addition of a few small, innocuous stickers that most humans would simply ignore—researchers were able to trick AI models of the sort self-driving cars use into dangerously misclassifying stop signs as “Speed limit: 45 MPH” signs. 

A passing human patrol officer would fail to note the sabotage, but to an AI system, just a few subtle stickers turned a stop sign into a “go” sign.  

If malicious hackers discovered this vulnerability first, it is clear that real-world harms like traffic fatalities could have easily ensued. 

Types of adversarial attacks

Researchers have created taxonomies of different types of attacks on AI systems.

Evasion attacks

Evasion attacks—like the stop sign trick described—refer to instances where hackers alter data processed by an AI system, creating so-called “adversarial examples” that dupe AI classifiers. The attacks are so-called because the altered data or stimulus is able to evade an AI model’s normal perception.

In addition to the vivid self-driving car example, researchers have been able to create almost imperceptible forms of visual noise. These elements called “adversarial perturbations” can be layered on top of data to dupe artificial intelligence.

In one well-known 2015 example, Google researchers were able to add just a bit of visual noise to an image of a panda. This change caused a computer vision model to become certain the image represented a gibbon.

In fact, the AI was even more confident of its misclassification of “gibbon” than it had been of its correct classification of “panda.”2 (The dark art of efficiently engineering the noise patterns that deceive a model is described in the section “Known methods of evasion attacks,” further ahead.)  

A key subtype of evasion attacks are malware attacks, where attackers evade detection systems meant to catch computer viruses. Attackers achieve this through various ways, but generally by employing tricks to make their malware look like harmless code; sometimes, attackers use their own AI to optimize this very process.

In one example, researchers developed a bot that could automatically camouflage malware over many trials, duping 20 malware-detection systems 98% of the time.3

Data poisoning attacks

Data poisoning attacks occur at a different, earlier stage of an AI model’s lifecycle, specifically during the training phase. Deep neural networks rely on large amounts of training data in order to learn useful patterns. With a data poisoning attack, an actor can corrupt the original training dataset, introducing data that will cause the resulting trained model to behave dysfunctional. 

One example relied on the fact that many AI models use data acquired after deployment in order to iteratively train the next version of the model. Taking advantage of this principle, trolls on Twitter bombarded a 2016 Microsoft chatbot called Tay with offensive material, eventually steering the chatbot to post hateful content itself. 

Another example out of the University of Chicago aims to empower artists to punish unscrupulous firms that might use artists’ copyrighted images to train their models without the artists’ consent. The project, Nightshade, “is designed as an offense tool to distort feature representations inside generative AI image models,” according to its makers.4

If an artist applies Nightshade on top of their images and an AI model later uses those images, the model might gradually learn incorrect labels for certain objects. For instance, coming to visualize cows as leather purses.

Security Intelligence | 1 April, episode 27

Your weekly news podcast for cybersecurity pros

Whether you're a builder, defender, business leader or simply want to stay secure in a connected world, you'll find timely updates and timeless principles in a lively, accessible format. New episodes on Wednesdays at 6am EST.

Privacy attacks

Privacy attacks exploit quirks of AI systems in order to indirectly infer or extract sensitive information that was part of their training data set. In theory, ML models are not meant to “remember” the data they train on. They extract useful patterns across datasets and do not retain the data they train on, as a hard disk drive would.

The reality of AI “memory,” though, is in fact more complex. In practice, researchers have observed that in some respects, models do seem to “remember” their training data.

In particular, ML systems will often express higher confidence levels in their predictions when those predictions relate to data points they saw in training. (While consumer chatbots like ChatGPT don’t display confidence scores, these values are often accessible through developer APIs or researcher tools.)

In a privacy attack method known as membership inference, an attacker might be able to infer sensitive information about someone: for instance, whether they were a patient in a psychiatric facility. If the attacker has some data on a specific individual (perhaps a partial medical chart), that attacker could query a model known to have trained on sensitive datasets. For example, psychiatric facility records.

By observing the confidence scores returned by the model, the attacker could infer that their target was indeed a member of the group used to train the model. 

A model inversion attack goes further, essentially enabling an adversary to reverse-engineer actual data that trained the model. The attacker can do this by using brute force techniques, iteratively using the model’s returned confidence scores as guidance of how to shape random, noisy data into something that resembles real training data for the model.

For instance, in 2015, academic researchers were able to exploit a facial recognition model’s confidence scores to reconstruct images approximating the real faces used to train the model. They did this by beginning with an image of pure noise, iteratively tweaking the image and by using the confidence scores of the model output to guide the next tweak.5

Model extraction attacks

In a model extraction attack (sometimes called, simply, “model stealing”), the goal of the attacker is to effectively “clone” a specific model. The motives for such an attack can vary. An attacker might simply want to avoid pay-per-query use of the original model. Another reason could be that the attacker might want to use the clone to surreptitiously refine targeted attacks that might work well on the original model.

The methods of most model extraction attacks are reasonably simple: the attacker systematically prompts the model with carefully chosen inputs and indexes the outputs. If the inputs are chosen strategically, sometimes a dataset of just thousands or tens of thousands of input/output pairs can be used to replicate the model or some aspect of the model.

For instance, a 2023 paper on “model leeching” demonstrated how such an attack could be used to extract task-specific knowledge from an LLM cheaply. For just USD 50 in API costs, the team was able to build a cloned model that could emulate one of the language model’s capabilities—reading comprehension—with 87% accuracy.6

White box attacks versus black box attacks

One more attack taxonomy distinguishes not by the type of harm, but by the type of model that is being targeted. Most of the preceding examples are so-called black box attacks, meaning that the models being targeted give access only to their outputs. But in so-called white box attacks, hackers attack open source models that are (often due to noble impulses by their makers) more evident about their inner workings.

With visibility into the behavior of the actual learned weights that make up the model, hackers can often leverage this white box access to craft more efficient and targeted attacks.

Known methods of evasion attacks

Of the preceding types of attacks, arguably evasion attacks are the most challenging, representing a genuinely new frontier in cybersecurity. Evasion attacks particularly worry (and fascinate) cybersecurity researchers because they exploit the fundamentally different ways machines and humans parse the world.

For this reason, a rich vein of research has focused on discovering methods by which hackers might generate evasion attacks—the better to patch these vulnerabilities before hackers hit them. (Thankfully, many defenses have also been discovered. For more information, see “How to defend against adversarial machine learning.”)  

Fast gradient sign method

In 2015, Google researchers revealed a simple method to generate adversarial examples—inputs that trick any deep learning system—which they dubbed the “fast gradient sign method,” or “FGSM.”2 Take the example of an image detection system. Such systems essentially carve up the world into clusters—this one for cats, this one for dogs, and so on.

The fast gradient sign method is a mechanism to find a rapid way to tweak an image to “push” it from one cluster into another, thwarting the integrity of the system’s decision-making.

Crucially, these tweaks often simply require bits of visual noise that are imperceptible to humans, yet dupe the machine. FGSM is called a “gradient-based” attack because it exploits an optimization algorithm used by machine learning systems called “gradient descent.

Given the stronger attacks that were soon after discovered, a model that has only been hardened against FGSM attacks is considered highly vulnerable. 

Projected gradient descent

Projected gradient descent (PGD) is another gradient-based attack, more subtle and powerful than FGSM. While FGSM essentially takes one leap in an adversarial direction to create its perturbations (the “noise” that breaks the model’s detection mechanisms), PGD uses an algorithm to take a series of baby steps. This more careful, iterative process allows it to find stronger and more impervious perturbations.

Further, a clever constraint in its algorithm prevents PGD’s perturbations from wandering too far from a baseline, ensuring that they are undetectable by humans. The tradeoff for attackers is cost—where FGSM can produce a fast-but-weak perturbation with just one gradient calculation, PGD must perform dozens or hundreds.

PGD is often used as a key benchmark for adversarial robustness, as it is considered the strongest gradient-based attack.7 An AI application that has been trained to resist PGD attacks can be considered meaningfully robust.  

Carlini and Wagner attacks

Exploiting the “gradient” of machine learning models, it turns out, is not the only way to attack such systems. A 2017 research paper8 from UC Berkeley computer scientists Nicholas Carlini and David Wagner revealed yet another method of finding adversarial input data. This method avoids information about the model’s gradient altogether.

Instead, Carlini and Wagner attacks frame the problem as one of pure optimization, seeking to find the minimal amount of change needed to an input while still forcing misclassification. For an image perturbation, for instance, such an algorithm might reveal the fewest number of pixels that need to be tweaked to deceive a model.

While computationally expensive to produce, the result is typically a perturbation far too subtle for a human to notice.

How to defend against adversarial machine learning

Due to the efforts of researchers who have discovered these weaknesses, countermeasures have been developed to help increase the robustness of machine learning models.

For evasion attacks of the sort described, experts have developed methods of so-called adversarial training. Essentially, the process simply involves including, alongside “clean” data, data that has been tweaked in the way that hackers might attempt, so the model learns to properly label even these adversarial examples.

This mitigation, while effective, can be costly in two senses: 1) it involves more compute and 2) models might become slightly less accurate overall after exposure to perturbed data.

“Training robust models may not only be more resource-consuming, but also lead to a reduction of standard accuracy,” write the MIT researchers behind the 2018 paper, “Robustness May Be at Odds with Accuracy.”9

In general, the principles of good cybersecurity apply to the realm of machine learning. Operational defenses include anomaly detection and intrusion detection tools that check for unusual patterns in data or in traffic. These patterns might indicate that a hacker is attempting to meddle with an ML system, whatever the stage of its lifecycle.

Also, red teaming or deliberately exposing models to controlled attacks from cybersecurity professionals that simulate those of adversaries, are an effective way to stress-test systems.

In a field as fast-moving as AI, the risk landscape is constantly shifting. Organizations like the National Institute of Standards and Technology are sources for the latest developments. NISTs 2024 report10 on AI risk management touches on adversarial machine learning, while also encompassing approaches to AI risk more broadly—including themes like bias, hallucination and privacy. Adopting an AI governance framework can also further help secure models against adversaries. 

Related solutions
IBM Guardium

Detect and respond to threats, gain real-time visibility and enforce security and compliance across your data estate.

Explore IBM Guardium®
AI cybersecurity solutions

Improve the speed, accuracy and productivity of security teams with AI-powered solutions.

    Explore AI cybersecurity solutions
    Security services

    Transform your business and manage risk with a global leader in cybersecurity, cloud and managed security services.

    Explore security services
    Take the next step

    Accelerate threat detection and response with AI-powered insights while protecting critical data with real-time visibility, threat detection and automated security controls.

    1. Discover IBM Guardium®
    2. Explore AI cybersecurity solutions