What is adversarial machine learning?

Author

David Zax

Staff Writer

IBM Think

Adversarial machine learning, defined

Adversarial machine learning is the art of tricking AI systems. The term refers both to threat agents who pursue this art maliciously, as well as the good-intentioned researchers seeking to expose vulnerabilities to ultimately advance model robustness. 

The field presents new challenges in cybersecurity due to machine learning models’ complexity and the wide range of their attack surfaces—including, often, the physical world. 

A real-world example

To begin to illustrate how different adversarial machine learning attacks can be from legacy cybersecurity threats, consider an example from the realm of self-driving cars. Self-driving cars are driven by complex AI systems that take in sensor input and then form classifications that determine the car’s behavior. For instance, as an autonomous vehicle approaches a stop sign, its machine learning algorithms will identify it, safely bringing the car to a stop. 

The problem is that the machine learning systems that have learned to classify stop signs use different criteria than the human mind. This in turn creates an eerie vulnerability, researchers at several universities demonstrated in 2017.1 By making only subtle but strategic alterations to stop signs—the addition of a few small, innocuous stickers that most humans would simply ignore—researchers were able to trick AI models of the sort self-driving cars use into dangerously misclassifying stop signs as “Speed Limit: 45 MPH” signs. A passing human patrol officer would fail to note the sabotage, but to an AI system, just a few subtle stickers had turned a stop sign into a “go” sign.  

Needless to say, had malicious hackers discovered this vulnerability first, real-world harms like traffic fatalities could have easily ensued. 

Would your team catch the next zero-day in time?

Join security leaders who rely on the Think Newsletter for curated news on AI, cybersecurity, data and automation. Learn fast from expert tutorials and explainers—delivered directly to your inbox. See the IBM Privacy Statement.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

https://www.ibm.com/us-en/privacy

Types of adversarial attacks

Researchers have created taxonomies of different types of attacks on AI systems.

Evasion attacks

Evasion attacks—like the stop sign trick described—refer to instances where hackers alter data processed by an AI system, creating so-called “adversarial examples” that dupe AI classifiers. The attacks are so called because the altered data or stimulus is able to evade an AI model’s normal perception. In addition to the vivid self-driving car example, researchers have been able to create almost imperceptible forms of visual noise—so-called “adversarial perturbations”—that can be layered on top of data to dupe artificial intelligence. In one well-known 2015 example, Google researchers were able to add just a bit of visual noise to an image of a panda, causing a computer vision model to become certain the image represented a gibbon. In fact, the AI was even more confident of its misclassification of “gibbon” than it had been of its correct classification of “panda.”2 (The dark art of efficiently engineering the noise patterns that deceive a model is described in the section “Known methods of evasion attacks,” below.)  

A key subtype of evasion attacks are malware attacks, where attackers evade detection systems meant to catch computer viruses. Attackers achieve this through a variety of ways, but generally by employing tricks to make their malware look like harmless code; sometimes, attackers use their own AI to optimize this very process. In one example, researchers developed a bot that could automatically camouflage malware over many trials, duping 20 malware-detection systems 98% of the time.3 

Data poisoning attacks

Data poisoning attacks occur at a different, earlier stage of an AI model’s life cycle, namely during the training phase. Deep neural networks rely on large amounts of training data in order to learn useful patterns. With a data poisoning attack, an actor can corrupt the original training dataset, introducing data that will cause the resulting trained model to behave dysfunctionally. 

One example relied on the fact that many AI models use data acquired after deployment in order to iteratively train the next version of the model. Taking advantage of this principle, trolls on Twitter bombarded a 2016 Microsoft chatbot called Tay with offensive material, eventually steering the chatbot to post hateful content itself. 

Another example out of the University of Chicago aims to empower artists to punish unscrupulous firms that might use artists’ copyrighted images to train their models without the artists’ consent. The project, Nightshade, “is designed as an offense tool to distort feature representations inside generative AI image models,” according to its makers.4 If an artist applies Nightshade on top of their images, and an AI model later uses those images, the model might gradually learn incorrect labels for certain objects—for instance, coming to visualize cows as leather purses.

Security Intelligence | 29 December | Interview 3 | Episode 14.5

Your weekly news podcast for cybersecurity pros

Whether you're a builder, defender, business leader or simply want to stay secure in a connected world, you'll find timely updates and timeless principles in a lively, accessible format. New episodes on Wednesdays at 6am EST.

Privacy attacks

Privacy attacks exploit quirks of AI systems in order to indirectly infer or extract sensitive information that was part of their training data set. In theory, ML models are not meant to “remember” the data they train on—they extract useful patterns across datasets and do not retain the data they train on, as a hard drive would. The reality of AI “memory,” though, is in fact more complex. In practice, researchers have observed that in some respects, models do seem to “remember” their training data. In particular, ML systems will often express higher confidence levels in their predictions when those predictions relate to data points they saw in training. (While consumer chatbots like ChatGPT don’t display confidence scores, these values are often accessible via developer APIs or researcher tools.)

In a privacy attack method known as membership inference, an attacker could be able to infer sensitive information about someone: for instance, whether they had been a patient in a psychiatric facility. So long as the attacker has some data on a given individual (perhaps a partial medical chart), that attacker could query a model known to have trained on sensitive data sets (e.g., psychiatric facility records). By observing the confidence scores returned by the model, the attacker could infer that their target was indeed a member of the group used to train the model. 

A model inversion attack goes further, essentially enabling an adversary to reverse-engineer actual data that trained the model. The attacker can do this by using brute force techniques, iteratively using the model’s returned confidence scores as guidance of how to shape random, noisy data into something that actually resembles real training data for the model. For instance, in 2015, academic researchers were able to exploit a facial recognition model’s confidence scores to reconstruct images approximating the real faces used to train the model. They did this by beginning with an image of pure noise, iteratively tweaking the image and using the confidence scores of the model output to guide the next tweak.5

Model extraction attacks

In a model extraction attack (sometimes called, simply, “model stealing”), the goal of the attacker is to effectively “clone” a given model. The motives for such an attack can vary: an attacker may simply want to avoid pay-per-query use of the original model, or the attacker may want to use the clone to surreptitiously refine targeted attacks that might work well on the original model.

The methods of most model extraction attacks are reasonably simple: the attacker systematically prompts the model with carefully chosen inputs and indexes the outputs. If the inputs are chosen strategically, in some cases a dataset of just thousands or tens of thousands of input-output pairs can be used to replicate the model or at least some aspect of the model. For instance, a 2023 paper on “model leeching” demonstrated how such an attack could be used to extract task-specific knowledge from an LLM cheaply. For just USD 50 in API costs, the team was able to build a cloned model that could emulate one of the language model’s capabilities—reading comprehension—with 87% accuracy.6

White-box attacks versus black-box attacks

One additional attack taxonomy distinguishes not by the type of harm, but by the type of model that is being targeted. Most of the above examples are so-called black-box attacks, meaning that the models being targeted only give access to their outputs. But in so-called white-box attacks, hackers attack open-source models that are (often due to noble impulses by their makers) more transparent about their inner workings. With visibility into the behavior of the actual learned weights that make up the model, hackers can often leverage this white-box access to craft more efficient and targeted attacks.

Known methods of evasion attacks

Of the above types of attacks, arguably evasion attacks are the most challenging, representing a genuinely new frontier in cybersecurity. Evasion attacks particularly worry (and fascinate) cybersecurity researchers because they exploit the fundamentally different ways machines and humans parse the world. For this reason, a rich vein of research has focused on discovering methods by which hackers might generate evasion attacks—the better to patch these vulnerabilities before hackers hit them. (Thankfully, many defenses have also been discovered. For more information, see “How to defend against adversarial machine learning.”)  

Fast gradient sign method

In 2015, Google researchers revealed a simple method to generate adversarial examples—inputs that trick any deep learning system—which they dubbed the “fast gradient sign method,” or “FGSM.”2 Take the example of an image detection system. Such systems essentially carve up the world into clusters—this one for cats, this one for dogs, and so on. The fast gradient sign method is a mechanism to find a rapid way to tweak an image to “push” it from one cluster into another, thwarting the integrity of the system’s decision-making. Crucially, these tweaks often simply require bits of visual noise that are imperceptible to humans, yet dupe the machine. FGSM is called a “gradient-based” attack because it exploits an optimization algorithm used by machine learning systems called “gradient descent.

Given the stronger attacks that were soon after discovered, a model that has only been hardened against FGSM attacks is considered highly vulnerable. 

Projected gradient descent

Projected gradient descent (PGD) is another gradient-based attack, more subtle and powerful than FGSM. While FGSM essentially takes one leap in an adversarial direction to create its perturbations (the “noise” that breaks the model’s detection mechanisms), PGD uses an algorithm to take a series of baby steps. This more careful, iterative process allows it to find stronger and more impervious perturbations. Further, a clever constraint in its algorithm prevents PGD’s perturbations from wandering too far from a baseline, ensuring that they are undetectable by humans. The trade-off for attackers is cost; where FGSM can produce a fast-but-weak perturbation with just one gradient calculation, PGD must perform dozens or hundreds.

PGD is often used as a key benchmark for adversarial robustness, as it is considered the strongest gradient-based attack.7 An AI application that has been trained to resist PGD attacks may be considered meaningfully robust.  

Carlini and Wagner attacks

Exploiting the “gradient” of machine learning models, it turns out, is not the only way to attack such systems. A 2017 research paper8 from UC Berkeley computer scientists Nicholas Carlini and David Wagner revealed yet another method of finding adversarial input data, one that eschews information about the model’s gradient altogether. Instead, Carlini and Wagner attacks frame the problem as one of pure optimization, seeking to find the minimal amount of change needed to an input while still forcing misclassification. For an image perturbation, for instance, such an algorithm might reveal the fewest number of pixels that need to be tweaked to deceive a model. While computationally expensive to produce, the result is typically a perturbation far too subtle for a human to notice.

How to defend against adversarial machine learning

Thanks to the efforts of researchers who have discovered these weaknesses, countermeasures have been developed to help increase the robustness of machine learning models.

For evasion attacks of the sort just described, experts have developed methods of so-called adversarial training. Essentially, the process simply involves including, alongside “clean” data, data that has been tweaked in the way that hackers might attempt, so the model learns to properly label even these adversarial examples. This mitigation, while effective, can be costly in two senses: 1) it involves more compute, and 2) models may become slightly less accurate overall after exposure to perturbed data. “[T]raining robust models may not only be more resource-consuming, but also lead to a reduction of standard accuracy,” write the MIT researchers behind the 2018 paper, “Robustness May Be at Odds with Accuracy.”9

In general, the principles of good cybersecurity apply to the realm of machine learning. Operational defenses include anomaly detection and intrusion detection tools that check for unusual patterns in data or in traffic that might indicate a hacker is attempting to meddle with an ML system, whatever the stage of its life cycle. Additionally, red teaming, or deliberately exposing models to controlled attacks from cybersecurity professionals that simulate those of adversaries, are an effective way to stress-test systems.

In a field as fast moving as AI, the risk landscape is constantly shifting. Organizations like the National Institute of Standards and Technology are sources for the latest developments. NIST’s 2024 report10 on AI risk management touches on adversarial machine learning, while also encompassing approaches to AI risk more broadly—including themes like bias, hallucination, and privacy. Adopting an AI governance framework can also further help secure models against adversaries. 

Related solutions
Guardium AI Security

Secure AI models and AI agents. Automatically discover shadow AI. Unify teams for trustworthy AI.

    Explore Guardium AI Security
    AI cybersecurity solutions

    Improve the speed, accuracy and productivity of security teams with AI-powered solutions.

      Explore AI cybersecurity solutions
      Security Services

      Transform your business and manage risk with a global leader in cybersecurity, cloud and managed security services.

      Explore security services
      Take the next step

      Discover shadow AI, secure all AI models and use cases, get real-time protection from malicious prompts, and align teams on common set of metrics—for secure and trustworthy AI. 

      Discover Guardium AI Security Explore AI cybersecurity solutions