May 2, 2018 | Written by: Pin-Yu Chen
Share this post:
Machine learning is transforming myriad aspects of our lives: our personal assistants, our shopping habits, our daily commutes—even our financial and healthcare systems. But AI models are not infallible: they may be vulnerable to adversarial attack, raising security concerns and potentially compromising people’s confidence in them. Just how vulnerable are they? Until recently, there was no comprehensive way to measure this. But working closely with our MIT collaborators as part of the MIT-IBM Watson AI Lab, we describe in our ICLR 2018 paper a new metric called CLEVER for evaluating the robustness of neural networks against adversarial attack. CLEVER scores can be used to compare the robustness of different network designs and training procedures to help researchers build more reliable AI systems.
One way to tamper with neural networks is adversarial attack. This involves carefully crafted perturbations called adversarial examples that, when added to natural examples, lead deep neural network models to misbehave. In image learning tasks, adversarial examples can be imperceptible to human eyes and can lead to big differences in results between humans and well-trained machine learning models. Outside the digital space, adversarial examples can exist as colorful stickers or 3D printing in the physical world, giving rise to rapidly increasing concerns on safety-critical and security-critical machine learning tasks.
In Figure 1, the black point in the center represents a natural example (e.g., an image of an ostrich) and the colored curves represent the boundaries for decision-making of a well-trained machine learning model (e.g., a neural network image classifier). Adversarial examples (the blue points in Figure 1) are very close to the natural example and may be visually similar but will yield different model predictions and be misclassified by the classifier (e.g., as ‘shoe shop’ and ‘vacuum’).
Robustness of a model
Despite various efforts to improve the robustness of neural networks against adversarial perturbations, a comprehensive measure of a model’s robustness is still lacking. Current robustness evaluation relies on the empirical defense performance against existing adversarial attacks and may result in a false sense of robustness, since the defense is neither certified nor guaranteed to be generalizable to unseen attacks.
To fill this gap, my team developed CLEVER (Cross Lipschitz Extreme Value for nEtwork Robustness), a comprehensive robustness measure that was presented at the Sixth International Conference on Learning Representations (ICLR) this year in Vancouver, Canada. CLEVER offers an attack-agnostic measure for evaluating the robustness of any trained neural network classifier against adversarial perturbations.
A CLEVER score
For an adversarial attack, one can define the “attack lower bound”, or the least amount of perturbation to a natural example required in order to deceive a classifier (the grey region in Figure 1). We have provided a theoretical justification for converting such an attack lower bound analysis into a local Lipschitz constant estimation problem. We then proposed an efficient way to evaluate the attack lower bound by using the Extreme Value Theory, yielding a novel robustness metric called CLEVER. The proposed CLEVER score is (1) attack-agnostic, meaning that it estimates well a certified attack lower bound for existing and unseen attacks; attacks whose strength are below the lower bound will fail; and (2) computationally feasible for large neural networks, meaning it can be efficiently applied to state-of-the-art ImageNet classifiers. Without invoking any specific adversarial attack, the CLEVER score can be directly used to compare the robustness of different network designs and training procedures to help build more reliable systems. One possible use case is the “before-after” scenario, where one can compare CLEVER scores to assess improvement in model robustness before and after implementing a certain defense strategy. The CLEVER score is also the first attack-independent robustness metric that can be applied to any neural network classifier.
In addition to the CLEVER work, my team has been working on defining effective white-box and black-box methods for crafting adversarial examples (presented at AAAI 2018 and AI-Security Workshop 2017), as well as understanding adversarial robustness (here and here), also presented at the recent ICLR workshops.
An implementation of CLEVER can be found in the open-source Adversarial Robustness ToolBox on GitHub.