Yesterday we announced a new release of the Adversarial Robustness Toolbox, an open-source software library to support researchers and developers in defending neural networks against adversarial attacks. The new release provides a method for defending against poisoning and “backdoor” attacks in machine learning models. We announced the release at Black Hat USA, the world’s leading information security event.
Machine learning models are often trained on data from potentially untrustworthy sources, including crowd-sourced information, social media data, and user-generated data such as customer satisfaction ratings, purchasing history, or web traffic . Recent work has shown that adversaries can introduce backdoors or “trojans” in machine learning models by poisoning training sets with malicious samples . The resulting models perform as expected on normal training and testing data, but behave badly on specific attacker-chosen inputs.
For example, an attacker could introduce a backdoor in a deep neural network (DNN) trained to recognize traffic signs so that it achieves high accuracy on standard inputs but misclassifies a stop sign as a speed limit sign if a yellow sticky note is attached to it. Unlike adversarial samples that require specific, complex noise to be added to an image , backdoor triggers can be quite simple and easily applicable to images or even objects in the real world. This poses a real threat to the deployment of machine learning models in security-critical applications.
Defending against backdoor attacks
The latest release of the Adversarial Robustness Toolbox provides a defence method for detecting these attacks. By examining and clustering the neural activations in the training samples, we can identify which samples are legitimate and which ones are manipulated by an adversary. This defence method has shown good results for known backdoor attacks. The new release also includes sample code so that users can test the defence method end-to-end on an image classification task.
Defences against data poisoning are a great addition to the existing capabilities of the Adversarial Robustness Toolbox, which mostly address evasion attacks and defences. Ultimately, when providing security for AI, we need to think about defences against poisoning and evasion attacks holistically. By doing so, we can improve robustness of AI systems, which is a key component of trusted AI.
Other new features
Other important new features in the latest release of the Adversarial Robustness Toolbox include:
- A new module for the detection of samples that an adversary has tampered with in order to achieve misclassifications.
- Extended capabilities for adversarial training of DNNs, which is the state-of-the-art approach for reducing the vulnerability of DNNs with regard to adversarial samples.
- Novel application programming interfaces (APIs) for getting access to internal activations of DNNs, which is important for analyzing the effect of adversarial inputs and potentially devising novel defences.
- An implementation of two additional evasion attacks: Basic Iterative Method and Projected Gradient Descent.
- Backend support for DNNs implemented in the MXNet Deep Learning framework.
- Optimized algorithms for all major evasion attacks, significantly improving their scalability to large datasets.
We are also sharing Python notebooks which demonstrate those new capabilities and help users to quickly get started. Moreover, we published a white paper  which outlines implementation details of the different attacks and defences. As the literature in this field has developed so fast and is rather scattered, we believe it is important to have that information in one place and ensure it is consistent.
Optimized algorithms for evasion attacks
The optimization of evasion attack algorithms has allowed us, for the first time, to apply the Jacobian Saliency Map Attack (JSMA) to high-resolution image data. An example of this attack is shown in Figure 1. What is special about JSMA is that it modifies only a small fraction of the pixels in an image in order to achieve the desired misclassification. In its original version, JSMA computes for each pair of pixels xi, xj the following quantities:
Here y denotes the target class of the attack, Fk denotes the output that the DNN assigns to class k, and Y is the set of all classes. JSMA then selects the pixels xi, xj for which α>0, β<0 and the product |α·β| is maximal; those correspond to the pixels which the attack should alter in order to change the classifier’s output. The procedure is repeated until the desired misclassification is obtained.
The search for the optimal pair of pixels and the computation of β are computationally expensive. Our optimization exploits that the outputs of the classifier sum up to 1. Therefore, β can be obtained by taking the partial derivatives of 1 – Fy(x) with respect to xi, xj, which saves computing the gradient over all alternative classes. Moreover, the optimal pixels xi, xj can be simply determined by choosing the two largest components of the gradient. This reduces the search time from quadratic to linear. Before our optimization, applying JSMA to high-resolution image data with a large number of classes was prohibitive both in terms of computation time and memory requirements.
Getting started with the Adversarial Robustness Toolbox
The Adversarial Robustness Toolbox supports DNNs implemented in the TensorFlow, Keras, PyTorch or MXNet deep learning frameworks. Currently, the library is primarily intended to improve the adversarial robustness of visual recognition systems; however, we are working on future releases that will comprise adaptations to other data modes such as speech, text or time series.
In terms of future work, we see coping with adaptive adversaries as a key next step. Adversaries can efficiently bypass deployed defences if they are aware of them. So far this has been demonstrated mostly in an ad hoc fashion. With the Adversarial Robustness Toolbox, we would like to provide a scalable framework for studying adaptive adversaries and devising novel defences against them.
As an open-source project, the ambition of the Adversarial Robustness Toolbox is to create a vibrant ecosystem of contributors both from industry and academia. The main difference to similar ongoing efforts is the focus on defence methods and on the composability of practical defence systems. We hope the Adversarial Robustness Toolbox project will stimulate research and development around adversarial robustness of DNNs, and advance the deployment of secure AI in real world applications. Please share with us your experience working with the Adversarial Robustness Toolbox and any suggestions for future enhancements.
 Mitigating poisoning attacks on machine learning models: A data provenance based approach, Baracaldo, B. Chen, H. Ludwig, J.A. Safavi (2017). Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 103–110.
 BadNets: Identifying vulnerabilities in the machine learning model supply chain, Gu, B. Dolan-Gavitt, S. Garg, S. (2017). CoRR, abs/1708.06733.
 The Adversarial Robustness Toolbox: Securing AI against adversarial threats, M.-I. Nicolae, M. Sinn (2018). IBM Research Blog
 Adversarial Robustness Toolbox v0.3.0, M.-I. Nicolae, M. Sinn, M.N. Tran, A. Rawat, M. Wistuba, V. Zantedeschi, N. Baracaldo, B. Chen, H. Ludwig, I.M. Molloy, B.Edwards (2018). arXiv:1807.01069