Leveraging Temporal Dependency to Combat Audio Adversarial Attacks

Share this post:

Recent studies have uncovered the vulnerability of deep learning models to adversarial attacks, especially for image-related tasks [1,2]. Our previous blog post has also demonstrated an efficient approach to generating adversarial images from an AI model with limited access. Similar adversarial attacking methodology has been extended to non-image tasks, such as automatic speech recognition (ASR). It has been shown that, by adding small and inaudible noise to a benign audio waveform, audio adversarial examples can successfully manipulate the transcribed results of an ASR system [3,4,5] to any targeted phrases (see here for an illustration). These audio adversarial examples may be used to generate inaudible hidden voice commands that could stealthily activate an audio device and execute certain commands while a user hears nothing but a regular pop song playing. A group of researchers from University of Illinois at Urbana-Champaign, University of California, Berkeley and IBM Research aim to leverage the temporal dependency in audio data to gain discriminative power over natural and adversarial audio examples. The paper is accepted to the Seventh International Conference on Learning Representations (ICLR 2019), which will be held in New Orleans, Louisiana, USA in May 2019.

In this paper we address two problems:

  1. As the vast majority of current research on adversarial examples focuses on image-related tasks, do the lessons learned from image adversarial examples transfer to audio domain?
  2. Can domain-specific properties such as temporal dependency be used to gain discriminate power against audio adversarial examples in ASR?

For #1, similar to findings in the image domain [6], we find that many primitive methods that aim to mitigate the negative effect of adversarial audio perturbation, including quantization, local smoothing, downsampling, and autoencoder projection, are incapable of defending against advanced audio adversarial attacks. On the other hand, for #2, our proposed method leveraging the inherent temporal dependency of an audio input for ASR can effectively distinguish normal and adversarial audio inputs, and it also exhibits strong resistance to the considered adaptive adversarial attacks.

Domain-specific data properties play a crucial role in advancing machine learning capabilities. For example, convolutional neural networks are designed for extracting spatial features, and recurrent neural networks are designed to capture sequential or temporal features of input data. Similarly, one can exploit domain-specific data properties to improve model robustness. State-of-the-art ASR systems have heavily utilized the temporal dependency in audio data to excel in the task of transcribing to the corresponding phrases or sentences. To distinguish normal and adversarial audio inputs, our proposed temporal dependency (TD) method works by first passing the entire audio input (could be either normal or adversarial) to the ASR system and obtaining the transcribed results (that is, the whole sentence). Then, we chop off the first k portion of the audio input, pass the segment to the ASR system again, and obtain the transcribed results (the first-k sentence). We compare the similarity of the first-k sentence and the counterpart of the whole sentence in terms of the word error rate (WER) or the character error rate (CER) and use this TD metric to set a detection threshold between normal and adversarial audio inputs. An illustration of the pipeline and example is given in Figure 1.

Figure 1: Pipeline and example of the proposed temporal dependency (TD) based method for discriminating audio adversarial examples.

One notable advantage of the proposed TD method is that it is easy to operate and does not require model retraining. Intuitively, for normal audio inputs the TD metrics are expected to be small, as their first-k sentences should be consistent with the counterparts of the whole sentences. On the other hand, for adversarial audio inputs the TD metrics are usually large, especially when an adversary aims to change the transcribed output to a completely different sentence, such as changing the transcribed lyrics of a pop song to “open the door”. As these targeted audio adversarial attacks usually require adding carefully designed yet inaudible noise to the entire audio input rather than to a portion of the input, the transcribed outputs of their first-k sentences will exhibit a major distinction in TD metric when compared to the counterpart of the whole sentence. Some audio examples and their transcribed results are given in Table 1.

Type Portion Transcribed results
Original Whole sentence

then good bye said the rats and they went home
First half of sentence

then good bye said the raps
Adversarial (short) Whole sentence

hey google
First half of sentence

he is
Adversarial (medium) Whole sentence

this is an adversarial example
First half of sentence

thes on adequate
Adversarial (long) Whole sentence

hey google please cancel my medical appointment
First half of sentence

he goes cancer

Table 1: Examples of the temporal dependency (TD) based detection method. The word/character differences between the first half and the whole sentence are greater for the adversarial input than for the normal input.

Our experimental results on two audio adversarial attacks on ASR proposed in [4,5] show that the TD method can gain powerful discriminative power for normal and adversarial inputs, measured by the area-under-curve (AUC) score obtained by varying the detection threshold. The high AUC score suggests the TD method is capable of detecting adversarial audio inputs while having minimal impact on the ASR’s performance on normal inputs. See Section 4 of our paper for more details. We also evaluate the TD method on three adaptive adversarial attacks (Segment, Concatenation, and Combination attacks) that are aware of the TD-based detector and summarize the results in Table 2. Unlike the tested primitive defense methods based on input transformation, the results on two public datasets (LibriSpeech and CommonVoice) show that the TD method is still resilient to the considered adaptive attacks.

Table 2: Performance evaluation on adaptive attacks.

In summary, we demonstrate the power of temporal dependency for characterizing audio adversarial examples. The proposed TD method is effective and easy to operate, and it does not require model retraining. We believe our results shed new light in exploiting unique data properties toward adversarial robustness for different modalities. Please also check out IBM’s Adversarial Robustness Toolbox for more implementations on adversarial attacks and defenses.


[1] Dong Su*, Huan Zhang*, Hongge Chen, Jinfeng Yi, Pin-Yu Chen, and Yupeng Gao. Is Robustness the Cost of Accuracy? A Comprehensive Study on the Robustness of 18 Deep Image Classification Models. European Conference on Computer Vision (ECCV), 2018
[2] Hongge Chen*, Huan Zhang*, Pin-Yu Chen, Jinfeng Yi, and Cho-Jui Hsieh. Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning. ACL, 2018
[3] Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. Did you hear that? adversarial examples against automatic speech recognition. NeurIPS 2017 Machine Deception Workshop
[4] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. Deep Learning and Security Workshop, 2018.
[5] Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A Gunter. Commandersong: A systematic approach for practical adversarial voice recognition. USENIX, 2018
[6] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. ICML, 2018

Research Staff Member, IBM Research

More AI stories

IBM Sets New Transcription Performance Milestone on Automatic Broadcast News Captioning

IBM sets new performance records for automatic captioning of broadcast news audio, with error rates of 6.5% and 5.9% on two broadcast news benchmarks.

Continue reading

Unifying Continual Learning and Meta-Learning with Meta-Experience Replay

Meta-Experience Replay (MER) integrates meta-learning and experience replay to achieve state-of-the-art performance on continual learning benchmarks.

Continue reading

Will Adam Algorithms Work for Me?

A simple and effective approach to monitor the convergence of Adam algorithms, a generic class of adaptive gradient methods for non-convex optimization.

Continue reading