Data augmentation uses pre-existing data to create new data samples that can improve model optimization and generalizability.
In its most general sense, data augmentation denotes methods for supplementing so-called incomplete datasets by providing missing data points in order to increase the dataset’s analyzability.1 This manifests in machine learning by generating modified copies of pre-existing data to increase the size and diversity of a dataset. Thus, with respect to machine learning, augmented data may be understood as artificially supplying potentially absent real-world data.
Data augmentation improves machine learning model optimization and generalization. In other words, data augmentation can reduce overfitting and improve model robustness.2 That large, diverse datasets equal improved model performance is an axiom of machine learning. Nevertheless, for a number of reasons—from ethics and privacy concerns to simply the time-consuming effort of manually compiling necessary data—acquiring sufficient data can be difficult. Data augmentation provides one effective means of increasing dataset size and variability. In fact, researchers widely use data augmentation to correct imbalanced datasets.3
Many deep learning frameworks, such as PyTorch, Keras, and Tensorflow provide functions for augmenting data, principally image datasets. The Python package Ablumentations (available on Github) is also adopted in many open source projects. Albumentations allows for augmenting image and text data.
Note that data augmentation is distinct from synthetic data. Admittedly, both are generative algorithms that add new data into a data collection in order to improve the performance of machine learning models. Synthetic data, however, refers to the automatic generation of entirely artificial data. An example is using computer-generated images—as opposed to real-world data—to train an object detection model. By contrast, data augmentation copies existing data and transforms those copies to increase the diversity and amount of data in a given set.
There are a variety of data augmentation methods. The specific techniques used for augmenting data depend upon the nature of data with which a user is working. Note that data augmentation is typically implemented during preprocessing on the training dataset. Some studies investigate the effect of augmentation on the validation or test set, but augmentation applications outside of training sets are rarer.4
Data augmentation has been widely implemented in research for a range of computer vision tasks, from image classification to object detection. As such, there is a wealth of research on how augmented images improve the performance of state-of-the-art convolutional neural networks (CNNs) in image processing.
Many tutorials and non-academic resources classify image data augmentation into two categories: geometric transformations and photometric (or, color space) transformations. Both consist of relatively simple image file manipulation. The first category denotes techniques that alter the space and layout of the original image, such as resizing, zooming, or changes in orientation (for example, horizontal flip). Photometric transformations alter an image’s RGB (red-green-blue) channels. Examples of photometric transformation include saturation adjustment and grayscaling an image.5
Some sources categorize noise injection with geometric transformations,6 while others classify it with photometric transformations.7 Noise injection inserts random black, white, or color pixels into an image according to a Gaussian distribution.
As noise injection illustrates, the binary classification of image augmentation techniques into geometric and photometric fails to cover the whole range of possible augmentation strategies. Excluded image augmentation techniques are kernel filtering (sharpening or blurring an image) and image mixing. An example of the latter is random cropping and patching. This technique randomly samples sections from several images to create a new image. This new image is a composite made from the sampled sections of the input images. A related technique is random erasing, which deletes a random portion of an image.8 Such tasks are useful in image recognition tasks, as real-world use cases may require machines to identify partially obscured objects.
Instance-level augmentation is another augmentation. Instance-level augmentation essentially copies labeled regions (for example, bounding boxes) from one image and inserts them onto another image. Such an approach trains the image to identify objects against different backgrounds as well as objects obscured by other objects. Instance-level augmentation is a particularly salient approach for region-specific recognition tasks, such as object detection and image segmentation tasks.9
Like image augmentation, text data augmentation consists of many techniques and methods that are used across a range of natural language processing (NLP) tasks. A few resources divide text augmentation into rule-based (or “easy”) and neural methods. Of course, as with the binary division of image augmentation techniques, this categorization is not all-encompassing.
Rule-based approaches include relatively simple find-and-replace techniques, such as random deletion or insertion. Rule-based approaches also encompass synonym replacement. In this strategy, one or more words in a string are replaced with their respective synonyms as recorded in predefined thesaurus, such as WordNet or the Paraphrase Database. Sentence inversion and passivation, in which the object and subject are swapped, are also examples of rule-based approaches.10
Per their classification, neural methods utilize neural networks to generate new text samples from the input data. One notable neural method is back-translation. This uses machine translation to translate input data into a target language and then back into the original input language. In this way, back-translation leverages linguistic variances that result in automated translations to generate semantic variances in single-language dataset for the purpose of augmentation. Research suggests this is effective for improving machine translation model performance.11
Mix-up text augmentations is another strategy. This approach deploys rule-based deletion and insertion methods using neural network embeddings. Specifically, pre-trained transformers (for example, BERT) generate word or sentence-level embeddings of text, transforming text into vector points, as in a bag of words model. The transformation of text into vector points generally aims to capture linguistic similitude, that is, words or sentences nearer one another in vector space are believed to share similar meanings or frequency. Mix-up augmentations interpolates text strings within a specified distance of one another to produce new data that is an aggregate of the input data.12
Many users struggle with identifying which data augmentation strategies to implement. Do data augmentation techniques vary in efficacy between datasets and tasks? Comparative research on data augmentation techniques suggests that multiple forms of augmentation have a greater positive impact than one, but determining the optimal combination of techniques is dataset and task dependent.13 But how does one go about selecting the optimal techniques?
To address this issue, research has turned to automated data augmentation. One automated augmentation approach uses reinforcement learning to identify augmentation techniques that return the highest validation accuracy on a given dataset.14 This approach has shown to implement strategies that improve performance on both in and out of sample data.15 Another promising approach to automated augmentation identifies and augments false positives from classifier outputs. In this way, automatic augmentation identifies the best strategies to correct for frequently misclassified items.16
More recently, research has turned to generative networks and models to identify task-dependent17 and class-dependent18 optimal augmentation strategies. This includes work with generative adversarial networks (GANs). GANs are deep learning networks typically used to generate synthetic data, and recent research investigates their use for data augmentation. A few experiments, for instance, suggest that synthetic data augmentations of medical image sets improve classification19 and segmentation20 model performance more than classic augmentations. Relatedly, research in text augmentation leverages large language models (LLMs) and chatbots to generate augmented data. These experiments use LLMs to generate augmented samples of input data with mix-up and synonymizing techniques, showing a greater positive impact for text classification models than classic augmentation.21
Researchers and developers widely adopt data augmentation techniques when training models for various machine learning tasks. By contrast, synthetic data is a comparatively newer area of research. Comparative experiments on synthetic versus real data show mixed results, with models trained entirely on synthetic data sometimes outperforming, sometimes underperforming models trained on real-world data. Perhaps unsurprisingly, this research suggests synthetic data is most useful when it reflects characteristics of real-world data.22
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
Connect your data and analytics strategy to business objectives with these 4 key steps.
Take a deeper look into why business intelligence challenges might persist and what it means for users across an organization.
To thrive, companies must use data to build customer loyalty, automate business processes and innovate with AI-driven solutions.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.
Introducing Cognos Analytics 12.0, AI-powered insights for better decision-making.
f Martin Tanner and Wing Hung Wong, “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, Vol. 82, No. 398 (1987), pp. 528-540.
2 Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, and Timothy A Mann, “Data Augmentation Can Improve Robustness,” Advances in Neural Information Processing Systems, Vol. 34, 2021, https://proceedings.neurips.cc/paper_files/paper/2021/hash/fb4c48608ce8825b558ccf07169a3421-Abstract.html.
3 Manisha Saini and Seba Susan, “Tackling class imbalance in computer vision: A contemporary review,” Artificial Intelligence Review, Vol. 54, 2023, https://link.springer.com/article/10.1007/s10462-023-10557-6.
4 Fabio Perez, Cristina Vasconcelos, Sandra Avila, and Eduardo Valle, “Data Augmentation for Skin Lesion Analysis,” OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, 2018, https://link.springer.com/chapter/10.1007/978-3-030-01201-4_33.
5 Connor Shorten and Taghi M. Khoshgoftaa, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, 2019, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0.
6 Duc Haba, Data Augmentation with Python, Packt Publishing, 2023.
7 Mingle Xu, Sook Yoon, Alvaro Fuentes, and Dong Sun Park, “A Comprehensive Survey of Image Augmentation Techniques for Deep Learning,” Patter Recognition, Vol. 137, https://www.sciencedirect.com/science/article/pii/S0031320323000481.
8 Connor Shorten and Taghi M. Khoshgoftaa, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, 2019, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0. Terrance DeVries and Graham W. Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout,” 2017, https://arxiv.org/abs/1708.04552.
9 Zhiqiang Shen, Mingyang Huang, Jianping Shi, Xiangyang Xue, and Thomas S. Huang, “Towards Instance-Level Image-To-Image Translation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3683-3692, https://openaccess.thecvf.com/content_CVPR_2019/html/Shen_Towards_Instance-Level_Image-To-Image_Translation_CVPR_2019_paper.html. Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph, “Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2918-2928, https://openaccess.thecvf.com/content/CVPR2021/html/Ghiasi_Simple_Copy-Paste_Is_a_Strong_Data_Augmentation_Method_for_Instance_CVPR_2021_paper.html.
10 Connor Shorten, Taghi M. Khoshgoftaar and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0. Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen, “Syntactic Data Augmentation Increases Robustness to Inference Heuristics,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2339-2352, https://aclanthology.org/2020.acl-main.212/.
11 Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0. Rico Sennrich, Barry Haddow, and Alexandra Birch, “Improving Neural Machine Translation Models with Monolingual Data,” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 86-96, https://aclanthology.org/P16-1009/.
12 Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0. Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, Philip Yu, and Lifang He, “Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks,” Proceedings of the 28th International Conference on Computational Linguistics, 2020, https://aclanthology.org/2020.coling-main.305/. Hongyu Guo, Yongyi Mao, and Richong Zhang, “Augmenting Data with Mixup for Sentence Classification: An Empirical Study,” 2019. https://arxiv.org/abs/1905.08941.
13 Suorong Yang, Weikang Xiao, Mengchen Zhang, Suhan Guo, Jian Zhao, and Furao Shen, “Image Data Augmentation for Deep Learning: A Survey,” 2023, https://arxiv.org/pdf/2204.08610.pdf. Alhassan Mumuni and Fuseini Mumuni, “Data augmentation: A comprehensive survey of modern approaches,” Array, Vol. 16, 2022, https://www.sciencedirect.com/science/article/pii/S2590005622000911. Evgin Goveri, “Medical image data augmentation: techniques, comparisons and interpretations,” Artificial Intelligence Review, Vol. 56, 2023, pp. 12561-12605, https://link.springer.com/article/10.1007/s10462-023-10453-z.
14 Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le, “AutoAugment: Learning Augmentation Strategies From Data,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113-123, https://openaccess.thecvf.com/content_CVPR_2019/papers/Cubuk_AutoAugment_Learning_Augmentation_Strategies_From_Data_CVPR_2019_paper.pdf.
15 Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, and Quoc V. Le, “Learning Data Augmentation Strategies for Object Detection,” Proceedings of the 16th European Conference on Computer Vision, 2020, https://link.springer.com/chapter/10.1007/978-3-030-58583-9_34.
16 Sandareka Wickramanayake, Wynne Hsu, and Mong Li Lee, “Explanation-based Data Augmentation for Image Classification,” Advances in Neural Information Processing Systems, Vol. 34, 2021, https://proceedings.neurips.cc/paper_files/paper/2021/hash/af3b6a54e9e9338abc54258e3406e485-Abstract.html.
17 rishna Chaitanya, Neerav Karani, Christian F. Baumgartner, Anton Becker, Olivio Donati, and Ender Konukoglu, “Semi-supervised and Task-Driven Data Augmentation,” Proceedings of the 26th International Conference on Information Processing in Medical Imaging, 2019, https://link.springer.com/chapter/10.1007/978-3-030-20351-1_3.
18 Cédric Rommel, Thomas Moreau, Joseph Paillard, and Alexandre Gramfort, “ADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals,” International Conference on Learning Representations, 2022, https://iclr.cc/virtual/2022/poster/7154.
19 Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan, “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing, 2018, pp. 321-331, https://www.sciencedirect.com/science/article/abs/pii/S0925231218310749.
20 Veit Sandfort, Ke Yan, Perry Pickhardt, and Ronald Summers, “Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks,” Scientific Reports, 2019, https://www.nature.com/articles/s41598-019-52737-x.
21 Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park, “GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation,” Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2225-2239, https://aclanthology.org/2021.findings-emnlp.192/. Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang Shen, Tianming Liu, and Xiang Li, “AugGPT: Leveraging ChatGPT for Text Data Augmentation,” 2023, https://arxiv.org/abs/2302.13007.
22 Bram Vanherle, Steven Moonen, Frank Van Reeth, and Nick Michiels, “Analysis of Training Object Detection Models with Synthetic Data,” 33rd British Machine Vision Conference, 2022, https://bmvc2022.mpi-inf.mpg.de/0833.pdf. Martin Georg Ljungqvist, Otto Nordander, Markus Skans, Arvid Mildner, Tony Liu, and Pierre Nugues, “Object Detector Differences When Using Synthetic and Real Training Data,” SN Computer Science, Vol. 4, 2023, https://link.springer.com/article/10.1007/s42979-023-01704-5. Lei Kang, Marcal Rusinol, Alicia Fornes, Pau Riba, and Mauricio Villegas, “Unsupervised Writer Adaptation for Synthetic-to-Real Handwritten Word Recognition,” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3502-3511, https://openaccess.thecvf.com/content_WACV_2020/html/Kang_Unsupervised_Writer_Adaptation_for_Synthetic-to-Real_Handwritten_Word_Recognition_WACV_2020_paper.html.