What is data augmentation?

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Business Development + Partnerships

IBM Research

What is data augmentation?

Data augmentation uses pre-existing data to create new data samples that can improve model optimization and generalizability.

In its most general sense, data augmentation denotes methods for supplementing so-called incomplete datasets by providing missing data points in order to increase the dataset’s analyzability.¹ This manifests in machine learning by generating modified copies of pre-existing data to increase the size and diversity of a dataset. Thus, with respect to machine learning, augmented data may be understood as artificially supplying potentially absent real-world data.

Data augmentation improves machine learning model optimization and generalization. In other words, data augmentation can reduce overfitting and improve model robustness.² That large, diverse datasets equal improved model performance is an axiom of machine learning. Nevertheless, for a number of reasons—from ethics and privacy concerns to simply the time-consuming effort of manually compiling necessary data—acquiring sufficient data can be difficult. Data augmentation provides one effective means of increasing dataset size and variability. In fact, researchers widely use data augmentation to correct imbalanced datasets.³

Many deep learning frameworks, such as PyTorch, Keras, and Tensorflow provide functions for augmenting data, principally image datasets. The Python package Ablumentations (available on Github) is also adopted in many open source projects. Albumentations allows for augmenting image and text data.

Augmented data vs. synthetic data

Note that data augmentation is distinct from synthetic data. Admittedly, both are generative algorithms that add new data into a data collection in order to improve the performance of machine learning models. Synthetic data, however, refers to the automatic generation of entirely artificial data. An example is using computer-generated images—as opposed to real-world data—to train an object detection model. By contrast, data augmentation copies existing data and transforms those copies to increase the diversity and amount of data in a given set.

Industry newsletter

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Data augmentation techniques

There are a variety of data augmentation methods. The specific techniques used for augmenting data depend upon the nature of data with which a user is working. Note that data augmentation is typically implemented during preprocessing on the training dataset. Some studies investigate the effect of augmentation on the validation or test set, but augmentation applications outside of training sets are rarer.⁴

Image augmentation

Data augmentation has been widely implemented in research for a range of computer vision tasks, from image classification to object detection. As such, there is a wealth of research on how augmented images improve the performance of state-of-the-art convolutional neural networks (CNNs) in image processing.

Many tutorials and non-academic resources classify image data augmentation into two categories: geometric transformations and photometric (or, color space) transformations. Both consist of relatively simple image file manipulation. The first category denotes techniques that alter the space and layout of the original image, such as resizing, zooming, or changes in orientation (for example, horizontal flip). Photometric transformations alter an image’s RGB (red-green-blue) channels. Examples of photometric transformation include saturation adjustment and grayscaling an image.⁵

Example of basic image augmentation for cat image

Some sources categorize noise injection with geometric transformations,⁶ while others classify it with photometric transformations.⁷ Noise injection inserts random black, white, or color pixels into an image according to a Gaussian distribution.

Example of noise injection for image augmentation

As noise injection illustrates, the binary classification of image augmentation techniques into geometric and photometric fails to cover the whole range of possible augmentation strategies. Excluded image augmentation techniques are kernel filtering (sharpening or blurring an image) and image mixing. An example of the latter is random cropping and patching. This technique randomly samples sections from several images to create a new image. This new image is a composite made from the sampled sections of the input images. A related technique is random erasing, which deletes a random portion of an image.⁸ Such tasks are useful in image recognition tasks, as real-world use cases may require machines to identify partially obscured objects.

Visualization for random cropping for golden retriever image

Instance-level augmentation is another augmentation. Instance-level augmentation essentially copies labeled regions (for example, bounding boxes) from one image and inserts them onto another image. Such an approach trains the image to identify objects against different backgrounds as well as objects obscured by other objects. Instance-level augmentation is a particularly salient approach for region-specific recognition tasks, such as object detection and image segmentation tasks.⁹

Text augmentation

Like image augmentation, text data augmentation consists of many techniques and methods that are used across a range of natural language processing (NLP) tasks. A few resources divide text augmentation into rule-based (or “easy”) and neural methods. Of course, as with the binary division of image augmentation techniques, this categorization is not all-encompassing.

Rule-based approaches include relatively simple find-and-replace techniques, such as random deletion or insertion. Rule-based approaches also encompass synonym replacement. In this strategy, one or more words in a string are replaced with their respective synonyms as recorded in predefined thesaurus, such as WordNet or the Paraphrase Database. Sentence inversion and passivation, in which the object and subject are swapped, are also examples of rule-based approaches.¹⁰

Chart visualization of rule-based text augmentations

Per their classification, neural methods utilize neural networks to generate new text samples from the input data. One notable neural method is back-translation. This uses machine translation to translate input data into a target language and then back into the original input language. In this way, back-translation leverages linguistic variances that result in automated translations to generate semantic variances in single-language dataset for the purpose of augmentation. Research suggests this is effective for improving machine translation model performance.¹¹

Visualization of translation augmentation with phrase I am dancing in the club

Mix-up text augmentations is another strategy. This approach deploys rule-based deletion and insertion methods using neural network embeddings. Specifically, pre-trained transformers (for example, BERT) generate word or sentence-level embeddings of text, transforming text into vector points, as in a bag of words model. The transformation of text into vector points generally aims to capture linguistic similitude, that is, words or sentences nearer one another in vector space are believed to share similar meanings or frequency. Mix-up augmentations interpolates text strings within a specified distance of one another to produce new data that is an aggregate of the input data.¹²

Mixture of Experts | 9 January, episode 89

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch all episodes of Mixture of Experts

Recent research

Many users struggle with identifying which data augmentation strategies to implement. Do data augmentation techniques vary in efficacy between datasets and tasks? Comparative research on data augmentation techniques suggests that multiple forms of augmentation have a greater positive impact than one, but determining the optimal combination of techniques is dataset and task dependent.¹³ But how does one go about selecting the optimal techniques?

Automated augmentation

To address this issue, research has turned to automated data augmentation. One automated augmentation approach uses reinforcement learning to identify augmentation techniques that return the highest validation accuracy on a given dataset.¹⁴ This approach has shown to implement strategies that improve performance on both in and out of sample data.¹⁵ Another promising approach to automated augmentation identifies and augments false positives from classifier outputs. In this way, automatic augmentation identifies the best strategies to correct for frequently misclassified items.¹⁶

Generative networks

More recently, research has turned to generative networks and models to identify task-dependent¹⁷ and class-dependent¹⁸ optimal augmentation strategies. This includes work with generative adversarial networks (GANs). GANs are deep learning networks typically used to generate synthetic data, and recent research investigates their use for data augmentation. A few experiments, for instance, suggest that synthetic data augmentations of medical image sets improve classification¹⁹ and segmentation²⁰ model performance more than classic augmentations. Relatedly, research in text augmentation leverages large language models (LLMs) and chatbots to generate augmented data. These experiments use LLMs to generate augmented samples of input data with mix-up and synonymizing techniques, showing a greater positive impact for text classification models than classic augmentation.²¹

Researchers and developers widely adopt data augmentation techniques when training models for various machine learning tasks. By contrast, synthetic data is a comparatively newer area of research. Comparative experiments on synthetic versus real data show mixed results, with models trained entirely on synthetic data sometimes outperforming, sometimes underperforming models trained on real-world data. Perhaps unsurprisingly, this research suggests synthetic data is most useful when it reflects characteristics of real-world data.²²

IBM is named a Leader in Data Science & Machine Learning

Learn why IBM has been recognized as a Leader in the 2025 Gartner® Magic Quadrant™ for Data Science and Machine Learning Platforms.

Resources

The 2025 CEO’s guide: 5 mindshifts to supercharge business growth

Activate these five mindshifts to cut through the uncertainty, spur business reinvention, and supercharge growth with agentic AI.

Level up your ML expertise

Learn fundamental concepts and build your skills with hands-on labs, courses, guided projects, trials and more.

Unlock the power of generative AI + ML

Learn how to confidently incorporate generative AI and machine learning into your business.

Put AI to work: Driving ROI with gen AI

Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.

How to choose the right foundation model

Learn how to select the most suitable AI foundation model for your use case.

Explore IBM Granite

IBM® Granite™ is our family of open, performant and trusted AI models, tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

How to thrive in this new era of AI with trust and confidence

Dive into the 3 critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.

AI in Action Report

We surveyed 2,000 organizations about their AI initiatives to discover what's working, what's not and how you can get ahead.

Footnotes

All links reside outside IBM.com.

^f Martin Tanner and Wing Hung Wong, “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, Vol. 82, No. 398 (1987), pp. 528-540.

² Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, and Timothy A Mann, “Data Augmentation Can Improve Robustness,” Advances in Neural Information Processing Systems, Vol. 34, 2021.

³ Manisha Saini and Seba Susan, “Tackling class imbalance in computer vision: A contemporary review,” Artificial Intelligence Review, Vol. 54, 2023.

⁴ Fabio Perez, Cristina Vasconcelos, Sandra Avila, and Eduardo Valle, “Data Augmentation for Skin Lesion Analysis,” OR 2.0 Context-Aware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical Image-Based Procedures, and Skin Image Analysis, 2018.

⁵ Connor Shorten and Taghi M. Khoshgoftaa, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, 2019.

⁶ Duc Haba, Data Augmentation with Python, Packt Publishing, 2023.

⁷ Mingle Xu, Sook Yoon, Alvaro Fuentes, and Dong Sun Park, “A Comprehensive Survey of Image Augmentation Techniques for Deep Learning,” Patter Recognition, Vol. 137.

⁸ Connor Shorten and Taghi M. Khoshgoftaa, “A survey on Image Data Augmentation for Deep Learning,” Journal of Big Data, 2019, . Terrance DeVries and Graham W. Taylor, “Improved Regularization of Convolutional Neural Networks with Cutout,” 2017.

⁹ Zhiqiang Shen, Mingyang Huang, Jianping Shi, Xiangyang Xue, and Thomas S. Huang, “Towards Instance-Level Image-To-Image Translation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3683-3692, . Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, and Barret Zoph, “Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2918-2928.

¹⁰ Connor Shorten, Taghi M. Khoshgoftaar and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, . Junghyun Min, R. Thomas McCoy, Dipanjan Das, Emily Pitler, and Tal Linzen, “Syntactic Data Augmentation Increases Robustness to Inference Heuristics,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2339-2352.

¹¹ Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021, . Rico Sennrich, Barry Haddow, and Alexandra Birch, “Improving Neural Machine Translation Models with Monolingual Data,” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016, pp. 86-96.

¹² Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht, “Text Data Augmentation for Deep Learning,” Journal of Big Data, 2021. Lichao Sun, Congying Xia, Wenpeng Yin, Tingting Liang, Philip Yu, and Lifang He, “Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks,” Proceedings of the 28th International Conference on Computational Linguistics, 2020. Hongyu Guo, Yongyi Mao, and Richong Zhang, “Augmenting Data with Mixup for Sentence Classification: An Empirical Study,” 2019.

¹³ Suorong Yang, Weikang Xiao, Mengchen Zhang, Suhan Guo, Jian Zhao, and Furao Shen, “Image Data Augmentation for Deep Learning: A Survey,” 2023. Alhassan Mumuni and Fuseini Mumuni, “Data augmentation: A comprehensive survey of modern approaches,” Array, Vol. 16, 2022. Evgin Goveri, “Medical image data augmentation: techniques, comparisons and interpretations,” Artificial Intelligence Review, Vol. 56, 2023, pp. 12561-12605.

¹⁴ Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le, “AutoAugment: Learning Augmentation Strategies From Data,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 113-123.

¹⁵ Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, and Quoc V. Le, “Learning Data Augmentation Strategies for Object Detection,” Proceedings of the 16^th European Conference on Computer Vision, 2020.

¹⁶ Sandareka Wickramanayake, Wynne Hsu, and Mong Li Lee, “Explanation-based Data Augmentation for Image Classification,” Advances in Neural Information Processing Systems, Vol. 34, 2021.

¹⁷ rishna Chaitanya, Neerav Karani, Christian F. Baumgartner, Anton Becker, Olivio Donati, and Ender Konukoglu, “Semi-supervised and Task-Driven Data Augmentation,” Proceedings of the 26^th International Conference on Information Processing in Medical Imaging, 2019.

¹⁸ Cédric Rommel, Thomas Moreau, Joseph Paillard, and Alexandre Gramfort, “ADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals,” International Conference on Learning Representations, 2022.

¹⁹ Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan, “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing, 2018, pp. 321-331.

²⁰ Veit Sandfort, Ke Yan, Perry Pickhardt, and Ronald Summers, “Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks,” Scientific Reports, 2019.

²¹ Kang Min Yoo, Dongju Park, Jaewook Kang, Sang-Woo Lee, and Woomyoung Park, “GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation,” Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 2225-2239. Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang Shen, Tianming Liu, and Xiang Li, “AugGPT: Leveraging ChatGPT for Text Data Augmentation,” 2023.

²² Bram Vanherle, Steven Moonen, Frank Van Reeth, and Nick Michiels, “Analysis of Training Object Detection Models with Synthetic Data,” 33^rd British Machine Vision Conference, 2022. Martin Georg Ljungqvist, Otto Nordander, Markus Skans, Arvid Mildner, Tony Liu, and Pierre Nugues, “Object Detector Differences When Using Synthetic and Real Training Data,” SN Computer Science, Vol. 4, 2023. Lei Kang, Marcal Rusinol, Alicia Fornes, Pau Riba, and Mauricio Villegas, “Unsupervised Writer Adaptation for Synthetic-to-Real Handwritten Word Recognition,” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3502-3511.

What is data augmentation?

Authors

What is data augmentation?

Augmented data vs. synthetic data

The latest AI trends, brought to you by experts

Thank you! You are subscribed.

Data augmentation techniques

Image augmentation

Text augmentation

Decoding AI: Weekly News Roundup

Recent research

Automated augmentation

Generative networks

Resources

Footnotes