In its most general sense, data augmentation denotes methods for supplementing so-called incomplete datasets by providing missing data points in order to increase the dataset’s analyzability.1 This manifests in machine learning by generating modified copies of pre-existing data to increase the size and diversity of a dataset. Thus, with respect to machine learning, augmented data may be understood as artificially supplying potentially absent real-world data.
Data augmentation improves machine learning model optimization and generalization. In other words, data augmentation can reduce overfitting and improve model robustness.2 That large, diverse datasets equal improved model performance is an axiom of machine learning. Nevertheless, for a number of reasons—from ethics and privacy concerns to simply the time-consuming effort of manually compiling necessary data—acquiring sufficient data can be difficult. Data augmentation provides one effective means of increasing dataset size and variability. In fact, researchers widely use data augmentation to correct imbalanced datasets.3
Many deep learning frameworks, such as PyTorch, Keras, and Tensorflow provide functions for augmenting data, principally image datasets. The Python package Ablumentations (available on Github) is also adopted in many open source projects. Albumentations allows for augmenting image and text data.