My IBM

What is upsampling?

29 April 2024

Authors

Jacob Murel Ph.D.

Senior Technical Content Creator

Upsampling increases the number of data samples in a dataset. In doing so, it aims to correct imbalanced data and thereby improve model performance.

Upsampling, otherwise known as oversampling, is a data processing and optimization technique that addresses class imbalance in a dataset by adding data. Upsampling adds data by using original samples from minority classes until all classes are equal in size. Both Python scikit-learn and Matlab contain built-in functions for implementing upsampling techniques.

Upsampling for data science is often mistaken for upsampling in digital signal processing (DSP). The two are similar in spirit yet distinct. Similar to upsampling in data science, upsampling for DSP artificially creates more samples in a frequency domain from an input signal (specifically a discrete time signal) by interpolating higher sampling rates. These new samples are generated by inserting zeros into the original signal and using a low pass filter for interpolation. This differs from how data is upsampled in data balancing.

Upsampling for data balancing is also distinct from upsampling in image processing. In the latter, high resolution images are first reduced in resolution (removing pixels) for faster computations, after which convolution returns the image to its original dimensions (adding back pixels).

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Why use upsampling?

Upsampling is an effective way to address imbalance within a dataset. An imbalanced dataset is defined as a dataset in which one class is greatly underrepresented in the dataset relative to the true population, creating unintended bias. For instance, imagine a model is trained to classify images as showing a cat or a dog. The dataset used is composed of 90% cats and 10% dogs. Cats in this scenario are overrepresented, and if we have a classifier predicting cats every time, it will yield a 90% accuracy for classifying cats, but 0% accuracy for classifying dogs. The imbalanced dataset in this case will cause classifiers to favor accuracy for the majority class at the expense of the minority class. The same issue can arise with multi-class datasets.¹

The process of upsampling counteracts the imbalanced dataset issue. It populates the dataset with points synthesized from characteristics of the original dataset’s minority class. This balances the dataset by effectively increasing the number of samples for an underrepresented minority class until the dataset contains an equal ratio of points across all classes.

While imbalances can be seen by simply plotting the counts of data points in each class, it doesn’t tell us whether it will greatly affect the model. Fortunately, we can use performance metrics to gauge how well an upsampling technique corrects for class imbalance. Most of these metrics will be for binary classification, where there are only two classes: a positive and a negative. Usually, the positive class is the minority class while the negative class is the majority class. Two popular metrics are Receiver Operating Characteristic (ROC) curves and precision-recall curves.¹

Mixture of Experts | 9 May, episode 54

Decoding AI: Weekly News Roundup

Join our world-class panel of engineers, researchers, product leaders and more as they cut through the AI noise to bring you the latest in AI news and insights.

Watch the latest podcast episodes

Advantages and disadvantages of upsampling

Advantages

No Information Loss: Unlike downsampling, which removes data points from the majority class, upsampling generates new data points, avoiding any information loss.
Increase Data at Low Costs: Upsampling is especially effective, and is often the only way, to increase dataset size on demand in cases where data can only be acquired through observation. For instance, certain medical conditions are simply too rare to allow for more data to be collected.

Disadvantages

Overfitting: Because upsampling creates new data based on the existing minority class data, the classifier can be overfitted to the data. Upsampling assumes that the existing data adequately captures reality; if that is not the case, the classifier may not be able to generalize very well.
Data Noise: Upsampling can increase the amount of noise in the data, reducing the classifier’s reliability and performance.²
Computational Complexity: By increasing the amount of data, training the classifier will be more computational expensive, which can be an issue when using cloud computing.²

Upsampling techniques

Random oversampling

Random oversampling is the process of duplicating random data points in the minority class until the size of the minority class is equal to the majority class.

Though they are similar in nature, random oversampling is distinct from bootstrapping. Bootstrapping is an ensemble learning technique that resamples from all classes. By contrast, random oversampling resamples from only the minority class. Random oversampling can thus be understood as a more specialized form of bootstrapping.

Despite its simplicity, random oversampling has limitations, however. Because random oversampling solely adds duplicate datapoints, it can lead to overfitting.³ But it still has many advantages over other methods: its ease of implementation, lack of stretching assumptions about the data, and low time complexity due to a simple algorithm.²

SMOTE

The Synthetic Minority Oversampling Technique, or SMOTE, is an upsampling technique first proposed in 2002 that synthesizes new data points from the existing points in the minority class.⁴ It consists of the following process:²

Find the K nearest neighbors for all minority class data points. K is usually 5.
Repeat steps 3-5 for each minority class data point:
Pick one of the data point’s K nearest neighbors.
Pick a random point on the line segment connecting these two points in the feature space to generate a new output sample. This process is known as interpolation.
Depending on how much upsampling is desired, repeat steps 3 and 4 using a different nearest neighbor.

SMOTE counters the problem of overfitting in random oversampling by adding previously unseen new data to the dataset rather than simply duplicating pre-existing data. For this reason, some researchers consider SMOTE a better upsampling technique than random oversampling.

On the other hand, SMOTE’s artificial data point generation adds extra noise to the dataset, potentially making the classifier more unstable.¹ The synthetic points and noise from SMOTE can also inadvertently lead to overlaps between the minority and majority classes that don’t reflect reality, leading to what is called over-generalization.⁵

Borderline SMOTE

One popular extension, Borderline SMOTE, is used to combat the issue of artificial dataset noise and to create ‘harder’ data points. ‘Harder’ data points are data points close to the decision boundary, and therefore harder to classify. These harder points are more useful for the model to learn.²

Borderline SMOTE identifies the minority class points that are close to many majority class points and puts them into a DANGER set. DANGER points are the ‘hard’ data points to learn, which again is because they’re harder to classify compared to points that are surrounded by minority class points. This selection process excludes points whose nearest neighbors are only majority class points, which are counted as noise. From there, the SMOTE algorithm continues as normal using this DANGER set.³

ADASYN

Adaptive Synthetic Sampling Approach (ADASYN) is similar to Borderline SMOTE in that it generates more difficult data for the model to learn. But it also aims to preserve the distribution of the minority class data.⁶ It does this by first creating a weighted distribution of all the minority points based on the number of majority class examples in its neighborhood. From there, it uses minority class points closer to the majority class more often in generating new data.

The process goes as follows:²

Create a KNN model on the entire dataset.
Each minority class point is given a “hardness factor”, denoted as r, which is ratio of the number of majority class points over the total number of neighbors in KNN.
Like SMOTE, the synthetically generated points are a linear interpolation between the minority data and its neighbors, but the number of points generated scales with a point’s hardness factor. What this does is generate more points in areas with less minority data and less points in areas with more.

Data transformation/augmentations

Data augmentation creates new data by creating variations of the data. Data augmentation applies across a variety of machine learning fields.

The most basic form of data augmentation deals with transforming the raw inputs of the dataset. For example, in computer vision, image augmentations (cropping, blurring, mirroring and so on) can be used to create more images for the model to classify. Similarly, data augmentation can also be used in natural language processing tasks, like replacing words with their synonyms or creating semantically equivalent sentences.

Researchers have found that data augmentation effectively increases model accuracy for computer vision and NLP tasks because it adds similar data at a low cost. However, it is important to note some cautions before executing these techniques. For traditional geometric augmentations, “safety” of transformations should be looked at before performing them. For example, rotating an image of a “9” would make it look a “6,” changing its semantic meaning.⁷

Recent research

SMOTE extensions and deep learning have been the focus of upsampling techniques in recent years. These methods aim to improve model performance and address some of the shortcomings of upsampling, like introduced bias in the distribution of the minority class.

Some developments in SMOTE include a minority-predictive-probability SMOTE (MPP-SMOTE), which upsamples based on estimated probabilities of seeing each minority class samples.⁸ Multi-Label Borderline Oversampling Technique (MLBOTE) has been proposed to extend SMOTE to multi-class classification.⁹ Both have outperformed all existing SMOTE variants and retained the patterns in the original data.

Neural networks have also been used to develop oversampling techniques. Generative Adversarial Networks have stirred some interest, producing promising results, although training time makes this technique slower than other traditional upsampling methods.¹⁰

How to choose the right foundation model

Learn how to choose the right approach in preparing datasets and employing foundation models.

Resources

The 2025 CEO’s guide: 5 mindshifts to supercharge business growth

Activate these five mindshifts to cut through the uncertainty, spur business reinvention, and supercharge growth with agentic AI.

AI in Action 2024

We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.

Explore IBM Granite

IBM® Granite® is a family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.

Level up your AI expertise

Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at a low price.

IBM AI Academy

Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.

Put AI to work: Driving ROI with gen AI

Want to get a better return on your AI investments? Learn how scaling gen AI in key areas drives change by helping your best minds build and deliver innovative new solutions.

Unlock the power of generative AI and ML

Learn how to confidently incorporate generative AI and machine learning into your business.

How to thrive in this new era of AI with trust and confidence

Dive into the three critical elements of a strong AI strategy: creating a competitive edge, scaling AI across the business and advancing trustworthy AI.

Footnotes

¹ Haobo He and Edwardo Garcia, Learning from Imbalanced Data, IEEE, September 2009, https://ieeexplore.ieee.org/document/5128907 (link resides outside ibm.com). (1,2,10)

² Kumar Abishek and Mounir Abdelaziz, Machine Learning for Imbalanced Data, Packt, November 2023, https://www.packtpub.com/product/machine-learning-for-imbalanced-data/9781801070836 (link resides outside ibm.com). (3,4,6,8,9,12,14-17)

³ Kumar Abishek and Mounir Abdelaziz, Machine Learning for Imbalanced Data, Packt, November 2023, https://www.packtpub.com/product/machine-learning-for-imbalanced-data/9781801070836 (link resides outside ibm.com). Alberto Fernandez, et al., Learning from Imbalanced Data Sets, 2018.

⁴ Nitesh Chawla, et al., SMOTE: Synthetic Minority Over-sampling Technique, JAIR, 01 June 2002, https://www.jair.org/index.php/jair/article/view/10302 (link resides outside ibm.com).

⁵ Kumar Abishek and Mounir Abdelaziz, Machine Learning for Imbalanced Data, Packt, November 2023. Haobo He and Edwardo Garcia, Learning from Imbalanced Data, IEEE, September 2009, https://ieeexplore.ieee.org/document/5128907 (link resides outside ibm.com).

⁶ Alberto Fernandez, et al., Learning from Imbalanced Data Sets, Springer, 2018.

⁷ Connor Shorten and Taghi Khoshgoftaar, A survey on Image Data Augmentation for Deep Learning, Springer, 06 July 2019**,** https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0197-0 (link resides outside ibm.com).

⁸ Zhen Wei, Li Zhang, and Lei Zhao, Minority prediction probability based oversampling technique for imbalanced learning, Science Direct, 06 December 2022, https://www.sciencedirect.com/science/article/abs/pii/S0020025522014578?casa_token=TVVIEM3xTDEAAAAA:LbzQSgIvuYDWbDTBKWb4ON-CUiTUg0EUeoQf9q12IjLgXFk0NQagfh0bU3DMUSyHL_mjd_V890o (link resides outside ibm.com).

⁹ Zeyu Teng, et al., Multi-label borderline oversampling technique, ScienceDirect, 14 September 2023, https://www.sciencedirect.com/science/article/abs/pii/S0031320323006519?casa_token=NO8dLh60_vAAAAAA:AWPCvCP8PQG43DvkQFChZF2-3uzB1GJBBtgPURevWe_-aR0-WTbLqOSAsiwxulNAuh_4mIDZx-Y (link resides outside ibm.com).

¹⁰ Justin Engelmann and Stefan Lessmann, Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning, 15 July 2021, ScienceDirect, https://www.sciencedirect.com/science/article/abs/pii/S0957417421000233?casa_token=O0d1BtspA8YAAAAA:n2Uv3v2yHvjl9APVU9V_13rQ9K_KwT0P__nzd6hIngNcZJE-fmQufDgR6XT1uMmDBHx8bLXPVho (link resides outside ibm.com). Shuai Yang, et al., Fault diagnosis of wind turbines with generative adversarial network-based oversampling method, IOP Science, 12 January 2023, https://iopscience.iop.org/article/10.1088/1361-6501/acad20/meta (link resides outside ibm.com).

What is upsampling?

29 April 2024

Authors

Jacob Murel Ph.D.

The latest AI News + Insights

Why use upsampling?

Decoding AI: Weekly News Roundup

Advantages and disadvantages of upsampling

Advantages

Disadvantages

Upsampling techniques

Random oversampling

SMOTE

Borderline SMOTE

ADASYN

Data transformation/augmentations

Recent research

Resources

Related solutions

Footnotes

The latest AI News + Insights