Downsampling is an effective way to address imbalances within a dataset. An imbalanced dataset is defined as a dataset in which one class is greatly underrepresented in the dataset relative to the true population, creating unintended bias. For instance, imagine a model is trained to classify images as showing a cat or a dog. The dataset used is composed of 90% cats and 10% dogs. Cats in this scenario are overrepresented, and if we have a classifier predicting cats every time, it will yield a 90% accuracy for classifying cats, but 0% accuracy for classifying dogs. The imbalanced dataset in this case will cause classifiers to favor accuracy for the majority class at the expense of the minority class. The same issue can arise with multi-class datasets.1
The process of downsampling counteracts the imbalanced dataset issue. It identifies majority class points to remove based on specified criteria. These criteria can change with the chosen downsampling technique. This balances the dataset by effectively decreasing the number of samples for an overrepresented majority class until the dataset contains an equal ratio of points across all classes.
While imbalances can be seen by simply plotting the counts of data points in each class, it doesn’t tell us whether it will greatly affect the model. Fortunately, we can use performance metrics to gauge how well a downsampling technique corrects for class imbalance. Most of these metrics will be for binary classification, where there are only two classes: a positive and a negative. Usually, the positive class is the minority class while the negative class is the majority class. Two popular metrics are Receiver Operating Characteristic (ROC) curves and precision-recall curves.1