Restoring Balance in Machine Learning Datasets

Share this post:

If you want to teach a child what an elephant looks like, you have an infinite number of options. Take a photo from National Geographic, a stuffed animal of Dumbo, or an elephant keychain; show it to the child; and the next time he sees an object which looks like an elephant he will likely point and say the word.

Teaching AI what an elephant looks like is a bit different. To train a machine learning algorithm, you will likely need thousands of elephant images using different perspectives, such as head, tail, and profile. But then, even after ingesting thousands of photos, if you connect your algorithm to a camera and show it a pink elephant keychain, it likely won’t recognize it as an elephant.

This is a form of data bias, and it often negatively affects the accuracy of deep learning classifiers. To fix this bias, using the same example, we would need at least 50-100 images of pink elephants, which could be problematic since pink elephants are “rare”.

This is a known challenge in machine learning communities, and whether its pink elephants or road signs, small data sets present big challenges for AI scientists.

Restoring balance for training AI

Since earlier this year, my colleagues and I at IBM Research in Zurich are offering a solution. It’s called BAGAN, or balancing generative adversarial networks, and it can generate completely new images, i.e. of pink elephants, to restore balance for training AI.

For example, let us consider the classification of road traffic signs. All warning signs share the same external triangular shape. Once BAGAN learns to draw this shape from one of the signs, we can apply it for drawing any other one. Since BAGAN learns features starting from all classes whereas the goal is to generate images for the minority classes, a mechanism to drive the generation process toward a desired class is needed. To this end, in this work we apply class conditioning on the latent space. We initialize the discriminator and generator in the GAN with an autoencoder. Then, we leverage this autoencoder to learn class conditioning in the latent space, i.e. to learn how the input of the generative model should look like for different classes.

Seeing is believing

In the paper we report using BAGAN on the German Traffic Sign Recognition Benchmark, as well as on MNIST and CIFAR-10, and when compared against state-of-the-art GAN, the methodology outperforms all of them in terms of variety and quality of the generated images when the training dataset is imbalanced. In turn, this leads to a higher accuracy of final classifiers trained on the augmented dataset.

Restoring Balance in Machine Learning Datasets

Five representative samples for each class (row) in the CIFAR-10 dataset. For each class, these samples are obtained with generative models trained after dropping from the training set 40{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} of the images of that specific class. (click to enlarge)

Restoring Balance in Machine Learning Datasets

Five representative samples generated for the three most represented majority classes in the GT- SRB dataset. (click to enlarge)

Restoring Balance in Machine Learning Datasets

Five representative samples generated for the three least represented minority classes in the GT-SRB dataset. (click to enlarge)

The work was recently published and made open-source. Visit Github today to try it for free

BAGAN: Data Augmentation with Balancing GAN. Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, and Cristiano Malossi.

More AI stories

Using iter8 and Kiali to evolve your cloud applications while gaining insights into their behavior

IBM Research has partnered with Red Hat to bring iter8 into Kiali. Iter8 lets developers automate the progressive rollout of new microservice versions. From Kiali, developers can launch these rollouts interactively, watch their progress while iter8 shifts user traffic to the best microservice version, gain real-time insights into how competing versions (two or more) perform, and uncover trends on service metrics across versions.

Continue reading

The Open Science Prize: Solve for SWAP gates and graph states

We're excited to announce the IBM Quantum Awards: Open Science Prize, an award totaling $100,000 for any person or team who can devise an open source solution to two important challenges at the forefront of quantum computing based on superconducting qubits: reducing gate errors, and measuring graph state fidelity.

Continue reading

New IBM and Intel Blockchain Security Feature Targets 5G Auctions

A new security feature developed by IBM and Intel extends blockchain capabilities and helps increase trust in high-stakes markets such as wireless spectrum auctions. As telecom companies start rolling out the fifth generation of wireless networks, the term 5G is becoming omnipresent in the news linking it to the prospect of higher data transfer speeds. […]

Continue reading