Training AI models for prediction tasks like classification or regression typically requires labeled data: annotated data points that provide necessary context and demonstrate the correct predictions (output) for each sample input. During training, a loss function measures the difference (loss) between the model’s predictions for a given input and the “ground truth” provided by that input’s label. Models learn from these labeled examples by using techniques like gradient descent that update model weights to minimize loss. Because this machine learning process actively involves humans, it is called “supervised” learning.

Properly labeling data becomes increasingly labor-intensive for complex AI tasks. For example, to train an image classification model to differentiate between cars and motorcycles, hundreds (if not thousands) of training images must be labeled “car” or “motorcycle”; for a more detailed computer vision task, like object detection, humans must not only annotate the object(s) each image contains, but where each object is located; for even more detailed tasks, like image segmentation, data labels must annotate specific pixel-by-pixel boundaries of different image segments for each image.



Labeling data can thus be particularly tedious for certain use cases. In more specialized machine learning use cases, like drug discovery, genetic sequencing or protein classification, data annotation is not only extremely time-consuming, but also requires very specific domain expertise.

Semi-supervised learning offers a way to extract maximum benefit from a scarce amount of labeled data while also making use of relatively abundant unlabeled data.