RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection

Share this post:

Deep convolutional neural networks (CNNs) are currently the tool of choice for many computer vision tasks including image classification and object detection.  However, CNNs are notoriously data-hungry and can require thousands of training samples per category.  Yet, in many practical applications, it is not feasible to collect more than a few training samples per category.  Few-shot learning aims to enable effective learning in these data-limited settings. Typically, few-shot learning relies on pre-training from base models that are learned in advance from large datasets.

Recent studies have achieved significant advances in using CNNs for few-shot learning. This has been demonstrated for domain-specific tasks1,2, but few works have investigated the problem of few-shot object detection, where the task of recognizing instances of a category, represented by a few examples, is complicated by the presence of the image background (surely unobserved during training in the few-shot case) and the need to accurately localize the objects (Figure 1).

RepMet figure

Figure 1

In order to advance the technology on few-shot object detection, we develop a new approach in our paper presented at IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019) in June 2019.   The paper, titled “RepMet: Representative-based metric learning for classification and few-shot object detection” demonstrates a practical approach for building a few-shot detector.  RepMet allows new few-shot categories to be learned ‘on the fly’ using a distance-based classifier that replaces the standard linear classifier within the object detection process. While the set of visual categories must be fixed for the latter throughout the model pre-training, using distance-based classification allows learning of new categories by presenting a few samples.  The probability of a candidate image region belonging to one of the categories is determined by the Euclidean distance of this region to samples representing them, computed in the specially learned embedding space. Generally, this method bears the name of ‘metric learning,’ since the Euclidean distance in this embedding space represents some meaningful (semantic) distance between the visual content of the encoded image regions.

In our RepMet approach, learning the embedding space in the standard detector training procedure required some innovations, like learning a Gaussian Mixture model parameters for each visual category. In practice, assuming a fixed variance and an isotropic metric, the only parameters required are the mode means of the model, which are essentially representative examples of the visual categories. They are optimized jointly with the feature extracting backbone and with the embedding space (Figure 2). The joint optimization holds an advantage over the sampling strategy that would otherwise need to be employed if optimizing the metric to be used within a detector using conventional means for metric learning using DNNs. This is intuitive, because sampling cannot effectively take into account the entire (very large) set of possible background regions of interest (ROIs) present on each training image (and taken into account by joint training), and hence is prone to larger false positive rates after training.

RepMet figure

Figure 2

RepMet figure

Figure 3


The performance of RepMet for few-shot detection was tested on an episodic benchmark defined over the ImageNet-LOC detection dataset. Each episode is an instance of the few-shot detection problem with m novel categories (unseen during base model pre-training) of k training samples each for a k-shot-m-way detection task (episode). The goal of the detector is then to detect the objects belonging to the novel categories in a set of query images of the episode. See Figure 4 for an example of a 1-shot-5-way episode with one query image. The experiments over the proposed benchmark have demonstrated that RepMet can achieve close to 80 percent mAP with as little as 10 examples per category in 5-way detection.

RepMet figure

Figure 4

RepMet is one of the first proposed few-shot detection methods to date, and we hope it will inspire more researchers to investigate into the important problem of few-shot detection.


  1. O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. Advances In Neural Information Processing Systems (NIPS), 2016.


Research Staff Member, IBM Research

More AI stories

Daily chats with AI could help spot early signs of Alzheimer’s

In a new paper published in Frontiers in Digital Health journal, we present the first empirical evidence of tablet-based automatic assessments of patients using speech analysis — successfully detecting mild cognitive impairment (MCI), the transitional stage between normal aging and dementia.

Continue reading

IBM researchers investigate ways to help reduce bias in healthcare AI

Our study "Comparison of methods to reduce bias from clinical prediction models of postpartum depression” examines healthcare data and machine learning models routinely used in both research and application to address bias in healthcare AI.

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading