RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection

Share this post:

Deep convolutional neural networks (CNNs) are currently the tool of choice for many computer vision tasks including image classification and object detection.  However, CNNs are notoriously data-hungry and can require thousands of training samples per category.  Yet, in many practical applications, it is not feasible to collect more than a few training samples per category.  Few-shot learning aims to enable effective learning in these data-limited settings. Typically, few-shot learning relies on pre-training from base models that are learned in advance from large datasets.

Recent studies have achieved significant advances in using CNNs for few-shot learning. This has been demonstrated for domain-specific tasks1,2, but few works have investigated the problem of few-shot object detection, where the task of recognizing instances of a category, represented by a few examples, is complicated by the presence of the image background (surely unobserved during training in the few-shot case) and the need to accurately localize the objects (Figure 1).

RepMet figure

Figure 1

In order to advance the technology on few-shot object detection, we develop a new approach in our paper presented at IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019) in June 2019.   The paper, titled “RepMet: Representative-based metric learning for classification and few-shot object detection” demonstrates a practical approach for building a few-shot detector.  RepMet allows new few-shot categories to be learned ‘on the fly’ using a distance-based classifier that replaces the standard linear classifier within the object detection process. While the set of visual categories must be fixed for the latter throughout the model pre-training, using distance-based classification allows learning of new categories by presenting a few samples.  The probability of a candidate image region belonging to one of the categories is determined by the Euclidean distance of this region to samples representing them, computed in the specially learned embedding space. Generally, this method bears the name of ‘metric learning,’ since the Euclidean distance in this embedding space represents some meaningful (semantic) distance between the visual content of the encoded image regions.

In our RepMet approach, learning the embedding space in the standard detector training procedure required some innovations, like learning a Gaussian Mixture model parameters for each visual category. In practice, assuming a fixed variance and an isotropic metric, the only parameters required are the mode means of the model, which are essentially representative examples of the visual categories. They are optimized jointly with the feature extracting backbone and with the embedding space (Figure 2). The joint optimization holds an advantage over the sampling strategy that would otherwise need to be employed if optimizing the metric to be used within a detector using conventional means for metric learning using DNNs. This is intuitive, because sampling cannot effectively take into account the entire (very large) set of possible background regions of interest (ROIs) present on each training image (and taken into account by joint training), and hence is prone to larger false positive rates after training.

RepMet figure

Figure 2

RepMet figure

Figure 3


The performance of RepMet for few-shot detection was tested on an episodic benchmark defined over the ImageNet-LOC detection dataset. Each episode is an instance of the few-shot detection problem with m novel categories (unseen during base model pre-training) of k training samples each for a k-shot-m-way detection task (episode). The goal of the detector is then to detect the objects belonging to the novel categories in a set of query images of the episode. See Figure 4 for an example of a 1-shot-5-way episode with one query image. The experiments over the proposed benchmark have demonstrated that RepMet can achieve close to 80 percent mAP with as little as 10 examples per category in 5-way detection.

RepMet figure

Figure 4

RepMet is one of the first proposed few-shot detection methods to date, and we hope it will inspire more researchers to investigate into the important problem of few-shot detection.


  1. O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. Advances In Neural Information Processing Systems (NIPS), 2016.


Research Staff Member, IBM Research

More AI stories

IBM Research and The Michael J. Fox Foundation Develop Modeling Methodology to Help Understand Parkinson’s Disease Using Machine Learning

In collaboration with The Michael J. Fox Foundation for Parkinson’s Research, our team of researchers at IBM is aiming to develop improved disease progression models that can help clinicians understand how the disease progresses in relation to the emergence of symptoms, even when those patients are taking symptom-modifying medications.

Continue reading

AI Could Help Enable Accurate Remote Monitoring of Parkinson’s Patients

In a paper recently published in Nature Scientific Reports, IBM Research and scientists from several other medical institutions developed a new way to estimate the severity of a person’s Parkinson’s disease (PD) symptoms by remotely measuring and analyzing physical activity as motor impairment increased. Using data captured by wrist-worn accelerometers, we created statistical representations of […]

Continue reading

Image Captioning as an Assistive Technology

IBM Research's Science for Social Good team recently participated in the 2020 VizWiz Grand Challenge to design and improve systems that make the world more accessible for the blind.

Continue reading