RepMet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection

Share this post:

Deep convolutional neural networks (CNNs) are currently the tool of choice for many computer vision tasks including image classification and object detection.  However, CNNs are notoriously data-hungry and can require thousands of training samples per category.  Yet, in many practical applications, it is not feasible to collect more than a few training samples per category.  Few-shot learning aims to enable effective learning in these data-limited settings. Typically, few-shot learning relies on pre-training from base models that are learned in advance from large datasets.

Recent studies have achieved significant advances in using CNNs for few-shot learning. This has been demonstrated for domain-specific tasks1,2, but few works have investigated the problem of few-shot object detection, where the task of recognizing instances of a category, represented by a few examples, is complicated by the presence of the image background (surely unobserved during training in the few-shot case) and the need to accurately localize the objects (Figure 1).

RepMet figure

Figure 1

In order to advance the technology on few-shot object detection, we develop a new approach in our paper presented at IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019) in June 2019.   The paper, titled “RepMet: Representative-based metric learning for classification and few-shot object detection” demonstrates a practical approach for building a few-shot detector.  RepMet allows new few-shot categories to be learned ‘on the fly’ using a distance-based classifier that replaces the standard linear classifier within the object detection process. While the set of visual categories must be fixed for the latter throughout the model pre-training, using distance-based classification allows learning of new categories by presenting a few samples.  The probability of a candidate image region belonging to one of the categories is determined by the Euclidean distance of this region to samples representing them, computed in the specially learned embedding space. Generally, this method bears the name of ‘metric learning,’ since the Euclidean distance in this embedding space represents some meaningful (semantic) distance between the visual content of the encoded image regions.

In our RepMet approach, learning the embedding space in the standard detector training procedure required some innovations, like learning a Gaussian Mixture model parameters for each visual category. In practice, assuming a fixed variance and an isotropic metric, the only parameters required are the mode means of the model, which are essentially representative examples of the visual categories. They are optimized jointly with the feature extracting backbone and with the embedding space (Figure 2). The joint optimization holds an advantage over the sampling strategy that would otherwise need to be employed if optimizing the metric to be used within a detector using conventional means for metric learning using DNNs. This is intuitive, because sampling cannot effectively take into account the entire (very large) set of possible background regions of interest (ROIs) present on each training image (and taken into account by joint training), and hence is prone to larger false positive rates after training.

RepMet figure

Figure 2

RepMet figure

Figure 3


The performance of RepMet for few-shot detection was tested on an episodic benchmark defined over the ImageNet-LOC detection dataset. Each episode is an instance of the few-shot detection problem with m novel categories (unseen during base model pre-training) of k training samples each for a k-shot-m-way detection task (episode). The goal of the detector is then to detect the objects belonging to the novel categories in a set of query images of the episode. See Figure 4 for an example of a 1-shot-5-way episode with one query image. The experiments over the proposed benchmark have demonstrated that RepMet can achieve close to 80 percent mAP with as little as 10 examples per category in 5-way detection.

RepMet figure

Figure 4

RepMet is one of the first proposed few-shot detection methods to date, and we hope it will inspire more researchers to investigate into the important problem of few-shot detection.


  1. O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. Advances In Neural Information Processing Systems (NIPS), 2016.


Research Staff Member, IBM Research

More AI stories

IBM launches blockchain for high-end textile for transparency of the supply chain

A new solution for the textile industry use blockchain allows users to track the entire spectrum of fabric manufacturing.

Continue reading

Parquet Modular Encryption: Developing a new open standard for big data security

To help advance data security in the cloud, IBM Research has initiated and currently leads joint work with the Apache Parquet community to address critical issues in securing confidentiality and integrity of sensitive data.

Continue reading

AI Models Predict Breast Cancer with Radiologist-Level Accuracy

Our team of IBM researchers published research in Radiology around a new AI model that can predict the development of malignant breast cancer in patients within the year, at rates comparable to human radiologists.

Continue reading