June 18, 2019 | Written by: Leonid Karlinsky
Categorized: AI | IBM Research-Haifa
Share this post:
Deep convolutional neural networks (CNNs) are currently the tool of choice for many computer vision tasks including image classification and object detection. However, CNNs are notoriously data-hungry and can require thousands of training samples per category. Yet, in many practical applications, it is not feasible to collect more than a few training samples per category. Few-shot learning aims to enable effective learning in these data-limited settings. Typically, few-shot learning relies on pre-training from base models that are learned in advance from large datasets.
Recent studies have achieved significant advances in using CNNs for few-shot learning. This has been demonstrated for domain-specific tasks1,2, but few works have investigated the problem of few-shot object detection, where the task of recognizing instances of a category, represented by a few examples, is complicated by the presence of the image background (surely unobserved during training in the few-shot case) and the need to accurately localize the objects (Figure 1).
In order to advance the technology on few-shot object detection, we develop a new approach in our paper presented at IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019) in June 2019. The paper, titled “RepMet: Representative-based metric learning for classification and few-shot object detection” demonstrates a practical approach for building a few-shot detector. RepMet allows new few-shot categories to be learned ‘on the fly’ using a distance-based classifier that replaces the standard linear classifier within the object detection process. While the set of visual categories must be fixed for the latter throughout the model pre-training, using distance-based classification allows learning of new categories by presenting a few samples. The probability of a candidate image region belonging to one of the categories is determined by the Euclidean distance of this region to samples representing them, computed in the specially learned embedding space. Generally, this method bears the name of ‘metric learning,’ since the Euclidean distance in this embedding space represents some meaningful (semantic) distance between the visual content of the encoded image regions.
In our RepMet approach, learning the embedding space in the standard detector training procedure required some innovations, like learning a Gaussian Mixture model parameters for each visual category. In practice, assuming a fixed variance and an isotropic metric, the only parameters required are the mode means of the model, which are essentially representative examples of the visual categories. They are optimized jointly with the feature extracting backbone and with the embedding space (Figure 2). The joint optimization holds an advantage over the sampling strategy that would otherwise need to be employed if optimizing the metric to be used within a detector using conventional means for metric learning using DNNs. This is intuitive, because sampling cannot effectively take into account the entire (very large) set of possible background regions of interest (ROIs) present on each training image (and taken into account by joint training), and hence is prone to larger false positive rates after training.
The performance of RepMet for few-shot detection was tested on an episodic benchmark defined over the ImageNet-LOC detection dataset. Each episode is an instance of the few-shot detection problem with m novel categories (unseen during base model pre-training) of k training samples each for a k-shot-m-way detection task (episode). The goal of the detector is then to detect the objects belonging to the novel categories in a set of query images of the episode. See Figure 4 for an example of a 1-shot-5-way episode with one query image. The experiments over the proposed benchmark have demonstrated that RepMet can achieve close to 80 percent mAP with as little as 10 examples per category in 5-way detection.
RepMet is one of the first proposed few-shot detection methods to date, and we hope it will inspire more researchers to investigate into the important problem of few-shot detection.
- F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A unified embedding for face recognition and clustering. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
- O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching Networks for One Shot Learning. Advances In Neural Information Processing Systems (NIPS), 2016.