We proposed a new natural language-based system for interactive, fine-grained image retrieval . There has been a flurry of recent research interest in visually grounded conversational agents, driven by the progress of deep learning techniques for both image and natural language understanding. A few interesting application scenarios have been explored by recent work, such as collaborative drawing  and visual dialogs . In our recent NeurIPS 2018paper , we explored a new paradigm of training visually grounded dialog agents for a practical and challenging real-world application. Specifically, we proposed a novel framework of an image retrieval system which learns to seek natural language feedback from the user and iteratively refines the retrieval result. Compared to conventional interactive image retrieval systems, which only allow for binary or fixed-form feedback, the natural language-based user interface used here is more natural and expressive. Ultimately, techniques that are successful on such tasks will form the basis for the high-fidelity, multi-modal, intelligent conversational systems of the future.
Figure 1. Natural language-based feedback provides a more natural and expressive interface than fixed-form feedback as commonly used in existing interactive image retrieval systems. (click to enlarge)
Interactive image retrieval via natural language feedback
The design of the proposed image retrieval agent is inspired by how a personal shopping assistant would interact with a customer: the agent learns about the visual features of the customer’s desired product by iteratively showing the customer candidate images and getting feedback on them. In this retrieval setting, deciding on which image from the image database should be shown to the user at each round and how to aggregate the user feedback over time are the essential problems to be addressed. The desired behavior of the retrieval system is to learn how to construct the optimal sequence of candidate images to obtain the most informative user feedback. Instead of specifying the rules for candidate image selection and feedback aggregation, we let the retrieval system (which we call the dialog manager) automatically learn to optimize the retrieval objective, which is the ranking percentile of the target image in the database. To the best of our knowledge, this is the first case of applying goal-oriented training to the context of interactive image retrieval.
Below is the neural network architecture for our dialog manager. A retrieval session consists multiple rounds of interactions. During each round of interaction, a Response Encoder embeds the information from the current dialog turn to a visual-semantic feature vector representation; then a State Tracker component integrates the visual-semantic feature vector with information from the previous interactions and outputs a history feature vector; and lastly, a Candidate Generator finds the most similar image in the database to return to the user given the output history feature vector from the State Tracker. The entire network is trained end-to-end and employs a training strategy of pre-training using supervised learning, followed by a fine-tuning step using reinforcement learning.
Figure 2. Neural architecture of the dialog manager. (click to enlarge)
Addressing data sparsity challenge: user simulator based on relative image captioning
One challenge remains in order to train the dialog manager, which is the lack of training data on user dialogs. The natural approach is to train the dialog manager in an online fashion with human annotators in the loop. However, this procedure is prohibitively slow and expensive. It takes about one minute to collect one set of dialog with 10 rounds of interactions. So 120k sets of training dialogs would require 2k hours of annotation effort.
To this end, we employed model-based reinforcement learning for training the dialog manager. The user model is based on a novel computer vision task: relative image captioning, which learns to describe prominent visual differences between two images using natural language. The trained relative captioner serves as a proxy of human annotators and allows for efficient training of the dialog manager without costly annotation. We collected a dataset for relative image captioning and trained a show-attend-tell based captioner. We found that although there is a difference between the generated descriptions and human provided descriptions, for most cases, relative captioner is able to provide reasonable descriptions on the visual differences between any pair of images.
Figure 3. Examples of relative image captions which were used to train the user simulator. (click to enlarge)