Dialog-Based Interactive Image Retrieval

Share this post:

We proposed a new natural language-based system for interactive, fine-grained image retrieval [1]. There has been a flurry of recent research interest in visually grounded conversational agents, driven by the progress of deep learning techniques for both image and natural language understanding. A few interesting application scenarios have been explored by recent work, such as collaborative drawing [2] and visual dialogs [3]. In our recent NeurIPS 2018 paper [1], we explored a new paradigm of training visually grounded dialog agents for a practical and challenging real-world application. Specifically, we proposed a novel framework of an image retrieval system which learns to seek natural language feedback from the user and iteratively refines the retrieval result. Compared to conventional interactive image retrieval systems, which only allow for binary or fixed-form feedback, the natural language-based user interface used here is more natural and expressive. Ultimately, techniques that are successful on such tasks will form the basis for the high-fidelity, multi-modal, intelligent conversational systems of the future.

Figure 1. Natural language-based feedback provides a more natural and expressive interface than fixed-form feedback as commonly used in existing interactive image retrieval systems. (click to enlarge)

Interactive image retrieval via natural language feedback

The design of the proposed image retrieval agent is inspired by how a personal shopping assistant would interact with a customer: the agent learns about the visual features of the customer’s desired product by iteratively showing the customer candidate images and getting feedback on them. In this retrieval setting, deciding on which image from the image database should be shown to the user at each round and how to aggregate the user feedback over time are the essential problems to be addressed. The desired behavior of the retrieval system is to learn how to construct the optimal sequence of candidate images to obtain the most informative user feedback. Instead of specifying the rules for candidate image selection and feedback aggregation, we let the retrieval system (which we call the dialog manager) automatically learn to optimize the retrieval objective, which is the ranking percentile of the target image in the database. To the best of our knowledge, this is the first case of applying goal-oriented training to the context of interactive image retrieval.

Below is the neural network architecture for our dialog manager. A retrieval session consists multiple rounds of interactions. During each round of interaction, a Response Encoder embeds the information from the current dialog turn to a visual-semantic feature vector representation; then a State Tracker component integrates the visual-semantic feature vector with information from the previous interactions and outputs a history feature vector; and lastly, a Candidate Generator finds the most similar image in the database to return to the user given the output history feature vector from the State Tracker. The entire network is trained end-to-end and employs a training strategy of pre-training using supervised learning, followed by a fine-tuning step using reinforcement learning.

Figure 2. Neural architecture of the dialog manager. (click to enlarge)

Addressing data sparsity challenge: user simulator based on relative image captioning

One challenge remains in order to train the dialog manager, which is the lack of training data on user dialogs. The natural approach is to train the dialog manager in an online fashion with human annotators in the loop. However, this procedure is prohibitively slow and expensive. It takes about one minute to collect one set of dialog with 10 rounds of interactions. So 120k sets of training dialogs would require 2k hours of annotation effort.

To this end, we employed model-based reinforcement learning for training the dialog manager. The user model is based on a novel computer vision task: relative image captioning, which learns to describe prominent visual differences between two images using natural language. The trained relative captioner serves as a proxy of human annotators and allows for efficient training of the dialog manager without costly annotation. We collected a dataset for relative image captioning and trained a show-attend-tell based captioner. We found that although there is a difference between the generated descriptions and human provided descriptions, for most cases, relative captioner is able to provide reasonable descriptions on the visual differences between any pair of images.

Figure 3. Examples of relative image captions which were used to train the user simulator. (click to enlarge)

Code and data are available here:


[1] Xiaoxiao Guo*, Hui Wu*, Yu Cheng, Steven Rennie, Gerald Tesauro, Rogerio Schmidt Feris. “Dialog-based Interactive Image Retrieval.” NeurIPS 2018. (*equal contribution)

[2] Jin-Hwa Kim, Devi Parikh, Dhruv Batra, Byoung-Tak Zhang, Yuandong Tian. “CoDraw: Visual Dialog for Collaborative Drawing.” arXiv preprint arXiv:1712.05558 2017.

[3] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh and Dhruv Batra. “Visual Dialog” CVPR 2017

Research Staff Member, IBM Research

Hui Wu

Research Staff Member, IBM Research

More AI stories

Think 2019 Kicks Off with Live Debate Between Man and Machine

Today, an artificial intelligence (AI) system engaged in a live, public debate with a human debate champion at Think 2019 in San Francisco.

Continue reading

Watch IBM’s AI System Debate a Human Champion Live at Think 2019

IBM Research and Intelligence Squared U.S. host a live public debate featuring Project Debater, the first AI system that can debate humans.

Continue reading

Certifying Attack Resistance of Convolutional Neural Networks

Researchers from MIT and IBM propose an efficient and effective method for certifying attack resistance of convolutional neural networks to given input data.

Continue reading