AI

Overcoming Challenges In Automated Image Captioning

Share this post:

Automatic image captioning remains challenging despite the recent impressive progress in neural image captioning.  In the paper “Adversarial Semantic Alignment for Improved Image Captions,” appearing at the 2019 Conference in Computer Vision and Pattern Recognition (CVPR), we – together with several other IBM Research AI colleagues — address three main challenges in bridging the semantic gap between visual scenes and language in order to produce diverse, creative and human-like captions.

Compositionality and Naturalness

The first challenge stems from the compositional nature of natural language and visual scenes. While the training dataset contains co-occurrences of some objects in their context, a captioning system should be able to generalize by composing objects in other contexts.

Traditional captioning systems suffer from lack of compositionality and naturalness as they often generate captions in a sequential manner, i.e., next generated word depends on both the previous word and the image feature. This can frequently lead to syntactically correct, but semantically irrelevant language structures, as well as to a lack of diversity in the generated captions. We propose to address the compositionality issue with a context-aware Attention captioning model, which allows the captioner to compose sentences based on fragments of the observed visual scenes. Specifically, we used a recurrent language model with a gated recurrent visual attention that gives the choice at every generating step of attending to either visual or textual cues from the last generation step

To address the issue of lack of naturalness, we introduce another innovation by using generative adversarial  networks (GANs) [1] in training the captioner, where a co-attention discriminator scores the “naturalness” of a sentence and its fidelity to the image via a co-attention model that matches fragments of the visual scenes and the language generated and vice versa. The Co-attention discriminator judges the quality of a caption by scoring the likelihood of generated words given the image features and vice versa. Note that this scoring is local (word and pixel level) and not at a global representation level. This locality in the scoring is important in capturing the compositional nature of language and visual scenes. The discriminator role is not only to ensure that the language generated is human-like, but it also enables the captioner to compose by judging the image and sentence pairs on a local level.

Generalization

The second challenge is the dataset bias impacting current captioning systems. The trained models overfit to the common objects that co-occur in a common context (e.g., bed and bedroom), which leads to a problem where such systems struggle to generalize to scenes where the same objects appear in unseen contexts (e.g., bed and forest). Although reducing the dataset bias is in itself a challenging, open research problem, we propose a diagnostic tool to quantify how biased a given captioning system is.

Specifically, we created a test diagnosis dataset of captioned images with the common objects occurring in unusual scenes (Out of Context – OOC dataset) in order to test the compositional and generalization properties of a captioner. The evaluation on OOC is a good indicator of the model’s generalization. Bad performance is a sign that the captioner is over-fitted to the training context. We show that GAN-based models with co-attention discriminator and context-aware generator have better generalization to unseen contexts than previous state of the art methods (See Figure 1).

Evaluation and Turing Test

The third challenge is in the evaluation of the quality of generated captions. Using automated metrics, though partially helpful, is still unsatisfactory since they do not take the image into account. In many cases, their scoring remains inadequate and sometimes even misleading — especially when scoring diverse and descriptive captions. Human evaluation remains a gold standard in scoring captioning systems. We used a Turing test in which human evaluators were asked if a given caption is real or machine-generated. The human evaluators judged many of the model-generated captions to be real, demonstrating that the proposed captioner has a good performance and promising to be a valuable new approach for automatic image captioning.

Outlook

Progress on automatic image captioning and scene understanding will make computer vision systems more reliable for use as personal assistants for visually impaired people and in improving their day-to-day life. The semantic gap in bridging language and vision points to the need for incorporating common sense and reasoning into scene understanding.

image captioning

Figure 1 : Examples of out of context images. GAN refers to the model proposed by IBM Research in this work. CE is the cross entropy trained baseline. RL is the reinforcement learning based baseline with NLP metric (CIDER) as a cost function. GT is the “ground truth” caption given by a human.

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014


Adversarial Semantic Alignment for Improved Image Captions, Authors (in alphabetical order):  Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Tom Sercu

Research Staff Member, IBM Research

Igor Melnyk

Research Staff Member, IBM Research

More AI stories

Could AI help clinicians to predict Alzheimer’s disease before it develops?

A new AI model, developed by IBM Research and Pfizer, has used short, non-invasive and standardized speech tests to help predict the eventual onset of Alzheimer’s disease within healthy people with an accuracy of 0.7 and an AUC of 0.74 (area under the curve).

Continue reading

State-of-the-Art Results in Conversational Telephony Speech Recognition with a Single-Headed Attention-Based Sequence-to-Sequence Model

Powerful neural networks have enabled the use of “end-to-end” speech recognition models that directly map a sequence of acoustic features to a sequence of words. It is generally believed that direct sequence-to-sequence speech recognition models are competitive with traditional hybrid models only when a large amount of training data is used. However, in our recent […]

Continue reading

IBM Research at INTERSPEECH 2020

The 21st INTERSPEECH Conference will take place as a fully virtual conference from October 25 to October 29. INTERSPEECH is the world’s largest conference devoted to speech processing and applications, and is the premiere conference of the International Speech Communication Association. The current focus of speech technology research at IBM Research AI is around Spoken […]

Continue reading