Overcoming Challenges In Automated Image Captioning

Share this post:

Automatic image captioning remains challenging despite the recent impressive progress in neural image captioning.  In the paper “Adversarial Semantic Alignment for Improved Image Captions,” appearing at the 2019 Conference in Computer Vision and Pattern Recognition (CVPR), we – together with several other IBM Research AI colleagues — address three main challenges in bridging the semantic gap between visual scenes and language in order to produce diverse, creative and human-like captions.

Compositionality and Naturalness

The first challenge stems from the compositional nature of natural language and visual scenes. While the training dataset contains co-occurrences of some objects in their context, a captioning system should be able to generalize by composing objects in other contexts.

Traditional captioning systems suffer from lack of compositionality and naturalness as they often generate captions in a sequential manner, i.e., next generated word depends on both the previous word and the image feature. This can frequently lead to syntactically correct, but semantically irrelevant language structures, as well as to a lack of diversity in the generated captions. We propose to address the compositionality issue with a context-aware Attention captioning model, which allows the captioner to compose sentences based on fragments of the observed visual scenes. Specifically, we used a recurrent language model with a gated recurrent visual attention that gives the choice at every generating step of attending to either visual or textual cues from the last generation step

To address the issue of lack of naturalness, we introduce another innovation by using generative adversarial  networks (GANs) [1] in training the captioner, where a co-attention discriminator scores the “naturalness” of a sentence and its fidelity to the image via a co-attention model that matches fragments of the visual scenes and the language generated and vice versa. The Co-attention discriminator judges the quality of a caption by scoring the likelihood of generated words given the image features and vice versa. Note that this scoring is local (word and pixel level) and not at a global representation level. This locality in the scoring is important in capturing the compositional nature of language and visual scenes. The discriminator role is not only to ensure that the language generated is human-like, but it also enables the captioner to compose by judging the image and sentence pairs on a local level.


The second challenge is the dataset bias impacting current captioning systems. The trained models overfit to the common objects that co-occur in a common context (e.g., bed and bedroom), which leads to a problem where such systems struggle to generalize to scenes where the same objects appear in unseen contexts (e.g., bed and forest). Although reducing the dataset bias is in itself a challenging, open research problem, we propose a diagnostic tool to quantify how biased a given captioning system is.

Specifically, we created a test diagnosis dataset of captioned images with the common objects occurring in unusual scenes (Out of Context – OOC dataset) in order to test the compositional and generalization properties of a captioner. The evaluation on OOC is a good indicator of the model’s generalization. Bad performance is a sign that the captioner is over-fitted to the training context. We show that GAN-based models with co-attention discriminator and context-aware generator have better generalization to unseen contexts than previous state of the art methods (See Figure 1).

Evaluation and Turing Test

The third challenge is in the evaluation of the quality of generated captions. Using automated metrics, though partially helpful, is still unsatisfactory since they do not take the image into account. In many cases, their scoring remains inadequate and sometimes even misleading — especially when scoring diverse and descriptive captions. Human evaluation remains a gold standard in scoring captioning systems. We used a Turing test in which human evaluators were asked if a given caption is real or machine-generated. The human evaluators judged many of the model-generated captions to be real, demonstrating that the proposed captioner has a good performance and promising to be a valuable new approach for automatic image captioning.


Progress on automatic image captioning and scene understanding will make computer vision systems more reliable for use as personal assistants for visually impaired people and in improving their day-to-day life. The semantic gap in bridging language and vision points to the need for incorporating common sense and reasoning into scene understanding.

image captioning

Figure 1 : Examples of out of context images. GAN refers to the model proposed by IBM Research in this work. CE is the cross entropy trained baseline. RL is the reinforcement learning based baseline with NLP metric (CIDER) as a cost function. GT is the “ground truth” caption given by a human.

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014

Adversarial Semantic Alignment for Improved Image Captions, Authors (in alphabetical order):  Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Tom Sercu

Research Staff Member, IBM Research

Igor Melnyk

Research Staff Member, IBM Research

More AI stories

MIT-IBM Watson AI Lab Welcomes Inaugural Members

Two years in, and the MIT-IBM Watson AI Lab is now engaging with leading companies to advance AI research. Today, the Lab announced its new Membership Program with Boston Scientific, Nexplore, Refinitiv and Samsung as the first companies to join.

Continue reading

Adversarial Robustness 360 Toolbox v1.0: A Milestone in AI Security

IBM researchers published the first major release of the Adversarial Robustness 360 Toolbox (ART). Initially released in April 2018, ART is an open-source library for adversarial machine learning that provides researchers and developers with state-of-the-art tools to defend and verify AI models against adversarial attacks. ART addresses growing concerns about people’s trust in AI, specifically the security of AI in mission-critical applications.

Continue reading

Making Sense of Neural Architecture Search

It is no surprise that following the massive success of deep learning technology in solving complicated tasks, there is a growing demand for automated deep learning. Even though deep learning is a highly effective technology, there is a tremendous amount of human effort that goes into designing a deep learning algorithm.

Continue reading