Text2Scene: Generating Compositional Scenes from Textual Descriptions

Share this post:

Generating images from textual descriptions has become an active and exciting area of research. Interest has been partially fueled by the adoption of generative adversarial networks (GANs) [1], which have demonstrated impressive results on a number of image synthesis tasks. However, challenges remain when attempting to synthesize images for complex scenes with multiple interacting objects. In our paper, a Best Paper Finalist at CVPR 2019, we proposed to approach this problem from another direction. Inspired by the principle of compositionality [2], our model produces a scene by sequentially generating objects (in the forms of clip-art, bounding boxes, or segmented object patches) containing the semantic elements that compose the scene.

Compositional Scene Generation
We introduce Text2Scene, a model to interpret visually descriptive language in order to generate compositional scene representations. In particular, we focus on generating a scene representation consisting of a list of objects, along with their attributes (e.g., location, size, aspect ratio, pose, appearance). We adapt and train models to generate three types of scenes as shown in Figure 1: cartoon-like scenes, object layouts, and synthetic images.

Figure 1: Tasks on generating scenes from text

We propose a unified sequence-to-sequence framework to handle these three different tasks.

Generally, Text2Scene consists of a text encoder (Fig 2 (A)) that maps the input sentence to a set of latent representations, an image encoder (Fig 2 (B)), which encodes the current generated canvas, a convolutional recurrent module (Fig 2 (C)), which passes the current state to the next step, attention modules (Fig 2 (D)), which focus on different parts of the input text, an object decoder (Fig 2 (E)) that predicts the next object conditioned on the current scene state and attended input text, and an attribute decoder (Fig 2 (F)) that assigns attributes to the predicted object, and (G) an optional foreground embedding step that learns an appearance vector for patch retrieval in the synthetic image generation task.

The scene generation starts from an initially empty canvas that is updated at each time step. For the synthetic image generation task, our model sequentially retrieves and pastes object patches from other images to compose the scene. As the composite images may exhibit gaps between patches, we also leverage the stitching network in [5] for post-processing.


Figure 2: Text2Scene framework overview


We compare our approach to the latest GAN-based methods. Experimental results show that our model achieves near state-of-the-art performance on automatic metrics. Human subject evaluation shows that 75% of people preferred our outputs compared to the best GAN-based method such as SG2IM[3] and AttnGAN [4].

Figure 3:  Qualitative examples of scene generation results

Synthesizing images from text requires a level of language and visual understanding, which could lead to applications in image retrieval through natural language queries, representation learning for text, and automated computer graphics and image editing applications. Our work proposes an interpretable model that generates various forms of compositional scene representations. Experimental results demonstrate the capacity of our model to capture finer semantic meaning from descriptive text to generate complex scenes.

[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS), 2014.

[2] Xiaodan Zhu and Edward Grefenstette. Deep learning for semantic composition. In ACL tutorial, 2017.

[3] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image gener- ation from scene graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[4] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine- grained text to image generation with attentional generative adversarial networks. In IEEE Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2018.Xiaojuan

[5] Qi, Qifeng Chen, Jiaya Jia, and Vladlen Koltun. Semi-parametric image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Text2Scene: Generating Compositional Scenes from Textual Descriptions, Fuwen Tan, Song Feng, Vicente Ordonez

Research Staff Member, IBM Research

More AI stories

MIT-IBM Watson AI Lab Welcomes Inaugural Members

Two years in, and the MIT-IBM Watson AI Lab is now engaging with leading companies to advance AI research. Today, the Lab announced its new Membership Program with Boston Scientific, Nexplore, Refinitiv and Samsung as the first companies to join.

Continue reading

Adversarial Robustness 360 Toolbox v1.0: A Milestone in AI Security

IBM researchers published the first major release of the Adversarial Robustness 360 Toolbox (ART). Initially released in April 2018, ART is an open-source library for adversarial machine learning that provides researchers and developers with state-of-the-art tools to defend and verify AI models against adversarial attacks. ART addresses growing concerns about people’s trust in AI, specifically the security of AI in mission-critical applications.

Continue reading

Making Sense of Neural Architecture Search

It is no surprise that following the massive success of deep learning technology in solving complicated tasks, there is a growing demand for automated deep learning. Even though deep learning is a highly effective technology, there is a tremendous amount of human effort that goes into designing a deep learning algorithm.

Continue reading