More Is Less: Learning Efficient Video Representations

Share this post:

Authors: Quanfu Fan, IBM Research Staff Member; Hilde Kuehne, IBM Research Staff Member; Marco Pistoia, Distinguished Research Staff Member and IBM Master Inventor; Richard Chen, IBM Research Staff Member; and David Cox, Director, MIT-IBM AI Watson Lab.

Understanding video is tricky for machines. In our latest work on video action recognition, we have developed a novel low memory footprint and efficient architecture for spatio-temporal analysis of video. The results show strong performance on several benchmarks – and allow training of deeper models using larger sequences of input frames, which will lead to higher accuracy on video action recognition tasks.

This work will be presented at the 2019 Conference and Workshop on Neural Information Processing Systems (NeurIPS) on December 8-14 in Vancouver, British Columbia.

Video understanding encompasses a wide range of applications such as video indexing and retrieval, video content enrichment, and human-robot interactions.  However, video understanding has made rapid progress in recent years.  Many approaches use expensive 3D Convolutional Neural Networks (3D-CNNs) to learn spatio-temporal representations that build on the success of applying 2D-CNN models for image recognition.

However, in order to achieve good results on video, an action recognition model needs to be deep and process a long sequence of input frames.  This makes training of 3D-CNNs model computationally intensive.

Big-Little Video Network as Video Representation

Our new paper, inspired by IBM’s previous work on Big-Little Network [1], proposes a novel lightweight 2D video architecture that efficiently models video information in both space and time (see Fig. 1). The architecture, referred to as bLVNet, contains two network branches (one deep and one shallow) that learn effective video feature representations while balancing the computational costs associated with the network depth and number of input frames.

The input frames are divided into two groups of low and high image resolutions. The deep branch (expressive, but more costly) learns information from the low-resolution images, while the shallow branch (efficient, but less accurate) processes the high-resolution data. The two branches compensate each other through merging to yield strong features for video action recognition.

We show that this approach achieves a reduction FLOPs by 3X-4X and memory usage by approximately 2X compared to the baseline.

Figure 1: Different architectures for action recognition. a) TSN [2] uses a shared CNN to process each frame independently, so there is no temporal interaction between frames. b) TSN-bLNet is a variant of TSN that uses bLNet [1] as backbone. It is efficient, but still lacks temporal modeling. c) bLVNet feeds odd and even frames separately into different branches in bLNet. The branch merging at each layer captures short-term temporal dependencies between adjacent frames. d) bLVNet-TAM includes the proposed aggregation module, represented as a red box, which further empowers bLVNet to model long-term temporal dependencies across frames.

We also developed a method to exploit temporal relations across frames by aggregating the spatial features learned by bLVNet. As illustrated in Fig. 2, the aggregation can be made learnable by efficient 1×1 depth wise convolutions and implemented as a network-independent module. It demonstrates stronger ability in capturing temporal information than 3D convolution.

Figure 2: Temporal aggregation module (TAM). The TAM takes as input a batch of tensors, each of which is the activation of a frame, and produces a batch of tensors with the same order and dimension. The module consists of three operations: 1) 1×1 depthwise convolutions to learn a weight for each feature channel; 2) temporal shifts (left or right direction indicated by the smaller arrows; the white cubes are padded zero tensors.); and 3) aggregation by summing up the weighted activations from 1).

Experimental Results

We compare our approach with other recently proposed approaches for action recognition on three large-scale datasets. Our approach establishes a state-of-the-art on the Something-Something dataset and achieves competitive performance on Kinetics400 and Moments-in-time.

Table 1 Recognition Accuracy of Various Models on Something-Something-V2

Table 2 Recognition Accuracy of Various Models on Kinetics400

Table 3 Recognition Accuracy of Various Models on Moments-in-Time

Conclusion and Next Steps

One surprising finding in our paper is that disentangling spatial and temporal information works better than learning them jointly by 3D convolution. This allows us to focus more on how to capture temporal information more effectively for video understanding in future work.


[1] Chun-Fu (Richard) Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, and Rogerio Feris. Big-little net: An e_cient multi-scale feature representation for visual and speech recognition. In International Conference on Learning Representations, 2019.

[2] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision

(ECCV). Springer, 2016.

More AI stories

We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page:

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading