December 6, 2019 | Written by: IBM Research Staff Members
Share this post:
Authors: Quanfu Fan, IBM Research Staff Member; Hilde Kuehne, IBM Research Staff Member; Marco Pistoia, Distinguished Research Staff Member and IBM Master Inventor; Richard Chen, IBM Research Staff Member; and David Cox, Director, MIT-IBM AI Watson Lab.
Understanding video is tricky for machines. In our latest work on video action recognition, we have developed a novel low memory footprint and efficient architecture for spatio-temporal analysis of video. The results show strong performance on several benchmarks – and allow training of deeper models using larger sequences of input frames, which will lead to higher accuracy on video action recognition tasks.
This work will be presented at the 2019 Conference and Workshop on Neural Information Processing Systems (NeurIPS) on December 8-14 in Vancouver, British Columbia.
Video understanding encompasses a wide range of applications such as video indexing and retrieval, video content enrichment, and human-robot interactions. However, video understanding has made rapid progress in recent years. Many approaches use expensive 3D Convolutional Neural Networks (3D-CNNs) to learn spatio-temporal representations that build on the success of applying 2D-CNN models for image recognition.
However, in order to achieve good results on video, an action recognition model needs to be deep and process a long sequence of input frames. This makes training of 3D-CNNs model computationally intensive.
Big-Little Video Network as Video Representation
Our new paper, inspired by IBM’s previous work on Big-Little Network , proposes a novel lightweight 2D video architecture that efficiently models video information in both space and time (see Fig. 1). The architecture, referred to as bLVNet, contains two network branches (one deep and one shallow) that learn effective video feature representations while balancing the computational costs associated with the network depth and number of input frames.
The input frames are divided into two groups of low and high image resolutions. The deep branch (expressive, but more costly) learns information from the low-resolution images, while the shallow branch (efficient, but less accurate) processes the high-resolution data. The two branches compensate each other through merging to yield strong features for video action recognition.
We show that this approach achieves a reduction FLOPs by 3X-4X and memory usage by approximately 2X compared to the baseline.
Figure 1: Different architectures for action recognition. a) TSN  uses a shared CNN to process each frame independently, so there is no temporal interaction between frames. b) TSN-bLNet is a variant of TSN that uses bLNet  as backbone. It is efficient, but still lacks temporal modeling. c) bLVNet feeds odd and even frames separately into different branches in bLNet. The branch merging at each layer captures short-term temporal dependencies between adjacent frames. d) bLVNet-TAM includes the proposed aggregation module, represented as a red box, which further empowers bLVNet to model long-term temporal dependencies across frames.
We also developed a method to exploit temporal relations across frames by aggregating the spatial features learned by bLVNet. As illustrated in Fig. 2, the aggregation can be made learnable by efficient 1×1 depth wise convolutions and implemented as a network-independent module. It demonstrates stronger ability in capturing temporal information than 3D convolution.
Figure 2: Temporal aggregation module (TAM). The TAM takes as input a batch of tensors, each of which is the activation of a frame, and produces a batch of tensors with the same order and dimension. The module consists of three operations: 1) 1×1 depthwise convolutions to learn a weight for each feature channel; 2) temporal shifts (left or right direction indicated by the smaller arrows; the white cubes are padded zero tensors.); and 3) aggregation by summing up the weighted activations from 1).
We compare our approach with other recently proposed approaches for action recognition on three large-scale datasets. Our approach establishes a state-of-the-art on the Something-Something dataset and achieves competitive performance on Kinetics400 and Moments-in-time.
Table 1 Recognition Accuracy of Various Models on Something-Something-V2
Table 2 Recognition Accuracy of Various Models on Kinetics400
Table 3 Recognition Accuracy of Various Models on Moments-in-Time
Conclusion and Next Steps
One surprising finding in our paper is that disentangling spatial and temporal information works better than learning them jointly by 3D convolution. This allows us to focus more on how to capture temporal information more effectively for video understanding in future work.
 Chun-Fu (Richard) Chen, Quanfu Fan, Neil Mallinar, Tom Sercu, and Rogerio Feris. Big-little net: An e_cient multi-scale feature representation for visual and speech recognition. In International Conference on Learning Representations, 2019.
 Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision
(ECCV). Springer, 2016.