Posted in: AI, Cognitive Computing

IBM Research at CVPR 2017: Helping AI systems to see with computer vision

This week IBM Research will be participating at the Conference on Computer Vision and Pattern Recognition (CVPR) in Honolulu, Hawaii from July 21 –25. As a major computer vision event, it’s a place for researchers, academics, students, and even investors to learn about the latest advances in the field. IBM’s presence this year includes multiple papers, demos, and invited talks, that demonstrate our progress to give AI systems sight, enabling them to unlock important visual insights and decisions that can transform industries.

We look forward to seeing you at one of our presentations listed below.  Or stop by the IBM Research booth #230 at the CVPR Industry Expo and meet some of our scientists to learn more about the work we are doing.  We will have demos showing the first-ever cognitive movie trailer for the horror movie Morgan (in partnership with 20th Century Fox), the system we deployed at The Masters and Wimbledon for automatic generation of sports highlights, our state-of-the-art image captioning system (MS-COCO top entry), our efforts to use computer vision to help doctors detect skin cancer, the Watson visual recognition services, and more.

Here’s a snapshot of our activities at CVPR:


S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical Sequence Training for Image Captioning

Saturday July 22, 2017

computer vision

Watson says: “A green bird sitting on top of a bowl.”

Image captioning is a fundamental problem in machine perception and human-computer interaction, as it requires the translation of raw visual inputs into natural language descriptions. In this paper we address the Microsoft COCO Image Captioning Challenge, and describe our captioning system, which is currently ranked first on the task (c.f. Table-C5, C40).

The high performance of our system hinges upon a novel reinforcement learning (RL) technique, which we call self-critical sequence training (SCST). As an RL-based approach, SCST can directly optimize non-differentiable metrics, and eliminates the bias associated with the traditional approach of training on ground-truth sequences—by exploring alternative image descriptions, and then evolving based on their associated rewards. In contrast with traditional RL approaches, SCST avoids estimating future rewards, and instead utilizes the reward associated with the output of the current system to normalize the rewards it receives, which ultimately boosts system performance.

Our system is a deep neural network that is trained to compose image descriptions directly from raw image data, without any intermediate supervision, and incorporates an attention mechanism to learn correspondence between the modalities, which gives it the ability to “focus” on different parts of the image while composing. While the inputs to our system are images and the output actions of our system are words, both the model and training approach are quite general, and can be evolved to tackle more general problems in machine perception and human-machine interaction—both are gateways to more general AI, with applications limited only by our imagination.


Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. S. Feris. Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification

Saturday July 22, 2017

This paper’s core contribution is a method that automatically determines the best deep learning architecture to solve multiple tasks simultaneously, efficiently, and accurately. Solving multiple tasks using a single deep network model (instead of multiple separate models) is important to achieve faster computation and low memory footprint. However, how to design a suitable network architecture to solve multiple tasks simultaneously is still an open problem, and current methods often rely on manual exploration based on a trial and error type system, which is time-consuming and error-prone. Our proposed method instead automates this process, using a novel algorithm that decides “with whom” each task should share features at each network layer. On the CelebA dataset for facial attribute classification, our approach obtains state-of-the-art accuracy with a model 90x more compact and 3x faster.


A. Joshi, S. Ghosh, M. Betke, S. Sclaroff, and Hanspeter Pfister. Personalizing Gesture Recognition Using Hierarchical Bayesian Neural Networks

Saturday July 22, 2017

No two people gesture in the exact same way, so this IBM Research (in collaboration with Harvard and Boston University) team’s paper describes the use of personalized models to teach a system to learn both within and between subject variations of different gestures. The main technical innovation of this work lies in the use of data efficient Bayesian neural networks, which can learn from very few labeled examples as well as guide interactive labeling of unlabeled gestures. From an application perspective, this model’s performance could be helpful for understanding the actions airport ground crew make even if there are subtle gesture differences among them. Such individualized models also have wide applications in other domains, including healthcare. The team is continuing to expand the system and teach it to identify gestures directly from video data alleviating the need for kinematic data.

L. Karlinsky, J. Shtok, Y. Tzur, and A. Tzadok. Fine-grained recognition of thousands of object categories with single-example training

Saturday July 22, 2017

This paper approaches the problem of fast detection and recognition of a large number (thousands) of object categories while training on a very limited amount of examples, usually one per category. Examples of this task include: (i) detection and recognition of retail products on supermarket shelves in unconstrained photographs, while training on one image per product (e.g. using the store’s online catalogue); (ii) detection of brand logos; and (iii) detection of 3D objects and their respective poses within a single 2D image. Building a detector based on so few examples presents a significant challenge for the current top-performing (deep) learning based techniques, which require large amounts of data to train. In this work we successfully demonstrate its usefulness in a variety of experiments on both existing and our own benchmarks achieving state-of-the-art performance.

A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza, J. Kusnitz, M. Debole, S. Esser, T. Delbruck, M. Flickner, D. Modha. A Low Power, Fully Event-Based Gesture Recognition System

Computer vision

Air guitar — one of 10 different hand and arm gestures IBM researchers trained a neural network to recognize.

Sunday July 23, 2017

As humans, we take for granted our ability to look at a person and instantly recognize whether they are waving at us or clapping their hands. The brain does it quickly, without overheating or running out of energy — which is what would happen if you tried the same task on your laptop or smartphone. To bridge this gap, IBM researchers developed the IBM TrueNorth neurosynaptic processor, which contains a million artificial neurons organized like the brain’s cerebral cortex. In this paper, IBM scientists used a special iniLabs DVS128 event camera modeled after the mammalian retina with a TrueNorth processor running a neural network they trained to recognize 10 different hand and arm gestures. The system is event-based, meaning it only reacts if there’s a change in what it’s seeing. This enables the system to run with much less power — under 200 mW. This model could enable AI applications efficient enough to be powered off the battery in a smartphone or a self-driving car, for example.

The team is also making the dataset they used to train the neural network available for download — one of the first event-based datasets provided to the field.

H. Xu, J. Yan, N. Persson, W. Lin, and H. Zha. Fractal Dimension Invariant Filtering and Its CNN-based Implementation

Sunday July 23, 2017

This paper proposes a novel nonlinear filter based on local fractal analysis techniques. This filter, which is implemented via a CNN, not only preserves the invariance of the local fractal dimension, but it also enhances the structural information hidden in images. The hope is for this filter to be applied to material analysis or photo editing applications, such as the generation of painting-style images from photos. This is the first attempt to design a fractal dimension invariant filter, while also connecting fractal-based image models with CNN-based methods.

S. Zhai, H. Wu, A. Kumar, Y. Cheng, Y. Lu, Z. Zhang,and R. S. Feris. S3Pool: Pooling with Stochastic Spatial Sampling

Sunday July 23, 2017

Many cognitive systems are taught based on being shown lots of examples, with the more they ‘see’ the more they know. The core contribution of this paper is providing a method that allows deep learning with few training examples by changing the operation of pooling layers in deep convolutional neural networks. We observe that although the regularly spaced down sampling in traditional pooling layers is intuitive from a signal processing perspective (which has the goal of signal reconstruction), it is not necessarily optimal for learning (where the goal is to generalize). We study this aspect and propose a novel pooling strategy with stochastic spatial sampling (S3Pool), where the regular downsampling is replaced by a more general stochastic version. Our approach yields more accuracy especially in cases where only a few training examples are available.


Michele Merler, Dhiraj Joshi, Quoc-Bao Nguyen, Stephen Hammer, John Kent, John Smith, Rogerio Feris. Automatic Curation of Golf Highlights using Multimodal Excitement Features

Friday July 21, 2017

computer vision

IBM’s AI-powered video highlights system used to auto-curate the most exciting highlights of Wimbledon 2017

The production of sports highlight packages summarizing a game’s most exciting moments is an essential task for broadcast media. Yet, it requires labor-intensive video editing. In this paper IBM scientists propose a novel approach for auto-curating sports highlights that leverages video and audio AI techniques, and using it to create a system for the editorial aid of golf highlight reels, which was put to use at the 2017 Masters Golf Tournament. The proof-of-concept brought together computer vision and other leading AI technologies to listen, watch and learn from a live video feed of the golf tournament and automatically identify and curate the most exciting moments and shots into segments that could be used in online highlight packages. The team built out the system further to create a solution for The Championships, Wimbledon, that went beyond selecting and curating individual segments, to automatically creating a one to two minute highlights package of matches for the Wimbledon editorial team’s use across the Wimbledon Digital Platforms.

Women in Computer Vision Main Workshop: Computer Vision for the Blind. Chieko Asakawa

Wednesday, July 26

In this talk IBM Fellow Chieko Asakawa will discuss emerging technologies that can help the visually impaired.  Blind people have been dreaming of a machine which can recognize objects, people and the environment around them. For many years, such machines were only available in science fiction, but now thanks to advances in deep learning and computer vision technologies, new solutions are becoming a reality.

Tensor Methods in Computer Vision Workshop: A New Tensor Algebra – Theory and Applications. Lior Horesh

Wednesday, July 26

Tensors are instrumental in revealing latent correlations residing in high dimensional spaces. Despite their applicability to a broad range of applications in machine learning, speech recognition, and imaging, inconsistencies between tensor and matrix algebra have been impending their broader utility. Researchers seeking to overcome those discrepancies have introduced several different candidate extensions, each introducing unique advantages and challenges. This tutorial will review some of the common tensor algebra definitions, discuss their limitations, and introduce the new t-tensor product algebra, which permits the elegant extension of linear algebraic concepts and algorithms to tensors.









Add Comment

Your email address will not be published. Required fields are marked *

Rogerio Schmidt Feris

Research Manager, Vision & Learning, IBM Research