I am part of the team at the MIT IBM Watson AI Lab that is carrying out fundamental AI research to push the frontiers of core technologies that will advance the state-of-the-art in AI video comprehension. This is just one example of joint research we’re pursuing together to produce innovations in AI technology that solve real business challenges.
Great progress has been made and I am excited to share that we are releasing the Moments in Time Dataset, a large-scale dataset of one million three-second annotated video clips for action recognition to accelerate the development of technologies and models that enable automatic video understanding for AI.
A lot can happen in a moment of time: a girl kicking a ball, behind her on the path a woman walks her dog, on a park bench nearby a man is reading a book and high above a bird flies in the sky. Humans constantly absorb such moments through their senses and process them swiftly and effortlessly. When asked to describe such a moment, a person can quickly identify objects (girl, ball, bird, book), the scene (park) and the actions that are taking place (kicking, walking, reading, flying).
Clips showing sections of video frames used by a neural network to predict the actions in the videos. These methods show the neural network model’s ability to locate the most important areas to focus on so that it can begin to identify everyday moments.
MIT Computer Science and Artificial Intelligence Laboratory Principal Research Scientist, Dr. Aude Oliva (on left) and IBM Research Scientist, Dr. Dan Gutfreund, co-leaders of the Moments in Time Dataset
For decades, researchers in the field of computer vision have been attempting to develop visual understanding models approaching human levels. Only in the last few years, due to breakthroughs in deep learning, have we started to see models that now reach human performance (although they are restricted to a handful of tasks and on certain datasets). While new algorithmic ideas have emerged over the years, this success can be largely credited to two other factors: massive labeled datasets and significant improvements in computational capacities, which allowed processing these datasets and training models with millions of parameters in reasonable time scales. ImageNet, a dataset for object recognition in still images developed by Prof. Fei Fei Li and her group in Stanford University, and Places, a dataset for scene recognition (such as “park,” “office,” “bedroom”), developed by Dr. Aude Oliva and her group in MIT, were the source of significant new innovations and benchmarks, enabled by their wide coverage of the semantic universe of objects and scenes respectively.
We have been working over the past year in close collaboration with Dr. Aude Oliva and her team from MIT, where we are tackling the specific challenge of action recognition, an important first step in helping computers understand activities which can ultimately be used to describe complex events (e.g. “changing a tire,” “saving a goal,” “teaching a yoga pose”).
While there are several labeled video datasets which are publicly available, they typically don’t provide wide semantic coverage of the English language and are, by and large, human-centric. In particular, the label categories in these sets describe very specific scenarios such as “applying makeup” or sporting events such as “high jump.” In other words, the videos are not labeled with the basic actions that are the necessary building blocks for describing the visual world around us (e.g. “running,” “walking,” “laughing”). To understand the difference, consider the activity ‘high jump.’ It can only be used to describe that particular activity as, in general, it is not part of any other activity. Basic actions, on the other hand, can be used to describe many types of activities. For example, “high jump” encompasses the basic actions “running,” “jumping,” “arching,” “falling,” and “landing.”
The video snippets in the Moments in Time Dataset depict day-to-day events, labeled with the basic actions that occur in them. The label set contains more than 300 basic actions, chosen carefully such that they provide a wide coverage of English-language verbs both in terms of semantics as well as frequency of use.
Another unique aspect of the Dataset is that we consider an action to be in a video even if it can only be heard (e.g. a sound of clapping in the background). This allows the development of multi-modal models for action recognition.
Finally, the Dataset enjoys not only a wide variety of actions, which is known as inter-label variability, but it also has a significant intra-label variability. That is, the same action can occur in very different settings and scenarios. Consider the action “opening,” for example. Doors can open, curtains can open, books can open, but also a dog can open its mouth. All of these scenarios appear in the Dataset under the category “opening.” For us humans, it is easy to recognize that all of them are the same action, despite the fact that visually they look quite different from each other. The challenge is to train computer models to do the same. One starting point is identifying the temporal-spatial transformation that is common to all these “opening” scenarios as a means of identifying these patterns. This project will begin to help us with this and other challenges.
The choice of focusing on three-second videos is not arbitrary. Three seconds corresponds to the average short-term memory time span. In other words, this is a relatively short period of time, but still long enough for humans to process consciously (as opposed to time spans associated with sensory memory, which unconsciously processes events that occur in fractions of a second). The physical world that we live in puts constraints on the time scale of short-term memory: it takes a few seconds for agents and objects of interest to move and interact with each other in a meaningful way.
Automatic video understanding already plays an important role in our lives. With the expected advancements in the field, we predict that the number of applications will grow exponentially in domains such as assisting the visually impaired, elderly care, automotive, media & entertainment and many more. The Moments in Time Dataset is available for non-commercial research and education purposes for the research community to use. Our hope is that it will foster new research, addressing the challenges in video understanding and help to further unlock the promise of AI.
I encourage you to leverage the Dataset for your own research and share your experiences to foster progress and new thinking. Visit the website to obtain the dataset, read our technical paper that explains the approach we took in designing the dataset and see examples of annotated videos that our system was tested on. I look forward to sharing more details on challenges and results that will come from this effort.
In a recently published paper in this year’s INTERSPEECH, we were able to achieve additional improvement on the efficiency of Asynchronous Decentralized Parallel Stochastic Gradient Descent, reducing the training time from 11.5 hours to 5.2 hours using 64 NVIDIA V100 GPUs.
IBM scientists presented three papers at INTERSPEECH 2019 that address the shortcomings of End-to-end automatic approaches for speech recognition - an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits.
Recent advances in deep learning are dramatically improving the development of Text-to-Speech systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech.