The tremendous growth of video data — including media content, sports broadcasts, educational video, consumer content, news and more — has created a significant demand for artificial intelligence (AI) tools that can automatically understand visual content to facilitate effective curation and searching of large video collections.
Our team at IBM Research is creating core technologies using the most advanced AI techniques that power solutions that uncover insights from this vast amount of video data. As another step towards this goal, we have created a first ever multi-modal system for summarizing golf video for the 2017 Masters Golf Tournament.
Capturing all the action at the Masters and sharing it with fans in a timely fashion has generally required labor-intensive effort. With 90 golfers playing multiple rounds over four days, video from every tee, every hole and multiple camera angles can quickly add up to thousands of hours of footage.
We worked with the IBM iX design team to create a proof-of-concept system for auto-curation of individual shot highlights from the tournament’s live video streams, with the goal of simplifying and accelerating the video production process to create golf play highlight packages.
Our system extracts exciting moments from live video streams of the Masters tournament based on multimodal (video, audio, and text) AI techniques. More specifically, this system is trained to “watch” and “hear” broadcast videos in real-time, accurately identifying the start and end frames of key event highlights based on the following markers:
- Crowd cheering
- Action recognition, such as high fives or fist pumps
- Commentator excitement (tone of voice)
- Commentary (exciting words or expressions obtained from the Watson Speech to Text API)
- Shot-boundary detection
- TV graphics (such as lower third banners)
- Optical character recognition (the ability to extract text from images to determine which player is in which shot)
The selected segments are then added to an interactive dashboard for quick review and retrieval by a video editor or broadcast producer, speeding up the process by which these highlights can then be shared with fans eager to see the latest action. In addition, by leveraging TV graphics and optical character recognition, our system automatically gathers information about the player name and hole number. This metadata is matched with the relevant highlight segments, which could be used to enable searches like “show me all highlights of player X during the tournament” or to build personalized highlights based on a viewer’s favorite players.
The solution created for the Master’s is an extension of our team’s recent work creating the first Cognitive Movie Trailer. Our technology extends state-of-the-art deep learning models, and provides effective methods for learning new classifiers using a few manually-annotated training examples via self-supervised and active learning techniques.
To our knowledge, this multimodal, highlight-ranking system is the first of its kind to be deployed for a sporting event. Integrating the multiple audio and visual components (such as the detection of a player celebrating or measures of the commentator and crowd excitement levels) using AI methods for ranking golf’s exciting moments was a challenge, but offers new opportunities for applications both in, and out, of the sports industry.
We believe our technology could be extended to domains beyond sports and film production, from discovering exciting moments in home videos to real-time monitoring of people’s activities, such as the detection of an elderly person falling down in their home.