How AI picks the most exciting moments at the US Open without bias

By | 9 minute read | September 6, 2019

Note: This blog post was authored by Aaron Baughman with Stephen Hammer, Eythan Holladay, Eduardo Morales and Gary Reiss.

Tennis play at the US Open consists of 254 matches in the men’s and women’s singles events totaling tens of thousands of points. During the tournament’s two weeks, many matches are played in parallel, and it’s virtually impossible for any tennis fan, or the editorial team at the United States Tennis Association (USTA), to capture any sizable percentage of the best points.

To help solve this challenge, IBM built an AI system that clips and creates candidate highlight videos and assigns a fair excitement score, all within two minutes of the end of each match. Every highlight is ranked so that tennis fans and video editors at the USTA and its broadcast partners can see the most exciting points of the tournament, while minimizing the influence of player gestures, match analytic score, player rank, player age and crowd size.

This required teaching IBM Watson to better recognize acoustics, crowd cheer, commentator tone, gesture, face expression and to understand and remove inadvertent AI bias. The result is a higher quality selection of sports highlights, and a process that may influence the way fans watch sports and athletes train for them.

Let’s take a closer look at what happens behind the scenes.

The mind of AI as editor: Picking out top plays

Video streams from US Open courts are ingested and understood by machine multimedia comprehension algorithms. The computing techniques condense video from full length tennis matches to clipped highlights by using computer vision and sound to determine scene boundaries. The camera angle and transitions between different angles are helpful in selecting scenes. However, computer vision alone produces false positives when a player performs an unexpected action before or after the point or if the broadcaster needs to change the viewing perspective to keep the content compelling.

Because the visual aspect of a tennis match is highly diverse with its variety in angles, obstructed views, resolutions, contrast and colors, we began analyzing the sound component. Generally, the sound in tennis is relatively stable and consistent when contrasted to vision. We developed a system that detects events such as ball hits and point boundaries in a tennis match. This required solving many interesting technical challenges, which we spotlight in our technical paper, Detection of Tennis Events from Acoustic Data. It will be presented at the Association for Computing Machinery (ACM)’s Second International ACM Workshop on Multimedia Content Analysis in Sports, and it will be available here after October.

The information architecture that can record, isolate, and process 50,000 tennis plays is fascinating and you can read more about it in our post on a similar system we deployed at Wimbledon.

Scoring excitement – fairly

Once a play is isolated as a video clip, it needs to be automatically scored for excitement—and scored fairly. Each play is enriched with 39 tennis, crowd measures, biographic and statistical features that are sourced from an IBM DB2 on Cloud database. One of the features, rank, is the average rank of the tennis players within the match. The average rank provides us with a single privileged value helping us identify and mitigate bias in highlight scoring. Each of the enriched records are placed into an emotion queue for highlight ranking. Next, a scene excitement ranking system pulls each new record and creates component-level highlight measures using AI.

The highlight results are placed onto a queue for bias processing. An agent-based system, written in Python, pulls any new record and calls IBM Watson Machine Learning for an overall context highlight score. The context score is a trained predictive model that combines all of the 39 predictors into a single excitement rating. In the process, IBM Watson OpenScale continuously learns and debiases the context excitement score based on selected attributes such as court and average player rank in a match.

The AI Highlights system at the US Open uses several deep learning and machine learning techniques to determine the excitement level of a video. Each video is split into its video and sound components. The sound is converted into an MP3 format and placed on a disk store. A Python process picks up the MP3 and sends the content into a Convolutional Neural Network (CNN) called SoundNet with the PyTorch library. The last layer of the CNN is removed to retrieve the spatial representation of the sound. The feature vector is input into a Support Vector Machine (SVM) that was trained on the domain of tennis. Two SVMs are applied to produce a crowd cheer and commentator speech excitement score. The score is further scaled to compensate for video sound changes year over year at the US Open.

The visual aspects of the video are analyzed from extracted video frames. Each image is sent into the VGG-16 neural network model within the Caffe deep learning framework. This solution was pre-trained on ImageNet, a large visual database. The VGG-16 model was adapted to recognize exciting tennis movements. An action excitement score is scaled to provide a score for a tennis player. The same set of images are used to determine the reaction of a tennis player as well as body part detection. Portions of the body such as the head and torso are tracked to determine the speed of motion and gestures.

Each of the individual scores from cheer, action, body motion contributes to the crowd cheering and player gesture scores and the overall excitement score shown just below. Each excitement score is saved into the Cloudant data store for downstream processing by the debiasing app.

The overall excitement score takes into account crowd cheering, match analysis, and player gestures 


The debiasing Python application removes unintended bias and alters unethical excitement levels. Several AI technologies within Watson OpenScale detect bias and correct the overall context excitement level with mitigation techniques while monitoring model accuracy. The variables, player rank, player age, gesture score, crowd cheer score and point analytic score are used to measure and remove bias. For example, the gesture score from an emotional player might be biased towards excitement when contrasted to a calm and low-key player. As a result, the gesture score predicts an animated player will have an exciting tennis shot.

To remove a potential bias, the Python application creates an overall context excitement score by applying a trained SVM that was deployed on Watson Machine Learning. Each of the scoring payloads is sent to Watson OpenScale for continual bias detection and mitigation. Throughout the debiasing process, Watson OpenScale trains a postprocess debias model that removes bias from the score given a set of monitored attributes.

Components involved in ranking highlights


Off-court factors play an important role

To determine which attributes we should monitor, we created a tennis domain ontology for each of the 39 predictors. The player popularity metrics track how many times a player’s profile is visited from US-based traffic compared to worldwide page views. We combine the popularity analytics with tournament features. Within each match, excitement levels are related to a type of hit, win, ball and player tracking statistics. In addition, the tournament round and court can be factors that contribute to the ranking of a highlight. The multimedia measures such as crowd cheer, gestures, player expression and speech tone provide significant insights into the importance of a point. Finally, we included player biographic information so that country of origin, rank, and age could have an opportunity to influence the highlight rank. Many of the values within the tennis domain ontology were protected attributes with specific privileged values that may not have group fairness. We identified considerable bias when examining tennis court of play and average team rank.

Monitored attributes used in calculating rank


Throughout the US Open, Watson OpenScale monitors the bias of context scores based on several selected attributes: gesture score, crowd score, analytic score, team rank and team age. During the tournament, different fairness variables had a diverse set of bias trends. By using Watson OpenScale, we could look at the fairness levels from differing time and fairness perspectives. One fairness perspective we analyzed was crowd score.

A view of the monitored attributes bias detection in IBM OpenScale.


Traditionally, Arthur Ashe Stadium, the largest court that hosts the finals of the main singles and doubles events can seat up to 23,771 fans. The second-largest venue, Louis Armstrong Stadium, holds less than half of the main court, with about a capacity of 10,200 people. The other courts seat considerably less people and showcase smaller draws in an intimate setting. With the large difference in crowd size, we began analyzing bias from crowd score. The highlight score was categorized into 5 bins where 0 is least exciting and 4 is the most. We wanted to create fair scores across our reference group where the crowd ranking was between 0.7 and 1 and the monitored group was bounded by scores between 0 and 0.69. The majority of the overall excitement bias occurred when the cheering level was between 0.13 and 0.27. When we debiased the crowd score, making large and small crowds comparable, we had a model that was 52 percent more fair.

The debiasing of highlight scores based on crowd excitement. 

We also found that the player’s age had a high correlation with highlight scores. Players that were younger than 21 drove higher levels of tennis excitement compared to players older than 33. For example, on August 31, 48 percent of the highlights had high excitement ratings if the player was younger than 21. The excitement was cut by more than half to 20 percent for players older 33. After we debiased the age, we increased the overall fairness of highlights from 42 percent to 91 percent.

The debiasing of highlight scores based on player age. 

We have a clear understanding why IBM OpenScale chooses to debias a match when we look at the predictive power results. In one highlight example, a candidate highlight for debiasing had a high excitement score of a 4 with only 26.23 percent confidence. As you can see in the next graph, the most important predictors were the gesture score, set number and world-wide profile visits over the Internet. The situational match and crowd score contributed towards a vote of a lower highlight score. In this case, the bias was not high enough in any one dimension to select this match for mitigation.

IBM Watson OpenScale explains which features led to a particular excitement score.

Each of the top highlights ranked by the context of play and multimedia excitement features tell the narrative of the 2019 US Open. You might be surprised by some of the highlights. With our fair AI Highlights process, you can enjoy a more objective and broader approach to top tennis moments than fans have ever been able to see before.

In fact, fans from the around the world were able to use the technology to create their own highlights at the 2019 IBM US Open Experience.  Learn more about making AI outcomes fair and explainable with IBM Watson OpenScale here. And read more about how the US Open works with IBM here.

The IBM Experience at the US Open enabled fans to create their own highlights.


Most Popular Articles