Wimbledon AI-generated content

By , Gray Cannon, and Sara Perelman | 12 minute read | June 23, 2021

With contributions by Michael Behrendt, Nick Wilkin, and Eris Calhoun

This year’s action at Wimbledon will leverage content and insights generated by IBM Watson. Not only is the grass always green at Wimbledon, but your content will create an even greener experience as you engage the tournament with the experience of your choice. Our AI algorithms will read, interpret, forecast, and correlate data across 18 courts to produce insightful context for interesting impact players. You might be surprised by our system’s suggested players to follow throughout the tournament.

For this year’s Wimbledon, IBM is introducing two new innovative technology solutions: “IBM Power Rankings with Watson” and “IBM Pre-Match Insights with Watson.”

  • IBM Power Rankings are an AI-powered analysis of player performance. The Tennis Tour ranking systems use 52-weeks of historical data to quantify player performance. To complement these, Power Rankings focus on a player’s most recent history, combining advanced statistical analysis, the natural language processing of IBM Watson, and the power of IBM Cloud to analyze daily performance data, mine media commentary, measure player momentum, and direct the attention of fans to the most compelling matchups.
  • IBM Pre-Match Insights with Watson are AI-generated fact sheets that help fans quickly get up to speed on every singles match at Wimbledon. The algorithms use advanced AI and IBM Cloud to mine the most recent player statistics and media commentary for insight, breaking down the individual elements of the IBM Power Rankings, sharing relevant quotes from various media sources, and constructing a natural language summary of key performance metrics.

Experience the IBM Power Rankings

The IBM Power Rankings is the measure of a player’s strength going into and throughout a tournament. The factors that contribute to IPR provide strong indicators as to who will win a head-to-head match. The IPR is complementary to the traditional tour rank. Over a tennis season, a player’s ATP or WTA ranking is based on the number of points accrued over 19 different tournaments within a 52-week rolling window. There is more overlap between the tour rank and IPR leading up to a grand slam than during a grand slam. The points that make up the ranking are dropped. As its foundation, IPR uses relevant industry punditry observed through thousands of news sources combined with player performance to create an index of a player’s momentum. During a grand slam, the crowd becomes more focused on a player with precise language. This helps both rankings to move toward independence, as shown in Figure 1.

The relationship between IBM Power Ranking and Tour Rank
Figure 1. The relationship between IBM Power Ranking and Tour Rank

Each day, the IBM Power Rankings are updated and available on a leader board. Figure 2 shows the experience on a mobile device. The board shows each player’s power ranking, power ranking movement, and tour rank. A last updated timestamp provides the context as to which data was used for the current player power rankings. The experience is broken out into gentlemen’s and ladies’ power rankings to track the road to the championship.

IPR leaderboard mobile
Figure 2. The IBM Power Rankings leader board mobile experience

Here is how the IBM Power Rankings work.

IBM Power Rankings

Over 25 factors contribute to the IPR. Within the player performance dimension, a player’s win velocity, overall win ratio, and projected future win ratio account for win power. Next, the quality of a win, rank difference, injury status, tournament participation boost, round progression award, and win margin boost award players for meaningful play. Within natural language, the crowd’s opinion about a player’s performance and health is a large factor within the IPR. Both content sentiment and normalized volume are forecasted forward a few days to provide leading indicators for IPR. At the same time, the overall assessment of the player adapts to the current grand slam with a refocus metric. This enables IPR to rapidly adapt to current play outcomes.

The IPR becomes an insight with the application of a predictive model called “likelihood to win.” The model has 30 features that include comparative elements of IPR. A head-to-head singles match is assessed by the model. A win probability is assigned to each player. The win probability can shift day by day as the data around punditry and performance changes. Figure 3 shows the overall architecture of the IBM Power Rankings system.

Player Power Ranking System
Figure 3. Power Rankings System

The core IPR system runs over IBM Functions, a serverless technology that can run code bootstrapped by containerized technologies. A series of triggers run action codes on predefined schedules. The long-running ranking action calls itself as it processes players. Statistical data is pulled from SportRadar while punditry is queried through Watson Discovery. The functions code calls OpenShift RESTful services that apply natural language processing techniques to the text. The volume and sentiment trends of the queried data is forecasted a few days into the future by Watson Core OneNLP. A spike forecaster that was trained by IBM AutoAI and deployed on Watson Machine Learning helps to discover anomalous future situations. The results of the data are stored within Db2.

At the end of each player’s IPR process, a feature vector is posted to a likelihood to win Python application running on Red Hat OpenShift. The feature vector is normalized with missing values imputed before being posted to the likelihood to win predictive model. The resulting probability of a win for two players within a match are saved to Db2. A Cognos Dashboard pulls data from Db2 into data visualizations. In parallel, IBM Code Engine aggregates Likelihood to win and IPR data together into a JSON file for upload into an IBM Cloud Object Storage. The IBM Code Engine publisher creates data that feeds into the myWimbledon experience.

Experience the IBM Pre-Match Insights with Watson

While our system creates the IBM Power Rankings, the Pre-Match Insights system is applying natural language processing, AI, and statistical analysis to tennis-related content. The IBM Power Rankings, Likelihood to win, and Pre-Match Insights are joined together within a singular experience, as shown in Figure 4. The most meaningful insights provide transparency to both the upcoming tennis match and to the power rankings.

Pre-match insights
Figure 4. The IBM Pre-Match Insights experience

In the media section of IBM Pre-Match Insights with Watson

First, we decided to focus on core media outlets to answer key questions. What makes a player interesting? What happened in their career to lead them to play at Wimbledon? Player Insights with Watson seeks to uncover the answer to these types of questions, along with any other facets of a player’s background that makes them stand out from the field.

To achieve this, Watson searches for information on a given player across millions of news articles, blog posts, and other online media, supplemented by deep dives on a targeted selection of tennis sources, such as https://www.wimbledon.com. Watson has a deeper understanding of the editorial content through natural language processing enrichments: articles are categorized by their prevalent topics or concepts, and relationships are drawn between entities such as people and places. Articles deemed relevant to both the player and the topic domain are then summarized using extractive algorithms.

The nature of extracting sentences from a body of text means that a degree of context is lost in the process. Pronouns and any time-relative references such as “two years ago” might be disconnected from their roots, leaving the summarized sentence difficult to understand. To mitigate this, we attempt to resolve orphaned coreferences using sentences within +/- 2 of the extracted summary.

Having collected relevant articles, extracted salient information, and resolved any lingering coreferences, the next stage is to assess each of the sentence’s quality. Two dimensions are used to determine the quality of a given snippet: its grammatical coherence, determined by scikit-learn surface form parse rules and decision tree, and a trained machine learning model that measures topic alignment. Sentences that pass a quality threshold are determined to be insightful and are stored in our Cloudant natural language processing store as factoids. The factoids are stored as JSON documents by topic/player and are then sent through our Insights Human Review Tool. Here, human operators review and approve the stored factoids.

Figure 5 depicts the architecture of the factoid system.

Tennis factoids
Figure 5. Tennis factoids architecture

By the numbers section of IBM Pre-Match Insights with Watson

The on-court action at Wimbledon produces dozens of distinct statistics for fans and tennis experts to analyze. These statistics are particularly useful when previewing an upcoming matchup, as they can indicate the relative strengths and tendencies of each player. Does this player hit many winners from her forehand? Does the player often approach the net? Statistics can answer these questions and many more, giving fans insight into the forthcoming match. A skilled analyst can study data tables and uncover the areas in which each player stands out. Pre-Match Insights brings this level of comparative analysis to statisticians and casual fans alike by instead presenting the data in natural language.

IBM maintains databases that store these statistics and other relevant information using the Db2 on Cloud service. In their raw form, these stats are still difficult to interpret. Comparisons are difficult to make because matches can differ in length, from under 1 hour to over 4 hours. To normalize for this variability, IBM calculates per-point frequencies. Each frequency is then converted to a rank value with respect to that statistic among the entire tournament field of 128 competitors.

The most extreme values are the items that will be most interesting to the tennis audience. Additionally, Pre-Match Insights draws contrasts by highlighting the stats with the largest percentile differences between the two players in the matchup. After these key stats are selected, the system converts the stats to natural language. To do this, the system must understand the various components of a statistical highlight. These components include the subject phrase, verb phrase, and contextual phrase. As humans generate natural language using various word choices and syntactical ordering, the AI system also varies these elements to produce human-like language. The output structures and diction are then selected according to probability. At this level of variety, the natural language generation system, which is powered by open source natural language generation and IBM Research technologies, can produce hundreds of unique texts for each match’s selected stats. Additional processing then confirms grammatical correctness such as pronoun, article, and verb agreement.

The final task of the Literature Generator web service is to persist the texts and corresponding metadata to a Cloudant NoSQL database on IBM Cloud, which feeds the human review UI. After a Pre-Match Insights package receives approval, an IBM Code Engine application joins the statistics with corresponding factoids and writes a JSON document to a bucket on IBM Cloud Object Storage. The contents of this bucket are delivered on Wimbledon.com using the IBM Content Delivery Network. The Content Delivery Network is well-equipped to serve the high traffic for these data files as they power the Pre-Match Insights features on Wimbledon.com.

Natural language generation
Figure 6. Natural language generation for tennis architecture

Personalization: myWimbledon

myWimbledon is a new personalized experience integrated seamlessly across Wimbledon’s digital platforms, giving fans the opportunity to engage with more personalized content throughout the championships. myWimbledon uses the data created by IBM Pre-Match Insights and the IBM Power Rankings to show you the most important information.

This first-of-its-kind personalization feature for Wimbledon fans will come to life as a recommendation engine. We know that the current greats of the tennis world won’t be around forever. Therefore, using a recommendation engine based on what we know about the fan and the IBM Power Rankings can help fans get behind and learn about the next big stars leveraging the power of IBM Watson.

This custom rules-based recommendation engine will enable fans to discover new players by making suggestions based on their current favorited players, the IBM Power Rankings, top players, and other statistically rendered criteria. The recommendations will be served to fans on the homepage timeline as:

  • “Editors Pick” – a form of knowledge-based recommendation
  • “IBM Power Ranking picks” – a form of context-based recommendation
  • “User Favorite picks” – a form of individual-based recommendation

Recommendations timeline
Figure 7. The recommendations timeline for myWimbledon

Over the course of the tournament, these player recommendations will evolve and alert the fan to newly recommended players. myWimbledon also includes smart links out to our other new features such as the IBM Power Rankings Leaderboard and Pre-Match Insight for the recommended players.

In addition, for myWimbledon users, each recommended main draw singles player will have a produced highlight package, containing 2-3 minutes of content showing the best points won by that player in each round.

Figure 8 shows how the overall personalization system works.

Personalization system
Figure 8. Overall personalization system

Each content producer system such as Factiods, NLG from Stats, NLP Optimization, AI Highlights, and Power Rankings create deep and diverse types of information about gameplay. The content is stored in three independent Cloudant NoSQL databases. At the same time, current game state such as players statistics are streamed into a Db2 database. All of the data is joined together by a Python publisher application. The application is containerized with Docker and run as a flask plus RESTful service. The image is pushed to IBM Cloud image registry and run on IBM Code Engine. A subscription-based service was created to schedule when the publisher API is called. The following code depicts the IBM Cloud CLI command.

ibmcloud ce sub ping create --name IPRpublisherscheduledev --destination IPRpublisherdev --data '{}' --schedule '*/10 * * * *' --path example

Show more

IBM Cloud Code Engine allows running various workloads in a serverless fashion – containers, batch jobs, apps, and functions. This allows developers to run the broadest possible set of workloads in a serverless fashion, and as a result gain the highest possible capex and opex savings, combined with a very high level of productivity due to not having to deal with IT infrastructure concerns.

After each run, player lists and IBM Power Ranking data is converted to JSON and uploaded to IBM Cloud Object Storage. The IBM Cloud Object Storage bucket is fronted by a Content Delivery Network and consumed by Wimbledon experiences.

Enjoy the tournament

Tennis data that matters the most finds fans. This increases the longevity of fan engagement while preparing them for the next generation of tennis play. The current stars of Wimbledon are soon to retire after their historically long reign, and fans are in need of their next Wimbledon hero. This personalized experience will help fans rally behind the next big stars in the tennis world based on players and content recommended by AI.