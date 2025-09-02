Is that tennis announcer narrating a thrilling rally on the court a human or a bot? Soon, thanks to advances combining computer vision and text-to-speech language models in new ways, it may be difficult for fans to tell the difference. But that could be a good thing: for events like the US Open, which is home this year to over 300 matches over 15 days, it is logistically tricky and prohibitively expensive to staff human announcers to cover all the action on the courts.
IBM researchers from the MIT-IBM Watson AI Lab are combining AI models in this way to adjust speech elements like intonation and volume in AI-generated sports commentary so that it sounds more life-like and engaging. For example, the models can detect when fans and players get particularly excited after a big point, and the AI voice can grow more animated in response, instead of delivering all commentary with the same robotic level of enthusiasm.
“The idea of AI-generated commentary is not to replace humans,” said Rogerio Feris, a Principal Scientist and Senior Manager of the MIT-IBM Watson AI Lab. “It’s to augment humans and provide more coverage for courts that currently lack commentary.”
Back in 2023, IBM Consulting and IBM Research teamed up to bring AI-generated sports commentary to the US Open and Wimbledon. To do so, the team first extracted play-by-play metadata from video footage using computer vision to understand every detail of the game. The model detected court and net movement and position; tracked players and balls; classified various shots, from backhands to forehands to volleys; and identified the direction of each shot. The researchers combined this metadata with additional information from other modalities, such as the loudness of crowds cheering, as well as match data scoring and radar-measured ball speed. Then, the team fed this rich metadata into a large language model (LLM) that was fine-tuned to produce commentary in natural language as output.
The next task was making the AI outputs sound like a human. While the output was accurate from a content perspective, most LLMs cannot produce language that incorporates prosody, the linguistic term for the elements of speech—such as intonation, stress, rhythm and loudness—that provide different meanings to given words or phrases.
For example, if an agent powered by an LLM says “Today’s weather is sunny,” and the human interacting with the model says “I am sorry. Can you say it again?” then the LLM would repeat “Today’s weather is sunny.” In a conversation between two humans, however, the person responding would typically slow down and enunciate “Today’s weather is sunny,” and perhaps speak a bit louder.
LLMs typically represent content well in text transcriptions, but prosody remains unfamiliar to them. Consequently, effective modeling of prosody and its connection to content is essential for training speech-augmented LLMs.
Motivated by this challenge, fellow IBM Researcher Yang Zhang, a colleague of Feris’s from the lab, recently wrote a paper about a new speech LLM he helped develop called ProsodyLM. The model is built upon a very simple tokenization scheme: each speech utterance is first transcribed into text, then followed by a sequence of word-level prosody tokens, describing the F0, duration and energy behaviors for each word. F0 is the “fundamental frequency,” meaning the frequency at which the vocal folds vibrate when voiced speech sounds are made. People’s perception of this physical vibration is what we call “pitch.”
The team introduced a three-step pipeline to bring prosody into sports commentary. In the first step, a computer vision model consumes multiple different kinds of match data, such as the score, play-by-play information, the level of crowd excitement and the players’ expressions and movements, then uses that information to generate an excitement score. The excitement score and corresponding data are fed into a second IBM LLM that generates a script that modulates based on the excitement of the action on the court.
“For normal rallies, it might just spell out what’s happening on the court,” said Zhang in an interview with IBM Think. “But for extremely exciting rallies, it might add in ‘What a play! That’s unbelievable!’”
Finally, the script goes into the text-to-speech ProsodyLM, so that the audio can incorporate elements of rhythm, energy and pitch that make it sound as natural as possible.
ProsodyLM was pre-trained on 30,000 hours of audiobooks and ultimately demonstrated better prosody understanding than prior models across various categories. ProsodyLM could, for example, correctly recognize emotion and stress in speech utterances, without being trained to perform those tasks. By explicitly tokenizing the prosody information and content, the resulting language model can generate very expressive speech, develop a preliminary understanding of emphasis and emotion and successfully clone the styles in reference speech.
“Now, instead of AI commentators that speak in a monotone level of excitement and sound unnatural to audiences, these tools can express a high excitement level just like human commentators, who get much more expressive during a very exciting rally,” Zhang said.
Looking forward, once the prototype has advanced to production and excitement-driven sports commentary is rolled out in official tennis tournaments, a next step could be letting fans personalize the sports commentary, Feris said. For example, fans could decide if they want high versus low excitement commentary. In the meantime, Zhang said, the team is receiving a lot of interest from researchers and clients working on other sports like Formula 1 car racing.
In addition, this excitement-driven AI sports commentary was part of an IBM “Behind the Scenes” 2025 US Open demo this year of emerging tennis technologies. This means that in the not-so-distant future, you might want to tune in more closely to see if you can detect whether it’s a human or an AI announcer whipping up the crowd after an overhead smash or a tricky drop shot.
