Is search less trustworthy? In Episode 18 of Mixture of Experts, host Tim Hwang is joined by the IBM® Fellows—Aaron Baughman, Kush Varshney and Trent Gray-Donald. Today, the experts chat about how AI is being integrated at the US Open. Next, the Perplexity is introducing ads in Q4, what is the effect on search? Finally, what's all the hype with Cursor? Tune-in to today’s episode for all this and more.
The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
📩 Sign up for a monthly newsletter for AI updates from IBM.
Tim Hwang: So, is AI going to wipe out all sports journalists? No matter the sport, you know, we’re always working with the same constant: that’s the human. It seems to me that page search and Perplexity pose some really big questions. It’s all about simple economics and who’s incented to do what. Are you the customer, or are you the product? Should I be using Cursor? You shouldn’t ever ask a question of an LLM these days, at least, that you don’t already know kind of the answer for yourself. All that and more on today’s episode of Mixture of Experts.
I’m Tim Hwang, and I’m joined today, as I am every Friday, by a world-class panel of engineers, researchers, product leaders, and more to hash out the week’s news in AI. On the panel today: Aaron Baughman, IBM Fellow; Kush Varshney, IBM Fellow; and Trent Gray-Donald, IBM Fellow.
So, to kick us off, the US Open is this week, and as usual on Mixture of Experts, we’re of course excited about the tennis, but we’re really excited about the AI. I really want to talk about the role of AI in the US Open. But first, to kick us off because I personally am a huge tennis fan, let’s just go quickly around the horn. I want everybody’s nominee for the best tennis player of all time. Aaron, we’ll start with you.
Aaron Baughman: Yeah, so that’s a great question. Easy answer: Ben Shelton.
Kush Varshney: Leander Paes.
Tim Hwang: All right, I like that one. Very good. And Trent, how about you?
Trent Gray-Donald: Oh, I prefer squash, so Jonathan Power.
Tim Hwang: Okay, great. Well, thanks. I asked that question to kind of kick us off on the discussion today. Of course, the US Open is happening right now as we record this episode, and as usual on Mixture of Experts, we’re excited about the tennis, but what we really want to talk about is the AI.
Aaron, in particular, I wanted to have you on the panel to kick off this section because I understand you’ve been experimenting with using language models to generate both long and short-form stories for the Open. I wanted to talk a little about what you’re discovering. What’s working really well in these experiments you’ve been trying out?
Aaron Baughman: Yeah, well, thanks for having me. It’s really fascinating to watch how we apply these AI technologies, in particular these agentic architectures with a diversity of large language models deployed out at scale, to the US Open happening right now. If you go to www.usopen.org and go to news, you can see a lot of our stories that are created with both human and large language models together.
In general, we have two different types of projects. One is we’re creating hundreds of match reports, pre- and post-, long and short form, for 255 of these different matches. The second project is called AI Commentary, where we take stats, transform that into a different data representation like JSON, input that with a prompt to get out text, and then that’s voiced over with text-to-speech and embedded into these highlight videos.
Tim Hwang: Yeah, that’s really cool. Tell me a little more about how that works exactly. How do you go from a game to a report about a game? Presumably, there has to be a feed about, “This person just had a great serve,” for instance. How do you do that conversion? I think what’s interesting is you’re going from video and visual to a written medium. I’m curious how you guys approach that problem.
Aaron Baughman: Yeah, it’s really neat. This is all about message-driven architectures. Whenever we get a score—for example, when a match ends—we get a message, and within seconds, less than seconds, we’ll take that message and pull in from about 14 different feeds that have raw data describing the players, the match, where they are, what they’ve done in the past, and we also AI-forecast what’s going to happen in the future. We take all of that and turn it into a representation that a large language model can understand, like JSON elements with key values describing what’s happening in tennis: how many aces somebody is getting, how many breaks somebody has won in the match.
All of that is packaged together and pushed into our scaled-out architecture—Granite, for example—and we pass it in with a prompt. The output is fluent text that describes the scene that just happened or is coming up. It’s really cool to see it live. There’s all sorts of fact-checking, quality checks, novelty pieces, and simplicity checks to make sure it’s up to par. I use “par” on purpose because we also do things for golf, which is part of our over three-year story that has evolved into the US Open.
Tim Hwang: That’s great. Trent, I saw you nodding at the mention of Granite. I don’t know if you have a connection to the project, but I’m wondering if you can paint a picture of where you see all this going. We’re seeing these experiments in golf, now in tennis. Should we expect in five years that a lot of sports coverage, summaries, and commentary will be AI-generated? Or is this more of a sports-specific thing?
Trent Gray-Donald: I think this is just the beginning of a lot of different initiatives. The reason I’m nodding is that Aaron and I actually... I run the Watsonx SaaS service that does the inferences that Aaron calls. So he’s basically calling my service when he does the work. I’m the plumbing; he does all the interesting domain-specific work tying together the data sources, and it ends up coming into our service. He and I work together on figuring out how to handle the capacity and latencies.
In general, how Aaron’s built it... I see this whole agentic universe. There’s a spectrum from highly scripted to letting the LLMs do what they’ll do, and there’s a big middle ground. For live events, for human things like sports, we’re going to start seeing increasingly interesting agentic architectures emerging that will extend beyond a given sport.
The interesting question is always: can you find the right unique snippets to tell people? One of the jokes we have—we’re big baseball fans—is when we’re listening to the play-by-play and they come up with these ridiculous statistics: “This is the third player since 1943 who stood on their left foot and wiggled their ear.”
Tim Hwang: Yeah, I’ve come to expect that. I watch a lot of soccer, and it feels like commentators just fill space with a remarkable bank of edge-case statistics. The question is, can we capture and distill that? Obviously, there’s a lot of data mining going into producing those right now. How do we connect those and make it engaging, interesting, and human?
Trent Gray-Donald: That’s right. Yeah, for sure.
Tim Hwang: Kush, curious about your thoughts. I know one aspect of your fellow work is thinking about AI governance, which is how these systems influence people. One response is, “What is a sports journalist supposed to do in the future where a lot of their work generating coverage and commentary is automated?” As Aaron said, there are ways for humans and AI to work together. I’d love to hear how you see that relationship evolving. Is there a role for humans in an AI-enabled sports future?
Kush Varshney: Yeah, I think we’re going to talk more about human-AI collaboration towards the end as well. The way I think about it isn’t so much about what job we’re trying to automate away, but really the question of the dignity of the humans involved. If you’re the human and you’re subservient to the AI, you have no dignity left in many ways.
What are the workflows we can set up to get a better product, still get the advantages of automation, but leave the dignity of the human intact? One way to think about it is House, M.D., the TV show. Dr. House had his residents doing stuff for him, conducting tests, but there was a very adversarial relationship; they were always trying to prove him wrong. If we can get AI systems to be in that mode, working with the human, then the human still stays with the agency and dignity but gets the benefit of the high technologies. I think something like that could play out as we go forward with a lot of different human collaborations.
Tim Hwang: Yeah, I love the idea that in the future, a sports commentator will have an agent that generates those weird statistics Trent mentioned—an expert on finding and identifying those as the action evolves.
Before we move on, Aaron, maybe we’ll close this segment with you. This is your work getting shine at the Open. I’m curious: are some sports easier or harder to do this kind of work with? Theoretically, is any sport amenable to this story generation, or are there aspects of tennis or golf that make them ideal test cases? Did you pick this because you love tennis, or were there scientific reasons?
Aaron Baughman: Yeah, you know, the “no free lunch” theorem—there’s not a perfect solution for every problem—is applicable here. Every sport has a pro and con. It all comes down to what data is available, what the scale is, and what use case the fans want to see. I wouldn’t say there’s a perfect sweet spot in any one singular sport; there’s always a challenge.
Some challenges we’ve discussed are making sure we have meaningful stories and stats that bubble up. We use things like standard deviations around aces, for example, because a pure number of aces isn’t significant; it depends on sets played, the gender of the match, who’s playing. We have to break that down. If we go to racing, football, soccer, it’s all similar, but you apply the same mathematical techniques to the stats that can bubble up.
Another exciting area is getting human and machine working together. There’s a pendulum of how creative you want the LLMs to be versus how prescriptive. We tend to go somewhere in the middle, but it’s all experimental. It’s almost like the theory of mind: we want to predict what action a human editor will take so we can meet their expectations when we generate text.
No matter the sport, we’re always working with the same constant: the human. The other constant is data; we need access to it. But it’s fun, impactful, and a way to bring people together irrespective of creed, gender, race. It’s exciting to use a lot of Trent’s work and Kush’s work and bring it together for the world to see. I encourage you to check out usopen.org to see our work live—read the match reports, listen to commentary. It’s fascinating to watch the field evolve.
Tim Hwang: Yeah, for sure. This is where the magic happens. AI can be abstract; it becomes clear with an application like this where it helps you enjoy something you already love. It makes a huge difference.
Aaron Baughman: Yeah, and it’s fascinating to watch the field evolve in real-time.
Tim Hwang: I’ll introduce this by talking about Perplexity. Perplexity is a leading company in the generative AI movement, providing language models as an interface for search. The idea is a future with more conversational search experiences than current ones like Google, where you type a query and get a list of responses.
Perplexity has been one of the best products in the space for me; it’s one of the few I pay for and use weekly. An interesting news story popped up recently: Perplexity announced it’s moving towards a model where they roll out paid search. The background is, in the past, you subscribed to Perplexity and paid a monthly fee. Now they’re saying they’ll monetize by allowing people to buy ads on their platform. So if you search for “what is the best exercise machine,” you might see an ad from Peloton.
This is a big shift. One big hope for this technology was that conversational interfaces would be better and we might move away from ads to a subscription world, resulting in more faith, trust, and confidence in search results.
Trent, I’ll start with you. How do you feel about this? Does it make search less trustworthy? Should we be concerned about Perplexity’s shift?
Trent Gray-Donald: Well, in my view, yes, absolutely. I’m a big fan of “follow the money.” It’s all about simple economics and who’s incented to do what. Are you the customer, or are you the product? It’s very simple: as you shift to paid search, you become more of the product instead of the customer. My usual reaction is that this is not going to bode well for us as consumers.
Tim Hwang: Yeah, for sure. I remember an essay by Larry and Sergey when they founded Google, describing PageRank. At the end, they said no search engine should ever use ads because it would be terrible. Lo and behold, Google is a 90% ad-based company. But Kush, it’s very hard to avoid these incentives, right? The problem with subscription is people need to pay, which limits user growth. Is there any way you think of escaping ads as a business model in this space?
Kush Varshney: I’m really not sure. But one thing I want to point out, maybe counter to what Trent is saying, is that investment into an ad-based approach should also lead to investment in technologies that help with trustworthiness. Source attribution is a big problem with LLMs; you don’t know where the information in the generative output came from.
If that’s part of the monetization, there will be more investment into scalable source attribution techniques, which can increase trust—maybe not just for ad-driven platforms, but in general. Better techniques for tracing information can help go back and check for hallucinations. Incentives can work in weird, roundabout ways. The ad-driven aspect may or may not be good for trust, but it might lead to investment in things that do help.
Trent Gray-Donald: I agree in theory, but in practice, what incentive does Perplexity have to provide attribution in a better way? Do they just start obscuring it? Who’s got the leverage to prevent that? The fundamental thing is, we could, but we don’t.
Tim Hwang: There’s also another element: in a world of chat-based search, the trust problem might be worse. With Google, you have ten blue links; you can question why one link is ranked over another, and sponsored links are labeled. In a world where it’s just a paragraph, you can offer citations, but who will click through them?
Kush Varshney: Yeah, I mean, I never click “I’m Feeling Lucky”; I always want to see the ten results. But the point I was making is that whoever is paying for their stuff to appear needs assurance it will come through. If an ad has to get through the language model to appear in the output, ensuring that happens will require technology, and that same technology can be used to trace other facts. The reason it needs to be there for an ad-based business is that the advertisers need a guarantee their stuff will appear.
Tim Hwang: Aaron, I’m not going to let you be quiet on this segment. Any thoughts? Are you on Team Kush or Team Trent, or neither?
Aaron Baughman: Yeah, I think mixing revenue-driving with trust and transparency could be potentially dangerous; it could be used for alternative methods. But it’s about balance. I read an article about Goldman Sachs saying there’s too much AI spend and too little benefit, and for the AI industry to stay solvent, there needs to be revenue, with a large revenue gap today—we talked about the $600 billion gap with Sequoia a while ago on this show. That stuck with me.
On the other hand, we need trust and transparency to maintain users and demand. Once people lose trust, they won’t use these systems; I wouldn’t. Another point: many Perplexity users are highly educated, high-income earners. If you can influence that group, it can influence others as they tend to be leaders in fields.
It’s important that Perplexity, like Google did, publishes papers describing their algorithms and systems for us to access, creating a “digital passport” showing where data comes from. Then it’s up to us as IBM Fellows to educate: if you’re using these AI systems, you need to do your own due diligence, maintain your belief system, and be a critical thinker.
Tim Hwang: Yeah, that’s well warranted. If Perplexity were here, they might say, “Why are we held to a higher standard? Google has been monetized with ads for years, and people use it. Why is AI special?” Part of the worry, which Aaron brings up, is whether people will be critical thinkers. AI might make it too easy, limiting how much people click through links. I know I certainly don’t.
Aaron Baughman: Yeah, I’ll say that when I’m driving using map software like Google Maps, I completely forget where I’m going and couldn’t retrace my route because I don’t pay attention. The danger of not being a critical thinker because information is so easy to get... I think we all need to be careful.
Tim Hwang: That’s right. I had an incident where I left my phone in a restaurant, hopped in the car, and started driving, then realized I didn’t know how to get back. Very embarrassing. Any final thoughts on this trend?
Trent Gray-Donald: Some really good points. Kush’s point about advertisers wanting to see where their money goes is an interesting loop back that creates an incentive for more transparency. But we’re used to Google coming back with a list, and it’s up to us. The problem with chat is that it’s more opinionated; it has a humanness, like someone talking to you. LLMs talk with authority and confidence, even when it’s not warranted.
It will be interesting to see how we develop the right filters. We all know how to deal with a Google page: scroll past the first few items. It will be interesting to see how we build defenses here and if they’re harder to build.
Tim Hwang: Yeah, that’s a big open question. We’ll have to learn as a society, like we did when the first ten blue links emerged. It feels like we’re turning that wheel again.
Trent Gray-Donald: Exactly.
Tim Hwang: I’m going to move us to our third story. Former Tesla and OpenAI leader Andrej Karpathy tweeted his love for a product called Cursor, setting off discussion about AI’s role in software engineering. The unique thing about Cursor, compared to Copilot or Cody, is that it’s an entirely standalone IDE. They forked VS Code and rebuilt it from the ground up with AI.
An interesting part of the discourse is the argument that Cursor is interesting because it’s trying to get past the paradigm Copilot set. When Copilot launched, the idea was autocomplete for AI assistance in software engineering. Cursor is playing with things like diffs on your code, chat interfaces, pushing beyond autocomplete.
Kush, do you buy that? Is Copilot already old school? Is it version 1.0? Will we look back in 10 years and no one will think about using a Copilot-like interface to integrate LLMs into their workflow?
Kush Varshney: Yeah, that’s a great question. I think it relates to what we’ve been talking about: do you trust this thing? Are those autocompletes things you can verify yourself? You shouldn’t ever ask a question of an LLM these days, at least, that you don’t already know the answer for yourself.
Some folks on my team have been doing user studies, asking what features people want from AI for code. What we’re finding is that the biggest problem is code understanding. When you’re given a dump of a new codebase—thousands or millions of lines of code with weird configurations, maybe in a language you don’t know, like Go or Ballerina—how do you get a sense of where things are, how it’s organized, what it does? I think that’s an even more powerful use case.
Once you’re at the level of knowing what line or block to write, you’re already versed in what you need to do. Autocomplete can speed things up, but even getting started is a bigger problem.
Tim Hwang: It’s funny to think we’ve focused so much on AI generating code, but you’re saying the future is better documentation—the thing that’s difficult and no one wants to do. Trent, with your Watsonx Code Assistant work, I’m sure you’re interested in that interface. Do you agree with Kush that understanding and documentation is the most important thing?
Trent Gray-Donald: Absolutely. One of my day jobs is Chief Architect for Watsonx Code Assistant. I view this as a very young space; everybody’s trying different interfaces. The number of people using chat or chat-like features that Cursor makes easy is very large. Definitely, one of the first features asked for is code generation, and there is a constituency for that, but most people revert back to, “Can you just tell me what the hell my code is doing and help me put it together?”
Figuring out how to do that, getting the appropriate context—with LLMs having larger context windows and better prompt techniques—this will keep evolving. The bigger evolution will be towards agentic systems, with more planning and discussion. The question is: will it be human-in-the-loop, or just prompt and see an app built? I think, going back to the dignity comment, human-in-the-loop where a helper says, “I’ve broken this down into six steps; human, do you agree?” and you can fix it, is key.
Everyone’s experimenting, from tiny steps to letting it all fly. Exploring this problem space will be fascinating for the next several years; nobody’s quite figured it out, and models are getting better. I’m super excited about where this goes, and I welcome the exploration Cursor is doing around innovating on interface.
Tim Hwang: It’s very exciting. The joke will be that everyone in the future becomes an engineering manager. Aaron, are you a VS Code guy? Cursor’s bid is that people are comfortable with their IDE setup—it’s like setting up your office. What’s wild is Cursor is attempting to say these AI features will be so killer you’d abandon that or get over the hump of re-configuring. Is that prospect attractive to you? Have you tried Cursor? Would you jump to it? Is the value proposition strong enough to make that shift?
Aaron Baughman: Yeah, I write code every day; VS Code is my IDE of choice. I’m a big fan of paired programming and paired testing—having multiple people work together on a task or experiment. It improves code quality, engineering quality, the scientific process, and creates long-lasting teams with continuity. Relegating software and science to prompt engineering has cons.
The pros are it accelerates productivity, helps with code completion, creates comments to understand code. There’s certainly a place for it. However, we want to ensure our engineers and scientists still understand code, can write algorithms, create new programming languages and compute paradigms, like Quantum, which is a new paradigm where LLMs might not help much yet.
LLMs have to be trained on some pile of data. If a human can’t create that data trustworthily, some creativity and skill might be lost. The hype around Cursor is real; it’s a powerful product. But I’d encourage folks to put a time limit on using these tools to maintain our sharp blade for when we really need to do engineering, so we don’t all become just prompt engineers.
I use Watson Code Assistant through the VS Code plugin pretty much every day; it’s really good, creates comments. I also use Google’s GenAI feature for ideas on writing code better. But I always try to limit myself and my team to, say, 20/80 or 50/50, and ensure we’re still communicating as a team. That human interaction is important to me.
Tim Hwang: That implies two interesting things: in the future, there might be “screen time” limits for these features—“You’ve hit your limit for the week.” Also, there’s discussion about AI replacing engineers, but it feels like there will be constant pressure to learn more obscure languages because AI can’t touch those due to smaller datasets.
Trent Gray-Donald: No surprise, IBM’s been around a while and has created languages long in the tooth, like COBOL or PL/I. The amount of code for these on the internet is small, so models can’t do them well. We have more COBOL and PL/I code, so we can build better models for them. Companies approach us with esoteric languages asking for help. While esoteric languages are a barrier, especially for free models, wherever there’s a barrier, there are financial incentives to overcome it. But it’s going to be tough.
Tim Hwang: Well, great. I want to tie up today because we have the unique pleasure of having all three of you on this episode. As you may have overheard, listeners, when I introduced these guests, they are all IBM Fellows. For those who don’t know, the IBM Fellows program brings together some of the brightest minds in technology to work on projects at IBM. I looked it up: it includes a U.S. Presidential Medal of Freedom winner, five Turing Award winners, and five Nobel Prize winners.
I figured we’d take the last few minutes for people to hear about the program, what you’ve learned, and where you think it might go. Aaron, maybe I’ll toss it to you to kick us off. How has your experience with the Fellows program been, and what have you learned?
Aaron Baughman: Yeah, becoming an IBM Fellow is one of those seminal moments; it’s very surreal. My first thought was, “I hope I can live up to those who came before me and be an example to those who come after.” I’m in the middle and want to ensure we keep up to date with science, push engineering forward responsibly, and usher in the next generation of IBM Fellows.
The process of becoming a Fellow was rewarding because it helped me reflect on all the people who helped me achieve something I didn’t know was attainable. Being with Trent and Kush is amazing; I always knew and followed their work, and I didn’t know they’d be Fellows until it was announced. It’s great to be in the same class as them; it couldn’t be better in my view.
Tim Hwang: That’s great. Trent, Kush, any other reflections?
Trent Gray-Donald: I think it’s very important that companies in the technology space have leaders who are effectively pure technologists, to be the right balance to the business at times. One of the spoken things about Fellows is they are supposed to be a bit of a check and balance on what we can or should be doing in a given space.
Tim Hwang: You’re like the keepers of the technical flame.
Trent Gray-Donald: Yeah, because sometimes that’s necessary. It’s a huge honor to have become a Fellow. The number of people who have come before that I look up to is very large.
Kush Varshney: Yeah, it is extremely humbling. Looking at the list of all these people—Nobel prizes, inventing things we take for granted, like DRAM—it’s easy to be thought of in the same light. It’s been a few months for the three of us. One thing I’ve learned from traveling within IBM and outside is that people do look up to this position as an inspiration. I hadn’t thought of it that way. It’s a responsibility and, as Trent said, a way to have a check and balance. All of that in one role is crazy. The three of us are going to do our best to keep this tradition alive.
Tim Hwang: That’s great. Well, it’s an honor to have the three of you on the show. I hope we can get you all back on a future episode. But that’s where we’ll wrap it up for today. Thanks, everybody, for joining. And thank you for listening to another week of Mixture of Experts. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and other podcast platforms everywhere. We’ll see you next week.
Listen to engaging discussions with tech leaders. Watch the latest episodes.