Should we care about the GPT-5.2 rumors? This week on Mixture of Experts, we analyze the rumored “code red” release of GPT-5.2 as OpenAI responds to Gemini 3. Are the constant model drops benefitting consumers? Next, Stanford released their Foundation Model Transparency Index, revealing a troubling trend that most labs are becoming less transparent. However, IBM Granite achieved a 95/100 score. Then, our experts discuss what model transparency means for enterprise AI adoption. Finally, we debrief AWS re:Invent’s biggest announcements, including Nova frontier models and Nova Forge. Join host Tim Hwang and panelists Kate Soule, Ambhi Ganesan and Mihai Criveti for our expert insights.
The opinions expressed in this podcast are solely the views of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: I’m Tim Hwang and welcome to Mixture of Experts. Each week, MoE brings together a panel of the smartest minds in technology to distill down what’s important in artificial intelligence. Joining us today are three incredible panelists. We’ve got Mihai Criveti, Distinguished Engineer, Agentic AI; Kate Soule, Director, Technical Product Management, Granite; and Ambhi Ganesan, Partner, AI & Analytics. Welcome to you all. We’re really ending the year with a bang. There’s a lot to talk about today. We’re going to talk a little bit about rumors of GPT-5.2, a new transparency report out of Stanford, and Amazon’s newest generation of their Nova models. But first, we’ve got Aili with the news.
Aili McConnon: Hi everyone, I’m Aili McConnon, a tech news writer for IBM Think. Here are a few AI headlines you might have missed this week. Both Jeff Bezos and Elon Musk are now racing to develop data centers in space. IBM has acquired data streaming platform Confluent for USD 11 billion to help ramp up agent use in enterprises. OpenAI has started training models to confess when they’ve made stuff up or taken shortcuts. Ho ho ho, a new Santa agent lets users interact with Santa via text, phone, or video chat to share what they want for Christmas and to find out if they’re on the naughty or the nice list. For more, subscribe to the Think newsletter linked in our show notes.
Tim Hwang: And now let’s see what our experts think of ChatGPT 5.2. This is kind of an interesting story. Rumors are swirling, and by the time you listen to this, this model actually may be out — that effectively OpenAI has called a code red to get its GPT-5.2 model out to go compete largely with the new Google model Gemini, which indeed, as we’ve talked about before in previous episodes, is very, very impressive. Ambhi, maybe I’ll start with you. This is a really interesting kind of reversal in some ways. Had we talked about it in January 2025, it would have been like OpenAI’s crushing everybody. They’ve got the state-of-the-art models. They’re ahead of everyone else. No one’s catching up. But this is kind of weirdly now in a situation where Google — which we would have said at the beginning of the year is the most behind — is now the one that’s kind of causing OpenAI to react. I don’t know, is this just gossip? Are we reading too much into this, or is it really a signal that OpenAI is in some ways falling behind in this race?
Ambhi Ganesan: Yeah. And look, I think we can speculate all we want. History always suggests that there’s always going to be this up-and-down roller coaster. I feel like if you made this entire saga a movie, it’s going to be full of plot twists and turns, so much that you’re going to be playing Dennis. You know who plays Sam? Right? Yeah. So it’s anybody’s guess. Of course, rumors are swelling. And I think the latest I read was that 5.2 is already on Cursor. There are indications that it may release soon. It’s not just 5.2. There’s Chestnut and Hazelnut as well accompanying it — code names for a couple of the image gen models to compete with Nano Banana Pro. So yeah, I think it’s anybody’s game at this point in time. We can speculate all we want, but hey, at the end of the day, consumers are the winners here, right? Welcome all the competition — all the good competition between the model makers.
Tim Hwang: You’re happy for the soap opera, basically.
Ambhi Ganesan: Yeah. Yeah, exactly.
Tim Hwang: Kate, would love to get your reaction on this, because I feel like at the end of this year, I’m tired. You know, it’s like every week there’s a new model out, and it’s like, what’s the difference between this model and that model? But do model launches matter anymore? Should we care about them? Or is the game really somewhere else now? And I actually wonder — I don’t know if I quite agree with your statement that the consumer wins at the end of this. Like, are we really in this race where the consumer is actually benefiting? Am I going to have this huge uptick in my productivity and daily life with 5.2?
Kate Soule: I don’t think so — and not for the potential costs that will come along with it. I think there definitely is a little bit of exhaustion that’s coming in just broadly around model releases. So I think OpenAI is going to try and capture attention back away from the success of Gemini. They’ve got to do that to save face with their broader investors and everything else they’re pursuing. But I don’t know that I would agree that at the end of the day, the consumer is going to be a lot better off the day after 5.2 is released than today.
Ambhi Ganesan: Yeah, so I get where Kate is coming from, right? But the way I look at it is at the end of the day, advances are going to keep coming. They’re going to keep coming. And what I mean by “the consumer is going to win” is that you want those advances to keep coming. You don’t want things to stagnate, right? So have that competition flowing — have that healthy competition flowing. So you keep advancing the boundaries. You keep pushing the boundaries. And so at the end of the day, as consumers of those models, there may not be dramatic changes, but every win counts. So you keep pushing the boundary, and that’s how the field advances. So at the end of the day, that healthy competition is great. You’ve got to have that.
Tim Hwang: Mihai, do you have any opinions on a model that is not yet out? Is this going to be the model that crushes everything for the year or…?
Mihai Criveti: I’m about as excited about this model as I am for the latest Windows or Mac OS hotfix. You see it in the Windows update. I was joking recently — I was like, they just dropped a new version of Zoom. Who’s excited about the new Zoom version? My take is this: many of these models are going to see minor updates that try to resolve issues with performance, with speed, with costs, with specialized use cases, with usage. And for example, in IDEs like Cursor or Codex or the equivalent of Claude Code, they’re going to try to optimize for specific benchmarks versus specific situations. But I don’t expect these updates to be necessarily revolutionary. They might put OpenAI — for the next two days, two hours, two minutes, two months if they’re lucky — ahead of Gemini in some of these specific benchmarks. Is it going to be world-changing? Likely not. It’s nice. It’s maintenance. It’s going to help with some of these specialized use cases, but I don’t think it’s going to be revolutionary. Otherwise, they would have called it GPT-6 versus 5.2, you know.
Tim Hwang: Yeah. And I think that’s one of the really interesting ironies. It feels like the situation we’re sitting in at the end of 2025 is that everybody kind of agrees there’s something rotten in the world of benchmarks, right? They don’t really provide us with a whole lot of traction on what we actually want to use these tools for. As yet, clearly they are motivating a lot of big corporate activity — OpenAI wants to be number one on all these benchmarks, and it doesn’t want to be left behind for any length of time when Gemini comes out and says, “Hey, we’re great against all these benchmarks.” But it’s almost like we’re optimizing for the same thing, and it feels like you end up in this discussion that Ambhi and Kate were just having, which is: well, there are these maybe downstream effects where everybody sort of benefits from us constantly pushing the frontier. The other angle is: is the industry focusing on the right thing? Kate, I guess you’re nodding, if you want to respond to that idea.
Kate Soule: Well, I think what’s really interesting — a week or two ago, Stanford’s Hazy Lab put out a report looking at intelligence per watt, basically, and how much performance we’re able to drive per watt of electricity powering the compute. What they found is that a lot of the adoption and market shares are with these big hosted models like the latest GPT models, but that if you actually look at what you can achieve by moving some of those workloads locally, you can get the same amount of performance at a lot lower energy consumption, a lot lower cost. So I think they argue that there’s a huge opportunity for disruption here — that the model providers might not be focused on the right metrics. And I would tend to agree with that. I think that right now we’re chasing a lot of investment dollars and prioritizing fancy benchmarks. But a lot of the future development is going to be incentivized more by performance per cost. And you don’t see that quite in the conversation today with these model releases that are coming out.
Tim Hwang: One angle I want to bring to this before we move on to the next topic: you work with a lot of customers and enterprises, right? And I think all of this comes on the backdrop of these companies obviously ultimately competing for enterprise dollars. So I’m curious — because I genuinely don’t know — when one of these new models drops, like 5.2, are customers like, “Oh man, this one’s topping all the benchmarks, I’ve got to move my entire stack over to the new model”? What’s the influence of these types of competitions, even very incremental ones, on who chooses to adopt what? Is there market influence from these kinds of launches?
Ambhi Ganesan: Yeah. There are two lenses through which you look at it. Enterprises are not going to immediately switch to the latest model at the drop of a hat. You pick a stable workhorse, you build your applications on top of that, you have to have some stability. You put it into production and then you start realizing value. It’ll be very, very tricky, very problematic to go and keep changing models at the drop of a hat. So it’s not going to happen immediately. But does it happen? Of course, it will happen. Because let’s say you track it over a period of six months or a year. Over the course of time, the pace at which these advances are happening — there is a fundamental step-function change in the performance of the models. A bunch of new capabilities have accumulated, which means, okay, from an application maintenance perspective, I do want to have a roadmap. There is a certain time window at which I say, “Okay, I’ve got a step-function change, and I’m going to go and make a switch to the latest model.” So yes, those model changes will happen and do happen, but it’s not going to happen for every single release.
Mihai Criveti: I will say the following. If you’re able to switch models at the drop of a hat, either your enterprise maturity is very low — where you’re an independent developer or a small shop and are able to just quickly switch models — or your maturity is very high, where you have all of your evals fully automated and you’re able to switch the model with a push of a button. All your evals get done, you can test your requests on the new model, and then you’re able to see, “Oh yeah, this one performs 17.3% better for my use case. It’s more cost-effective. I see the data in my observability platform in my dashboard.” You make the switch overnight. So if you’re in the middle, it’s going to be tough.
Tim Hwang: Well, we’ll just have to see. I guess this announcement of 5.2 — I’m sure we’ll be talking about it potentially next week when it actually launches, and we’ll see how all these predictions play out. But I think that’s really interesting. And I think, Ambhi, it’s very helpful to have this discussion on how so much of this is we see the competition, but it’s also on the backdrop of the customers and seeing what they do or how they react to this stuff.
Ambhi Ganesan: Yeah. I’m just hoping OpenAI is going to do the 12 Days of Christmas thing again — like last year. You like that? That was a good gimmick last year. 5.2, 5.3, 5.4 — one model release every day.
Tim Hwang: Yeah, exactly. Until we get to 5.12 and then they’ll roll it.
Ambhi Ganesan: Exactly, exactly. You just tweak the prompt every day and you call it a 5.0 project.
Tim Hwang: Exactly. I’m going to move us on to our next topic. So, we’ve talked about this report before, but a number of researchers at Stanford have come out with the latest edition of their transparency index. If you’re not familiar with this discussion from last year, the idea is that they’re taking a bunch of available models and trying to rank and assess basically how well these models do from the point of view of transparency: what kinds of documentation do they provide, what kinds of data disclosures do they have? I’ve always thought this is a very interesting project because when we say “transparency,” it’s a little bit like “open source” — what do we mean by that? These are attempts, I think, to get a lot more granular about what we mean when we say transparency. Kate, it’s good to have you on the show, because I understand Granite was a part of this transparency report. Do you want to talk a little bit about how you all approached it and how it all turned out?
Kate Soule: Yeah. So this is a report, as you mentioned, that Stanford does annually. We’ve participated in the past, and it really tries to break down model development into three components: upstream, the model training itself, and downstream of the model. What they do is they send a survey out to model developers — like IBM training our Granite models, both closed and open model developers — and they invite people to participate and share information about everything. Upstream of model development, like around data curation: what models are you using to generate data to train on your models? Downstream to the actual training process: do you release your training code? Do you release different repositories? Do you release different details about the architecture of the model? And then downstream of model use: things around like, do you release benchmarks on safety? Do you release details on gaps in performance? Do you release prompts that were successfully used to attack the model? That type of thing. What they have found is that over the years, transparency has actually greatly diminished. If you look between 2024 and this report that just came out last week in 2025, most labs have reduced the degree to which they are transparent — the degree to which they share details about these different facets of model development. IBM is taking a very different approach, which I’m really proud of, really focusing on transparency and trust and being as open as possible. I think it speaks to the rigor with which we put together our strategy and policies around how we train and develop our models, which is reflected in our ISO 42001 certification that we also received this year. And it allows us to be very forthcoming with what we’re working on, how we’re building it, and how we’re contributing it to the open-source ecosystem. So we’re really proud that Granite got the top score — 95 out of 100, I believe. And seeing where other labs are kind of going down in transparency over time, IBM demonstrated that we are actually doubling down and increasing the degree to which we’re transparent in model development.
Tim Hwang: Yeah, that’s 95 out of 100 different criteria, basically.
Kate Soule: Yes, exactly. Different indicators, different questions: do we answer and provide details? So it’s not actually looking at what was the result on this safety benchmark; it’s how transparent are you on your safety benchmarks. Do you share the benchmarks? Do you share this type of data? Which is a really cool approach.
Tim Hwang: And I think one of the things I want to ask you to speak a little bit more about is that across 100 of these metrics, you have to almost pick and choose, right? The team can’t afford to try to do everything or move everything forward on a year-to-year basis. Or maybe that is how the team is thinking about it. I’m interested in whether there are particular aspects of transparency that the team said, “Okay, this is what we’re really going to prioritize.”
Kate Soule: Yeah. So I think over the past year and a half, if you look at from where we were in 2024 to 2025, we have done a lot of work on automating and standardizing our training and development process so that there are automated records of everything. That makes it much easier to be transparent and share because there are so many minute details that go into these models — everything from when was a dataset acquired, what was the license it was acquired on, what was the source, what was the review process for it. So we actually invested heavily in the architecture around all of that data curation and training so that we can have a very streamlined lineage of our models. That makes it really easy to just be transparent and open and have that information at our fingertips. That also helps us with our own regulatory compliance requirements, where we want to be obviously best in class and able to respond to changing regulations as they evolve. And that made it possible for us to be a lot more open when it came to the transparency index this year.
Tim Hwang: Mihai, if I could bring you in. I think Kate’s already pointing out one of the interesting trends, which is obviously Granite doubled down on this, but the general trend is less transparency that we’re seeing. And this actually goes back to what we were talking about a little bit earlier about what the market incentivizes. How I read the transparency index is it’s sort of a dream of saying, “Look, people will be able to look at the index and say, ‘I want the more transparent model. Here’s how I find that,’ and the market will reward people who are more transparent.” But if anything, it feels like there’s actually been a pullback on transparency. Do you think that means that the market doesn’t really value transparency all that much?
Mihai Criveti: I think it depends on the type of business they serve. I’ve noticed in the report that B2B companies tend to be more transparent than B2C. Because regular consumers may not care if they’re running a 100 billion, 200 billion, or 500 billion-parameter model, how many GPUs it uses, how much water or CO2 emissions are used in the model. They may not necessarily care about the cost to run the model itself; they care about the cost to the end user. While B2B companies do need to care if they make these models available to other companies for consuming them, who may be running them on their own infrastructure. The second interesting trend I’ve seen is, like you pointed out, it went from 74% of the companies responding last year to only 30% responding this year. That’s kind of curious. If you look at xAI models or models from Anthropic or models from OpenAI, you don’t even know how many billion parameters they have. And you might not care. I would see it from one perspective: this kind of information can be used against them. “Oh, look how much CO2 or emissions this model is generating” or “how inefficient it is.” It can be used in calculating how viable their business is long-term — for example, are they actually subsidizing a lot of their end users? So a lot of this information is likely to become more transparent in B2B companies. AWS with their Nova models, IBM with their Granite models, Nvidia, and so on are going to become likely more transparent over time, while models that are focused more on the consumer market don’t necessarily need to publish those details and probably will not publish them anymore.
Tim Hwang: It almost feels like there’s going to be, on the consumer side, an Apple-ification of the world. What I mean by that is if you go back 20 years, it was like, “Okay, we have these open computing platforms and you’ve got Apple, and it’s a battle between open and closed.” And then over time, it kind of feels like everybody has been like, “Yeah, actually for the consumer, the general preference is they’re happy to pay more for a pretty closed system that’s pretty opaque. You have to go to a store and find a genius to fix these computers for you.” That’s kind of the state of play in consumer land. And then on enterprise, of course, open source has a long and robust legacy and is a huge, huge business. Do you see that happening in the world of AI applications as well, where it turns out that from a consumer standpoint, transparency is not so important that it really is forcing — “forcing” is a little strong — but encouraging companies like Anthropic and OpenAI to say, “Hey, we’re going to participate in this index and try to get a good score on it”?
Ambhi Ganesan: Well, partially, right. I always say that at the end of the day, we all sit in enterprises, but then we are also consumers. So we all wear those two hats at the same time. It’s not like we just immediately switch on and off between a consumer hat and an enterprise hat. Even when we’re sitting in an enterprise, we think with a consumer lens and vice versa. So some of those ways of thinking bleed into each other’s domains. And this is what I have noticed: I feel like the market in general is maybe asking the wrong questions. Yes, there is the prioritization on IP, which is why you see in these benchmarks — most of the labs, if you look at the downward trend on the metrics, there was a huge hit on the upstream component. But I don’t think there isn’t necessarily a reward for labs to do it or not. I feel like the right thesis should be whether the market is asking the right question. I’ll give you an example. Just earlier this week, I was with a client, and they were talking about DeepSeek and asking, “Oh, we want to see if we should be using open-source models. What do you think about DeepSeek? Should we be using that?” This is within an enterprise setting. What DeepSeek did was it opened the mindshare for open source. So everyone started thinking about open-source models, open-weight models, and started talking about it. But I think there is a conflation of transparency with open source and open weights, which is not necessarily true. So I think what most consumers and most enterprises are inherently asking for are transparent models, but they are terming it as — and asking for — “Hey, can I get open-source and open-weight models?” Those two are not necessarily the same. So I don’t fully buy the argument that the market isn’t asking for it. They are favoring it. Yes, there’s the inherent tension between “I’m going to optimize for my IP” from the labs’ perspective and the market saying, “Hey, I need some transparency.” But there is definitely a demand for that transparency. It’s just that they’re asking the wrong questions, which means that the signals aren’t really coming up into these reports appropriately.
Kate Soule: Well, I will say what’s interesting about the parallel you brought up, Tim, comparing to Apple is that Apple, at the same time, has taken away a lot of the configurability and user visibility into the hardware, but they also have one of the best reputations for privacy when it comes to devices and responsible use of data and information. And deservedly or not, they’ve built a strong reputation there, and I think it is paying off with consumers. I don’t see that quite yet in model development, but I think it’s going to become more and more of a priority. Transparency is one way you can indicate it; it’s not the only way. Anthropic didn’t score as well on transparency, but they have the ISO 42001 certification, and I think they’re also very well known for their principles in ethical AI. So I think transparency is just one tool to address some of the broader societal and ethical questions that may not be the singular driving market factor but will be an important market factor in the future.
Ambhi Ganesan: Just to add on to that, I do agree with Kate, and I do think that will become a trend. Just look back at social media as a parallel. When it started with MySpace in the early days of social media, privacy wasn’t at the center of everyone’s thoughts. It was about the cool thing and the ability to network. So the capabilities were at the forefront. But then when those capabilities matured and saturated, privacy went front and center. You had the shenanigans with Cambridge Analytica and things of that nature, the congressional hearings popping up. So you started to see that pivotal shift happen. I feel like you’re going to see some of that with any new technology: the capabilities come front and center, and then once those become mainstream, you’re going to start seeing some of these privacy concerns and transparency aspects come to the forefront really soon.
Tim Hwang: Kate, maybe to wrap this section up — you’re already scoring 95 out of 100. Where do you go next year? Do you work on that last remaining five? Are we already saturating the benchmark for transparency?
Kate Soule: I think there will certainly always be new ways to think about transparency. We’re moving from models being just a bag of weights that get released in open source — in the case of Granite, at least, open-weight models — to having more systems of models and software built together. That’s going to introduce new aspects of being transparent: being transparent not just on the weights themselves and how the weights were created, but looking at particularly around deployment in the systems and software that are executing the deployment. The details can have huge impacts on performance. I’d love to see the transparency index evolve to encompass those aspects. I know it’s certainly something IBM’s thinking about. We’re also working on one project: thinking through how you create a standardized AI bill of materials and have that be a standard artifact that can be released with models. I don’t want to give away too much, but expect some work from IBM on that in 2026 to come out. I think there’s going to be a lot more look at standardization, a lot more look at deployment of these models. So still lots to do, and we’re eager to work on it.
Mihai Criveti: I’d love to see more transparency over the infrastructure as well — the APIs they put in front of the models. Even the system prompt is kind of invisible. If you’re comparing the OpenAI model to ChatGPT as an end-user application, there’s a lot of other stuff going on in there which is unknown.
Tim Hwang: I’m going to push us on to our final topic. The big Amazon AWS re:Invent conference was just the other week. A number of really interesting announcements came out of that that we didn’t get a chance to cover in previous episodes. I started the episode by being like, “I’m bored of all these new model releases,” and we’re going to end with an Amazon release of some new models. So I’m a hypocrite, I suppose. The big news coming out of the conference is that Amazon announced its latest generation of Nova Frontier models. Amazon has always been really interesting in the MoE discussion just because they’ve always been kind of looming in the background. They have huge infrastructure. They have incredible data with all the e-commerce stuff. So it seems very natural that at some point they would really start making some very big swings in the AI space and in the model space. Ambhi, the question for you is: is this the big swing? Nova really feels like they’re touting this as “we’re now in the game.” Are they in the game?
Ambhi Ganesan: Well, there were some releases of Nova even last year, so Nova isn’t completely new. So technically, they’re saying, “Hey, we were already in the game last year.” Some of those advances are par for the course. They are releasing speech-to-speech models, which others are releasing as well. A couple of new advances came out: Nova Forge, which they’re touting as “we’re going to democratize multiple different mechanisms for you to go and build your own models.” It’s not just fine-tuning mechanisms — it’s still murky on exactly how they do this — but it’s almost like, “Hey, we’ll give you checkpoints, and then you come and blend in with your data and build your own custom pre-trained models from scratch. We’re going to democratize it. Enterprises can just go and do it. You don’t have to have a complete research lab to do it.” So that’s really exciting. The question again, if I put an enterprise lens on it: great, but how many of those capabilities are going to be used for how many enterprise use cases? A large mainstream set of use cases can be largely driven with your models out of the box with appropriate integrations. You may not need custom fine-tuned models or even custom pre-trained models for a good chunk of the use cases. So great capabilities. It’s a great push on the engineering side of things. Fantastic looking at it as an engineer, but also trying to think about what the enterprise value is and how that slots in. There’s another one: Nova Act, which is the enterprise equivalent of OpenAI’s browser use or Gemini’s browser use — being able to do that. The differentiation they talk about is, “Hey, now we have trained it on enterprise screens. So it’s not doing it on Instacart shopping; you’re training it on CRM screens. And we think we are way more equipped to handle those sorts of enterprise screens.” Still early days.
Tim Hwang: I think that piece is actually exciting. Because let’s all be honest: there’s always going to be a data and API issue, and there’s always going to be issues of, “Hey, am I having the most clean and hygienic data elements in an enterprise?” That is always going to be the case. So we’re looking at — and we’re all thinking — the browser use cases, the browser applications and capabilities can be fairly promising where you don’t have ready access to data. You just sort of mimic the human actions to do it. So it’s a promising capability, but then there are obviously a lot of open questions on the security of how that will work.
Mihai Criveti: Promising, still to be seen. I’m not a fan of training or fine-tuning models for most enterprise use cases. Mostly because whenever you talk to an enterprise, they assume they have data. Second, they assume they have the GPUs. Third, they assume they have the investment necessary to continuously fine-tune or train a model every single time their data evolves or changes. The reality is that large language models on their own are insufficient for the vast majority of enterprise use cases. Why? They’ve been trained on last year’s data, and they’ve been trained on public data. So you want to blend that data with your enterprise data. But we’ve seen techniques like RAG, GraphRAG, or agentic RAG as well as tool use — using MCP servers or leveraging all sorts of techniques — that provide sufficiently good access to real-time data and real-time information without the need for expensive tuning or fine-tuning. I think the proposition is for the very, very few companies that employ hundreds of data scientists who really make it their passion to train and fine-tune models. Even if you’re doing it on somebody else’s infrastructure, even if you’re not starting from scratch and you’re starting from a checkpoint, you shouldn’t underestimate the effort it takes to properly train or even fine-tune a model to a specific domain. And you shouldn’t underestimate the vast amount of data that is required, or the quality of data that is required. So I would say most folks should stick to agents. That’s why I like the fact that Amazon provides a one-stop shop for everything. Nothing biased or anything. But look, they have the other option. They have their AgentCore. They have agents. You don’t like this? We have that. So I would say don’t fine-tune or train a model unless you really have to and you know what you’re doing. It’s very unlikely that the resulting model is going to outperform a frontier model plus tool use. And even if it does, now you have to do that every single week or month or whatever the refresh rate of your data is. Still exciting if you’re in that space or if you’re in the 1% of companies that do need that service and you can’t buy the GPUs that are required for it, and you need to run that service — it’s awesome.
Tim Hwang: This is actually kind of fun because I feel like it flips the narrative from what we were talking about earlier. Earlier I was like, “Consumers don’t want complexity. They don’t want transparency. But enterprises do want complexity and transparency.” And Ambhi, you’re coming back and basically saying, “Actually, for most enterprises, they don’t want that either.” Kate, do you have any thoughts on this?
Kate Soule: I agree with everything that’s been said. The only other comment I have to add is I do think there could be something interesting around the research and academia community when it comes to these new types of reinforcement learning as a service and tuning as a service capabilities in Nova Forge. I thought it was really cool that they’re offering early checkpoints — partially trained versions of the Nova Lite model that can then be further customized. So one benefit that could come out of that — while I agree I’m skeptical on the direct enterprise value — I think it’s going to be a lot harder than people anticipate to get a specialized model using SFT or RL. I do think that by offering more of these components, we could potentially enable more engagement with academia, engagement from the research community that’s otherwise kind of hampered because they don’t have access to an early checkpoint. They even have a part of their service where you can mix your own data with the training data for continued training. So those are all really interesting things that hopefully could spur some more innovation that the field could benefit from and engage a new user group that’s been kind of left along the sidelines and not able to participate fully.
Tim Hwang: Yeah, definitely a constituency we don’t talk about enough on the show, but we should definitely talk more about it. Mihai, maybe I’ll give you the final word of this episode. A little bit of a peek into the future: one of the fun tidbits that Amazon announced when releasing the Nova models was that they’ve been playing around with making the claim that their frontier agents can operate for hours or even days on end. Regardless of how credible you think that claim is, I think we are kind of headed towards this really fun world where you’re like, “Okay, computer, I need you to help me out with something,” and it comes back three weeks later and is like, “Here’s what I did.” Are we headed for that world? Certainly, the technology will be able to do something in those three weeks, but I’m kind of curious if you feel like we’re finally getting these agents aligned enough to get there.
Mihai Criveti: Yeah, my agents can operate for weeks, and at the end of that, it doesn’t mean that I’m getting good results out of that money. You can run for years if you want. But then I care about the — it’s not an issue. I actually have a timeout I can tweak; I can keep it going and going and never return a final answer. Just tell me how many tokens you want me to consume. So look, I think what is improving is tool use. What we’re seeing is improvement in tool use in terms of the number of tools that can be called, the number of tools that can be called in parallel, the number of sequential tools that can be called, and techniques like MapReduce or being able to do vector search or tool search to call the right tool. These allow these kinds of continuous use cases. Let’s say you’re building a document — literally building, or let’s take a PowerPoint document because it’s even easier to visualize — and you’re building slide one, slide two, slide three, slide four. Each of those can be an independent tool call, and you can keep going and going if you’re managing your context right. So if you think about what’s preventing us from doing continuously running agents today, it’s just how difficult it is to properly manage that context. You’re working with a limited context of the LLM for tool orchestration. Everything needs to fit in the context within an execution, and then you need to use techniques to manage the context — how you compact it. If you use Claude Code or Codex, you see at some point it starts to compact it. It’s literally summarizing what you have in your context to a state that is good enough for it to continue from that state. So all of these techniques are coming together, and we’re seeing longer and longer running agents. Microsoft has Researcher. ChatGPT and Gemini have their deep research functionality. Amazon has similar techniques. We have similar techniques, and we’ve built our own deep researchers. I think at the end of the day, this is something we’re going to see more and more, because if you want to get good results in enterprise use cases from AI, you want it to touch all of your data. That means hundreds, potentially thousands of tool calls. RAG is not enough. With RAG, what you’re doing is selecting ten paragraphs, give or take, from whatever you’re searching, and then giving it to the model and hoping for the best. What I would like to do is give it all of the data — summarize this and this and this and keep going and going. It’s expensive. But in some cases, if you’re putting together a complex deliverable like an RFI response document or RFP response document — “go write me a book and come back with 300 pages on this topic” — you need that depth. So I do see a natural evolution of all agents within the enterprise space adopting this kind of deep researcher functionality with agents that can run for ten minutes, an hour, perhaps even overnight, to come back with a very complex response.
Ambhi Ganesan: Tim, I want to add a nuance to what Mihai said, and Mihai is absolutely right. You have to contextualize all of this. But that’s not to discount the advances that the field is seeing. You have to look at this in two dimensions. It’s not just about the amount of time that an agent or model or system is taking; it’s also, when it’s running for that much time, how reliably or how accurate is the outcome of the task you are accomplishing. That curve has definitely shifted to the right. A couple years back, we would have said high accuracy would have been on the order of a few seconds. Then it became a few minutes. And now we are definitely in the realm of a few hours. So the curve is definitely shifting. But it’s important to recognize it’s not just how long it’s running; it’s how long it’s running and doing it reliably with high accuracy.
Mihai Criveti: Yeah, and you’ve also helped with this: if you have agents that can self-evaluate and have intermediate checkpoints and retry and take different directions, then this is going to help improve them over a longer running execution cycle.
Tim Hwang: Yeah, I think that’s right. I think part of it is just going to be these tradeoffs. But I do think the frontier is going to be increasing continuously — something to pay attention to. Particularly, I think this will be the new frontier of claims being made about agents: “You can run them for weeks, you can run them for two weeks.” And so I think the question now will be: how do we measure that? How do we quantify that? So it’ll be very interesting to see. Well, that’s all the time we have for today. Kate, Ambhi, Mihai, thanks for joining us as always, and happy holidays. And thanks to all you listeners. If you liked what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere, and we’ll see you next week on Mixture of Experts.
Listen to engaging discussions with tech leaders. Watch the latest episodes.
An artificial intelligence (AI) agent refers to a system or program that is capable of autonomously performing tasks on behalf of a user or another system. It achieves this goal by designing its workflow and employing available tools.
Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But what is AI?
Developers build AI assistants on top of foundation models—for example, IBM Granite, Meta’s Llama models, or OpenAI’s models. Large language models (LLMs), which specialize in text-related tasks, represent a subset of foundation models.