The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: What’s the most exciting thing to come out of IBM Think this year? Kate Soule is Director of Technical Product Management for Granite. Kate, welcome back. What’s your pick for IBM Think?
Kate Soule: My pick is the research keynotes. We talked about a new wave of computing. We’ve got traditional classical computing, we’ve got quantum computing, and at Think we announced a new way of building with models: generative computing. It’s really exciting.
Tim Hwang: Kaoutar El Maghraoui, Hybrid Cloud Platform. Kaoutar, welcome back. What was your favorite?
Kaoutar El Maghraoui: My favorite was also the generative computing part, but also the launch of a lot of AI agents in our watsonx Orchestrate platform—over 150 enterprise-ready AI agents. That’s really huge.
Tim Hwang: Yeah, that is huge. And we will talk about that. And finally, last but not least, is Skyler Speakman, Senior Research Scientist. Skyler, watching the conference, what was your favorite?
Skyler Speakman: Yeah, a non-technical take on this is just how much fun they were having during the keynote. Arvind marched a mascot penguin across the stage, and the crowd loved it. So it was really cool to see people having fun up on stage during his keynote. Penguins, agents, and programming—all that and more on today’s Mixture of Experts. I am Tim Hwang, and welcome to Mixture of Experts. Each week, MoE brings together the smartest, most talented, most wonderful experts in all of artificial intelligence to talk a little bit about the biggest news in the sector. And this is a big episode. We’ve got a lot that we need to talk about, as per usual: a really fascinating story coming out of the New York Times about AI and hallucination, a bunch of news coming out of OpenAI in terms of its corporate organization and its recent acquisition of Windsurf. But first, I wanted to start with IBM Think, which was the big IBM conference of the year. Tons and tons of announcements and things to go through. But the one that was most important to me, of course, was... and I do want to start with is, Kate, I realize you have a book coming out that was also kind of announced at IBM Think, so maybe I’ll just start there for the plug.
Kate Soule: Yeah, no thanks, Tim. So we did release a book. I’ve got it here with me. It’s called AI Value Creators. Really excited to be able to share it more broadly. A lot of what we talked about at Think, particularly in some of the future-looking sessions like on generative computing, we actually have whole chapters dedicated to in the book. It’s really all about how folks looking to not just build with generative AI, but kind of build a competitive moat with generative AI, get the most value and invest in strategic places. So really, really excited for folks to check it out. We actually have a download link for all of our Mixture of Experts listeners, so we’ll include that in the show notes and would love any feedback the team has as they read through the content.
Tim Hwang: That’s great. And Kate, I guess for those who are just getting their head around generative computing, what’s the general concept there? Do you wanna give us a little bit of a flavor of how... it sounds like it’s a big part of the keynote, it’s a big part of the book. Just kind of interested in how all these pieces are fitting together and what is generative computing?
Kate Soule: Yeah, so I think at the end of the day, it’s really just trying to bring some of generative AI back to the realm of computer science. You know, if you look at how we’ve emerged building applications and agents with LLMs today, it’s all basically a form of prompt engineering, where we end up with these really massive pages and pages of prompts—we call them “essay prompts” in our book—where it can be very difficult to maintain. These prompts are very brittle. You look at how they’re written; they’re kind of over-optimized and force-fit for a specific model, and it’s just not very sustainable, secure. There’s all sorts of issues. If we think about how we build in a more computer science-forward discipline, there needs to be abstractions for key activities that we want a model to take on, and there needs to be ways to set clear control flow of how we build programs versus, you know, instead of asking a model, “First do this, then do this, then do this,” we can actually build a lot of the same code. We don’t need to ask a model to do everything. So it’s really about how can we take some of these best practices from software engineering and computer science and bring in all the power that models have to be able to express natural language and run functions in natural language, and bring them together in a much more maintainable way.
Tim Hwang: Nice. Yeah, that really is, I think, the future: moving into how do we make this production at scale. So it’s very exciting to see.
Kate Soule: Absolutely. And I think there’s a lot also that goes on when you start to build things in a little bit more structure, where you can take advantage of a lot of techniques that are coming out in the field around inference scaling and inference-time compute. So instead of running one big, massive prompt once, how do you break it up into smaller parts, run multiple generations, and use that to create an even richer response, often in far less time, far less compute. And all of that and more we really get into in the book.
Tim Hwang: That’s great. Yeah, well, I encourage everybody to check it out. The next one I want to touch on is, Kaoutar, you have already won the MoE award for mentioning “agent” first in the episode. But it is genuinely exciting. I mean, in some ways it’s no surprise that IBM would be announcing a kind of product leap in agents. But do you wanna talk a little about what’s happening and why you find it exciting?
Kaoutar El Maghraoui: Yes, definitely. So IBM at Think introduced over 150 pre-built AI agents through the watsonx Orchestrate platform, and I thought that’s really huge, enabling basically enterprises to deploy AI-driven workloads rapidly. So these agents are designed to be prebuilt; you can integrate them seamlessly with popular enterprise tools like Salesforce and Workday and Adobe, and allows businesses to automate tasks and enhance productivity. And this is showcasing IBM’s approach to support the creation of custom AI agents, which is also very important, relying first on the Granite models as well as models from Meta and Mistral. So it’s also a modular approach that provides you flexibility that facilitates tailoring your solutions for diverse business needs. I think that was also very important. So this flexibility is not just about one approach, but you can integrate different models in a flexible and modular way and allows you also to customize, in addition to the prebuilt existing AI agents that you can just add and customize.
Tim Hwang: Yeah, for sure. And I did wanna touch on that. Skyler, before we talk about the mascot—which I do want to hear more about—Kate, the mention of Granite, I guess you’ve been name-checked, so I do gotta bring it back to you. I understand there is an announcement coming out about Granite actually from IBM Think.
Kate Soule: So on Friday actually, we did a sneak preview. We didn’t tell anyone we were gonna do this. We released a preview of our Granite 4 models, and we got to talk about them a lot at Think. That was also a really exciting part of the conference. These models—we can post a link to the blog that talks about the new architecture behind them—but basically they’re a mixture of experts hybrid model. So they are very fast, very efficient. The tiny preview that we just released only takes 15 gigs of memory, so even running 120k context length with multiple concurrencies. So we think these models are gonna be really efficient and excellent counterpoints to complement much larger models that are being deployed—having those bigger models and then the smaller efficient Granite models working together hand in hand.
Kaoutar El Maghraoui: I really like the emphasis here on smaller, domain-specific and also energy efficiency. ‘Cause if you see these models, they range from 3 to 20 billion parameters, as opposed to what you see, like trillion parameters or many billion parameters in other open-source or other models. So the key thing here is how do you build these things that are optimized for specific industries and offering cost-effective and efficient alternatives to the larger general-purpose models. So I really like the focus on the efficiency here.
Tim Hwang: Yeah, for sure. So, Skyler, curious if you wanna tell us more about the mascot, but I think in general, I thought what was very striking about your response was you’re like, “It’s so much fun,” which I think is actually an important part of all this. To kind of hear what you saw.
Skyler Speakman: Yeah, I know. I think that just sort of captures it. They had this transition from having these Ferrari race car team members up on stage talking about how they’re using IBM tech, and then there was this pivot to IBM’s relationship with Red Hat, and of course Linux more broadly, and a penguin mascot just starts walking across the back of the stage. Great. So hats off to whoever had that planned. Maybe it was last minute; maybe that’s been someone’s dream for a year. I don’t know. But I thought it was well done.
Tim Hwang: Yeah, for sure. And I do like... it’s one of the things I’m really fascinated by is how all the companies in the AI space are coming up with their own brands about how they present AI stuff. Some companies are very serious, and some companies are very technical in a granular, almost academic way. And it’s kind of fun seeing IBM take a certain level of fun in terms of how to present and talk about this stuff. So it’s very cool. I’m gonna move us on to our next topic: super interesting article that kind of hit the New York Times I believe this week or last week, focusing on the rise of hallucinations with the emergence of reasoning models. We haven’t talked about hallucinations on the show for a little while, but obviously it remains a big question and a big problem that people are working on in the space. Skyler, maybe I’ll stay with you: do you have an intuition for why it seems... the article seemed to argue that reasoning models are newly hallucinatory in a way that we are learning to deal with. And is that the case? And do you have an intuition for why hallucinations themselves are not new?
Skyler Speakman: It does appear that they are on the rise. There was this great contra-position: they had asked a spokesperson for comment; they said no, they’re not on the rise. But if you go and check the receipts and look at the model cards that OpenAI also produces, you do see o4-mini hallucinating more than o3, and o3 hallucinating more than o1—like, definitely on the rise. Yes, it is. And they’re also very clear to say they don’t know why, and I’m also gonna draw a blank. Sorry. I’m not quite sure. I don’t have any gut instincts as to why those are increasing. Accuracies going up; they’re getting better at math, but hallucinations are also increasing. So it is something that really does need a lot more attention paid to it.
Tim Hwang: Yeah, and I think this is one of the really interesting things: I feel like the AI era is teaching us all the ways in which intelligence is very lumpy. You know, the model gets really good at one thing, and you kind of expect that it’ll be good at everything else in a well-rounded way, but that kind of doesn’t seem to be the case. Kate, I’m curious if you’ve got intuitions or similar. Like Skyler, you’re like, “I don’t know, it’s just weird.”
Kate Soule: Yeah, I will give my thoughts, obviously. I think there’s a lot that’s still left to be discovered, but to me it seems like it’s a classic example of just misaligned incentives. So we’ve got these models going through extensive reinforcement learning pipelines in order to improve the model’s verbosity, among other things, to get it to say more and to try and craft these well-rounded responses that humans will prefer. And there is some degree of... any human likes to hear people who are persuasive speakers talk. We’re not very good at fact-checking things, and we don’t naturally resonate with something that is just black and white, “The answer is X.” We want to know why, we want to hear more and more thought, and we question things less when we hear that thought process. And that’s a little bit counter to a different objective function that was originally solved for, which is much more “get the answer exactly correct.” And that’s how pre-reasoning models were certainly the focus. And so I expect there’s just some misalignment in those objective functions, and we’re trying to solve for a lot of different things, and we’re weighting having these really verbose thought processes that are much harder to check for factual accuracy when that training data is created, and that just innately is going to promote having more chances to hallucinate in any given response than just, “The answer is X.”
Tim Hwang: Kaoutar, are you optimistic in the end with all this? I remember a few years ago I was talking to a researcher who was like, “Don’t worry, in like 18 months there will just be no more hallucinations. We’re gonna crack the problem. It’s solved.” Clearly, there’s gonna be less and less hallucinations, and it’s just gonna be done. What’s interesting about this article is almost the idea that hallucinations might be a thing that keeps coming back as the technology advances. From where you’re sitting, do you feel like, yeah, maybe in 2030 we won’t even be talking about hallucination anymore ‘cause it’s a solved problem? Or is this something persistent that we’re gonna be dealing with for a long time?
Kaoutar El Maghraoui: Yeah, I think it’s gonna persist. Maybe we’ll have different techniques or methods or hybrid approaches where we need to do also factual check. So what’s happening here is these models use probabilistic—not logic—probabilities, not logic, to predict these responses. And reinforcement learning helps in math and coding, but also causes the model, like Kate mentioned, to forget some of these consistencies. The reasoning models take multi-step approaches to problem-solving. Each step introduces also this compound effect of hallucination. So the tools today can’t keep up. So of course a lot of work in research to build tools to trace the AI output back to the training data. But these systems are very complex, too large to fully understand, and the explanations even that are shown to the user sometimes don’t reflect the model’s actual internal process. So what are the broad implications here? Accuracy is kind of eroding here. Even as the LLMs become more powerful in cognitive tasks, their grip on factual reliability is loosening. And of course this has a lot of enterprise concerns. So I think the challenge still remains unresolved. There’s quite many efforts from OpenAI, Google, DeepSeek, and others; there is no clear fix. So hallucination appears to be kind of an intrinsic limitation of the current model architectures. So what I’m thinking is we need hybrid approaches—not just relying on the model, but see if we can combine that with other systems to do reasoning, symbolic reasoning, combine them with symbolic reasoning systems or factual checking. So hopefully that can resolve these issues that we find.
Tim Hwang: Yeah, and I did wanna get into that. Kaoutar, I think you point out quite rightly: from an enterprise standpoint, I’m a company about to implement this stuff; I’m reading in the New York Times that these great new models that people are trying to pitch me on hallucinate more. Skyler, what’s to be done? Kaoutar is kind of throwing out maybe we need more symbolic approaches. What is the toolkit of things that we do to try to deal with this, particularly in a setting where a business is trying to implement this and they need the reliability?
Skyler Speakman: I think that point right there at the end is very important: which use case are these being built for? Hallucinations during your Google search—it’s annoying, but it’s not game-breaking. Using a tool to improve some sort of legal argument or medical diagnosis—incredibly important. So I think these hallucinations will always be with us. I did think it would be on a downward trend, Tim, as you had said earlier. I am surprised they’re going up, because there are teams of researchers working on this problem, and they seem to be falling behind the pace the progress of the LLMs is... if we’re just reading the hallucination rates as they increase. So I think what’s probably the most key important part here is: what’s your downstream use case? And are hallucinations game-breaking in those? Then there will be some serious pause about how you really roll out AI into your workflows. If you’re using it to speed up an internet query, I think we’re gonna have some entertaining hallucinations for another five years to come yet.
Kate Soule: And if I can make a plug for generative computing, I think this is exactly the type of thing we’re trying to solve and to wrap our heads around for real deployed use cases: how do we set up workflows so that it’s not just a model giving carte blanche to go and create tons of chain-of-thought, do a bunch of actions, hallucinate some things, give a response back, but instead how can you have very programmatic control steps with checks where you’re validating the outputs programmatically, and where you really reduce the scope of what the model does at any one point in time so that you can really try and reduce your risks of hallucination and other safety issues? A key part to that is also bringing in additional layers of security. So for example, we’ve got Granite Guardian models, which can detect hallucinations in any grounded response or function call. So there’s all sorts of tools that you can start to layer in if you’re not taking what I call the “YOLO prompt” approach, where you just create one big prompt, throw it at the model, and fingers crossed, hope for the best. But if you start to break this out, it takes a little more work to set up, but it gives you so much more control over the risks and the performance at any given part in the process that I think it will be really critical for real-life enterprise deployments.
Tim Hwang: Yeah. I think this is still one of the funniest ironies of the AI era: you’ve built a thing that’s in the computer, but it doesn’t really behave like computing. And there’s all this work now to put it back in the box and make it behave like a more traditional computer, because you need it for all sorts of very practical reliability, security, safety reasons. There are prompts out there where it says in all caps, “Do not hallucinate.” That’s not computer science. We’ve lost all grounding to reality here. That’s not how computer science is done. So we need to get to a better way of working.
Kaoutar El Maghraoui: Yeah. It is the fact that we’re seeing right now: the smarter these models are getting at reasoning, the less we can trust them on facts. So hallucinations may require more than just reinforcement learning as it is being used today. So like Kate mentioned, we really need new architectures and new programming paradigms that really explicitly encode truth constraints, or modular hybrid systems that combine LLMs with verifiable databases or symbolic logic engines. And that’s, I think, at the core of what generative computing is trying to do.
Tim Hwang: I wanna move us to the last story of today. It was announced—or rather leaked—that OpenAI is about to make an acquisition of Windsurf, which is effectively a coding environment. And the number that has been leaked is that the acquisition would be USD 3 billion, which would make it the biggest OpenAI acquisition to date and obviously just a gigantic acquisition in its own right. Kate, to go back to you, some people were saying online that this is kind of, in some ways, evidence that a lot of this AGI stuff is marketing, because if you really believe that AGI was about to come about, why would you spend USD 3 billion on essentially a text editor with some AI components added to it? So kind of curious about how you size that up. Do you buy that argument, which is like, yeah, it kind of seems like maybe OpenAI is speaking out of two sides of its mouth here?
Kate Soule: I think OpenAI probably is speaking out of many different sides of its mouth at all times. But I do think that it makes a lot of sense, and I don’t think it’s mutually exclusive. So if you look at how OpenAI became the behemoth it is today, they released a chat interface. They found a UI that all of a sudden made their models relevant to mass consumers, and then they had millions of people using that interface, generating data that they used to bootstrap their way—rocket-ship their way—into really high-performance models. And I think what we’re seeing is the killer use case of 2025 and probably for a while is coding assistance. And they don’t have their own UI, their own access to developers in that arena. So they’re losing that advantage that gave them this amazing starting point and position. And so I see it very much as their way to try and regain some of that advantage and to better understand how their users are using the models and figuring out how to continue to improve the models moving forward.
Tim Hwang: Skyler, this is a little bit of a weird outcome, though, right? Because I could have remembered when ChatGPT first came out and everybody was doing startups around AI, people were like, “Oh, well you’re just a thin wrapper around GPT. That’s not a real company; that’s just a wrapper around GPT.” But USD 3 billion—it doesn’t really feel like these wrappers are quite valuable now. And it’s kind of almost an inversion from what we thought earlier in the game. Is that the right interpretation?
Skyler Speakman: I think while we’re talking about double-speak or talking about both sides of your mouth, I think on one hand you can call it a wrapper; on the other hand you can view Windsurf or some of these other companies as integrators. And OpenAI is great at model building, but they haven’t, as Kate’s pointed out, integrated into other spaces. They had a great chatbot interface. And I think while these models are continuing to grow, integration is the complementary scarce factor that’s lagging behind. So yes, wrapper or integrator, depending on which way you really view it. I do think OpenAI knows where it sits in terms of the model-building game, and they probably saw a bit of a weakness in their own structure of how do we actually deploy this on people’s machines that’s not a chat interface. And so again, maybe thinking of this more as integrating systems into the language models rather than a wrapper is probably why you can come up with USD 3 billion as opposed to just a wrapper. How it plays out, we don’t know. But I do think there’s this interesting take on the difference between building models and then actually integrating those into workflows, and this might be OpenAI covering its bases on the latter.
Tim Hwang: Yeah. I love the idea that a valuable wrapper is an integrator. It’s like, yes, once you get valuable enough, that’s what you’ve transformed into. Kaoutar, where does this all go? Because it kind of suggests this vertical integration in the space where coding assistance is obviously a really big use case, as Kate mentioned. And so it kind of makes sense that the model provider would eventually get one of those and it would be vertically integrated. I’m kind of thinking about, are there other domains you think that an OpenAI might be interested in? Because what’s interesting about AI is that it can be applied across all these different domains. And so it’s kind of like, well, maybe it’s not gonna be a USD 3 billion acquisition, but where else could they be going that they might want to create this sort of... they both control the model layer and also the application layer?
Kaoutar El Maghraoui: Yeah, that’s a very good point. And I think the example that Windsurf showed us here is they built this sticky developer workflow and an additional trust layer over GPT—what we all were referring to as the wrapper. And here OpenAI’s reaction: it’s not just that they want to own the model, but also the developer experience and the ecosystem. So it seems like we enter a phase where these verticalized copilots, for example, for finance, for law, for science, for medical, et cetera, they’re the new battleground. And owning the UX layer is a very strategic approach here, and I think that’s a smart play that OpenAI is doing, because as the model layer commoditizes, the moat is the ecosystem and the developer tooling. And especially as we are moving into more agentic AI, this vertical integration becomes very important if you really want to have a strategic advantage and be competitive in the marketplace.
Tim Hwang: Yeah, and it kind of leads to a world where it feels like maybe OpenAI is gonna become... they’re gonna almost take the Apple model, right? Where everything’s vertically integrated; they build the hardware, they have apps that are definitely their apps, and it’s just kind of end-to-end. Kate, do you think that’s gonna be the sort of future of AI, where you almost have some companies that are like Apple, other companies that are just kind of like the ThinkPad—a piece of a computer that you can run anything on?
Kate Soule: No, I definitely agree. And building on Kaoutar, I really like how you framed it: we’re starting to see commoditization at the model layer. And I think for a lot of tasks like coding assistance, we are absolutely hitting a point where many models are gonna start to converge on very similar layers of performance. And so then how do you differentiate? You make really high switching costs—or how do you develop your competitive moat? You make really high switching costs so that once you’re kind of in the ecosystem, you’re not gonna switch over to whoever’s offering the same offering for a few cents cheaper. And from that perspective, I think OpenAI and other providers are going to continue to invest in that. And that’s why it’s really important we continue to support a robust open-source ecosystem in order to make sure that we have diversity of technology, of thought, and ultimately are optimizing the efficiency of generative AI and trying to continue to bring down costs and push advantages, and make sure that we don’t just get locked into these single-provider ecosystems.
Tim Hwang: Yeah, for sure. Skyler, any thoughts on this?
Skyler Speakman: An analogy I’ve heard once before was, I don’t know, you go back 30 years and people defined their compute experience by what OS they used, you know, your Windows or your Mac, and then that converged, and then it was what browser you use that identified your user experience, and those have converged. Right now we’re in the space where people swear by one particular LLM. And I do think that will eventually converge as well. There will be small nuances here and there, but at least from a consumer perspective, I do see some converging. So we’ve seen it before happening over technology where that sort of decision defined your compute experience, and then fast forward five years and you can see that actually a lot of the options are pretty similar. I can see that sort of progression happening here with your chatbot of choice.
Tim Hwang: Yeah. It kind of makes me think a little about that old commercial, like, “I’m a Mac, I’m a PC.” I’m waiting for that commercial that’ll be like, “I’m an OpenAI coding assistant,” you know, like, “I’m an open-source coding assistant.” Well, more to come soon. As always, action-packed—a lot to cover, way more to cover than we have time for. But as always, thanks for joining us. Skyler, great to see you again. Kaoutar, Kate, always great to have you on the show. And thanks to all you listeners. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we’ll see you next week on Mixture of Experts.
Discover how to harness AI to drive business growth and innovation with insights from industry leaders and practical strategies.
Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But what is AI?
It has become a fundamental deep learning technique, particularly in the training process of foundation models used for generative AI. But what is fine-tuning and how does it work?
Listen to engaging discussions with tech leaders. Watch the latest episodes.