What’s the best open-source model? In episode 48 of Mixture of Experts, host Tim Hwang is joined by Kate Soule, Kush Varshney and Skyler Speakman to explore the future of open-source AI models. We kick things off by diving into the release of DeepSeek-V3-0324. Then, we explore more announcements coming from Google, including Gemini Canvas and Gemini 2.5. Next, Extropic enters the chat with a thermodynamic chip. Finally, we discuss how AI image generation is on the rise as OpenAI released GPT-4o image generation. All that and more on today’s Mixture of Experts.
Key takeaways:
The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: It’s 2026. Is the top model in the world an open-source model? Kate Soule is Director of Technical Product Management for Granite. Kate, welcome to the show. What do you think?
Kate Soule: I don’t know if I agree with that framing, Tim. I don’t think any one model is “top.” I don’t think there’ll be one model that is overall best at everything or that will “rule them all,” so to speak.
Tim Hwang: Alright. Kush Varshney, IBM Fellow, AI Governance. Kush, welcome to the show. What do you think?
Kush Varshney: I think “Open” is here already, and “Open” is gonna dominate into 2026.
Tim Hwang: All right, great. And Skyler Speakman, Senior Research Scientist. What’s your hot take on this question, please?
Skyler Speakman: If you define “the top” as the most used, then definitely open models will be the most used models in 2026.
Tim Hwang: All right, everybody’s fighting my questions today! And all that, and more, on today’s Mixture of Experts.
I am Tim Hwang, and welcome to Mixture of Experts. Each week, MoE brings you the best minds in artificial intelligence to walk you through the biggest headlines dominating the news. As always, there’s a lot to cover. We’re gonna be talking about Gemini’s new release, a new thermodynamic computing paradigm, and OpenAI’s image gen. But first, I really wanted to start by talking a little bit about DeepSeek-V3—and specifically, not V3, but a checkpoint that DeepSeek released.
To give the full numbers if you’re interested, it’s DeepSeek-V3-0324. There’s a lot of hype about this release because, by some measures—one specific one is this “Artificial Analysis” intelligence index—it is now the best reasoning model, the best model out there in the world.
But maybe, Kate, I’ll start with you. I know you kind of fought the premise of this question when I just asked you a moment ago. Should we think about models as being “the best in the world”? Is that even a useful way of thinking about this space?
Kate Soule: A couple of things. I think DeepSeek-V3 is a non-reasoning model, so a lot of the press is calling it the “best non-reasoning model in the world” according to reports like Artificial Analysis. I think a lot of these analyses are trying to come up with tools to help people better evaluate models and pick ones to use in production. The reality is, these models are all differentiated by like 0.01%. You know, the differences in performance. Do we really think that tiny lift in performance on one benchmark is going to result in meaningful performance improvements on a RAG or even an agent-based task you’re trying to deploy in production? I don’t think so.
I think these are great ways to give you a list of models to start testing, but ultimately, the best model is the one that does best on your task that you care about. And that could be any model, regardless of how it scores on some of these top-level benchmarks.
Tim Hwang: You’re almost saying we’re post-benchmarking in some ways. Like, all the models are so performant now that it’s almost difficult to say there’s one absolute measure. I don’t know if I’m putting words in your mouth, but...
Kate Soule: I mean, I think different model providers have different priorities. I think DeepSeek is actively chasing OpenAI. They’re trying to have the same pursuit of AGI, and so some of these benchmarks are being used as demonstrations of capability on that broader pursuit. That’s fair. I don’t think that means for an everyday production task or use case that it necessarily reflects a meaningful difference in performance. I think some of these big models are frankly overkill, and so boosting it a little bit further isn’t going to make a real, actionable impact.
The benchmarks that matter most if you’re trying to deploy a model are: what is the performance for a given cost profile? And those are things you just have to test use case by use case, using information like Artificial Analysis to help you get started. But ultimately, you have to run your own experiments.
Tim Hwang: Kush, maybe I’ll turn it over to you. The way I heard your response might be a contrasting position to Kate’s, right? I think you responded by saying, by any measure, “Open” is winning. So, it doesn’t matter how you measure this, “Open” will be the best in the world in 2026. Is that another way of thinking about it? Or maybe you were nodding, so maybe you actually agree with what Kate just said.
Kush Varshney: I agree with both Kate and Skyler on this point—that there are different ways of measuring what is “best,” and even asking the question of what is “best” is probably not the right way to think about it. But I think the main point is that “Open” is the way the world is gonna move forward. So, whether we wanna count “best” or “not best,” or usage, or adoption, I think “Open” is gonna have a very strong play.
Just continuing... So, whether that number is a little bit above or a little bit below, that’s not the critical point; just that it’s in the same ballpark is the important point. I think a couple of months ago when I was on the show, I was talking about the culture of how DeepSeek is doing their work—the fact that they can rapidly iterate and make this difference and reach their goals, whatever those happen to be, very quickly. I think that’s the continuing story in my mind. Whatever happens, I think DeepSeek will be able to adapt to the changing environment, whatever the needs happen to be in the actual world.
So, in terms of the culture aspect, I mean, “Open” culture is gonna be what’s gonna dominate, actually—not maybe the “open model.” So maybe I’ll clarify that a little bit.
Skyler Speakman: I’d like to jump in on Kush’s point there about the role of DeepSeek. For sure, the headline that got me to click was “Open Source is Now Best.” But below that headline was this really cool graphic that showed where DeepSeek was in January to where DeepSeek-V3 is in March now. And that delta, I think, is worth paying attention to. I agree about the difference from the other leaders—it depends what metric you’re using, et cetera. But for this same metric, the increase that DeepSeek has made from January to March is really quite impressive. Think about that change that has happened in that short period of time. I think that just echoes Kush’s sentiment about the way DeepSeek is going about creating and releasing these models: a very cool release in January and a great follow-up three months later. So that’s the really cool headline after the fact.
Tim Hwang: Yeah, that’s right. It’s very interesting as a way of looking at these metrics. We tend to think about them as, “Is the model good or not?” And I guess, Skyler, what you’re seeing is that maybe that’s not the real question. This is really useful for almost knowing how good the team and their improvement method is, right? It’s almost like, “How quickly can the team hill-climb?” That’s the really interesting thing revealed by these numbers, more so than the quality of the release in some objective sense.
Kate Soule: Well, and I also think there’s something interesting going on about being able to bootstrap reasoning models to improve non-reasoning model performance. So the initial V3 that was launched back in December—DeepSeek had an internal version of R1, which was their reasoning model that they said they used to train it. And then they released R1 in January, and that was market-moving. And now they’ve released an updated version of V3. So part of that momentum, which is really exciting to Skyler’s point, is that we see them able to innovate on some of these core building blocks that they’ve released. And that’s probably gonna unlock all sorts of ways that the broader open-source community can also innovate, given that they’ve released these building blocks out into the world, like the R1 model.
Tim Hwang: Yeah. There’s a final theme I wanna pick up on from Skyler’s original response, which is you said maybe one metric we should just look at is usage, right? Nothing beats usage. If there’s a lot of adoption, we can debate what’s better, but it’s almost like the one that’s being used is the one. Do you wanna talk a little bit about that? We don’t really talk about that so much. I feel like we’re often very obsessed with, “How did it do on this benchmark?” But I wonder if usage over time becomes a more important way of measuring not model quality exactly, but who’s winning, I guess, in some sense.
Skyler Speakman: I think downloads on Hugging Face is a thing, right? That’s kind of a stab at that idea of usage. And I think that is something these model developers keep track of and watch over time. So, no, I don’t think we’re too far off the mark by talking about adoption and usage.
Kate Soule: I will push back a little bit, just because DeepSeek is a huge model. If we talk about downloads and usage, I think small models are gonna lead and win—something a developer could literally download and run. The DeepSeek model is kind of a bear; I mean, it’s 670-plus billion parameters that would have to be loaded in memory to run. So I think usage is really important, but I think usage for these larger models is going to be predominantly in a hosted setup.
There are interesting ways to look at demand based off of model size, and I think we see a lot of small models that are more cost-effective are gonna get more usage in 2025 and 2026, versus some of the bigger models that are just monsters to run.
Tim Hwang: Yeah, for sure. One of the things I’ve always been obsessed with is one of my secret data points of the world that I would love to know is: what’s the book that’s most downloaded on Kindle that’s never read? I actually wonder if there’s a similar dynamic for language models, where you have these models that are widely hyped and very much downloaded, but the question is, how much use are they actually getting in practice? And we have a much more limited sense of that. That’s almost an invisible part of the question of, you know, who’s winning? And I think we’re already attacking the premise of that question.
Great. I’m gonna move us on to our next topic. Speaking about big models and the battles over benchmarks, Google did another raft of releases. They seem to have really been picking up the pace. There was an announcement for Google Gemini 2.5, and also the release of this “Canvas” feature they’ve been playing around with.
Because we’ve spoken a lot about models and benchmarks, I do wanna maybe start by talking about Canvas. One of the really cool features about it, I thought, was the idea that you can be coding and then automatically see a preview of what you’re building at the same time. We’ve talked about this a little in the past about how we’re still trying to figure out what AI-assisted coding will look in the future, and a lot of the innovation seems to be on the interface level.
So I guess, maybe Kush, I’m curious to get your thoughts on these types of approaches. It seems like we’re moving away from pure auto-complete. Just interested in how you think about it as a researcher on some of these issues.
Kush Varshney: Let me start with a little bit of a history lesson, if you’ll accommodate that. Of course, there was a person, an IBM Fellow, Irene Greif. She was in our Cambridge lab; she pretty much started it. She founded the field of Computer-Supported Cooperative Work. She started it in Lotus, and then IBM acquired Lotus, which became part of IBM Research, and so forth. That field brought together all these different things—human factors, distributed systems—a lot of different stuff about what it really means for humans to work together supported by computers and computer technologies.
I think the paradigm is shifting a little bit, and it’s more about individual work—how that’s supported by AI, and the collaboration between humans and AI. So, kind of the co-creativity and these sorts of things. I think this whole paradigm changing is calling for exactly that: innovations in the interfaces, in the interactions. I think there needs to be a lot more control given to the user, the ability to tinker with the interface to make it what works for them.
The Canvas is a great starting point, but I think—because I mean, just a single chat box is not the answer, I think everyone can appreciate that—but once we go beyond that, the world opens up into lots of different possibilities. And I think the Canvas is one. But why not just let me, as the user, determine what is the right interface for me? And maybe that’ll actually be the next step.
Tim Hwang: Oh, like the future will be almost purely like everybody will have their own interface for this sort of stuff. That’s very interesting. My question to the rest of the panel: is this the first broadly released multiplayer AI interface, where you’ve got multiple people interacting with the same interface? Have there been versions of this before? Is this the one to make the splash, where we look back and say this is the first time people are interacting together over the Canvas? Or am I blanking on some previous examples?
Kush Varshney: No, I mean, I think like Google Docs—you’re just all editing at the same time, and then you can have some AI helping each of the people a little bit—is in that same pathway. It’s not like we haven’t seen Canvas-type things. We use Mural for design thinking, and there’s multiple people moving things around. Our team is developing a kind of AI Mural version in our Cambridge Lab. So, a lot of things are happening, but yeah, it is a step, I would say.
Tim Hwang: Yeah. One question we’ve talked a little about in previous shows is it’s kind of funny that because ChatGPT was this big moment for AI, all the interfaces that have followed since have fallen into the gravitational well of everything needing to be chat. It feels like maybe what’s exciting about Canvas, and a bunch of other experiments, is that finally people are trying to stretch beyond that. It’s an interesting debate on how much path dependence there is here—whether people will sort of... I mean, myself, I’m kind of like, “Oh, there’s no chat,” or chat is a lesser part of this interface; it feels a little bit weird for me. I think that’s pretty interesting to see.
Kate, any thoughts on this? Is Canvas something you’d use? How do you feel about it, particularly from thinking about this from a product standpoint?
Kate Soule: Yeah, in general, I’m always a fan of finding ways to move beyond the initial chat-based constraints. I think Canvas is probably more of a stepping stone than a final destination. I think it’s got a little bit of that chat feel while still being different.
For coding, I really think it’s about being embedded where developers are coding today, versus having a standalone Canvas app where you iterate. In terms of where you’ll get the most productive use... I think it’s a little bit more of a demo perspective there.
From a product strategy, I think it’s interesting to look at how some of the big players are focusing more on the endpoint side of usage—like Anthropic, I think, is focused pretty heavily there—versus more the application side with UIs, where Google seems to be focusing a little more with this release, certainly with some of these new features.
Honestly, from my perspective, I was most excited by the Gemini 2.5 model simply for the reasoning. I do a basic sniff test for different reasoning models and just ask, “What is two plus two?” and see how much thought the model puts behind this answer. Like, can it figure out how not to reason if it’s simple?
Tim Hwang: I like that a lot.
Kate Soule: Yeah, and the model did actually pretty well. Compared to DeepSeek—R1 will give you five paragraphs of, “Okay, I’ve got two fingers on this hand and two fingers on this hand...” It goes way into it. Gemini was able to give a very reasonable, short response that was still correct. So I thought that boded well. I haven’t done more exhaustive testing, obviously; that’s just a quick sniff test. But that’s the first time I’ve seen a more practical... It is an easy question; I’m not gonna spend a million paragraphs and tokens trying to give you a response.
Tim Hwang: Yeah, that’s great. I love that. The idea is, actually now we need to be doing simpler evals because the question is whether or not you’re overcommitting resources. It’s like death by reasoning. Very, very interesting.
Kush, Skyler, other sniff tests, vibe checks on 2.5? I do think these qualitative evals are pretty valuable for people navigating, “Is this something I should spend time on or look into?”
Kush Varshney: Not in the last 36 hours, sorry.
Skyler Speakman: Same here.
Kate Soule: I also do, “Where is Rome?” That’s my other go-to. Similarly, you get paragraphs of debate on where Rome is, compared to a short response on Gemini. So I thought that was pretty good.
Tim Hwang: Yeah, I will need to try that with DeepSeek. I just love the idea of it grinding away on a very simple question and really stressing about the answer.
Kate Soule: It literally is like, “Okay, two fingers plus two fingers. But then if I have two toes plus two toes, how many toes do I have?” It gets mind-blowingly intricate.
Tim Hwang: I think one final thing I did wanna touch on—and Kush, I think we should recognize that you’re wearing a safety vest before I tee up this session. Do you wanna explain why you’re wearing a safety vest on the show today?
Kush Varshney: Yeah, this safety vest is because IBM Research, with our Granite program, is very focused on safety through our red-teaming, our Granite Safety Alignment, and our Granite Guardian model. So yeah, that’s just trying to represent that.
Tim Hwang: Yeah, absolutely. I did wanna finally talk a little bit about model safety here. One of the things we’ve talked a little about in the past is how much safety is built into the model versus a future where safety is a separate model you’re working on.
Looking at a release like this, it still feels like at least a lot of the big companies are still... I would say, at least Google, is still chasing after this kind of, “Well, it’s just all gonna be embedded in the model,” versus safety being outside. Do you wanna talk about the pros and cons of that? And why Google isn’t doing what a lot of other companies are doing, like Meta or IBM, saying, “Hey, we’re gonna separately think about safety as its own model construct”? Just curious to get your thoughts on that.
Kush Varshney: I mean, Google does have something called “Shield Gemma,” so they do have a player in this separate model field. But yeah, it’s really not a question of choosing between the different ways of doing it. You really should do everything because there’s never any perfect solution. So yes, do the safety alignment as best as you can, and then still have an input and output guardrail, because I think it’s critical. And then even on the data curation side, try to exclude as much of the bad content as possible.
To me, a big reason for keeping a separate guardrail model alive is, beyond the performance question—where yes, that does show you can do a little bit better—the other thing is customizability. Not every application, every use case, is gonna be exactly the same. So the notion of safety, the notion of what is desired and undesired, is gonna change. If you just pack everything in, you don’t have that flexibility anymore. So we need to think that every customer, every application, needs some level of customizability, and that applies to the overall model, but also on the safety side.
Tim Hwang: Yeah, that’s a great way of thinking about it. You’re saying safety at every level—do safety everywhere.
Kush, in 10 seconds, could you compare and contrast safety and security? The reason I ask is the UK recently rebranded their AI Safety Institute into the AI Security Institute. What are your thoughts, not necessarily on that rebrand, but along those two dimensions?
Kush Varshney: Yeah, no, I mean, both of us were in San Francisco in November, right, when it was a convening of the AI Safety Institutes. You were a part of the Kenyan delegation. And, yeah, things have changed a little bit. I think that’s more politics, more just wording things.
But to me, security is at the application level. Those are things you do in a general sense. And then safety is at the model level—things you’re trying to bake into the model or put an extra guardian on. And when you meet in the middle—the model comes up and the application comes down—that’s where the confusion might be a little bit. So security is becoming more AI-ish, and the AI model is becoming more secure in some capacities. So yeah, to me, the general idea is just reducing the risk of harms, and the more you can do that, that’s the goal.
Tim Hwang: For our next topic, I wanted to bring us to a hardware story—a really interesting feature coming out in Wired this week on a company called Extropic. What Extropic is investing in is an idea called “thermodynamic computing.” I really want to bring this up because a few episodes ago we talked about quantum, and these guys are really making the argument that it’s not gonna be GPUs, it’s not gonna be quantum, it’s gonna be this new thing called thermodynamic computing.
It’s really interesting as we think about the ways hardware influences the work of AI. I was interested in the takes of this group, as people who work in AI day in, day out. To what degree are you paying attention to these kinds of developments? I feel like one way of thinking about this company is that it’s big if true; if you can actually do it, then maybe it’s a really big deal. But we kind of don’t know at this point.
So I’m curious: on a day-to-day level, are folks thinking about these alternative computing platforms? Or are they still so far in basic research that they’re not impacting day-to-day thinking? Kate, maybe I’ll turn to you for the first take here.
Kate Soule: Yeah, I’m not an expert at all on chip design or hardware, but I think it’s something that certainly IBM—and we have huge teams working on specialized alternative chip design and AI accelerator chips—is paying really close attention to. There’s a lot of innovations going on in that space.
So, some of these headlines... normally we let it mature a little bit before we start paying more close attention. But as a field and as a whole, I think there’s a ton of opportunity to better optimize and redesign chips based on the inference loads we expect to see in the future—moving into, for example, running smaller models more times at inference versus one big model one time at inference, to improve performance as everyone starts investing more heavily in what we’re calling “inference-time compute.”
So I think there’s just tons of opportunities in this space. Certainly eager to see how Extropic evolves, and if something becomes more mature that the field can take advantage of.
Tim Hwang: This is kind of where I wanted to point the discussion. In some ways, the uniformity of GPUs, and even the uniformity of NVIDIA, has been really good for the AI space because there’s been a common standard people can build around on the hardware side. One question I’m curious about as this evolves is: if you have all these alternative computing platforms that end up being good ways of doing AI, does that fragment the space a little bit? I assume the way you’d try to do AI on top of a thermodynamic computing chip or a quantum chip might look really, really different.
So, as you think about the future, maybe Skyler, I’ll turn to you: do we think there’s gonna be more fragmentation in the space? Or is it... I don’t know, maybe we’ll find some way to just get CUDA to work on everything.
Skyler Speakman: I am not ready to invest in Extropic yet, but I do think they’ve got some interesting takes, and I was reading about it today. You don’t want any randomness in your floating-point operations—our typical zeros and ones. But if you’re doing billions and trillions of these floating-point operations, that noise is actually okay. The idea of AI and how we train those sorts of things involves distributions of data. So the problem is, you don’t want any randomness at any individual calculation, but you wanna simulate randomness at the larger scale.
Their approach seems to be: let’s not bother with zeros and ones anymore at the chip level. Let’s embrace randomness down at the chip level because that’s where we’re eventually going anyway, thinking always about distributions rather than the answer being, you know, “four,” for example.
So I’m really glad people are asking those questions. Whether or not they’ll be able to induce the desired distribution by passing electronics through a metal wafer... that will remain to be seen. But I’m really glad that people are considering this idea of the extreme accuracy required for our zeros and ones, and then in the bigger picture, actually, we don’t need that specific accuracy when you’re talking about training these massive models. So it’s some really cool tension to see how it plays out. But like I said, I’m not taking my money there quite yet.
Tim Hwang: Yeah, absolutely. I actually really love that you’re revealing a bias in how I framed this segment, which is that hardware is the upstream thing and all the AI people have to dance depending on how the hardware evolves. Skyler, you’re almost making the reverse argument, which is that what we’re seeing now, and what this company is an example of, is an attempt to make the hardware match what we know about AI now. So the power is actually going the other way; GPUs were always kind of an accident in some ways, and now we’re trying to rebuild that. That’s a nice take.
Kush, any final thoughts on all this?
Kush Varshney: Sure, I can maybe go back to some more of my history lesson if you guys are okay with that.
Tim Hwang: I feel like we have Chris on for the crazy take, and we have Kush on for the historical, philosophical perspective. History and everything like that, right.
Kush Varshney: So, just... what is thermodynamic computing? I think it’s to understand a little bit of how this has come about as well. You said it at the beginning, that there’s some sort of hardware lottery. Sarah Hooker is a researcher who wrote an essay all about this—that whatever the hardware happens to be, that’s kind of what makes things go forward.
So even the whole IBM company... it started at a time with this guy, Herman Hollerith, and he was doing punch cards. He did the US census in 1890 and stuff, right? And that’s like a paper with a hole in it—a very basic technology. And then in the sixties, Bob Dennard here at IBM Research invented DRAM, which took a capacitor and a transistor, and you could do memory that way instead of through these hole-punching things.
And then you get to the thermodynamics of it. So you have James Clerk Maxwell, the second law of thermodynamics, and he is trying to think about this demon that’s trying to make heat flow without any energy expended. And there were two researchers here at IBM, Rolf Landauer and Charlie Bennett. What they figured out is how to argue against this Maxwell’s demon thought process.
So Landauer showed that any sort of computation actually requires the use of energy; it requires heat. And then Bennett took that idea and said that this demon who is sorting hot and cold molecules must actually do some information processing, so that’s actually using energy, and so the second law of thermodynamics must hold.
So all of this is part of IBM’s heritage as well. But this new thing... I think it’s exciting. It’s been in the works for a long time as well, these thermodynamic ideas. The claim is that things like matrix inversion, which is a very important computation and very expensive to do with large matrices, can be done naturally with this sort of approach. So I think that makes a lot of sense.
Just take a capacitor and an inductor, and with those, you can actually set up the matrix on the conductors, let it dissipate energy however it’s supposed to, and then the correlation among these different circuits actually tells you the inverse of the matrix. So all of that is really cool stuff. And I don’t see why we shouldn’t be looking at those alternatives. A lens we know does the reciprocal operation; we know that resistors do this or that. So why not do it this way? We shouldn’t be beholden to digital logic just because that’s how it’s happened over the years.
Tim Hwang: I think all of these things, you take a look back and it’s always, “Well, actually it’s been going on for decades.” I feel like all of these new developments, AI included, are part of a very long historical legacy.
So for our final topic, this was a fun thing I did want to talk a little about. In a week that was just packed with different announcements, the one that seems to have taken the cake, at least in my social media feeds, has been the release of OpenAI’s 4o Image Generator. Most importantly for me is that this meme of rendering everything in a Studio Ghibli format, in an anime format, has just taken over. My social media feed is nothing but these images right now.
It’s a funny moment to take a step back and say, “Okay, image gen is suddenly trending again,” in a way that almost dampened all the other announcements this week. Playing around with it, it is really quite impressive.
Similarly, maybe Skyler, I’ll throw it to you for the vibe check. If you’ve played around with it, what do you think? Is this actually a big improvement? I mean, we’ve done style transfer in the past, so this is in some ways not new, but this seems to have really hit a nerve in a way that hasn’t been the case for previous announcements.
Skyler Speakman: It has. I have not played with it, but again, my feed has been filled with people rememeing all of these different styles. I think with this... are we in a position? Has multimodality, at least between language and images, has that been solved? Is this... are we gonna move the goalposts further down away, or can we say we have cracked it? GPT-4o has cracked multimodality. I think they’ve done that. I think this is some really, really cool, impressive tech. So yeah, I don’t know. Otherwise, we’re gonna again say, “No, but I can’t do this,” and we’ll keep moving those goalposts. So I think it really is quite impressive, at least again, from all my friends playing on it and sending images over social media.
Tim Hwang: Yeah. In some ways, having played around with it a little bit, it is a triumph not necessarily of images or text-to-images, but it’s almost a triumph of the ability to correctly infer what someone is looking for when they search. I think that’s always my reflection. Playing with older versions of Midjourney, it was like, “Oh, well, not quite this. Can you make this change? Can you make this change?” and you finally get to the end. This one is kind of magical because it’s very one-shot. You’re like, “I want this,” and it generates an image, and you’re like, “Oh, that’s kind of exactly what I was looking for.”
I think that’s really interesting. Is there, Kate, you’re nodding—is there a good name for that achievement? It feels like, what’s the big jump here in some ways?
Kate Soule: Well, I think it’s important to recognize where we were before, which was DALL-E 3, which was back in 2023—ancient in generative AI terms, way out of date! So, DALL-E 3 was more or less called as a tool, like being swapped in when it’s called to generate an image based on a part of the conversation, and then turned off, and then GPT-4o or whichever model can take back up the conversation.
I think what we’re seeing here—obviously OpenAI does not share a tremendous amount of details on their broader architecture and design—but based on what I’ve read in their docs and the release notes, they talk about this being a more native capability, embedded far deeper in the architecture of the system. So I think what we’re really seeing is some really exciting innovations in multimodality focused on system design: how can we bring down these multimodal components far more core to where the language model operates?
That could mean, for example, potentially sharing some parameters and being able to bring different components together much earlier on in the process, rather than at the very last minute with a tool call—you know, “call a friend” and then pick back up the conversation. I think that is the future, not just for multimodality, but for all types of understanding and more specialized tasks: being able to have different experts—whether that’s an expert on documents, or images, or audio—integrated at a systems level, far more internal to the model. And having a model or an application, like a chatbot, be far more of a systems-based approach versus, “Here are some weights that we’re calling for a given prompt.”
Tim Hwang: Yeah. As someone who’s been less involved in the day-to-day, Kate, could you explain a little bit why that has been hard? Moving from something bolted on at the end to fully integrated in the system—what makes that difficult?
Kate Soule: I think part of it is just the momentum of how things have been built the previous way. Starting with the release of the original GPT-3.5 models and ChatGPT, to scale performance has been baking it into the model at training time—train on more data, have more parameters, and boost your performance by just baking it all in in that upfront training step. So a lot of the system design and architecture and applications have been focused on, “Okay, there’s this big black box that we make a single call to, and we get a response back.”
We’re starting to see more of a shift. I don’t know that it’s necessarily more difficult; I think in some ways it’s actually a lot easier to innovate if we’re innovating outside of that training and innovating more on the systems-based approach. But we do have to make a conscious shift to enable that. We don’t have the same tools and capabilities available; the community needs to build those.
Particularly if you’re talking about doing this in the open versus OpenAI doing this behind a closed, gated wall—they’ve got a whole inference orchestration layer that they haven’t released to the broader world. So I think this is a big challenge that open-source models actually face in particular: being able to catch up to the same degree of this more systems-based approach, because we don’t have the same infrastructure or the same kind of revenue coming in, so to speak, to pay for that build-out to enable that system.
Tim Hwang: That’s really helpful. Thank you.
Kush, I’m gonna call on you not just as the history person, but also as the safety person. One of the things I observed in this wave has been—indeed, even the Studio Ghibli meme is something that I think traditionally companies have been a lot more restrictive on, right? To say, “Oh, you really don’t want to copy a style.” I’ve also seen a number of image generations that are a little bit at the edge of what you would consider acceptable image generation.
Do you think this also marks a shift in how companies are thinking about image gen? One way of reading this is that OpenAI is concluding that we should let up a little bit, we should allow people to use these image gen products more freely, even though it might occasionally generate some stuff which is offensive, harmful, toxic, and so on. I wanted to get a comment from you on the meta here. Are companies opening up in a way they haven’t in the past? And what are the trade-offs of that?
Kush Varshney: Yeah, I think they are. I think the image side of things is maybe a little bit more forgiving on this because, for natural language, the text is more used in business applications. Generative imagery is... I mean, less legalistic in some ways. So I do think that is probably the case, and for demonstration and for many other things.
I was actually playing around with this. An example that my wife and I were running—she actually did her MFA in computer art a decade ago, and she took a class in digital matte painting. One of the assignments for a week was they had to take an image—it was a summertime image—and then change it so that it looked like a wintertime image of the same scene. And this thing does it really well; I mean, in a minute you have what you were looking for.
But then what she was zooming in on was the windows of the building had some minor changes across the summer and winter image. So at first glance, I didn’t notice it. She is like an expert at this, so she was zooming in and going back and forth and really looking at whether something has changed or not.
From the safety perspective, those sort of little minor things that someone like me doesn’t notice is probably fine. But once you’re at a very expert level—if you’re an actual movie maker doing digital matte painting—then it becomes critical. So as a consumer tool, I think it’s all good, but there’s still a gap.
We have this researcher, Mario Klingemann; he is a world-famous AI artist. He just created a 12-minute-long, fully AI-generated film, and he couldn’t use any of the tools that are out there. I mean, he has to innovate the tools and everything else. This is being shown in Seville, Spain these days. This film is so professional-level; you can imagine the difference between what this image generation stuff is able to do and what the professionals are truly able to do. So this is not the tool for them, but I think it’s safe enough for all of us. So I think that’s the way to think about it.
Tim Hwang: Yeah, that’s really interesting. I almost love this threshold of “good enough to fool the amateurs.” I think it’s a really important threshold. I sent an image to a friend being like, “You know, it’s really impressive they get the fingers right now.” And then he shot back with a zoomed-in version of the image to show there was this little fingertip still hanging out somewhere. And I was like, “Oh, no.” It was enough to get past my sniff test, but anyone with a keener eye would’ve clearly seen the problem.
Skyler Speakman: On OpenAI’s blog release, they have a little paragraph about how they’ve used a reasoning LLM on the safety side of this generation. I don’t know if we’ll get any more details beyond that paragraph, but I thought that was interesting. I don’t know if they’re covering themselves or trying... but there was this very clear paragraph about how they’ve used their reasoning LLM to help parse through some of these more, you know, edge cases of things for unquestionable generation. So yeah, love to see if we get more details about that going forward, and why they called that nuance out in particular.
Tim Hwang: Yeah, that’ll be really interesting. I missed that, and I think it’s definitely worth keeping an eye on. It ties back to Kate’s little reasoning vibe check about how much time it spends thinking about whether it’s a good thing or a violation of its content guidelines—a very interesting set of questions.
Kate Soule: If it happens at training, I don’t care how long it takes.
Tim Hwang: That’s true. Exactly. I don’t see it; take as long as you like. It’s inference time.
Well, as usual, so many things to cover, not enough time to cover it all. Kush, Kate, Skyler, thanks for joining us. And thanks to all you listeners for joining us as well. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we will see you next week, as always, on Mixture of Experts.
Let’s bust some early myths about DeepSeek! In episode 40 of Mixture of Experts, the panel tackles DeepSeek R1 misconceptions, explains model distillation and dissects the open-source competition landscape.
The January 2025 release of DeepSeek-R1 initiated an avalanche of articles about DeepSeek. Given the volume of coverage and the excitement around the economics of a seismic shift in the AI landscape, it can be hard to separate fact from speculation and speculation from fiction.
Shobhit Varshney, VP and Sr. Partner, and AI, Data and Automation Leader at IBM, shares his five important takeaways from the DeepSeek breakthrough.
Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But what is AI?
It has become a fundamental deep learning technique, particularly in the training process of foundation models used for generative AI. But what is fine-tuning and how does it work?
In this tutorial, you will use IBM’s Docling and open-source IBM® Granite® vision, text-based embeddings and generative AI models to create a retrieval augmented generation (RAG) system.
Listen to engaging discussions with tech leaders. Watch the latest episodes.