DeepSeek facts vs. hype, model distillation and open-source competition

Watch the episode
Mixture of Experts podcast logo
Episode 40: DeepSeek facts vs. hype, model distillation and open-source competition

Let’s bust some early myths about DeepSeek. In episode 40 of Mixture of Experts, join host Tim Hwang and experts Aaron BaughmanChris Hay and Kate Soule as they break down this week’s top news and trends in AI.

Last week, we covered the release of DeepSeek-R1. Now that the entire world is up to speed, join us as we separate the facts from the hype. Next, listen to the discussion on model distillation and why it matters for competition in AI. Finally, dive into Sam Altman’s response to DeepSeek and whether R1 will radically reshape the open-source strategy of other tech giants. Find out all this and more on Mixture of Experts.

Key takeaways:

  • 00:01 - Intro
  • 00:41 - DeepSeek facts versus hype
  • 21:00 - Model distillation
  • 31:21 - Open source and OpenAI

The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

Listen on Apple Podcasts Spotify Podcasts Casted YouTube

Episode transcript

Tim Hwang: On a scale from 0 to 10, how big of a deal is DeepSeek-R1? Kate Soule is Director of Technical Product Management for Granite. Kate, welcome to the show. What do you think?

Kate Soule: I’m going to take maybe a little bit of a controversial position. I’m going to say 5.

Tim Hwang: Chris Hay, Distinguished Engineer and CTO of Customer Transformation. Chris, welcome back as always. 0 to 10. What do you think?

Chris Hay: 9.11 or 9.9, but I’m not sure which is the bigger number.

Tim Hwang: Wow, that is a niche reference. And finally, Aaron Baughman is IBM Fellow and Master Inventor. Zero to 10, Aaron, what do you think?

Aaron Baughman: Yeah, that’s a great question, and I think we’re gonna be right in between the other two at a 7.5.

Tim Hwang: All that and more on today’s Mixture of Experts. I’m Tim Hwang, and welcome to Mixture of Experts. This is episode 40 of the Mixture of Experts show. I’m really excited to meet this milestone with an all-star cast. Each week, MoE is the place to tune in to hear the news and analysis on the biggest headlines and trends in artificial intelligence. And today we’re all going to talk about DeepSeek-R1. It’s basically anything that anyone is talking about right now. It’s the talk of the AI chatter class, it’s rocking markets, and even my dad is texting me about it. So what I want to do is start the first segment with a little bit of DeepSeek-R1 myth-busting. If you’ve been anywhere around AI in the last week, you know the basic story. There is a Chinese lab, DeepSeek, that has released a new model called R1 that is both open-source and competitive with the state-of-the-art models coming out of Anthropic, OpenAI, and all the names that we’re really familiar with. And there has been so much hype about this story that, as I said, even my dad’s texting me about it. And a lot of the mainstream coverage has actually been getting a lot of the facts wrong. So what I want to start with is just to knock down a bunch of myths so we can calibrate as we really peel back and talk about this story. Kate, I want to start with you, because I know you were angry about this in the show Slack, and so I wanted to give you a chance to let loose. I think the first meme that we’ve heard in a lot of this mainstream news coverage is we can now train state-of-the-art models for 5.5 million dollars. And that’s so crazy expensive relative to the kind of numbers that we’ve heard before, right? I think the Stargate price was a hundred billion dollars or something crazy like that. So Kate, how true is that number? Can we really train models for 5.5 million dollars now?

Kate Soule: So, first, the number is true; it’s published in the paper. DeepSeek isn’t necessarily hiding anything about this number; it’s heavily caveated if you look at it. But the takeaway that people are driving from this number is a little bit crazy. So yes, training one iteration of a base model—DeepSeek-V3, by the way, this all came out in December; this isn’t late-breaking news as of last week—back in December, they trained this model. They said one iteration of training it would cost about 5.6 million dollars. But that’s like saying if a startup could go and train a model for the same cost... That’s like saying if I’m going to go run a marathon, that the only distance I’ll ever run is 26 miles. The reality is you’re going to train for months, practicing, training, running hundreds and thousands of miles potentially leading up to that one race that then takes 26.2 miles. And if you take that metaphor even a step further, it’s like saying, “Okay, what if I’m running a race, but I take a break every mile? I stop, I take a drink of water, I take a nap, I come back the next day, I keep running.” And you only add up your time from the actual miles that you’re running in the race, not all your breaks. That’s the equivalent of what this number represents. It’s a really valuable number to understand and impressive; the parts that they’re measuring, they did bring down a lot in efficiency. But that number does not represent the cost to go and now train a model. It’s not like we’re going to have startups now flooding the ecosystem with their own version of 600 billion parameter mixture-of-expert models.

Tim Hwang: That’s super helpful. Yeah, and I think it’s a great calibration. I want to kind of pick up on that last thing that you said, which is there is a lot that’s new here from an efficiency standpoint. And maybe Chris, I’ll toss it to you for the next meme that we’re hearing flow around in this space, which is DeepSeek-R1 is a huge breakthrough. Models are running way more efficiently than they used to. Dot, dot, dot, DeepSeek is so far ahead. I know you said that you felt like this model was a big deal, like 9.11, a bunch of a big deal. But can you tell us a little bit about, has DeepSeek really unlocked some novel things, and if so, how big of a deal are these novel things that they’re really uncovering with this new model?

Chris Hay: I said 9.11 or 9.9, so clearly, Tim, you think 9.11 is the bigger number out of those two. Sorry, there’s some uncertainty bars there. I actually think it is a big deal. I think there are a few things, which is we’re sort of joining everything together there. We’re actually saying, “Okay, here’s the base model, and then there’s the RL training for the R1 part of that,” right? And actually, if we separate out the DeepSeek-V3 version from the RL training there for a second, I think there is a big deal there. Because the reality is, never mind the 5.5 million bucks, right? You are going to be able to take an existing base model that has been pre-trained, and then you are going to be able to do RL training over the top of that. You’re going to be able to take your cold-start fine-tune data. So you can take a relatively small amount of data set and then put that on top and train it to do amazing tasks. And I know that myself, right? Because I took a tiny model myself, one and a half billion parameters, absolutely tiny Qwen model. And then I put maybe a thousand lines of SFT data, right? And I got that thing to be able to do math, right? Basic arithmetic at the same level as GPT-4o, right? Just myself. And I’m telling you right now, I love IBM; they do not pay me five and a half million dollars. That was on my laptop. So this is a big deal, and it’s a big deal because the thing that they’re showing there is long chain-of-thought has a huge impact and accurate data. Because actually, even to the point of RL, they started with pure RL training, right? So they actually just said, “Here’s your rewards for not doing anything else, and we’re going to train the model that way.” And then they went to say, “Actually, if we do one round of fine-tuning with a really good chain-of-thought set of data, maybe a few thousand rows worth of data, and then do RL training after that, then we get much better results.” So actually what they’re showing is that we can maybe stop obsessing about pre-training so much and we can get into this kind of post-training world and inference-time compute world. And for that, you don’t even need five and a half million dollars. Just your laptop and a little bit of tenacity and a little bit of GPU is gonna do that job.

Tim Hwang: That’s awesome. We’re gonna talk a little bit more about RL and chain-of-thought in just a second, but Aaron, I think before we move to that, one other question I want to ask you. The other third big meme is that everybody suddenly discovered Jevons Paradox this week. And I think one of the narratives that popped up is NVIDIA is doomed. You need a lot less compute for these models now. NVIDIA’s stock price took a tumble. I bought the dip, for what it’s worth. And I want to say to, I guess for Aaron, if you wanted to kind of respond to this question, or whether or not you think it’s a myth at all, is are we going to need a lot less compute in the future? Is NVIDIA doomed? Like how should we read this? And if you want to explain Jevons Paradox, you can go ahead too.

Aaron Baughman: Yeah, I mean, fundamentally, that’s an interesting notion. But I tend to follow the dynamics of AI, which comes to me in three different areas. One is the scaling law, right? Is that as you scale up the training of AI systems, you get better results, which means bigger models are better, generally, right? And then the shifting curve, where new ideas are making training more efficient, right? And so this affects that scaling law. So the more new ideas you get, the smaller models become more powerful, right? But then there’s a third, which is a shifting paradigm of these big revolutionary ideas that can change the scale of which you actually need to train these models in order to get performance, right? And so I think by having those points one, two, and three laid out, which is backed by a lot of research over time, I think that yes, there’s always going to be a demand for GPUs, but I do think that there’s going to be different chip architectures that are coming out. But also, if you look at some of the efficiency gains that V2 had, such as multi-head attention, were they able to cache a lot of the weights? The token throughput is incredible that they were able to achieve, but I think that was one of the bigger innovations that they had. And then the second one was their DeepSeek MoE, where they’re able to sort of partition out and share knowledge amongst these different agents that they can have. And that also helps. But those two things were some of the pieces that gave us the shifting curve on that scale law, which said, “Okay, I don’t need as many GPUs now.” But if you look at the foundational model, if you go to, say, DeepSeek-V2, right? It’s big. It’s a very big model, and V3 is even bigger with, what is it, 671 billion parameters, right? That’s a very big model.

Tim Hwang: Yeah, it’s chunky.

Aaron Baughman: So it’s very fun to watch that curve. And I think that we’ll see agglomeration of models together; we can do reverse distillation to create and combine smaller models together; you can do model distillation to create smaller models. But it’s going to be fun.

Kate Soule: I want to maybe pick out a point that Chris actually mentioned that I think is really important, which is the, “Can we just stop worrying about pre-training now?” Because I think everyone is talking about this 5.5, 5.6 million dollar number, and they’re tagging it to all of these amazing performance improvements that we’re seeing in the R1 model and the distilled models, and they’re kind of equating the two and saying, “Alright, now we can just go and we are getting this crazy performance at a pretty minimal cost.” And I think it’s really important to disambiguate these two things, right? A step in this process costs about 5.6 million; the true cost of building this pre-training model is likely orders of magnitude higher. But regardless, it almost doesn’t matter; this 5.6 million number doesn’t even matter because you can take this big model that’s now open-source and distill it basically for free on top of other open-source, smaller models that are out there to get crazy performance improvements and build. So it’s not that startups are going to go and build and pre-train their own 600 billion parameter model because the cost is only 5.6 million dollars. That’s the wrong takeaway. It’s that we now have the ability to distill, and thanks to more and more competitive models being put into the open source, that distillation is becoming even more powerful, and use reinforcement learning as a technique that DeepSeek used really effectively to go and build our own smaller versions of these models that are really powerful. And that is where there’s actually very low barrier to entry now, as Chris is saying, you know, doing it on your laptop.

Aaron Baughman: Yeah, yeah, yeah. I find that very, very nice because it’s like a house of cards, right? They’re only quoting the top card. They’re not quoting any of the cards at the bottom. And if you move one of those cards at the bottom, the whole house collapses, right? And so that 5.5 million is only the cost associated with maybe one epoch of this type of training, and that’s it. But if you look at the hardware, even that they use, what are the H800’s? Just procuring those alone or using those as a service is expensive, right? And so they’re excluding lots of costs associated with prior research, ablation studies, and lots of different things, right? Which, that number is very, very much misleading.

Tim Hwang: Chris, I see you nodding. Do you want to jump in? I don’t know if you have a comment.

Chris Hay: No, I was just nodding in agreement because I’m a very kind and collaborative person. No, I absolutely agree. I think that you’re gonna go for the big hit numbers, right? You’re gonna say, “We did this super cheap,” and you are really going to miss out all the steps that took you to get there in the first place, right? And as Kate probably knows better than anyone, the amount of experimentation that it takes for these models to get to the final version is a lot. So the actual final epoch, as Aaron was saying, that final training run, that’s just the end of the road there. But you know what? No one wants to hear about the big journey going up there. They want to hear the big number. We’re in a hype industry, baby. So we’ll, yeah, five and a half million. Here we go, right?

Tim Hwang: Kate, I guess maybe one last myth I’ve seen popping up that might be good to address before we do a segment on distillation, because it’s already come up a couple times, and I think it is worthwhile to explain what it is and why it matters. But maybe one last thing to cover before we get to that, Kate, is on the point about RL. It feels like the DeepSeek narrative has also been a little bit about the revenge of RL, like reinforcement learning’s back, baby. And I know some people have gone so far to be like, everything is RL now; fine-tuning is dead. Do you want to talk a little bit about that? Like even with everything that we’ve said, how much does R1 indicate to us that really RL will be kind of the more dominant method for these types of fine-tuning efforts going forward?

Kate Soule: Yeah, and I’m really curious to get Chris’s take on this because I know he’s just run these experiments right locally on his own laptop. So DeepSeek, in their paper, they trained two models in addition to all the smaller distilled models that they worked on. One model was trained with just reinforcement learning only. So there’s no additional data that’s added. You’ve got your pre-trained model, which costs 5.6 million plus all the arguable buffer on top, and they just use reinforcement learning using some rules-based systems more or less to be able to verify the results and score the responses. And so they called that R1-Zero, I believe. Then there’s R1, which they also created because in their paper they mentioned that there were some rough edges, so to speak, on the reinforcement learning-only model. And in that model, they start the model first with some fine-tuning, basically using some structured data in order to better prime it for this reinforcement learning task. And that is the model that everyone’s now playing with on the DeepSeek app and that everyone’s really excited about. So I think it’s a really interesting look at... The takeaway shouldn’t be that, “Oh, we can’t do RL only; we had to resort to this cold start and fine-tuning before the model was released.” The takeaway that I think people should have is it’s amazing how far they were able to push just RL. And yeah, there’s still always going to be a need for some structured data potentially, and maybe a hybrid approach is best, but it is kind of crazy how far they were able to push it. Now, what they also published in their paper, getting to the distilled models, and you asked about distillation... Distillation has been around forever. It’s where, back to the early days of the first Llama model, a group of students distilled that into Vicuna. And it’s basically where you generate a bunch of synthetic structured data from a big model and use that to fine-tune a small model. So DeepSeek used that same kind of thought process, doing just RL only on a small model, so no big models involved, just RL, and try to see how far could they get. They published numbers on Qwen32B. So how far can we push Qwen32B’s reasoning just on RL? And they weren’t able to, in the paper, they claimed they weren’t able to push it nearly as far, get any real reasoning capabilities out of the model; they had to resort to distillation, take their big R1 model, generate a bunch of synthetic data, and tune it. So, you know, I’m curious from your perspective, Chris, based off of some of the RL experiments you’ve been doing with small models, you said you also, I think, did some fine-tuning first to start it off and then with chain-of-thought reasoning and then RL on top.

Chris Hay: For me, the critical thing is the long chain-of-thought reasoning. That is actually an accurate long chain-of-thought reasoning. That is the thing that really enabled everything. So again, if you look at the paper, when they did RL, they said they got there. But if you think about, especially math problems, LLMs are not really good at that. So you’re going to say, “What’s 25 plus 8? What’s this? Whatever.” And you’re going to ask an LLM to go and generate this sum. And it may or may not get it right. It may or may not get the sums and the length of chain-of-thoughts that you want. It may not get its explanations right. So it’s really a bit of a crap shoot getting an accurate chain-of-thought. And then at the end of it, they’re using this thing called a verifier. And what the verifier does is take the answer that you’ve got and, you know, run a bit of rules to run the equation and say, “Yeah, that was correct or that was wrong.” And then you get a bit of a reward; it’s like, “Here’s a cookie. Well done, model. Good job.” But if you think about how long that’s going to take, you really are monkeys and typewriters at that point, right? It’s going to take time for the models to come back with the right answers. Now, if you run a fine-tuning step before that, so if you can produce long, accurate chain-of-thoughts for those math equations, for example—and I’m picking math because that was in the paper—then the model knows, “Here’s the right way of doing it.” And you don’t get a cookie if you got it wrong. So I think that combination between the two is the key thing. But I actually think the real takeaway from that paper is the long chain-of-thought. So when I did my experiment on my YouTube channel, the thing that I did is I took a slightly different approach from what DeepSeek did. And I have a thing called a math compiler. So what I did is I automatically generate the math equations, and then I put it into my compiler, and I generate an abstract syntax tree, and then I walk the tree, and then I don’t need the LLM to do the math; I’m just going step zero, step one, step two, and I’m just walking the tree, and I’m outputting the explanations, and then what I do is I use the LLM to transform that into something that the model actually understands, you know, is actually human language, and the explanations behind it, and then that’s how I got these really accurate chains of thought. And then when I put that in just as a fine-tuning step, I think I used maybe a hundred different examples, and honestly, the math—and I did it on a one and a half billion parameter model—the math was incredible, right? It was like a couple of decimal precisions of accuracy, which the larger models of six months ago would be nowhere near. So I think the real innovation is the long chain-of-thought and the accurate chain-of-thought. It’s not to say RL won’t get you there; it will get you there, but it’s just going to take a long time. So if you can short-set that a little bit and then have RL sort of do the smoothing out of the edges, then you’re really going to win. That’s kind of my view on this.

Kate Soule: RL is really valuable for tasks like math and things where it’s easy to check the accuracy, right, as well as relatively easy to generate that chain-of-thought. But when we look in the paper, for example, they talked about still needing some instruction tuning for tasks like tool calling, instruction following. Like, there’s still going to be a need for having... these reasoning models aren’t designed to do every single task; they’re specific for reasoning, and you’re still going to potentially need instruction tuning in order to handle some of those more specific instruction-following tasks.

Tim Hwang: So I’m going to move us on to our next segment. This is super helpful, I think, in terms of setting the scene, knocking down some of the myths that have popped up. We’ve already talked a bunch about distillation, and I think on the last episode, Skyler actually gave a short, brief explanation of it, but for those who weren’t listening on the previous episode, maybe Aaron, I’ll toss it to you. I think it’s worth it for our listeners to just get a sense of what is distillation in the first place? And then I think if you want to give that explanation, there’s some interesting things I think that are worth getting into about, well, what does this mean for where the industry is going? But maybe I’ll toss it to you to give the quick capsule explainer first.

Aaron Baughman: Yeah. Model distillation is a very powerful technique. It’s about having a teacher model that could be a bigger model where it’s encoded much more information through weights and through embeddings. And what you want to do is transfer that knowledge to a student model. And then usually that student model could be smaller, and it requires, then in turn, less resources to train and to also use for inference. And some people and groups think of this as model compression, where you’re making a model smaller, and so on. And then there’s different things that you can distill; you can distill response-based knowledge, you can distill feature-based knowledge, or even the relations between all the different connections within all of the neurons that you have. And one interesting thing that I wanted to bring up that I saw within the R1 paper is that the distillation process, it wasn’t just about, to me, about just doing model compression or getting knowledge out, but it was almost like this model translation. Because what I saw is that you were actually distilling information from an MoE, right? And then you were going directly to the student model, which was a densely connected feed-forward neural network in many different cases. And so just changing that model architecture looked to be a different way of doing this type of model distillation that I thought also gave R1 some advantages, especially whenever you were looking at using like Qwen2.5 and the Llama 3 series as the base foundational model to pull information out.

Tim Hwang: Yeah, and I think one of the most interesting elements of distillation is the idea that you can take any large model and bring that sort of knowledge into whatever it is that you’re building. I think really, literally, I think in the last 24, 48 hours, there was a little bit of a controversy over did DeepSeek use effectively OpenAI’s chains-of-thought or other inputs/outputs to kind of do the distillation here. I guess, Kate, the question is, this makes it very hard for any model company to protect its models in some ways, right? Because everything is distillable. Is that the right way of thinking about it?

Kate Soule: Yeah, I mean, I think by releasing a very capable—the most capable model to date in the open source with a permissible MIT license—DeepSeek is essentially allowing and eroding that competitive moat that all the big model providers have had to date, keeping their biggest models behind closed doors. And regardless of whether or not DeepSeek also benefited from distillation from those bigger models, we’re now able to go and take that really big model in the open and use it indiscriminately, where before people—I mean, this distillation from GPT has been going on for ages. Anyone can go to Hugging Face and find tons of data sets that were generated from GPT models that are formatted and designed for training and likely taken without the rights to do so. So this is like a secret that’s not secret that’s been going on forever. So yes, it most likely worked its way in some degree or fashion to the DeepSeek model, but it almost doesn’t matter anymore because DeepSeek now is out there, and that model can be used to run very similar style distillation with great effect on as many small models as you like, and anyone now has the rights, if they use DeepSeek’s model, to do so according to the license that it’s published under.

Tim Hwang: That’s right. Yeah, I think one of the funniest parts about the news cycle has been, “They used a secret, sinister technique called distillation.” Yeah, it’s like, actually everybody’s been distilling all the time. It’s just happening.

Kate Soule: It’s been around forever.

Tim Hwang: And it costs 5.5 million dollars. That’s right. Yeah, exactly. What strikes me, Chris, even to the example that you gave earlier, right, like you don’t, it turns out you don’t need a whole lot of data to make these models much, much better. And it kind of seems like there’s this fundamental thing in the market where it’s like, unless you want to control and really down to the nth level prevent people from getting outputs from a model, there’s basically no way to stop distillation, right? I don’t know if you think there’s a realistic way to prevent that at all.

Chris Hay: No, I don’t think so. I mean, the reality is, as Kate said, there’s open-weight models out there, and people are gonna do that. And I love this, by the way, and the reason I love it is that I’m all for chaos. I’m all for open source. I’m all for sharing and collaborations. So, you know what, people are going to go off now, they’re going to create their own data sets, they’re going to distill from different models, they’re going to share that out in the community, and you know what, we’re going to all end up with better stuff. Right? So I’m not a big fan of the closed models, personally, my opinion. I’m a big fan of sharing and learning from each other. So that’s what gets me excited about the DeepSeek stuff. And again, it’s not just the fact that they put the model out there that you can distill from; they talked about the techniques that they use. So it’s cool; we can all start doing interesting things. And you know what? I don’t think everybody’s going to... I don’t think we’re suddenly all going to be going out competing with OpenAI, Anthropic, blah, blah, blah. I don’t think all these people sitting in their bedroom are going to do that. But you know what they might be able to do is take one of these out-of-the-box pre-trained models and then solve one of their own particular tasks that the general model can’t do that’s specific to their use case and make it easier. But again, don’t undersell this. I mean, Kate, you know this better than anyone, right? Fine-tuning models is really hard, right? Because of all of the biases, you might think, “Hey, my model is now great at doing this one particular task,” but then you’ve just ruined that model from doing any other tasks because you didn’t have the right biases and mixes within that data set.

Kate Soule: Yeah. I mean, just take a look at the Hugging Face Open LLM Leaderboard. All those distilled versions of Llama and Qwen are on there, and they all rank significantly lower than the original model that they were distilled from on those Open LLM Leaderboard tasks, which are not predominantly reasoning-based tasks. So the model was boosted in reasoning, but other general performance characteristics drop. But I think it’s still incredibly powerful. And as we talk about DeepSeek introducing this new era of efficient open-source AI, it’s true. It’s just not true because they trained this really cost-effective model during the pre-training; it’s true because we now have the methods to create these armies of distilled, fit-for-purpose models that are specific for the tasks that you care about because we have better tooling, like powerful teacher models, out in the open-source ecosystem.

Aaron Baughman: Yeah, yeah, yeah. I think that there’s a lot of secret agents that are hidden amongst our labs. And in the next couple days, weeks, you’ll see them become super agents that are going to be released that we can all use. So I really think this might have been one of the impetuses to sort of grandstand what’s happening within the field of AI. DeepSeek just happened to be right time, right place to do it, to connect all the dots together. But I do think that lots of these technologies and new innovations that are coming out, inventions... You ask the question, “Can you prevent someone from distilling a model?” That sort of brings me back to biometrics. You know, it used to be, “Can you prevent someone from stealing a picture of your face?” And we came up with this cancelable biometric invention so that if someone took your picture, you could revoke your biometric and create a new one, right? So I think there might be some cancelable technologies and patents that we could work on together to achieve some of this.

Tim Hwang: Final question here, I think for Kate, particularly given your work on Granite. I think there’s maybe one point of view, which is, well, the only reason investors have put money in towards building these giant, giant models is the idea that if you build these giant models, you’ll be able to capture all the value from that model. And it sort of seems to me that if distillation gets good—and granted, distillation is hard in some respects, but if you’ve got enough eyeballs, someone will eventually figure out ways of cracking it—is there an argument here that it kind of erodes the incentive for people to invest in building the big model in the first place? It’s almost an accident that we’ve ended up with these giant models, and it’s partially based on the idea that you could have some exclusive control over this, but it feels like this is rapidly escaping the ability for anyone to be able to exclusively control.

Kate Soule: Yeah, look, I don’t think there’s any incentive to really build big models to run at inference time. The incentive is to build really big models to help you build really small models. And all it takes—it started with Llama releasing a 400 billion plus parameter model, NVIDIA released a 400 billion plus parameter model as a teacher, and now DeepSeek releasing their 600 billion plus parameters. Size isn’t everything; they also have to have high-quality post-training, which is why the reinforcement learning part of DeepSeek is so important. But we’re seeing more and more large models that can be used openly to train these smaller models. And I think it’s just going to continue to make this more of a teacher model-based commodity. Like, why pay for those big models if we’ve got similar capabilities out in the open that you can customize further? And I think we are going to converge on a point where we’ve got powerful enough tools to craft the smaller models that we need that are going to run 80 percent to 90 percent of our workflows for generative AI in the future.

Tim Hwang: Yeah, it’s kind of a funny world where you never talk to the giant model that’s just inside company headquarters, and then just lots of tiny models that are coming out around it. Well, in the last few minutes, I want to zoom out a little bit. We’ve been talking a lot about DeepSeek and what’s going on underneath the hood. And I want to just take a moment to talk a little bit about what all the other companies are doing relative to this development in the AI space. Sam Altman, of course, the head of OpenAI, put out a little tweet thread kind of responding to this news. And I’ll just quote a little bit of it. He said: “We are excited to continue to execute on our research roadmap and believe more compute is more important now than ever to succeed at our mission.” Which is really a statement by a guy that says, “Steady as she goes, we’re continuing on the research path as we had planned, and nothing has changed by the DeepSeek release.” I guess, Chris, maybe I’ll kick it to you. Do you buy that? Like, is OpenAI pretty much gonna just keep doing its strategy? Or does this really kind of fundamentally change what they’re gonna need to do?

Chris Hay: Nah, he’s gonna release his model sooner. He’s been holding on to these models for too long, and he needs to get on with it. And good on you, DeepSeek, right? Where’s my o3? You showed me it at the end of Christmas. Do I have it in my hands? No. So thank you, DeepSeek. Maybe we’ll get his model out a bit quicker, and then we’ll get o4 and o5, and then maybe we’ll get some of these models in Europe because guess what? They’re releasing vision models and video models, and I don’t have any of them, so I’m gonna get them as well. So woohoo!

Tim Hwang: So I guess ultimately what you’re saying is it just accelerates his roadmap, right? To just get him off the fence.

Chris Hay: There is no way he’s just gonna sit there and go, “Uh uh uh, I’m not giving you my model,” while DeepSeek is getting all of this press. He’s gonna respond, and we’re gonna get new models.

Tim Hwang: But I think, Aaron, maybe to turn it to you, you don’t think this changes their approach to, ironically, being kind of a closed-source model here, right? Like, this is not the kind of situation where you believe that OpenAI or Anthropic, any of the big providers, would say, “Hey, now we need to start switching to open source is the way we play this game.”

Aaron Baughman: I mean, this could go in several directions, but I think open versus closed source, I think there’s advantages and disadvantages to both, but I think ultimately it helps the academic community, which then in turn fuels economies of scale for the average consumer, right? Because if you think about it, you have two groups: you have the open-source group, closed group; they compete to make sure that one is better than the other, which then spurns innovation. Okay, great. And then within each one of those groups, you have companies and organizations that then in turn compete. You have like this n-squared of competition that further accelerates innovation. And so Sam Altman, I think he’s going to release secret agents sooner, make them available. And lots of the techniques that DeepSeek has shown, like that caching layer of the key value and queries that they’ve come up with, some of their MoE innovations, and then some of the parallelization whenever they can share context and information amongst their grid, lots of that is going to be included, I think, in Sam Altman’s, but pushed even further with their own innovations, and it’s going to splinter out a bit. But the fundamental model distillation and so on and so forth, I think that’s going to be very key. And then it brings a value proposition down to frameworks: how can I better train the models for my own fit purpose, whether I’m an enterprise or a customer, and then also how can I trust it, right? Because there’s going to be a zoo of models now that are out there. It’s just very confusing to pick which one do I use?

Tim Hwang: Yeah, Kate, so we’ve been talking about OpenAI; obviously they take up a bunch of airtime. But I guess one thing to kind of, as we think about zooming out to tell this DeepSeek story, is whether or not we think OpenAI is similarly situated. Like, everything we’ve been hearing is, “Okay, OpenAI is going to continue its strategy; it’s just going to move faster.” Do you think it changes the economics at all or the decision-making at all for, say, a Google or a Meta or, you know, even an Anthropic?

Kate Soule: I don’t think it changes the decision-making or strategy overall. I think a lot of DeepSeek’s strategy—necessity is the mother of invention. They only had access to H800 chips, so they optimized the hell out of it. They invested in efficient architecture like MoEs, and DeepSeek was born, right? So I think the U.S.-based labs are operating with very different constraints, and DeepSeek’s innovation doesn’t necessarily change that calculus. I also think a lot, again, what we’ve talked about today with DeepSeek is distillation. And for the labs pursuing AGI, distillation is not necessarily as relevant, right? They need to keep training as big a model as possible and have incentives to try and keep that behind closed doors. Whereas the business value, again, my take is the business value is all around these distilled smaller models that are actually what people are going to deploy in a commercial setting. And I don’t think they’re, at least at the highest strategy level and what they’re working on in terms of their investment profiles, that longer-term AGI game. And for that, you still need a crap ton of big GPUs, and they’re not going to want to release any of that out in the open.

Tim Hwang: Yeah, it’s not like they’re going to use Stargate to do small distilled models. That would be the funniest thing. It’s actually an inference cluster. Surprise. That’s, I think, really fascinating. I guess, Chris, maybe I’ll turn to you for the last word and last question here. Kate just talked a little bit about the idea that Chinese researchers are operating under very different constraints, so they develop different types of methodology, different types of models, different types of proficiencies. And do you think there’s something to the idea that we have an embarrassment of compute among the U.S. labs, and so it actually limits the degree to which we would ever invest in the kind of thing that DeepSeek would be working on? I’m really interested in the idea that these constraints really mean that AI will start to look pretty different in different parts of the world as researchers operate under very different constraints of what they need to do to deploy systems.

Chris Hay: I think that’s exactly the case, right? And you can see a little bit of reinforcement learning happening there and reward modeling, which you were saying here. You’re going to have less compute available to you, and guess what? They have different incentives at that point, and they’ve been rewarded by being more efficient. So if you’ve got an abundance of compute, you’re not really going to be optimizing for efficiency; you’re going to be trying to get your models out first. And I think that’s also, speaking from my own experience, I don’t have any H100’s kicking around. What have I got? I’ve got my MacBook Pro, right? So you’re trying to come up with innovative techniques to work within the hardware constraints that you run within today. So I, and I think honestly, if they didn’t have the chip constraints in China, I’m not sure that DeepSeek would have probably came up with those techniques, because they would have been just trying to focus and catch up with everybody else, as opposed to trying to take things from a different angle. And therefore, again, one of the reasons I believe in open source very much, and everybody’s sharing their papers, everybody’s running under different constraints, and they’re going to find new innovations, and if we share that, we’re all going to learn from each other and be able to contribute. And that’s not just the big labs, but the people in the community, just with their laptops, trying to discover and experiment with new things.

Tim Hwang: Ah, so I love this panel. Kate, Aaron, Chris, thank you for joining us on the show as always and walking us through DeepSeek. A lot more to talk about, and we will be tracking the story. And thanks to you listeners for joining us. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we will see you next week on a jam-packed episode again of Mixture of Experts.

Stay on top of AI news with our experts

Follow us on Apple Podcasts and Spotify.

Subscribe to our playlist on YouTube