OpenAI o3, DeepSeek-V3 and the Brundage/Marcus AI bet

Watch the episode
Mixture of Experts podcast artwork
Episode 36: OpenAI o3, DeepSeek-V3 and the Brundage/Marcus AI bet

Is deep learning hitting a wall? It’s 2025, and Mixture of Experts is back and better than ever. In episode 36 of Mixture of Experts, join host Tim Hwang along with Chris Hay, Kate Soule and Kush Varshney as they talk about one of the biggest releases of 2024, OpenAI o3. Next, hear the experts discuss the release of DeepSeek-V3. Finally, the experts dissect the AI bet between Miles Brundage and Gary Marcus. All this and more in the first episode of Mixture of Experts in 2025.

Key takeaways:

  • 00:01—Intro
  • 00:49—OpenAI o3
  • 14:40—DeepSeek-V3
  • 28:00—The Brundage/Marcus bet

The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

Listen on Apple Podcasts Spotify Podcasts YouTube

Episode transcript

Tim Hwang: A frequently asked question: is deep learning hitting a wall? Chris Hay is a distinguished engineer and the CTO of Customer Transformation. Chris, what do you think?

Chris Hay: Oh yeah, totally, Tim. In fact, I think it’s going backwards. I think the models are getting worse and worse and worse. This is the worst it’s ever been. It’s totally hit a wall, Tim.

Tim Hwang: Happy 2025, Chris. Uh, Kush Varshney, an IBM fellow working on issues of AI governance. Kush, welcome back. What do you think?

Kush Varshney: I think there is a wall, but it’s not an insurmountable one. I think we’re making progress. We’re, uh, changing it up instead of just taking some steps. We’re, uh, doing some rock climbing. A little bit more of a serious answer.

Tim Hwang: And Kate Soule is Director of Technical Product Management for Granite. Kate, happy 2025. What’s your take?

Kate Soule: No, I don’t think deep learning is hitting a wall. I think we’re finding new ways to apply it in 2025 that are going to have some interesting benefits.

Tim Hwang: All right. All that and more on today’s Mixture of Experts. I’m Tim Hwang. Happy 2025, and welcome to Mixture of Experts. Each week, MoE offers a world-class panel of product leaders, researchers, and engineers to analyze the biggest breaking news in artificial intelligence. Today we’re going to be talking about the release of DeepSeek-V3, a very public wager between an AI booster and an AI skeptic. Thanks, but first, let’s talk about OpenAI’s o3. This was the last announcement of OpenAI’s “12 Days of OpenAI” marketing event that they did at the end of last year, and it was arguably the biggest announcement. They basically have touted a new model, which is now getting sort of limited trial access for safety purposes, that blows out of the water a lot of the benchmarks that people have traditionally used to measure, or argue for measuring, whether or not we’re getting close to AGI. So, on a benchmark that we’ve talked about on the show in the past, Frontier Math, OpenAI’s o3 is doing incredibly well. And I think one of the reasons I wanted to kind of bring this up is that it really does seem like, you know, after what was a news cycle late last year of people saying deep learning is slowing down, the old methods don’t work anymore, pre-training is over, and a lot of general hand-wringing, this really kind of reset the narrative—at least in the circles that I run in—to say that, you know, there’s maybe a lot more room to run on all this. Chris, maybe I’ll turn to you first. You sort of outright made fun of me on the opening question. What’s your take on the o3 model? Like, how important is it? Does it really kind of indicate that there’s still a lot more progress to run? How do you read it, basically?

Chris Hay: I think it’s a great thing, actually. So I’ve been playing a lot with the o1 and the o1 Pro models, and I’ve been having the best time with them. So inference-time compute is really working. So I’m excited about o3. I’m just kind of annoyed that we don’t have it, though. That’s the real thing. And so it’s yet another, you know, “this is coming soon,” and that’s sort of annoying me, especially being in Europe, because in Europe we don’t get anything these days. We didn’t get Sora, we didn’t get half of any models that are coming through on the “12 Days of Christmas.” So, I’m excited about o3. As for the benchmark thing, two things in my mind about that: one, you know, my opinion, benchmarks are stupid, so I’m not really going to read into that. And then probably the second thing is, even if we take the opinion that benchmarks aren’t stupid, then it took an awful lot of time to come back with the answers, and it was a little bit “monkeys and typewriters,” right? Which is, if you type long enough, then you’re eventually going to get the answer. But with that aside, actually, I’m so impressed by o1 and o1 Pro that I’m super excited about o3, and I think it’s going to be a great model, and it’s really proving sort of inference-time compute.

Tim Hwang: Yeah. One follow-up there is, I know you’re saying you think all benchmarks are stupid, but you think this model is better. So what use case do you have in mind where you’re like, “Oh, actually, it seems o1 is noticeably better than what we’ve had before”?

Chris Hay: Yeah, there’s probably a few. The main one for me is coding, right? I mean, it is completely on a different level. Even Claude 3.5 Sonnet, GPT-4o, the early versions of o1... honestly, o1 Pro is on a different level. Now, probably the big thing that I’ve found myself working with the models is, Pro just takes quite a long time to come back with an answer. So, I end up switching between models all the time. It’s like, “Okay, I want a fast answer on this, I think it can handle this. Oh, no, it can’t handle it, I’m going to switch from o1 to o1 Pro.” So, that sort of changing models just to get fast answers back and how much reasoning I want from the models is a sort of technique. But for me, coding is definitely the biggest thing. I don’t really care about the math stuff because, like, I’ll just use a calculator, right? But definitely for coding, I see a mark.

Tim Hwang: Got it. Okay, maybe I’ll turn to you, Kate. So, I think, you know, if you’re not watching this space super closely, it’s easy, I think, to just get, like, bewildered by the number of models and find variations between all these models kind of coming out. You know, I think famously, or like it was kind of talked about, that the reason they jumped from o1 to o3 was that o2 was, I think, already used by a UK telecom company, so it was like a trademark thing that got the o3. But I guess, Kate, question for you is, if you can help our listeners kind of understand a little bit of, like, what’s new with what they’re trying with o3—like, kind of looking under the hood? You know, these models seem to be a lot more performant, but there also seem to be a lot of new things that they’re trying underneath the surface. And I think it’s worth kind of for our listeners to know a little bit of the flavor of that, if you want to speak to that at all.

Kate Soule: Absolutely. So I think the most important thing for our listeners to understand when looking at the new o3 model and the o-model series in general from OpenAI is that we’re transitioning from spending and innovating at the training time of the model, and instead saying, “Okay, let’s take a model that’s been trained and let’s run it multiple times and spend more compute at the actual inference time when it’s being deployed out in the world.” And it seems like with the o3 models, they’re continuing to innovate in what can be done at inference time—having the models essentially think longer (to risk anthropomorphizing these models), think longer through different tasks, search through many different potential options and solutions before picking the best one, which then leads to improved performance, but also it takes longer. To Chris’s point, you have to wait longer for a response. One of the things that I think is really important and really exciting about the o3 model and this broader investment and pivot to more inference-time compute is that it actually can give you some really nice trade-offs. And I think this is where we’re heading, and o3 is, you know, foreshadowed a little bit that you can run these models in a more efficient mode, or if you need the maximum performance, you can run them in kind of a compute-intensive mode. And I think that’s going to be really cool because it gives people the ability to set their compute budget, set their time constraints—you know, for latency, if they need an answer, a response quickly. And I think we’re going to see a lot more of that in 2025 of people playing along that kind of cost-performance trade-off, even within a single model, saying, “Okay, I want my model only to think about this for, you know, a minute, versus I want my model to give a response immediately, versus my model can think about this for five minutes and then give me a response back,” depending all on how much I’m willing to pay and how important it is that the model gives a really strong response back.

Tim Hwang: Yeah, definitely. Yeah, some people were joking online, I saw, that this is kind of the return of the old turbo mode on computers, where you’re like, “We want the computer to work harder.” But it actually, it’s a really interesting question about, like, almost like, what do users want the computer to think hardest about? Which I think is kind of a counterintuitive question about what types of queries and what types of tasks, you know, demand that. It’ll be really interesting to see. Kush, it’s ideal to have you on the line as well, because I think one of the most interesting parts of the launch—you know, I think Chris was frustrated by it, he was like, “Come on, just give me access to the model!”—but in traditional OpenAI style, they’ve said, “Well, no, we’re being careful with the launch, and you can get access to the model if you’re a safety or security researcher.” And they’re allowing people to have kind of requested access to go and red-team the model. I’m curious about how you read that, as someone who thinks about AI governance. You know, is that kind of going to be the paradigm for how companies release models going forward? Or, you know, is OpenAI kind of almost like... do you see this as marketing, right? They’re using the safety thing to be like, “Give us just a few more months to iron out the loose ends.”

Kush Varshney: I think it’s a combination of both, actually, because there’s this concept of the gradient of release, and Irene Solaiman from Hugging Face came up with this, and it’s kind of like, maybe take your time. The more powerful the model is, maybe the more, kind of, the slower you need to roll it out. But I think it’s a combination. So, OpenAI gave their models to the UK AI Safety Institute for testing in advance as well. And some of this, I think, is just to be able to say that they did it; some of it is to actually have some better safety alignment and so forth. So, yeah, I think it’s here and there. And one other thing that’s in this o3 release that they talk about is a new way of doing safety alignment; they call it “deliberative alignment.” And I think it’s kind of interesting. They’re saying that they are very much looking at an actual safety regulation, taking the text from that and training the model with respect to it, doing some synthetic data generation that follows along with what the policy says. And so, something we’ve been doing for a while as well—last year we published a couple of papers, we call it “Alignment Studio” and we call it “alignment from unstructured text” and so forth—and I think those sort of ideas are kind of carrying through. The new part is, again, the fact that this is spending a lot of time on the inference side, then thinking again and again about, “Am I meeting those safety requirements or not?” And as both Chris and Kate said, right, I mean, the more time you’re spending on the inference side, what should you be thinking about? What should the model be thinking about? And I said this in the last episode of the new year, I think that extra thinking is going to be for governance quite a bit. So, I think this is where it’s going to play. And yeah, I’m excited to see—maybe I’ll sign up to do some of the safety testing.

Tim Hwang: Yeah. I think it’s kind of two interesting things here. And what you said, I mean, I think one of them is, you know, the model to date feels like has been, you release the model, but then you’re also like, “We guarantee safety by releasing safety models,” right? Granite has done this. And, you know, maybe, Kate, a question back to you is, sort of, how much do you think that’s kind of just, like, almost just historically provisional? This is just like what we kind of have to do right now because we’re still working out the kinks on making the models themselves safe. I guess in the future, one argument is that the models are just kind of safe out of the box in a way that doesn’t separately require another model that kind of monitors outputs and does the safety work. Do you think that’s the case, or do you think this kind of bifurcated architecture is going to be what we’ll see going forward?

Kate Soule: Well, first I’d be careful. I don’t think anyone can guarantee safety no matter what we release, right? But I do think we’re going to continue to see more and more of these kind of safety guardrails being brought intrinsically into the model through these new types of alignment that Kush mentioned. That does not mean, though, that we shouldn’t also have additional layers of security and safety that have, you know, an independent check on model outputs. So I don’t see that going away. I think it’s always going to be a “yes, and”—right, let’s continue to add more and more layers, not, “We’re going to scribe away, you know, some of these layers, put it into the model, and now you’ve got one model, you’re all set.”

Tim Hwang: Very interesting. Kush, maybe the other thing that I think I’ll pick up on what you said before we move to the next topic is, you’re basically talking about inference as being almost like this kind of fixed budget of time, and you’re basically like, what do you want to spend—have the model spend its time on—thinking about the problem or thinking about whether or not the responses are safe or consistent with a safety policy? And I’m modeling my internal Chris here, who probably would be like, “You’re spending some of that time on trying to make it safe. Like, could it just solve the problem?” And I guess I’m kind of curious, is like, maybe that will become... do you think that will become a lever over time where you can almost, like, the user will specify, “I need 10 percent of your time spent on safety, 90 percent of the time on solving the problem,” or otherwise? Or, you know, that actually kind of opens up a whole other world in some ways.

Kush Varshney: Yeah, it does open up a whole new world. I mean, I wouldn’t say that I would want to spend a lot of time on this sort of safety deliberation either, but I think the fact that they’re calling it “deliberative” kind of speaks to something that, I mean, deliberation is meant to be like a discussion among lots of different viewpoints and this sort of thing. I don’t know if that’s actually what will happen, but that’s something I would want to happen so that different viewpoints, different sort of perspectives, can be brought into these different policies as well because, in, I mean, democratic sort of settings, you do want deliberation. You do want kind of minority voices to be heard as well, but not sure exactly that’s what they mean by “deliberative.”

Tim Hwang: Absolutely. Chris, I appreciate you’re shaking your head, so I want to make sure I’m not putting words in your mouth.

Chris Hay: I honestly think safety is super important, but I want the models quicker. So, you know, so do what you need to do, and I want the models to be fun, so don’t lobotomize them, you know what I mean? But, you know, we don’t want to do harmful stuff, but at the same time, come on, you know, it’s like, I want to play with the models.

Tim Hwang: Chris basically wants everything.

Kate Soule: Tim, we are also kind of assuming OpenAI is going to give us the choice, right, of how we want the model to spend that inference-time compute. And I don’t think that’s the direction that they’re headed. I think they’ve got some clear regulatory guidelines they’re trying to meet, performance issues that they want to make sure are addressed. I don’t see them handing over the keys to the kingdom, so to speak, to let us take these models for our own joy rides.

Tim Hwang: Yeah, no, I think that’s for sure. Right. And yeah, I think there’s a bunch of interesting questions that are sort of empirical questions, right? It’s just like, how much can... you know, how much does safety inference lead to better outcomes, right? Like, how much of this is like a mutually exclusive pie versus ones where you can get a little bit of both? How much is going to be defined by the regulator? How much is going to be defined by the user? A lot of things to pay attention to, I think, going into 2025. So I’m going to move us on to our next topic, which is the release of DeepSeek-V3. This is sort of an interesting announcement because I think we were, me and the production team, kind of tying up at the end of the year and we’re like, “Nothing’s going to happen in the last few weeks of the year.” And of course, there was the o3 announcement, which was huge, and then also similarly big was the announcement of DeepSeek-V3. And so this is an open-source model coming out of China that shows incredibly good performance on a lot of the benchmarks that most models are evaluated against. And I think there’s a lot of interesting things to talk about here, but I think maybe the first one, which I’ll throw to Chris, is this kind of claim that the DeepSeek team is making that they were able to basically build this incredibly performant model for way lower cost than you would expect. And I think a lot of the commentary online, and I think one of the things that made me think about, is that there’s so much that’s built on the economy of AI that is sort of based on the idea that it’s just really expensive to get, you know, really high-performance models. But this almost seems like the cost curve might be collapsing faster than we think. I don’t know, Chris, maybe that’s a little bit too optimistic, but yeah, I’ll maybe throw it to you.

Chris Hay: I think it’s kind of interesting what they’ve done. So they have put a lot of cool techniques within the pre-training side of things. And I mean, even things like multi-token prediction, and then they were better at kind of loss of tokens, et cetera, and how they route. So there’s a lot of things they did in training that they brought the cost down in, and I think they were doing mixed precision as well, so there was a lot of good things that they did there. I think what I would say, though, is that back to the earlier point about inference-time compute and kind of pre-train, I wonder at what point we maybe stop obsessing with the pre-trained side of things for models and actually, you know, be able to kind of fine-tune those models and have that community of fine-tuning existing. And I think that’s going to be more interesting, especially in the world of agents. Happy New Year! I’m the first person to say “agents” on the podcast. So thanks, Chris. So I think that’s more interesting. And as we move more towards inference-time compute, I think that will become important there. But it is really impressive for what they did, actually, for the cost of the model and how long it took them to train that. I honestly think they did a great job. So, yeah, there’s going to be more innovation in that space. I still think pre-training, though, is hugely inefficient because you’re really just saying, “Here’s the entire text of the internet. Go learn from it.” And I honestly think that’s probably an innovation that I would hope would change in 2025. And the way I think about it is, if I think about the kind of internet, it almost has a knowledge graph anyway. And I wonder if actually during that training process, if we brought a little bit of structure in the knowledge graphs into the pre-training process, then a lot of those training elements may come out a little bit quicker and better. I don’t know, I mean, I’m just sort of guessing here. But I think there’s a lot more innovation to do in pre-train. So hopefully with inference-time compute, we’re all going to be running around doing that, but I’m hoping that that focus on pre-train doesn’t go. So really good job to the DeepSeek team to continue to innovate.

Tim Hwang: Yeah, I remember, you know, when I worked a lot more closely with pre-training teams, I thought it was very interesting—at least among, you know, the nerds, the engineers, right?—it was very interesting that pre-training was like the high-prestige part of the organization. Right, like you’re running the rocket launch of AI, and then fine-tuning is something that we do afterwards. But, like, I think all the inference stuff and all the stuff that we’re seeing kind of point to this shift in the kind of cultural capital within these companies, where it’s like, “Oh, all the action right now is really happening after the pre-training step.” And I guess, Chris, almost what you’re proposing is maybe at some point the pendulum swings back, because it’s like, “Okay, there’s all of this kind of innovation still to be done on the pre-training side, but we’re just not there because of the hype cycle.”

Chris Hay: It’s going to swing back and forward, back and forward, back and forward. And you’re going to see that, right? Because you’re going to get to the point where you go, “The train isn’t good enough to do what we need to do, so therefore we’re going to use the smarter inference-time models to get better data to train the pre-trained models.” That’s going to become more efficient, and then we’re going to do the same on fine-tune, and that pendulum is going to swing and swing because you’re going to keep hitting kind of limits in one area and you’re going to go back to the earlier, like the pre-train, to try and fix that, and you’re just going to go back and forward, back and forward. So that pendulum is going to swing all the way through 2025, buddy.

Tim Hwang: Yeah, definitely. Kate, any thoughts on this? I mean, as someone who, you know, works with a team on open-source AI, I assume something like DeepSeek is a big deal—a big marker in some ways, a big way to start 2025.

Kate Soule: Yeah, and I agree with Chris. The team did an incredible job. But in terms of the cost, I don’t know the full details of what data was or was not used in the model. My hypothesis is they are using data that was available online that cost a lot more than USD 5,000 to generate. Right? So I don’t know that that total cost estimate actually reflects the fully burdened cost of the model. I suspect that they, like many model providers, are leveraging all of the data that’s now been posted and shared online that actually is only possible because others have invested so much money in creating larger models that can be used to then generate that data, that kind of, to what Chris was saying, can be taken back into training. So I think what I’m really interested in with the DeepSeek model, aside from that, is, you know, it’s a mixture of experts architecture, which is really interesting. So when it runs that inference, it’s a 600-plus billion parameters, but at inference time, it’s only about 40 billion parameters, meaning it can run much more efficiently than even, like, a Llama 400-plus billion parameter model. So I think that’s where we’re going to see a lot more innovation happening in 2025, is really digging into how we make these architectures more efficient, how we activate the right parameters at inference time, fewer parameters at inference time, to still drive performance without having to pay for the entire cost of running, you know, 600-plus billion parameters.

Tim Hwang: Yeah, that’s really interesting. Kush, from a governance standpoint, this is an interesting story as well. Right? You know, I think there’s certainly a vision among some folks, which is like, “Well, we just passed the laws in the U.S., and all the big AI companies are in the U.S., of course, and so that’s why, that’s how we govern AI.” But this is really a different world, right? Like, you know, a law passed in the U.S. is not going to change what the DeepSeek team is doing. You know, is governance possible in this world? Right? Because it sure seems like, you know, you are seeing so much AI progress everywhere, that governance becomes a real question.

Kush Varshney: Yeah. I mean, we talked about this before the show started that there are these core socialist values that are required of any generative AI in China. It’s a law that’s been around for more than a couple of years now, and DeepSeek has to satisfy those. So, I mean, those are things that are gonna be around. And I think the fact that all of these different AI safety institutes from different countries are forming a network, they’re convening, they’re figuring things out together, is a great sign. I think AI governance needs to be a worldwide activity. There’s no special thing because of one country or another country. And the more we can kind of bring everything into harmonization, the better it will be.

Tim Hwang: Yeah, I think that’ll be one really interesting bit is... you know, I think there was a thinking sort of maybe a few years back, which is we’re going to do sort of law and regulation to do this. You know, Kush, I guess kind of where you’re suggesting is a world where it’s a little bit more technical experts, like it looks a little bit like ICANN, right? Like in terms of how we govern the web, where kind of technical experts meet and they establish these standards, and it’s kind of voluntary protocol more than anything else. Do you think that’s how things are going to go?

Kush Varshney: Yeah, I think that’s how it’s gonna go. So in February, there’s a meeting in Paris where all of these safety institutes are coming together. So I think they’ll come up with a plan, they’ll figure out some codes of practice and all these things. So that’s where I think things are headed.

Tim Hwang: Chris, you started this episode by talking a little bit about how you, like, switch between different modes of OpenAI, right, where you’re like, “Okay, well, we’re going to use the o1 for this, we’re going to use the o1 Pro for this.” Do you do that kind of switching across open-source and closed-source at all?

Chris Hay: Yeah. You do? Okay. Yeah, no, I do that a lot with different models. So, like, the Llama models, for example, I’ve got such personality. So if I’m doing any kind of writing new stuff, then I tend to run into the kind of Llama models. The Granite models I use quite a lot as well; I use them a lot for kind of RAG-type scenarios because they’re really good at that, in that case. And also, if I’m pulling factual information, then I really want to be sure where the data’s been coming from, so I tend to lean on Granite in those cases. For coding, I tend to lean on o1. I have a lot of fun, actually, with some of the Chinese models, have a lot of fun with the Qwen models at the moment; they’re doing some great stuff in the same way as kind of DeepSeek is. So I think you’re gonna just use different models for different cases, right? Because some models are good at certain language translations, some models are good at kind of writing tasks, some are really good at code. And then the smaller models, for example, you know, especially low latency, especially for agents—I said “agents” again! Exactly. If you’ve got different agents, you want to run that on the smallest possible model that is going to perform the task that you need. So, I think we’re in this world where we are just going to use a lot of models. I think we’re going to—if I, again, talk in 2025, I hate to say this, but I think we’re going to stop talking about models so much towards the end of the year, maybe more, because you’re going to be caring about the tasks that it’s doing. “Here’s a language translation agent. Here is an agent that is going to write me unit tests.” I don’t really... I do care about the model, but I’m going to care more about the tasks that it’s performing. And then coming back to the kind of security and the kind of governance things for a second, I think that’s where governance starts to become really hard, right? Because if you’ve got very small models, like an 8-billion parameter model, and it’s got access to tools, and you’ve got it being orchestrated over the top, you know what, you can get into a lot of trouble very, very quickly with a tiny model and do some really interesting things. And I’m just not sure governance-wise that you’re going to be able to do a lot about that. So I think, as much as we talk about the large models and governance, in 2025, actually I think we’re going to start to hit the challenges of people doing interesting things with agents on the really tiny models.

Tim Hwang: Yeah, you’re saying almost like we’ll be able to govern the biggest companies and the biggest models, but that might not matter, is kind of what you’re saying, is that right?

Chris Hay: I think so, yeah.

Tim Hwang: I guess, Kush, do you want to respond to that, as someone who focuses his time on thinking about AI governance? I guess Chris is effectively saying maybe it’s just not sustainable over time.

Kush Varshney: Yeah. So I agree. And I’ll say “agents”—number three for the episode. This is really bad; this is becoming a meme because people are going to just start throwing it out for no reason. But yeah, I think when there is tool use, when there’s autonomous... that’s where governance really becomes interesting. So we’ve talked a lot over the years about trustworthy AI, and it wasn’t really like trust was a part of it, but really trust is needed when something is going to be acting autonomously because you don’t have the ability to control it or monitor it and these sort of things, and that’s really where trust is needed. So, and the more volatile, the more uncertain, more complex these things happen to be running and so forth, and yeah, I mean, that’s exactly where governance is the hardest. And I think where a lot of the innovation is going to happen.

Tim Hwang: Before we move on to the final topic, Kate, maybe I’ll turn it to you. You know, I thought it was very interesting; I had never really thought about that switch from... you know, I’ve heard about, “Oh, I do o1 Pro versus not, o1 preview,” but the switch from open-source to closed-source I think is pretty interesting. Maybe a final question before we move on to the last topic is, do you think right now open-source has any specific kind of capability advantages over closed-source? Or is that not even the right distinction here? You know, I think it was very interesting that Chris was like, “Oh, actually, like some of these models just have, like, way... the open-source models have better personality,” right? That’s kind of an interesting outcome in some ways.

Kate Soule: Yeah, I don’t see it so much as an open- versus closed-source question. I think different models are going to have inherently different strengths and weaknesses. And so if you only limit yourself to closed-source or closed-source from one provider, you’re going to miss out on kind of that suite and being able to pick and choose the best model for the best task. Ultimately, that’ll be the dream in the future is someone sends me, like, an AI-generated email and I’m like, “Yeah, you’re probably relying on Granite; I know what this sounds like.”

Tim Hwang: So, last segment we want to focus on today is a sort of interesting smaller bit of news that popped up at the end of last year, but I think it’s a fun one, particularly as we get into 2025. If you don’t know him, Gary Marcus is a long-time skeptic of all things AI. I think for every successive wave, Gary Marcus is there being like, “It’s never going to work.” And the current revolution in AI is no exception. I think he’s been a very big skeptic about the degree to which LLMs can get us to, quote, “true intelligence.” I’m going to talk about what that means. But interestingly, he established, or set up, a kind of official public bet with a gentleman by the name of Miles Brundage, who used to do policy at OpenAI, formerly of—he’s independent now. And basically what the bet says is, where is AI going to be a few years from now? And it sets up a set of, I believe, 10 different kind of tasks that AI could or could not take on. And there’s a lot of variation here, but a lot of them all kind of pertain to, you know, can the model produce kind of world-class versions of XYZ? So, you know, I think there’s one criteria is, you know, will an AI produce a world-class movie script or other kind of creative work? And I think these bets are useful because I think they, you know, kind of force folks to put their money where their mouth is and also kind of specify what it is that they mean when they say that a model is going to be, you know, truly powerful and capabilities going forward. And I guess I wanted to get the view of this group. You’ve seen kind of the Twitter-slash-X posts announcing this. Kate, maybe I’ll turn to you. I mean, is this a useful way of thinking about where AI is going, or do you think it’s just more, you know, Twitter noise?

Kate Soule: I thought it was interesting to think through, like, when I was looking through the different questions. And ultimately, if I look at the different items in that bet, the ones that stood out to me the most were assertions that hallucinations would basically be solved by, you know, this year. And I think that’s one of the biggest reasons why, personally, I actually wouldn’t take that bet. I don’t think hallucinations are going to be solved. I think if you look at the model architecture, even with the o1 and reasoning, you know, my hypothesis is it’s still a transformer model trained on a vast amount of internet data that’s being called many times in many different ways with reasoning and search, but I think there’s still some fundamental problems around hallucinations that, unless we really change the type of data that we train on, the volume of data that we train on, the architecture of these models, it’s not going to go away overnight or something we can necessarily just incrementally cure ourselves of. So I personally wouldn’t take the bet. I thought it was a useful framing to kind of think through.

Tim Hwang: Yeah, for sure. Kush, how about you? Would you have taken the bet on either side, I guess?

Kush Varshney: Yeah, I think the authorship question is an interesting one. So, I mean, that’s what they’re kind of going for, like, can this be an Oscar-winning screenwriter, a Pulitzer Prize-winning author, and the sort of stuff. And I’m going to take us on a little bit of a different direction. So, I mean, the fact of it is that, like, people have been coming up with all these analogies for LLMs, like a stochastic parrot or a DJ or a mirror of our society or these sort of things, but I think that’s the wrong way to look at it. So, about 65 years ago, there was this book that came out called The Singer of Tales by Albert Lord, and it was all about oral narrative poetry, so these bards who are kind of singing about heroes and this sort of stuff, and they compose the language as they’re singing it. It’s not like they write it beforehand, and they use formulas and all sorts of tricks to be able to do this. And I think that’s exactly what these LLMs are. And in that sort of construct, there is no sense of authorship. It’s like, just, they’re part of a tradition. And so like, you would never think that a Homer deserves a Pulitzer Prize for the Odyssey or Ved Vyas deserves a Pulitzer for the Mahabharata. I mean, this is just kind of a tradition that’s going on. And that’s, I think, the right way to think about LLMs. So, so it’s like the question is not the right question. And even if you think about, again, going very historical, philosophical, so you had, uh, Michel Foucault who asked, “What is an author?” And the discussion that he had is, the only reason we even thought of authors is because lawyers needed someone to blame when there were some bad ideas out there. So, I think that’s the same thing. It’s like an LLM is not an author, and we shouldn’t really be asking for that sort of thing.

Tim Hwang: And I think it actually touches on what Kate said as well, which is basically, like, do these kind of criteria for the bet assume a certain direction for AI that might not actually be the most important thing around AI, or even like an important aspect of, you know, quote, “really powerful AI systems,” right? Like, it may not turn out in the end that we really need to solve hallucination, or like it may not really turn out in the end that the big impact of AI is that you have, like, the Pulitzer Prize-winning AI that generates a novel completely from scratch. That’s kind of interesting. I don’t know, Chris, maybe you haven’t had a chance to jump in just yet. Curious about what you think about all this.

Chris Hay: Oh, I think the test is totally stupid, in my opinion. And the reason is, I looked down the list of 10 items, and I don’t think I’m capable of doing any of those 10 items. So if I’m not capable of doing the 10 items, I’m, you know, is it unfair to think AI is going to be able to do that within a year? I mean, how are you doing your Pulitzer Prize-winning novel? Is it going well? Or your Oscar-winning... Chrissy, any... here, any programmer on the planet, you know, have you been able to hit 10,000 lines of code bug-free first pass? Come on. It’s like, I think you’re asking a lot. The only one I think I could maybe do is the video game one. And it’s like, I don’t know when to laugh at the right moment in movies. You just need to ask my wife that. It’s just like, “Why are you laughing?” I was like, “Oh, that thing over there,” right? It’s like, and am I able to say the characters without hallucinating? No, we all hallucinate. We make up little subplots that are going on in our head in these movies. So I think... I don’t think it’s a bad thing, but I think you’re asking a lot of LLMs to be able to do that, you know, and even putting that as a test for 2025, and, you know, yeah, maybe AI will be able to achieve three, four of these things. I just don’t think it’s the right time to be asking those questions.

Kate Soule: Well, I don’t know. We just came back from our, you know, everyone was out on holiday breaks, where at least I got to take a step outside of the Cambridge tech bubble where everyone, you know, is really deep into this technology, and hearing folks talk about AI... you know, I have a family member who calls it “the AI machines.” There’s a lot, I think, of misconceptions of what AI can do and what it’s going to be useful for. And so I think, like, putting it in terms of that, you know, everyday folks can understand, who watch movies and read books and aren’t necessarily living and breathing the technology, and helping show that, “No, that’s not going to be possible, like, you know, X, Y, and Z, you guys are thinking about this the wrong way,” I think it is helpful to have that type of discussion and discourse. I think we take for granted a lot that not everyone is living and breathing this the same way that, you know, this excellent panel is on generative AI.

Tim Hwang: Yeah, I’ll guarantee to you that the average person is not waking up being like, “Should I use o3 or o1?” Those distinctions are not anything that any normal person is thinking about. But yeah, I think that’s a good point, right? I mean, I think part of it is just, like, you know, there’s a dream that all this AI becomes kind of superhuman, right, at some point. And I think, Chris, like, maybe to respond to your comment, there’s kind of an effort to sort of be like, “What would that look like?” And I guess, yeah, maybe that does really miss the point in some ways. I also think it’s a really good indication of how quickly our expectations have adjusted around the technology, right? We’re like, had you asked me four years ago, like, “Could it do all these things? Could it just write an email?” you know, I’d be like, “That’s ridiculous.” And then now you’re like, basically, the expectation is like world-class, Pulitzer Prize-winning... you know, it’s kind of just like, because the baseline is just very normal to us now. So it’s, I guess, an indication of the rising expectations around all of this stuff.

Kush Varshney: Just coming back to DeepSeek for a second, I think one thing that we didn’t talk about is just the culture at DeepSeek. So there was an interview of their CEO that was making the rounds after DeepSeek came out, but the interview was from November. And I think the cultural aspect of how they kind of developed this thing is really interesting. They really followed this sort of “geek way.” So Andrew McAfee had this book, The Geek Way, and it’s been very popular within IBM circles, actually. So our CEO has been reading it, telling everyone to read it. And it’s kind of like, really like doing things fast, being open, letting everyone contribute, being very scientific about things, trying to prove them out, not having hierarchies and all of that stuff. And that’s exactly how DeepSeek is doing it. And I think we can learn a lot from it. Just, we’re a little bit too encumbered, even though we want to be doing things the same way. So like, how do other companies kind of innovate in a rapid fashion in the same way? So I think that’s maybe something to learn as well.

Tim Hwang: Yeah. One of the debates I have with a friend of mine is, there’s a... what is it? I think it’s called Conway’s Law. So the idea is that you ship your org chart, and that has kind of interesting implications in the world of AI, where it’s just like, “Well, are all of these AIs going to basically, in some ways, reflect the companies that create them?” And, you know, the reason why certain models are more chatty is that this is just, like, in part a reflection of all of the people in that organization.

Chris Hay: Interesting connotations.

Tim Hwang: Pre-training and the, you know, how pre-training has been the focus and kind of the most prestigious team to join, right? That’s right. Yeah, yeah. There’s a joke because we have a mutual friend who works at Anthropic, and we’re like, “It’s cool, it’s Claude. He’s Claude.” It’s very funny to kind of just see play out in practice. Well, that’s great. So let’s leave it there. Chris, great thought to end the episode on and for us to start 2025. Kush, Kate, Chris, as always, incredible to have you on the show. And thanks to you all for joining us. If you enjoyed what you heard, you can get us on Apple Podcasts, platforms everywhere. And we will be here next week on another episode of Mixture of Experts.

 

Stay on top of AI news with our experts

Follow us on Apple Podcasts and Spotify.

Subscribe to our playlist on YouTube