AI code generation: Wins, fails and the future

Watch the episode

What’s the future of AI code generation? This week on Mixture of Experts, host Tim Hwang is joined by Chris Hay, Olivia Buzek and Gabe Goodhart to debrief the biggest AI use-case of 2025: AI-powered software engineering. 

Claude Opus 4.5  solved a months-long optimization in under an hour but failed spectacularly at simple tasks. The barbell effect is real. Next, who’s the architect—you or the model? We discuss agent orchestration, context windows and why tool performance varies wildly. Then, model differentiation: are OpenAI and Anthropic fundamentally different, or does agent architecture matter more? Finally, can open-source compete with closed ecosystems? We explore vertical integration, inference costs and the future of open models. All that and more on this week’s Mixture of Experts.

  • 00:00 – Introduction 
  • 01:11 – The barbell problem: AI coding wins and fails 
  • 03:46 – Claude Code cracks Apple Metal optimization 
  • 07:52 – Who’s the architect: You or the AI? 
  • 11:44 – Model vs agent orchestration 
  • 20:44 – The future of unsupervised AI agents 
  • 24:30 – Open source vs proprietary tools 
  • 33:22 – The inference cost challenge

The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

Listen on Apple podcasts Spotify podcasts YouTube Casted

Episode transcript

Tim Hwang: I’m Tim Hwang and welcome to Mixture of Experts. Each week, MoE brings together a panel of the smartest minds in technology to distill down what’s important in the crazy world of artificial intelligence. Joining us today are three incredible panelists. We’ve got Chris Hay, Distinguished Engineer; Olivia Buzek, Lead Developer Advocate for AI; and Gabe Goodhart, Chief Architect, AI Open Innovation. This is going to be a fun episode. It’s one of our end-of-year episodes, so we are basically departing from our usual news story format. I want to get this group together specifically to talk about the past, present, and future of code generation. In my opinion, code generation is basically one of the biggest stories of the year for AI, right? Starting from January to now, the work of engineering has changed in a very significant way. From Cursor and Windsurf to Claude Code and the rise of “vibe coding” as a thing — really, if you’re looking for where the most salient impact of AI is happening right now, it’s in software engineering and code generation. Olivia, maybe I’ll start with you. What do you think comes next? Are we entering a mature space, or is next year going to be as crazy and tumultuous as 2025 was, in your opinion?

Olivia Buzek: I think it’s a little bit of both. What I’ve seen over the last year is that even the AI skeptics are starting to use AI in their work almost every day. When you’re starting a project, you definitely use AI to get your things off the ground. We have the evolution of things like the AGENTS.md files that people are putting in their projects so that you have a standard way that your particular project will be interpreted by the AI. And in hiring processes, we’re seeing people actually checking to see whether or not you understand how to use the AI tools. All of that points in a strong direction: this thing is here, it’s here to stay. At the same time, I think we see a lot of limitations as well. So far, I have yet to hear anybody say, “Oh yeah, this is as capable as a human. I hand off all sorts of tasks to it. I literally just tell it to look at my board and take off the next task and take care of things for me” — because it’s just not there yet, and it’s just not that trustworthy yet. And we have seen a few catastrophic failures over the course of the year.

Tim Hwang: Right. And I did want to get into that — it’s almost a little barbell-shaped in my mind. Gabe, one of the things I wanted to talk to you about is: I’m not a day-in, day-out coder. I used to be, but I’m terrible at coding, so I stopped doing it and moved to a different profession — podcast host. At least for me, these tools have kind of revolutionized the game because I can just sit down and start having fun. But to Olivia’s point, do we still feel like there’s a gap when it comes to using this at the frontier — the most complex, the hardest software applications? Is this still more in the realm of “if you’re doing day-in, day-out coding work or you’re a junior engineer, that’s where most of the action is happening”? Do you buy that as a premise?

Gabe Goodhart: I would have said yes last week. What changed? I used Claude 4.5 Opus to crack a problem that I have been trying to crack for months, and it nailed it in under an hour. This is a problem that is deep in the guts of llama.cpp — trying to get better performance out of the recurrent models, optimizing the Metal kernels, understanding the shape of grid layouts and thread group dynamics and SIMD group memory sharing, just the gnarliest corners of gory bits. And let me just be clear: the official internet documentation for Apple Metal programming is a 2000-page PDF. That’s it. There is no — you go to CUDA and the internet is full of good information. I expect AI models to nail CUDA, but for Metal, I was blown away at how strong it was. So to that end, I will say the barbell-shaped analogy — or something-shaped that is not nice and uniform, whatever physical shape you choose — is exactly my experience right now. These models can do some amazing stuff, and they can fall really, really flat in what should be really simple use cases.

The opposite end of the spectrum — the reason I would have said “yes last week” was that I also heavily used Claude Code to try to build out a CLI for a pretty straightforward REST API, and it did a fantastic job of cranking out a beautiful CLI with lots of nice pretty colors and inline JSON highlighting and all sorts of awesome stuff. That was 100% code coverage too — the tests were great. They all just mocked everything and didn’t actually test anything, and nothing worked. But it was so cool. And then I spent a week having to... Somebody from the Continue team coined this phrase “chiseling.” So you basically use it to create the rough block, and then you have to chisel out the shape of what you actually want your statue to look like underneath the very rough block that you just splat out with a code assistant. This was my experience. It actually probably took much longer. The end product might be prettier — I probably wouldn’t have come up with all the latest, coolest CLI libraries myself — but the process of actually fixing all of the stuff that it just mocked away in the unit tests was really frustrating. So in some ways, I would have expected exactly the opposite experience: generating a CLI against a well-defined REST API is bread and butter — that should just be point and shoot, forget about it — and then deep in the gnarly weeds of Metal optimizations, it would fall over and have no idea what to do.

Tim Hwang: But you’re pointing out it’s actually the reverse. Yeah, exactly right. You’re pointing out that the intuitions are flipped. For me personally, what it points to is that I just don’t quite have a good intuition about what it’s going to be good at and what it’s not going to be good at. And I’ve tried to build that intuition, which means there’s still some misalignment between the capabilities of these tools and the day-to-day mental model that I, as a developer, have around the complexity of a task. So there’s still some learning on my part to do. Chris, typically we’ve seen these high-profile failures, and people are like, “Ha ha, look at the terrible AI.” I’m kind of of the view that maybe those get ironed out over time as engineers understand what the systems are good and bad at. It’s actually less of a technical problem and more of an engineering-culture-and-norms problem. We’ve got this hammer, but we’re still not really sure what hammers are good for yet, so we’re kind of swinging it around and being like, “Oh, it wasn’t really good for that.” Do those problems disappear with time as we get a little more mature on how to apply these applications? Do you buy that?

Chris Hay: Yeah, I think so. One of the questions I like to ask myself with the coding models is: who is the architect? I ask that because if the coding assistant is the architect, then it’s going to choose the framework. It’s going to choose which libraries. It’s going to choose whether it’s going to mock or not mock, etc. You’re putting all of the decisions onto the model. That is okay — if you don’t really know a language or you don’t know the frameworks or you’re not a UI person or whatever, then you don’t really have much of a choice. You’re saying, “I don’t quite know what I’m doing here, so I’m more in the vibing world — go do that.” And therefore, it can make bad decisions in that sense. And Gabe, to your point, the models are really lazy. If they think they can get away with just faking something up, or they can just go, “Oh, due to context window limitations, I’m stopping right now” — you’re like, “Dude, come on, try harder. We are right in the middle of it.” Imagine a junior engineer just went, “No, it’s 4:30 in the afternoon, I’m going to knock off in 30 minutes, so there’s no point in me looking at this. I’m going home now.” You’d be like, “Nah, you’re fired.” But I think this is a fair point. Asking who is the architect in this case — sometimes it’s okay not to be the architect. You’re vibing your prototype and doing whatever. But you’re in a different paradigm when you want to start production, and that’s where you have to use things like the rules. If you’re in Claude Code, your CLAUDE.md or your AGENTS.md. If you’re in Cursor, you need to use rules to really guide the model and say, “This is the architecture that I expect. These are the standards that I want you to follow.” Therefore, you probably have to put as much effort into architecting as you would normally do. I think that is the big paradigm shift that is happening: architecture is going to become more important — but actually writing your architecture in a way that is AI-friendly, agent-friendly, as opposed to in a Word document or a UML diagram sitting somewhere on the cloud. It’s really about orchestrating with the AI, and then you’re going to get really fast feedback loops.

Tim Hwang: That’s one of the things I do want to talk a little bit more about — the evolving role of the engineer or the programmer in all this. I think one of the things I’m really interested in is how all these models are kind of differentiating with time. We live in a world — Gabe, you might have made this comment in a previous episode — of model abundance. There are all these models, and they’re all really, really good. Olivia, I’ll toss this question back to you: just take OpenAI and Anthropic for a second. Do you feel like these models are approaching code generation differently? Is the kind of code they’re producing different in flavor, or is everybody converging on the same kind of code generation? I ask that because you can imagine being like, “Oh, I really understand what OpenAI is good and bad at, but I have no idea what Anthropic is good and bad at.” That has big implications for how these models become a kind of programming language with time — almost like a tribe where you say, “I’m a Pythonista.” I’m wondering if that kind of thing is on its way.

Olivia Buzek: Currently, I’m seeing a lot of people in an experimental phase with a whole bunch of different ones, because I don’t know that we have solved that characterization. That characterization may evolve more over time. I also think it’s in some ways less about the models themselves and more about the agent architecture that’s underlying those code assistants. That is making a much larger difference. For example, when I’m playing with the model itself on something like Continue in my IDE, I’m not getting that agent experience. Because it doesn’t have very many agents to it, it almost doesn’t matter what model I throw at a particular problem — it can only do so much. Where you see a huge difference is in the actual planning for a task. In one assistant, it’ll be like, “My planning tends to be focused more around security and optimization problems,” so it’ll get stuck on that part. Another agent will be more interested in the mocking thing that Gabe was talking about. So you’ll see tendencies because of the agent architecture that’s underlying it, which is of course completely opaque to the user other than the way you sort of start feeling it out.

Tim Hwang: That’s actually really interesting. You’re almost saying this is less a function of the model and more that agent orchestration is producing these differentiations. Gabe, you’re nodding and shaking your head. Do you want to jump in?

Gabe Goodhart: Yeah, I’m doing this weird nod-shake-head at the same time because Olivia, when you said that, that’s exactly the comment I wanted to make. I’ve said this on many episodes, but the user experience of any one of these AI tools is a combination of the quality of the model and the quality of the system built around the model. In this case, I have seen multiple different tools using the Claude family of models behave extremely differently with exactly the same flavor of problem thrown at them. It comes down to the implementation of simple things like context compaction. What do you do when you get a 20,000-line C++ file thrown at you? You just explode, or do you carefully read it in chunks of 100 lines and keep going? What do you do when you know you are unable to find an answer on the internet, or when you try an experiment and it fails? How do you back up and try again? These things are all at that orchestration layer, and I think this is where the actual individual tools are going to differentiate themselves. That’s why I keep coming back to Claude Code: of all the tools I’ve tried, they have this experience of “it just works” nailed. Everything else has required so much more finagling and babysitting from me. With Claude Code, I don’t have to select what mode it’s in. I don’t have to carefully choose to only send it files that I know it can handle. I just point it at files on the internet and files on my local machine. It asks me at the right times when to do what operations, and it goes to town. So that’s my personal favorite these days, and I really think this is — the tooling layer is really important.

That said, the reason I was doing the funky shake-the-head thing is that using this example from the last couple of days — I did this work on the Metal optimization with Claude Code 4.5 Opus and was blown away.

Tim Hwang: You said you did it, Gabe. You said you knew nothing about it, and it was nice of Claude Opus to figure it out for you. How much did you actually do?

Gabe Goodhart: Okay, we’ll unpack that one. To your comment, Chris — I love your framing of “who’s the architect here.” I’ve been banging my head against this problem for months. I’ve been trying to tackle this from the mathematical perspective of reformulating the SSM scan operation as SSD following the Mamba 2 paper, et cetera. It turns out I was looking in the wrong place. The right place to look was the very inefficient SSM conv implementation that didn’t take advantage of thread grouping. I actually was the one who figured that out myself by carefully commenting out chunks of code and realizing that if I took away the SSM conv operation, I got double the performance. That was the light bulb that said, “Ah shoot, I’ve been looking in the wrong place.” Then I went over — I’ve read this kernel many times myself. I haven’t seen anything that says this is clearly a problem because I don’t know the ins and outs of how the Metal GPU is architected. So I got all the way to the point of finding the problem, but I didn’t know what to do with it. I pointed Claude at that and said, “Claude, here’s what I’m experiencing. Here are the commands I’ve been running to isolate this. Here’s the line I had to comment out to get to this point of discovery. Please take it from here.” And it was able to say, “I read that code. Thank you for the pointers. The problem is right there, this line.” So I did a lot of the work to get there. Claude did the work to actually solve the problem.

Chris Hay: Gabe, I think you’re really stating where we are right now: if you don’t know what you’re doing at all, you will get so far — it won’t be the most maintainable code, it will be a bit muddy, whatever. But today, you still need the human part of that loop. You need to guide Claude. And to your point about Claude Code, it really does deal in a couple hundred lines at a time, so it’s a very narrow window. You can direct and push it to different places. But if you need that broader view and you need it to look at the larger context, you’re either doing some thinking yourself or you’re going to the Claude web interface and saying, “No, think a little bit further for me.” You do need to do that thinking today. I’m not so sure that’s going to be so necessary in the future.

Tim Hwang: Is the 100-line thing a design choice by Anthropic?

Chris Hay: Yeah. How do we read that? The most generous interpretation is they actually want you to do some thinking, but I don’t know if you would read it that way. I think they just don’t want you to burn your context windows. I really think it’s as simple as that. But it is actually remarkably efficient at it. If you look at the tools that Claude Code has, it actually has very few tools — it’s called Fetch, it’s called Grep, it’s got Bash. What more do you need?

Gabe Goodhart: I’ve got to be honest — that’s actually good enough.

Chris Hay: In reality, it has very little tools, but it is incredible in how it’s able to execute. So I sometimes question my lifestyle choices in building MCP servers every so often, going, “Am I wasting my time here?” because Claude Code does so well with so few tools. But I think that reflects where we are today. I do believe that in the future, the tools are going to get more efficient. They’re not only going to be using Grep. If you look at things like Cursor, for example, they have indexed your codebase — I’m sure Claude Code is on that path already. In fact, I think they released something recently. So I think a lot of those constraints we’re talking about are going to go away. Ultimately, I do think you are still going to be part of that loop. You still want to be that architect. But the progression is — I hate to say it this way — treat it like another compiler. We went from punch cards to assembly to C to C++ to Java to Python to JavaScript to TypeScript, etc. We’ve gone up the stack. Apart from Gabe and his story about looking at Apple Metal, do we really go and look at the assembly code that often? No, because we kind of trust the tools to do that job. I’m pretty sure we’re in that kind of paradigm shift, but you still need to know what’s going on under the hood. I think we’re at another higher level of abstraction going forward.

Gabe Goodhart: To wrap up the “where-we-are-today” point: in parallel to doing this with 4.5 Opus, I did a simpler task with 4.5 Sonnet, and it needed way more oversight than what I had to give Opus. Opus actually did a great job of looking at git history, looking at JSON files, looking at all the pointers I gave it both on the web and locally, and needed ultimately very little oversight in solving a complex problem. Sonnet, on the other hand — I pointed it at a gnarly problem that’s very hard to test because it crashed my terminal every time it was triggered; literally the whole terminal app just died, which was a pain — Sonnet claimed success three or four times, and I had to keep going back and saying, “No, I’m pretty sure that’s not right.” So there is a model capability here. The difference we’re going to see — and I haven’t tried this against Gemini 3 or the latest versions of the OpenAI 5 Codex models or any of the other latest-gen ones — but I suspect that’s the capability difference: essentially how much oversight do you have to give this sort of “deep T” expert that you’re pointing at a specific problem? The thing that I think is going to be really interesting for next year is to see if that individual, task-oriented deep T expert — when I say deep T, I’m referring to T-shaped skill sets — I think what we’re seeing right now is that if you give one of these models a well-researched problem in a domain that you are not yourself a deep T expert in, it can actually do a very good job of solving that. But the more capable the model, the less you have to supervise that solution. I think going forward, we’re going to see this paradigm that we see peeking out from under the covers with Google Antigravity: “I’ve hit a point where I can actually reliably trust that deep T expert is going to get the problem correct, so now I can start launching a bunch of these in parallel and not babysit them.” I think that’s the holy grail to get to next year. We’re definitely not there yet — from what I hear from Antigravity users and from other attempts at fleets of agents and becoming an agent manager. But I just smell-test based on the capability gap from 4.5 Sonnet to 4.5 Opus. I feel at least some optimism that we will get to that point next year where you can actually queue up large quantities of tasks for independent operation and basically only supervise them when they come back and tell you they’re completed.

Tim Hwang: I’m going to move us on to a final topic — particularly fitting given the folks on this panel. I want to take the last few minutes of this episode to talk about open source. One of the big meta-narratives of 2025 is that open source is continuing to catch up. It used to be “give it a few months and open source will have what the state of the art had.” Now it feels like we’ve gotten to the point where it’s at parity or even getting ahead of proprietary models. Olivia, I want to hear from you on your experience with this. Are we going to see that pattern also happen in the code generation space, where open models start to be able to do code generation at a Claude level? Is that in the offing? Why or why not?

Olivia Buzek: I think we have to make a little bit of a distinction between open weights and open source frameworks that are being used to do code generation. I mentioned Continue, which is an open-source framework you can use for code generation stuff. As I mentioned, though, they haven’t really leaned into agentic pieces yet, so you’re kind of on your own in terms of making that model highly performant. A lot of these models are in fact open weights models where you can download the weights and put them behind a whole bunch of different things. What we’re not seeing yet is an openness within the most common tools to just use any open weights model on the open market. Not every single code generation tool is saying, “You can just pop in whatever open weights model you want.” They end up doing this hybrid synthesis of an open weights model combined with an agent architecture that is designed for that particular model. So I think we’re still seeing a lot of combinations being more successful than the open weights models themselves. But that doesn’t mean the open weights models aren’t powerful. It just means they need a lot more guidance than being able to be used just off the shelf.

Gabe Goodhart: The one delta I would say on that is that Continue actually has leaned in heavily to agents, but Continue — like many open-source tooling layers — tries to split the difference between running against local models and against hosted closed models. Their agents work great if you plug in Claude or Gemini. I spent a bunch of time last week trying to get it to work with Granite 4 Small, and it does not work very well. There are others out there like OpenCode, which I also tried extensively with Granite 4 Small to similar effect. Now, part of this could be simply the nature of the size of these models. I haven’t tried running it against a really large, frontier-level open model because I can’t run that on my dev box. But I also think there is an inherent advantage of closed ecosystems to be able to co-evolve the model and the tooling together so that you’re not trying to keep this level of separation between the model’s capabilities and the actual agentic patterns around it. All of these tooling layers for coding or otherwise involve a great deal of prompt engineering and a great deal of manual tuning of “oh, I’ve seen that it tends to fail in this corner case, so I need to code my way or prompt my way out of that corner case.” That’s just really hard to do in a model-agnostic way. So I think that’s one of the big advantages. I haven’t personally tried Qwen’s direct Qwen2.5 Coder local CLI. I probably should give that one a shot, because that’s an example of an open ecosystem trying to do this where they have a model-specific open tooling layer. The one I have tried is pointing OpenAI’s Codex at GPT-OSS-120B, and I would say that is a solid step up from running Continue or OpenCode against Granite 4 Small. Again, model size is a big element here, but also the pairing of the model capabilities with the agent side. So I don’t have a clear, decisive answer here, but I do think you’re spot on to point out that this really has to do with the software layer, and that’s probably where there’s the most catch-up to be done on the open side relative to the model capabilities. Because of the loose coupling in open source, I think we’re going to see it a little bit harder to get to those peak performance capabilities.

Tim Hwang: Olivia, this almost feels like a story of vertical integration — that maybe there’s a structural advantage. The dream of open source is you take a bunch of components off the shelf, you click them together, and with a little bit of spit and polish, it works. But it feels like the amount of work that still needs to go in to get the model and the software to work together is something where almost structurally open source has a problem. It’s strong at other things, but in this particular case, it might have some limitations. Does that mean we should be a little bit pessimistic about fully open ecosystems for code generation, or do you feel there are things the community will do to deal with this?

Olivia Buzek: I’m still heavily optimistic about it. I just don’t think we can draw the conclusion that open source is ever an off-the-shelf solution. I think this is true in every space. If I looked at an Ubuntu GUI, I can do anything, but I have to know what I’m doing. Even to this day, if I’m using Linux for something, it’s going to require more configuration, but I can configure the heck out of it and I can get exactly what I want from it. So I think we’ll see more of that. If you can imagine a world where you end up having a lot more control over that agent architecture and you also get to choose your open weights model, then you can basically say, “I do these particular types of tasks — 90% of my work looks like this — and Claude Code doesn’t necessarily get me there, or Codex doesn’t necessarily get me there, but because I’m doing this particular type of task all the time, I can build something that works for me.” I believe that’s going to exist. And I also think that the development of this open ecosystem allows rapid innovation, sharing, and being able to make sure that we’re always working at the state of the art — which is not something you get when everybody is fully closed. In a world where everybody is completely closed, we can never make those comparisons of whether the model is actually making the difference or the agent architecture is making the difference. Once it’s that tightly held together, you end up in a world where you’re only able to look at whether the two together are succeeding. You can never say, “I’m just going to change out the open weights model underneath. I’m going to change Opus to Sonnet, or even GPT-OSS, and then I’m going to check out the difference.” If you’re completely unable to make those comparisons, you’ll never know: is this caused by my agent architecture or is this caused by my model?

Chris Hay: The biggest problem in my mind is the cost of inference. If we really analyze this for a second, the folks who are using Claude Code are all sitting there on our various plans — I’m on my Max Pro Plus whatever plan — and therefore I’m never worrying about the cost of tokens. I would not be paying the API costs in a million years. The only way I can use my Max plan is through Claude Code. I can’t use an open-source tool to go and connect to that; that’s not allowed. And they’re not the only ones that do that. Gemini is the same. Codex is the same. Even Qwen and Kimi K2 — they all offer similar plans. You’re doing a subscription plan, and you can only go through those tools. So you are locked into that tool. You can go and talk to other models, but you’re going to be paying the API costs, and that is the problematic element. Whether you are Cursor or one of the others, that’s why they’re all developing their own models — because they need something that can satisfy the subscription plan. People don’t want to pay per use; they want to pay per subscription. So when is this going to change in reality? The models need to be much, much smaller. If you have a coding model that is 3 billion parameters, 7 billion parameters at the max, that can run on your machine and is as capable as Opus 4.5 is today — then there we go. At that point, all of the open-source tools can go wild. But until then, the cost of tokens — technically you can do it, but economically it just doesn’t make a lot of sense.

Now, Gabe, to cover your points about the capability of the models — the good news is I have played with a bunch of those models. I’m a model connoisseur; I love playing with different models. When I did my Kimi K2 video, I was genuinely surprised at how good that model was. I was like, “Whoa.” It’s not at Claude 4.5 Sonnet or even Opus levels, but it was pretty darn good. I would say the same about playing with DeepSeek-V3.2 — the reasoner model — at the weekend. Again, incredible model. But what open-weight model can I run on my machine? No. I can download it, sure, after a few days, but I’ve got nothing I can run it on. So I think inference needs to be sorted out. Until then, we’re going to be sitting on these vertical stacks.

Tim Hwang: Yeah, absolutely. 100% agree. On that note of unanimity — Gabe, Olivia, Chris — this panel is fire. I wish I could bring it together once a quarter. It’s amazing to have you all on the show. That’s all the time we have for today. Thank you to all you listeners. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we’ll see you next week on Mixture of Experts.

Stay on top of AI news with our experts

Follow us on Apple Podcasts and Spotify.

  1. Subscribe to our playlist on YouTube