What does the future hold for DeepSeek? In episode 39 of Mixture of Experts, join host Tim Hwang along with experts Abraham Daniels, Kaoutar El Maghraoui and Skyler Speakman to discuss the release of DeepSeek-R1. Next, explore Mistral’s IPO plans and what it could mean for the market. Then, listen to the discussion around FrontierMath’s new benchmark—why is it so difficult? And finally, hear the experts break down the IDC report on code assistants. What do we need to know about generalist and specialized coding assistants? Tune in to this week’s episode to find out.
Key takeaways:
The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: At the end of 2025, is DeepSeek leading the state of the art in artificial intelligence? Abraham Daniels is a Senior Technical Product Manager with Granite. Abraham, welcome back to the show, joining us for the second time. What do you think?
Abraham Daniels: They’re definitely making a splash in the open-source space, but you know, it’s a really competitive landscape, so I guess we’ll have to wait and see.
Tim Hwang: Kaoutar El Maghraoui is a Principal Research Scientist and Manager at the AI Hardware Center. Kaoutar, I feel like you’re becoming a regular here on the show. What’s your take on this question?
Kaoutar El Maghraoui: DeepSeek is definitely reshaping the AI landscape, challenging giants with open-source ambition and state-of-the-art innovations. But talking about leading, I think that remains to be seen. It’s not just about the raw performance, but it’s also about the whole integration.
Tim Hwang: And finally, last but not least, is Skyler Speakman, who is a Senior Research Scientist. Skyler, welcome back. What is your take?
Skyler Speakman: Amazing technology. Great splash, as we said earlier. But I think there’s really some big geopolitics at play on how these models really get developed and are used across the world.
Tim Hwang: All right. All that and more on today’s Mixture of Experts. I’m Tim Hwang, and welcome to Mixture of Experts. Each week, MoE is the place to tune in to hear the news and analysis on some of the biggest headlines and trends in artificial intelligence. Today, we’re going to cover quite a lot, as per usual. We’re going to talk about Mistral potentially going IPO, controversy around the FrontierMath benchmark, and a recent interesting IDC report on generalized versus specialized coding assistance. But first, I want to start with DeepSeek. So just this past week or so, DeepSeek released R1. And if you recall and you’re a listener to the show, you know that just a few episodes ago, I believe we were talking about DeepSeek-V3, which at the time I think kind of blew everybody’s mind, where they were showing really incredible performance with incredibly less compute and costs than what we’re traditionally used to in the AI space. And with R1, it basically is DeepSeek’s pretty fast-on-its-heels release, showing that it has performance comparable with kind of state-of-the-art stuff coming out of OpenAI, specifically, to wit, o1 and kind of the inference compute techniques that really seem to give it a bunch of benefit for that model. I guess maybe, Abraham, I’ll start with you. Do you want to talk us through a little bit about why this is a big deal? Because I remember when o1 was released, people were like, “This is a huge innovation and really shows that OpenAI has this big technological edge.” Pretty soon afterwards, it seems like DeepSeek is doing almost the same thing. So I don’t know if you want to talk our listeners through, how do they do that? How do they catch up so quickly?
Abraham Daniels: Yeah, that’s a great question. I think there’s kind of two things that are really cool here. One is, of course, just the comparative performance with a state-of-the-art, leading-edge, bleeding-edge model like o1. But unlike o1, it’s been pretty cool that DeepSeek has decided to open-source it, which has been able to kind of proliferate some pretty powerful models across the community without the blockage or added need for commercial license. So I think they’re really kind of shifting the paradigm, given a lot of these model providers are starting to slap on more specific licenses that are tailored to more commercial practices, given the business model that they’re in. So I think it kind of shifts the idea of what does it mean to be transparent? What does it mean to be open without having to risk performance?
Tim Hwang: Skyler, it strikes me that when we’ve talked about this issue in the past, we’ve really talked about it in terms of OpenAI versus Meta, right? And Meta’s trying to kind of compete with OpenAI by releasing these incredibly powerful models open-source. This almost feels like now everybody’s after OpenAI exactly the same way. And obviously the distinction here, which is pretty interesting, is that DeepSeek is not a classic player; it’s not a big tech player. So do you want to speak a little bit to that? I know you kind of mentioned that you think the competitive dynamics here are really interesting to watch.
Skyler Speakman: So, first off, I think we’ll get to the competitive dynamics in a bit, but reinforcement learning is back on the scene. And I know it kind of sort of died out for a while when deep neural networks really took over. But there now are multiple companies, and I think DeepSeek is an example of making it quite public, of bringing this back into the large language models. So cool to see these ebbs and tides of various parts of AI and machine learning come and go. So that’s kind of more on the technology side; it’s really cool to see some of these things pop back up.
Tim Hwang: Yeah, totally. And I guess a quick comment on that: I think it is funny that for DeepMind, which originally made its bet on reinforcement learning, I think the rhetoric of the last year was, “Ah, they made the wrong bet and now they’re trying to catch up.” And now it’s like, were they just really, really far ahead of everybody else? I don’t know.
Skyler Speakman: Yes. No, great comment. There was this big push in reinforcement learning before, I think, the transformer. And now these things seem to be, I’d say, cohabitating, or at least being in the same technology. DeepSeek has shown that they can put both of those techniques into the same package. And I think that is a really compelling argument for their strength going into 2025.
Tim Hwang: Kaoutar, maybe I’ll turn to you. I know out of the set of folks on the panel, you sounded the most cautious about DeepSeek. There’s a point of view which is, “Oh man, they’re releasing V3, that’s incredible,” and then not like a month or so later, “Oh my God, now they’re releasing R1, they’re catching up so quickly.” There’s a way the human mind is just like, “Well, if we continue these trends, then AGI by the end of the year from DeepSeek.” Do you want to speak a little bit about why you’re still ultimately kind of skeptical that DeepSeek is the arrival of a genuine deep challenger to something like OpenAI?
Kaoutar El Maghraoui: Yes, I think the key question is, what advancements does R1 introduce compared to V3? And how does it compare to o1? Are we talking about incremental changes or really like true innovations and new things that are leapfrogging the AI community? So they’re claiming that they’re improving the search precision, the scalability, the usability, while their V3 release focused on optimizing the core algorithms. So they’re saying that R1 has capabilities such as better contextual understanding, especially for these complex reasoning tasks, which makes it competitive, kind of toe-to-toe with o1. So I think we still need to test these models to see really whether they’re there, because this is a new release, so it still remains to be tested and to see what capabilities they’re really bringing to the table and how do they really compare with o1. I mean, they’re showing some of the benchmarks that sometimes they exceed o1, so I think that’s something that needs to be validated. But one thing that I’m a bit skeptical about is I think o1 still benefits from their proprietary integration with enterprise-grade features, which R1 might lack. And that’s something that still needs to be tested and evaluated. And another thing is, what are the broader implications in this rapid iteration for the open-source ecosystem? The release cycles are pretty impressive; they’re very fast cycles. This release pace showcases the power also of community-driven innovations. However, maintaining quality while scaling adoption remains a challenge here. The open nature of DeepSeek could accelerate AI democratization, and it’s also challenging the big players like OpenAI, putting pressure visibly that they’re coming with very competitive pricing, much cheaper compared to OpenAI’s pricing. So I think it still remains to be validated whether we’re really talking about true innovation that goes hand in hand with what o1 is doing or even better. So that needs to be still validated, but I still think the fine-tuning capabilities, the integration with the enterprise use cases, that probably are still lacking there.
Tim Hwang: Yeah, for sure. I guess, Abraham, that’s a very natural place to turn to you. What I hear in Kaoutar’s argument is the idea that the models are going to become more commodity with time, and the competitive edge is integration, right? Which is, “Well, OpenAI can kind of win now because it’s hooked into all these other types of systems, and that’s actually where the advantage is.” As someone who’s working on Granite, is that kind of how you see the market? I’m kind of curious about your response to all that.
Abraham Daniels: Yeah, I think there’s kind of two people that we gear towards. There’s the commercial users, where they’re really focused on enterprise use cases, ensuring that there’s proper governance wrapped around the model and demonstrable safety and support. And then there’s the open-source developers that, in my opinion, kind of dictate what is the best outside of benchmarks, which, to Kaoutar’s point, is not always exactly what it seems. Our developer community really dictates what the best is given what the adoption rate is. So I think over here at Granite, we’re focused on open source, so I think DeepSeek is a phenomenal play in terms of being able to open up the aperture when it comes to some of the most performant models on the market. And honestly, I’m looking forward to kind of seeing what comes from this in terms of the learnings that are shared and how developers in the community actually start to use R1 to develop new ways of creating applications and spaces where this model can perform. And if I may just add, the release of DeepSeek did come along with a number of distilled versions. So just to the point of adoption, the 650 billion parameter model is not gonna fit everywhere in terms of compute availability. So the fact that DeepSeek understood that in order to adopt the model, you have to have different weight classes for different use cases, I think that just adds to their story as well.
Kaoutar El Maghraoui: Yeah, I think to truly lead, LLMs need to move just beyond the raw benchmarking performance and to really reach true innovations, you have to innovate across efficiency, ethical framework, specialized adaptability, ecosystem support. So pushing the boundaries, not just in AI, but also how it’s going to transform human interactions, technology, enterprise applications. So it’s really a story about end-to-end integration while being safe, being ethical. That’s when you can really claim true leadership in the AI space. So a full story of integration, not just looking at the benchmark performance. Benchmark performance is important, but I think integrating it full end-to-end and meeting all the regulations, safety, and the ethical considerations will be really important to drive wide-scale adoption. I like the analogy of the teacher-student model. Think of the big model as a teacher and the smaller models as a student, and they’re just trying to mimic, like Skyler said, the internal representation and mimic the final answers while still having a much smaller footprint.
Tim Hwang: Yeah, totally. Sounds like Skyler wants to get in. Skyler, before your response, if I can prompt you a little bit, can you explain a little bit what distillation is? I think it is super important and is going to totally change a lot of the competitive dynamics in the space. Even I have kind of the barest understanding of what it is. So I think probably you should start with an explanation of what does it mean that they’ve released a bunch of distilled models? And then you should do whatever hot take you’re going to do.
Skyler Speakman: All right. I’ll try not to get into lecture mode too much. Knowledge distillation is when a much larger, probably much more complex model is used as a target for a smaller or less capable model. So what do I mean by a target? Hopefully our users understand the idea of the next token prediction task, right? You have to complete the rest of the sentence. Knowledge distillation doesn’t care quite as much about predicting the next token, but rather taking a smaller model and asking it to match the internal representation of a larger model. So before that larger model gives its answer, it has its own internal representation of the answer. Now we are tasking the smaller model to match that representation rather than making a prediction of another token. And actually last year, Llama showed great results of getting Llama 3.2 smaller through knowledge distillation. But what’s different here is they are now fine-tuning a Llama-based model, but the larger one is coming from DeepSeek. So this is kind of spreading across different companies here in different ways of training. The original DeepSeek model is way too large to actually run in a lot of circumstances. But as part of this release, they also have Llama-based models that have been fine-tuned as guided or as distilled from the DeepSeek model. And I think that’s something that was a very, very smart play because people are used to kind of the Llama sizes and you can use Llama APIs, and these seem to be plug-and-play with those existing tools already. So knowledge distillation is a way of taking a much larger, much more complex model and using it to guide the training process of a smaller model that uses a lot less VRAM and makes a lot of the users much happier.
Tim Hwang: So I’m going to move us on to our next topic. Mistral, the French open-source AI company, recently appeared at the World Economic Forum happening in Davos. And after much rumor, confirmed that they were not attempting to sell the company or be acquired, but instead would be pushing for an IPO. I think it’s a kind of nice opportunity to talk about Mistral because I remember, many moons ago—and by that I mean, I don’t know, 18 months ago—Mistral was the thing that everybody was talking about in terms of open-source AI. And candidly, we haven’t really heard from them in some time, right? We haven’t talked about Mistral at all in the last, say, 10 episodes of Mixture of Experts, and open-source seems to have appeared to become much more dominated by, say, Meta. I guess the question I wanted to ask the panel first is, is open-source really Meta’s game right now? Or is there a chance for these kind of earlier players that really moved along open-source AI in a really big way in the early innings of this game? Do they still have a fighting chance here? Or is it really kind of Meta’s game in some way? Abraham, maybe I’ll toss it to you. I’m curious about what you think about that.
Abraham Daniels: In short, I don’t think it’s only Meta’s game. The most recent Llama license, although it allows for open source, there are some intricacies in terms of model nomenclature has to include Llama. So they do still wrap some restrictions around how you use your model, especially if you are an IBM or a different model developer that wants to distill DeepSeek into Llama. So I think the market is still open. IBM is 100 percent committed to open source. Our entire roadmap will ensure that our dense models and our MoE models are released on Hugging Face, fully open source under Apache 2 licensing. So personally, I think the field is still kind of open to who wants to lead that charge. And just based on our last conversation, obviously DeepSeek now entering the space with extremely high-performance model... I think right now it’s just who’s committed to it more so than who owns it right now.
Tim Hwang: Skyler, do you agree with that?
Skyler Speakman: Yes, I do. I’m rooting for them. I think, perhaps, I don’t know, living in the global majority, I do pay more attention about where these models come from. And so I am rooting for models coming from the EU or any of the kind of non-traditional large players. So great to see them, you know, not at least being up for sale. We’ll see how long that stays out. But yeah, it was really cool to see that statement. And again, rooting for models that are coming from as diverse parts of the world as possible. And so I’m still holding out for Mistral to still represent large parts of the world. Should they or could they might be the key difference there? I think if they could, they would have yet. I think it is proving much more difficult to kind of scale these efforts across the country. And it’s also why I think two countries have really dominated this space. So I would like to see more of that; again, why I would be a Mistral fan. I think it would take lots of investments from governments, from universities, if that money exists, to really push that type of homegrown effort of models. And I don’t really see that now. That’s why, again, Mistral, stay strong, still represent other parts of the world.
Tim Hwang: Yeah, of course, because I think that is a big part I did want to bring up: the global majority and the geography of all this. We talked about DeepSeek, China; Mistral, for a long time it’s considered like, “Oh, okay, Europe’s also going to have its open-source player in the space.” So yeah, I think it is exciting. I guess, Skyler, to kind of push you a little bit further, do you think that different countries, different regions of the world will produce very different kinds of models? I guess that’s kind of the thing that you might be suggesting here.
Skyler Speakman: Definitely. Yeah. So, Kaoutar, are you going to buy into the Mistral IPO?
Kaoutar El Maghraoui: I think it’s a great strategic move by Mistral. It’s great for the European startups ecosystem because they often face challenges around scaling due to limited venture capital compared to what we see in the U.S. So the Mistral’s IPO will test really whether Europe can foster globally competitive AI companies. And of course, I think it’s important not to have this centralization just between U.S. and China. It’s good also to see other countries, Middle East and Europe, also contributing models. I think going to the question you had, whether we’re going to see different models coming from different regions, there might be some nuances there. For example, the cultural implications, the language, all these things, maybe some of these regions might tailor their models to their specific cultures, their specific traditions, focus more on incorporating their languages also in terms of the APIs and answering questions and things like that, which would be great, while also, but of course for general questions and so on, there will be commonalities, but I think there might be also some regionalization that might happen in the future. Even the way that the model responds to you, for example, maybe the tone of the language, whether you want it to be polite or you want it to be aggressive, I think if we can inject some of these human traits in these human-AI interactions and kind of tint it with some cultural aspects, which would be really great. You know, the way you greet a person will be different from a region to region. Would you incorporate maybe some religious aspects to it or some cultural aspects? It would be nice to see some of these specializations per regions.
Tim Hwang: That’ll be so interesting because I think it’ll... I mean, there’s almost nothing mysterious about it. It’s almost like, okay, if you’re based in a country, you may think to use certain data sets that people in other countries may not think to use, right? And that’ll actually have a material effect on the behavior of the model. And so I think it’s like, there’s really kind of interesting aspects of, “Oh, what would you choose to use if you’re based in France versus Menlo Park, California?” And I think that’s a really interesting twist of it. Yeah, definitely. I’d love to do the test, which is, you know, talk to this chatbot: “Which country do you think this chatbot is from?” Like whether or not you could be like, “Oh, that’s definitely an American chatbot, I would know.”
Tim Hwang: Next topic that we’re going to cover today is a pretty interesting one. A few episodes ago, we talked about the release of a benchmark called FrontierMath from a group called Epoch AI. And FrontierMath is fascinating to me at least because it is an attempt to keep up with how high capability these models are becoming. So what FrontierMath is, is you work with a group of really kind of graduate mathematicians, kind of like professional expert mathematicians, to put together incredibly hard math problems that even they have a hard time solving, and using that as the source of the eval benchmark. And you know, all the classic evals, like MMLU or whatever, have kind of become saturated; no one really thinks that they give us good signal anymore on model performance. Now I bring it up again today because there was sort of an interesting controversy that emerged where it sort of came out that OpenAI had been involved in the development of this eval, and in fact had gotten sort of access to these kind of initial test questions. And I think there’s a couple of kind of responses that Epoch had; one of them is that there’s a holdout set that the OpenAI team won’t be able to get access to, there’s kind of a commitment not to train on these questions, which might also distort the eval performance. But I kind of wanted to raise it because I think we’re kind of in this interesting time where everybody knows the existing evals that are kind of the main benchmarks in the industry are kind of broken. Everybody’s seeking to create better evals. And we’re kind of in this new world where we’re trying to work out what should that look like exactly? And I guess, Skyler, I want to throw it to you: how should we sort of think about the involvement of companies in developing benchmarks?
Skyler Speakman: I guess the skeptical part of me would just say, expect that type of back-and-forth between the companies and the evals, and then take whatever performance gains they’re advertising with a grain of salt and wait for third-party confirmations. So that’s probably my largest takeaway there is, don’t say it’s never going to happen; in some cases, perhaps it really is great to have smart people get into the same room and break down barriers between companies and the goals of making benchmarks. But don’t just take that particular company’s word about how amazing their product is on arguably overfitting results. So yes, just add overall skepticism and just kind of raise the bar a little bit on consumer education of what these kind of results really mean and make people really be appreciative of third-party confirmations.
Tim Hwang: Definitely. I, cause I think, I don’t know, I take that. And I think you know, I’m a little bit sympathetic to Epoch, right? Which is, well, you want to create an eval that challenges the very best models. And part of that involves working kind of closely with the companies to design those evals. The worst thing is you release an eval that is completely irrelevant to actually testing any model performance at all. And so almost by necessity, there is this kind of interaction. Abraham, do you kind of buy that? This is sort of like inevitable. I know I have some friends who are like, “Church and state, right? The eval people should never talk to the companies,” which I think is at least in my mind a little broken, but curious about what you think.
Abraham Daniels: I think it’s a very controversial thing here, you know, what can you really trust here? So there are all these benchmarks out there. But with this controversy that happened around FrontierMath, you can see that OpenAI has this advanced access, which raises concerns about fairness, because it gives them an advantage in optimizing their models specifically for those benchmarks. And this compromises the integrity of fair benchmarking, where all the participants should start from the same baseline. So how can we fix this? Can we maybe establish some governance around these evals? Can we have some transparent access rules, some independent oversight, like a third party that makes sure that everybody has access at the same baselines and that they don’t get access maybe to data that will help them tune their models for those specific use cases? And then can we have an open review process for these results? So that’s going to require a lot of work, but I think it can be done. Technically, it can be done to have these third parties that are completely independent that establish a governance and write these tools and processes to be able to really ensure a fair evaluation process. And I hope we get to that at some point because what can you trust? And you have to do these evaluations sometimes yourselves. And I think maybe the community can also contribute to all these evaluations and provide more validation. I mean, the jump was pretty significant in the benchmark. I think it went from before the o1 results it was 2 percent and jumped to 25 percent with the o1 result. That’s a big jump.
Tim Hwang: Yeah. And I think there was also someone, I think Chollet, the creator of the ARC-AGI benchmark, he refuted OpenAI’s claim of exceeding human performance. He highlighted that o3 still struggles with some of the basic tasks. So then it remains, what do you trust? A 25% leap here compared to the 2%? Or maybe there are still some gaps that they’re not telling the full story.
Kaoutar El Maghraoui: Can we create an eval LLM? Some model that evaluates all of these other models? Can we automate this evaluation process?
Tim Hwang: Kaoutar, I guess this kind of leaves us in a funny place though if we take Skyler’s rule, which is we should see all these evals with a bit of skepticism. Is it true that in the end, vibes still are the best eval? Can we trust any eval anymore? It kind of leaves me in a fun place because I’m like, “Well, I really desperately want to have some kind of quantitative metric here,” but it sort of feels like maybe that’s ultimately a lost game.
Kaoutar El Maghraoui: Yeah, I think the incentives are a little bit interesting here too, because I think Epoch gets burned in this story, but OpenAI gets burned as well, right? Because it doesn’t, it’s not a great look in some ways. And I feel like almost there’s incentive to be as hands-off as possible, because look, when o3 comes out, I really do believe it will be better at very hard math. I think there is actually some genuine signal here, but where we are now is maybe a little bit in the shadow of, “Oh, well, we know this arrangement, and they had access, and all that.” The question is like, how much of that delta is the model, and how much of it is being able to kind of study for the test basically? So yeah, I think we’re going to have to keep on this. There’s a great article that I saw that just came out, I think a few weeks back, that was kind of making the observation that models are getting better, but we can’t really measure how. We live in this kind of funny world where all the evals kind of seem broken. We have a general strong intuition that things seem to get better, but we have no way of actually assessing that, which I think is a funny situation to be in.
Tim Hwang: Yeah, I think that’s kind of where we end up. If we think that vibes are going to be a powerful way of evaluating models—and what we really say by vibes is an interactive evaluation, you talk with the model to get a better understanding—it seems very intuitively obvious to me that at some point you will end up with, well, to scale that we need LLMs talking to LLMs. And that kind of... they’re conducting a scaled “vibes eval.” I don’t know where that goes, but it kind of feels like that’s maybe one set of research paths that you’d go down.
Kaoutar El Maghraoui: I see the future as an AI co-creation with software developers, where the future of programming will involve human-AI collaboration with AI as a coding assistant helping to brainstorm, optimize, and refine solutions.
Tim Hwang: Yeah, we’ll see. I just host the show. Someone else needs to do that work.
Tim Hwang: So for our final topic today, we’re going to talk about a report that came out of the research group IDC about generalist versus specialized coding assistants. It was released just earlier this month, I believe. The report kind of takes a look at what programmers are getting out of coding assistants. And they show a lot of the results that I think we are familiar with at this point. So they report that 91 percent of developers are using coding assistants. They say that 80 percent of those developers are seeing productivity increases, with the mean productivity increasing by 35%. So all kind of the good news that we’re used to, which is that these coding assistants really do seem to be helping people along and doing better at their job as software engineers. I think the really interesting thing, though, that they make a distinction on is between generalist and specialized coding assistants. So generalist are basically overall coding help, with specialized assistants focusing on specific programming language, specific frameworks, industry-specific requirements. And they kind of make the distinction: these are actually like two different markets, and right now you kind of need both to do coding assistance. I guess maybe the question, maybe I’ll throw it to you, Abraham, first, is I always thought that where we’re headed with these coding assistants is that they will just be one coding assistant model to rule them all. But it is kind of interesting to me; they seem to be making the argument that no, there’s going to be these really interesting niches for, my joke is like the FORTRAN model, right? It’s just specific to this particular use case. Is that what you guys are seeing at Granite? I’m kind of curious because I know you’ve done a fair amount of coding work.
Abraham Daniels: You know what? I’m trying not to make predictions in this space because everything changes so fast. But what I will say is that there’s a shift in workforce specifically around capabilities. So I think that for organizations that need to be able to maintain their environment, they will look for models that help that. And if that can be provided as a part of a general model, all the better. But I think right now, it’s still looking to be more of a specialist model focus.
Tim Hwang: So I guess your prediction is that we will actually just see... this is temporary and we will see the merger; generalists will become specialized at some point.
Abraham Daniels: Yeah, I think it’s hard.
Tim Hwang: Skyler, do you want to talk a little bit about the interesting labor impact of all this? I was joking with a friend recently, I was like, what you really need to do now, talking about the FORTRAN code assistant, is specialize in languages that no one programs in anymore. Right, because if you do Python, you do any of the popular languages, you’re about to get wiped out because the models are going to get really good really fast. And so the main thing is to flee into like what weird obscure version of Haskell, you know, and kind of that’s your defensive moat if you’re a coder. Is that good advice? Or is that just crazy?
Skyler Speakman: That’s a great anecdote, and I think actually it’s not just a story. I do think actually IBM’s got a lot of vested interest in keeping some of those old languages up and running. So beyond just a punchline, here’s a great breakdown: as part of this survey that was done from the IDC, they also said what particular tools, or what particular tasks, do you use these assistants for? And at the top of the list was unit test case generation. So this is like the really boring part of software engineering, writing all these unit tests to try to break your code. In that sense, I would say to your friend, don’t specialize in building unit tests. That is something that I think machines are doing a great job of, and people are already leveraging for that task. But at the bottom of this list of where they aren’t using these tools as much is code explanation, which is, “If I copy in a set of this code, can I have an LLM tell me what this code is doing?” So I think there’s this really cool breakdown between what tasks software developers really want to be automated for them—things like coding up unit tests—and other areas where they actually need to use kind of higher-level processing of, “Ooh, what is this code doing? Can I explain what this code is doing to somebody else?” And that kind of breakdown here of how at least software developers in the U.S. are currently using tools, I think represents that gap. So to your friend, don’t tell them to specialize in unit test generation, but maybe have them skill up a little bit on the ability to explain what that code is doing, because that’s something that currently the AI assistants at least are not being used for.
Tim Hwang: Yeah, I think that’s right. And I think I don’t know, Skyler’s emphasis on, like, don’t do unit tests but work on explaining the code, I think is very interesting. Classically, documentation is always terrible for any software. And I guess, Skyler, kind of what you’re saying is maybe that’s actually where the future is. You really gotta get better at that soon.
Kaoutar El Maghraoui: I agree with you, Abraham. I think the problem-solving process—how do you decompose a problem into subproblems—and also the algorithmic thinking, understanding how to create a very innovative algorithm, this is something that requires deeper thinking, deeper expertise that probably AI cannot solve today. Like coming up with a new algorithm that solves something, like some of the existing problems, is still challenging for an AI system to do.
Abraham Daniels: I was actually having a conversation with a former coworker, and I don’t want to date him, but when he was in computer science, when he was doing his grad school in computer science, he said they didn’t code. Their goal was to think about how to strategically outline your code and what’s the thought process behind building it, as opposed to just going and building. And he recently took on a new role in a new space and he’s had to learn a new language. And it was funny; he was saying, “I don’t have to build code anymore.” I think the gap that I see with a lot of these PhDs coming out is they don’t have to build code, but they’re never taught how to think through and explain why we’re doing what we’re doing. So he found it a lot easier to actually learn given that that was kind of where he started. So to your point, Skyler, he’s actually seeing that the better you can actually structure your code in your head before you actually start to write it, the easier it is to learn.
Tim Hwang: Well, let that be a lesson, or the word of advice, to all you coders out there who are listening to the show. As always, I say this every single episode, but we are out of time for all the things that we need to talk about. Thank you for joining us, Abraham; we’ll have you back on the show. Kaoutar, as always. And Skyler, thanks for coming on, and thanks for joining us. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we will see you next week on Mixture of Experts.
Listen to engaging discussions with tech leaders. Watch the latest episodes.