Does Gemini 3 live up to the hype? This week on Mixture of Experts, we analyze the release of Google’s Gemini 3 model. Next, OpenAI released a new benchmark about the impact of AI on the economy, GDPval. We debate AI automation and the job market. Then, we always talk AI agents, today we discuss some great innovations coming out of IBM Research and more. Finally, Anthropic disrupted an AI-led cyberattack, what does this mean for AI agents and malicious actions? Join host Tim Hwang and our AI experts Marina Danilevsky, Merve Unuvar and Gabe Goodhart on this week’s Mixture of Experts to learn more.
The opinions expressed in this podcast are solely the views of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: All that and more on today’s Mixture of Experts. I’m Tim Hwang and welcome to Mixture of Experts. Each week, MOE brings together a panel of the finest minds in technology to distill down what’s important in the latest news in artificial intelligence. Joining us today are three incredible panelists. We’ve got Marina Danilevsky, Senior Research Scientist; Gabe Goodhart, Chief Architect, AI Open Innovation; and joining us for the very first time is Merve Unuvar, Director, Agentic Middleware and Applications Research, AI. All right, lots of interesting topics for today’s episode. We’re going to talk a little bit about, of course, the drop of Gemini 3, some attacks using Claude. But first, we’ve got Aili with the news.
Aili McConnon: Hi, I’m Aili McConnon, a tech news writer for IBM Think. Here are this week’s AI headlines.
Tim Hwang: First, I want to start with the big news of the week, which is the launch of Gemini 3. So long rumored, long teased, but finally out. And it’s a remarkable model. I mean, from some of the benchmarks Google is reporting, explosively good performance on what have been considered some of the most difficult evals and benchmarks out there. So huge leaps on the HumanEval exam, really big jumps on ARC-AGI. But I guess maybe let’s just kind of start with the vibe check. Marina, have you had a chance to play with the model yet? I’m curious about what you think about it and if it feels substantively very different from what came before.
Marina Danilevsky: I haven’t played with it. I’ve looked at a few digests about it. It does seem like there’s a lot of interest in making the more complicated benchmarks be something that’s handleable. It was interesting to me that a couple of people reported that it’s still hallucinating and it still really likes to give answers rather than say that it doesn’t know, although it’s not so psychopathic about it. But it still really likes to give answers. So it’s an interesting combo. It’s still making mistakes and doesn’t like to admit that it doesn’t know something. Maybe that’s saying something about this new set of models.
Tim Hwang: And Gabe, quick question. Would you recommend Gemini 3? Have you played around with it yet?
Gabe Goodhart: Yeah, I’ve played just a little bit with it this morning. And I don’t know, I think my take on this is we’re really starting to see the ecosystem motes evolving. I think this is a necessary step for Google in their AI ecosystem to have a model that is at par or better than all of their competitors so they can truly claim to be running ahead in the front of offerings here. But what really struck me about the announcement was that they actually took a swing at a piece of differentiation, because frankly, a really great model is not that differentiated anymore. It’s like, “That’s nice, but for what?” For what most people want to use a model for, we don’t need something better than what we’ve already got.
I thought the thing about the Anti-Gravity editing platform was really interesting because it actually looked like something novel they’re adding to their ecosystem that you can’t get anywhere else. The idea of an agentic IDE is not at all new—there are well-known startups out there doing that, there are open-source ways to pull that together on your own. The part that I think was novel here was the intentional transition to framing it as a management-of-agents problem.
Tying this all back to, “Would I recommend the model itself?” I haven’t played with this capability, but a colleague of mine has. And the idea of being able to launch a fleet of delegate worker agents that can all work on separate tasks in parallel, and you can manage them, is something that I think has some real compelling chops. So if the model can actually hold up to that level of independence and parallel analysis, it could be a real breakthrough on a net-new capability that you couldn’t get with any other ecosystem. So I’m really excited to try that out and to see where it goes. But from a pure model perspective, I’ll play with it.
Tim Hwang: Yeah, for sure. I think that’s one of the really interesting things coming out of all this. You know, I think we used to marvel even earlier this year—‘Oh my God, new model, incredible benchmarks, look at all this progress.’ But here we are, sitting in November of 2025, and we’re like, ‘Eh, the benchmarks, whatever, awesome. The science is amazing and I can accomplish the same set of tasks that I had before with a much, much smaller model running on my laptop.’
So Merve, do you want to talk a little bit about Anti-Gravity? I did think that this was sort of the big, interesting differentiator in the announcement—to say, ‘Hey, we acquired Windsurf, we’re going to do our IDE and it’s going to be an agentic IDE.’ What’s your take? Where is this all going? And I guess, give our listeners an intuition of why Google is even investing in that kind of differentiation.
Merve Unuvar: I think Gabe alluded a little bit to it, right? The ecosystem play. I think they’re aiming for advanced tool use—the whole agentic applications—and making tool calls more robust, and also increasing the modalities. They’re claiming you can do editor, terminal, web browser, and many different execution modes. This means you can plan, code, execute, and verify different tasks more autonomously. Agents in Anti-Gravity will also generate artifacts, that’s what they claim—task lists, plans, screenshots, browser recordings. When you want to take agents to the next stage beyond academic benchmarks and put them out there in reality, it’s quite promising.
But I did play with the model; I didn’t play with the Anti-Gravity platform yet. As Marina said, it’s still hallucinating, but just like other big models, it’s really, really good with the initial prompt—the way you describe your first set of things. They have a ‘build’ section where you can build artifacts, you can build UI elements. So I asked Gemini to create me a workout plan and an interactive dashboard to track my workout sessions. I gave it my weight, height, and age to customize it for myself. The very first UI was really nice—I was multitasking in a meeting, and it was able to build in Streamlit locally in less than two minutes. Then I asked it to add some more personalized pictures, like motivation pictures, customize with my name. And then I realized it added a reminder section saying that I should eat high-nutrition food after the workout “to grow.”
I’m a mother with two children. If this was for my kids, I think it would make more sense. But for me, I’m not growing. It totally messed that part up. It already knew my age, so it’s way past my growing age. So again... The overall performance on benchmarks is quite impressive and the claims they are making excited me, because I think this is the largest capability jump we’ve seen in a few months. It’s nice, but it has some flaws, which I personally experienced in my first UI dashboard that I built with it.
Tim Hwang: Marina, hopefully you can help me square this circle here. I think it’s really interesting that your first reaction to this model is “still hallucinating.” We kind of have this funny split-screen experience with these models where they are performing better and better against benchmarks, but yet our everyday... Is it true that the more powerful models simply hallucinate more? Or is this just... they’re seemingly really strong in some things but remain amazingly weak in other domains.
Marina Danilevsky: So I’m going to agree with what Gabe said, which is, for plenty of tasks, I don’t need this really large model. I can do better with a smaller model because hopefully we’re finally getting beyond this idea that we’re going to have one model to rule them all. That never should have been a goal; it’s never going to work for an LLM with the architecture that we have. What you want is a suite—either give them different instructions or different preferences.
If you want to do better on these very complicated benchmarks, which really want the model to think through a lot of things, generate a lot of thoughts and attempts, then you don’t want it to be reticent and sit there saying, “I don’t know anything, I’m going to twiddle my thumbs because I don’t have a citation to offer you.” These are different tasks. This is the same thing as the statistical tension between precision and recall—you’re going to do better in one, you’re going to do worse in the other. This is going to be a consistent thing. Use them for different things.
The fact that the more interesting thing now is automation, the more interesting thing now is really the multimodality... lean into that, because that really is more interesting. Having one model that’s always going to do the best job at giving you the right information? Why? Review your Karl Marx, review your division of labor. Let’s reinvent the normal civilization of people working together. That’s going to be more effective. It always will be.
Tim Hwang: Yeah. And I think it’s kind of funny where this is all resolving to, because I agree. One of the things that was like, ‘Oh man, this is going to change everything’ was ‘one model to rule them all.’ But if we end up in a world where we have very specialized agents for very specialized kinds of tasks, are we back to App-land again? Are we back to software again? In some ways, we’re kind of reconstructing applications, like specialized software again, which I guess is everything old is new again in some sense.
Gabe Goodhart: I think it’s a healthy tension. I mean, frankly, one of the reasons we’re in the AI moment we are is that the pendulum really swung with the introduction of Transformers, and suddenly you didn’t need a complicated suite of software to get 80% of the solution. That was a real game-changer.
I think what we’re seeing here is... the general-purpose populace is still going to use one chat window. They’re going to jump to a chat window and enter some things. Now, if that chat window becomes an increasingly complicated software machine behind the hood, the user doesn’t need to know. So the interface change of “one model to rule them all” I think is sticky; I don’t think that’s going anywhere. But the actual implementation behind the scenes—I think we’ve already seen that with the GPT-4 series, we will almost certainly see it with other frontier model offerings, and I think we’ll see the open equivalent of a software stack emerging that allows you to ensemble models for specific parts of your workflow, exposing that nice single entry point that users want to interact with.
So I think it’s a healthy tension. The nice thing here, as a software architect, is you get an abstract interface which is your chat box, and then you get to implement it however you want. We’ll iterate on that implementation because we’re software folks and we like to do that. But I think we’ll swing back and forth a little bit on the complexity behind the scenes.
Tim Hwang: Yeah, for sure. I think it’ll be so funny if in a few years people are like, “Well, rather than a chat window, what if we had a desktop with icons you can click on?” and it’s just like... we’ll be back to where we were.
So. Two very interesting announcements out of IBM recently about agents, really in the last few months, and I think particularly on Gabe’s last comment. I understand one of the announcements is around a project called KUGA, which is billed as an enterprise-ready generalist agent. Merve, do you want to talk a little bit about what the team was working on and thinking about for this launch?
Merve Unuvar: Sure, happy to. As you said, this has been my life since the launch, and we got good feedback. We’re trying to become an enterprise-ready generalist agent, and it’s not an easy task to take on. Where we started, like everybody else who starts to build enterprise-ready agents, is with some simple, traditional ways—build a domain-specific agent. Maybe ReAct code, that simple pattern you take, and then you start evolving it to, “Oh, my task is too complex and my single agent cannot handle this. So let me go and build a task decomposer on top.” Oh, this is now becoming a multi-agent architecture where you have a layer up top that picks the right sub-agent to do it. It’s a classical engineering design principle because it’s easier to distribute to the sub-agents; we believe it’s going to work faster.
And then what we realized is we’re not the only ones that do this. My peer groups in IBM Research, when they build sophisticated agents, they go through this experience as well. Let’s start simple, and then it all of a sudden becomes very complex. And then we stepped back and we thought, maybe we can create a generalized version of this where people can jumpstart using the KUGA architecture rather than building all these things by themselves. So we can give KUGA, which is this multi-agent supervisory layer already embedded in a multi-agent architecture, and people can configure it for their own domain and users.
The traditional way is like: build a domain-specific agent, evolve it, do some custom benchmarks, onboard your own tools, configure your own domain, do your own benchmarks, and then deploy. So that’s our vision. It’s open, it’s out in the open now for people to try and give us feedback and see if it works for their domain. So we’re very excited that we launched in the open so we can capture if what we experimented with in research actually can be mimicked in real-world application uses.
Tim Hwang: Yeah, that’s really exciting. I think one of the things we’ve been watching really closely here at MOE is... I love about the kind of agent competition world right now is we’re very much in the world of norm-setting. ‘We’re doing it this way. We hope you do it this way as well.’ I think there’s various projects that are more or less successful at attempting to build those standards.
What’s really intriguing, Gabe, I’m curious if you want to talk about this in the context of KUGA, is it seems here, what’s really intriguing—Merve, I’m hearing you right—is that everybody starts by building an agent, and they all discover exactly the same problems over and over again. Everybody’s going through that process of re-discovery right now. I guess, Gabe, that’s pretty promising from the point of view of, “Okay, let me just shortcut this. Here’s a standard framework.”
Gabe Goodhart: Yeah, I’ve also been thinking a lot about this, having a lot of conversations with different teams building different components. One of the things that really seems to be true is that there are emerging slots for an abstract architecture for agents in open source, and presumably in closed source (but we don’t know how those tools are implemented necessarily). The generalist agent is absolutely one of those slots. And having an open offering that’s configurable and permissively licensed is a really awesome place for people to start collaborating and building.
Tool management is another big piece of this, and it just seems to be coming up over and over again that this sort of emergent architecture is there. To the point about refining and iterating on the actual agentic architecture itself, the analogy I keep coming to is: if I asked anybody out there working at a company, “Please go build me a REST API server for X, Y, and Z,” I wouldn’t have to tell them the architecture. I wouldn’t have to tell them what programming language to use. It’s just a well-established pattern that everybody knows how to do if they’ve ever touched cloud software. Agents aren’t there yet, but as many people have said, 2025 is the year of the agent. I think by the end of 2025 we’re going to be close to actually hitting that point where we can just say, “Hey, build an agent for this,” and everybody just knows what you mean.
As you exactly described it, Merve, I think the decomposition is exactly that step from, “I got it running in Flask on my local machine with HTTP,” to “Now I’ve got a server that has middleware for authentication and serves TLS and can be horizontally replicated.” Those are the steps you take when you’re building a microservice after you get your demo app running. The same thing you’re describing with KUGA is exactly what people are hitting after they get their first ReAct agent off the ground.
Merve Unuvar: You mentioned, “Oh, here is the agent, go use it.” We’re really trying not to put people in a box of, “Okay, this is KUGA and you have to use it.” The configuration piece makes it easier for people to say, “I need this, but I can configure it this way.” Also, what we did—which is maybe a good time to introduce the ALTK, the Agent Lifecycle development Toolkit that we also released in the open—is we componentized KUGA. We built different components to support KUGA’s different capabilities, like memory, guardrails, and other things that make KUGA function in the real world.
But some people may not want to start from KUGA. They may still have their own sophisticated agent that they built and don’t want to move it to KUGA. So they can reuse these components under the Agent Lifecycle Toolkit. If they want the memory piece, they can take it and apply it to their agent. This is again about democratizing and not really pushing people to use this specific thing. There’s a flexible design; you can take different components and apply them to your current agentic implementation if you want to improve certain aspects of your agent.
Tim Hwang: Marina, I want to go back to the comment you made earlier when we were talking about Gemini 3 and this movement to more specialized agents over time. It strikes me that we will almost reproduce human org structures in agents, because you’ll have this agent that’s kind of the middle manager—its role is to manage other agents. Do you think that’s kind of where we’re headed ultimately? We’re moving away from “one agent to rule them all,” but there will still be these generalist agents, and really their role will be that middle manager in the org chart.
Marina Danilevsky: From biology to software, you have this combination of hubs and spokes. There’s a real reason that you end up settling on that. Maybe it’s going to be a different number of spokes, more hubs, fewer hubs. But as you figure out the way to solve a particular problem, that’s still the pattern: you need some specialists and you need somebody doing the managing and the planning.
So yeah, what’s interesting about this era is that we are able to go faster, further than we thought. But if you take that 10,000-foot view, it’s still, “Alright, I’ve got a task, maybe I can delegate it to the spokes a lot faster.” But you still somewhere in there need a hub where you say, “Okay, this is what you do next, this is how you know that you’re done, this is what you try.” It’s very natural and very correct. The technology to get us to go faster is great and very exciting. But yeah, this is the normal pattern of problem-solving.
Tim Hwang: Yeah. The future is kind of like figuring out how you staff your project with different kinds of agents. It almost feels like... and maybe a couple of people in there just to keep an eye on. There’s some actual humans in there.
So, maybe a last point, Merve. Where are you headed next with all this? I know you said this has been your life since the launch, but where does KUGA, where does ALTK go next?
Merve Unuvar: Just like with Gemini 3 benchmark results... we started with KUGA and then we went out and found the most challenging and representative benchmarks we could go after, which were WebArena and AgentBoard. We were number one on both of them for a long time, and we kept trying to keep our position. But it’s very different to keep our position in benchmarks as number one versus putting it out there and hearing directly from the users where it breaks.
For example, latency is a problem right now. When we built KUGA, we really focused on accuracy and how well KUGA listens to the task. But latency is one of the requirements that came from real users when we launched outside, saying, “Okay, this is too slow for me to use.” So we have a bunch of things we’ve captured from the community that we would like to incorporate.
Also, when I mentioned the ALTK—the core components that help agent builders boost their agent performance—there are a couple of things we’re working actively on. One is memory, and I’m not talking about storage and data structures. I’m talking about: what can you make out of this memory? What do you want to remember? What do you want to forget? What is the middle ground where you want to keep learning? Because some tool combinations may never work, and if you have that trajectory saved in memory, can you bring it up and do some self-learning for KUGA or other agents?
The other one is consistency. This is extremely important, and there is not a single definition of consistency in the literature. People define it maybe sometimes with repeatability. But in an enterprise setting, and also in consumer settings, it’s important. You don’t want your agent to do something one way one day and a very different way another day. So, how can you bring this consistency to real-world agents so they are, within their own world, consistent with their behavior and don’t throw ridiculous answers one day when you ask the same question?
So these are the two main topics we are working on. We’re excited that we’re making progress towards getting real feedback from the community and also advancing the capabilities of KUGA with these components.
Tim Hwang: That’s great. Well, we’ll have to track the project. How do people find out more about it if they want to keep up with your work?
Merve Unuvar: Sure. There’s kuga.dev. This is the KUGA website where you can go to GitHub and learn all the blog posts and other things. And we have altk.ai for the individual components that constitute and help KUGA perform better. So if they go to these two websites, they’re all good.
Tim Hwang: Nice. Yeah, those are the solid TLDs right there.
Well, I’m going to move us on to our next topic. This is a recurring theme for us in 2025, an interesting ongoing set of discussions about how AI will impact the economy. I think overall, the discussion has matured over the last 12 months. I think in the beginning of the year we were still very much like, “All the jobs are going to disappear.” And I think now we’re getting more to the mode of, “Well, let’s do an eval on that.” We’re approaching it in a very machine-learning way.
So, OpenAI announced a benchmark they call GDP-Val. Essentially, they’re trying to say: we have all sorts of benchmarks trying to evaluate AI capabilities, but a critique against a lot of them is that they don’t tend to be very realistic. [GDP-Val] curates a set of tasks from actual professions and evaluates whether AI is able to produce outputs on par with a human expert. They run this as a way of trying to get an assessment of what the effects of AI will be on the economy, particularly against these economically valuable tasks.
There are some interesting results. I think the big headline is that, even though it’s a benchmark from OpenAI, they discovered that Claude Opus 4.1 is the strongest performer against these tasks and, in some cases, is able to reach near parity with human experts.
So, Gabe, maybe I’ll turn it to you. I’m curious what you read from these types of results. Are we still kind of back where we were in, like, December 2024, which is like, “Oh God, AI is going to replace all the jobs”? These are certainly impressive results, but how do you parse through it?
Gabe Goodhart: I mean, in professional settings, the promise of AI has been: take away the stuff I don’t want to do so I can spend more time on the stuff I do want to do. I think it’s really easy to poke holes at benchmarks because measuring things is really difficult. So I want to upfront say this is a really good stab at a new aspect of benchmarking, and I think especially the reliance on human experts is important.
The holes that I saw immediately were: they’re still doing this as basically one-shot artifact creation as the benchmark. Most of the time I spend at my job doing things I don’t want to do involves investigation, asynchronous communication, etc. In fact, when it comes time to create artifacts, that’s the stuff I do want to do. Writing code is my happy place, even if I’m using an assistant. What is less happy is walking around trying to find the correct way to implement something, or looking through a giant pile of corporate docs.
Every benchmark makes approximations so it can translate from a fuzzy human space into math. That’s the nature of benchmarking. In this case, they’ve made some approximations that, while valuable, still have holes.
The other interesting part was that they used human graders as the gold standard for evaluating. The whole reason we’re in this GenAI boom is that we figured out ways to not have humans labeling data, and this sounds like we’re back to humans evaluating data. They also created a proxy with another model, but now you’ve got AI evaluating AI and you’re in a recursive loop.
It’s really interesting to see this try to tackle real-world problems. The other piece they didn’t articulate very well was: amongst the classes of problems a profession tackles, is this tackling the hardest ones or the simplest ones? Responding to email or Slack is not the most mentally challenging thing, but it’s a lot of what I do. I’m sure the same is true for a lawyer, doctor, or nurse. So I’d be curious if they evaluate whether these are the low-lift or high-lift tasks.
Marina, you’re smiling through Gabe’s explication of GDP-Val. Curious about your take. What I’m hearing from Gabe is “better, but maybe not good enough” at assessing what is ultimately a very complex thing—like, what is a job?
Marina Danilevsky: I really found it interesting diving into this. First of all, props to whoever in their comms or marketing team thought several months ahead about what the write-up headlines were going to be, because the headlines were “AI can now do half of our jobs.” Great. That’s not what you’re supposed to get from this, but fantastic.
I like what they’re trying to do very much. If you actually go read through the data points, which they made available on Hugging Face—very good on transparency—it’s mostly planning tasks, summarization tasks, the kind of things LLMs are pretty good at. Reading between the lines, it did seem they had a lot of different submissions and narrowed them down until they had a set of maybe similar-looking tasks, even though it was across a number of jobs. We’re still only talking about a couple hundred tasks that got made. So that is a point: there are only going to be so many.
I completely agree with Gabe. When you look at the prompts themselves, they looked very detailed, very refined—somebody with a very clear idea of what they want. A lot of pre-work goes into it before you ask, “Hey, write me this summary.” “I have prepared some files, some reference Excel, some sites for you.” So there’s a lot of pre-planning, and then this does the, “Alright, fine, put it all together.” It probably still helps a good amount, and it would give an example to people who don’t understand: “What kind of jobs can AI be decent at?” Look at these 200 tasks and see what it’s actually pretty good at. I like that part.
As far as the evaluation goes, it’s pairwise comparisons. Is it the most sophisticated, detailed thing? No. Having done evaluations for years, I’d guess they tried a variety of things and got a lot of noise because the artifact from each task is detailed and noisy, difficult even for a human to judge. So they ended up going with a win rate. I wouldn’t listen to the headlines. I’d look with interest at the paper, especially the appendices where they go into the evaluation process.
These are all prompts. There’s not a lot of breaking it down—exactly what Gabe said, exactly what we were just talking about with Merve: “First do this, then do this, here’s an agentic plan.” This is a prompt, and it primarily asks the model to present something akin to a plan you’d execute downstream. It’s a very particular type of task. In the future, it may be interesting to have multiple models with multiple capabilities, maybe specific models. Is evaluating that going to be harder exponentially? Probably, which is why this benchmark is the way it is. It was intelligently done, ready for headlines. But I do like the interest in real-world tasks and the lens this can show on which jobs we talk about when we say “AI taking our jobs.” I wish there were more write-ups that dove into the actual data; I find that more interesting than the final evaluation.
Tim Hwang: Yeah, absolutely. That was a great analysis.
Merve, maybe a final question for you. There’s a really interesting tension in what Marina said. You want the eval to measure real-world task risks, but the real world is really messy, so methodologically it’s hard to do a good eval. Part of me thinks we’re going to spend a lot of time developing increasingly sophisticated evaluations to measure economic impact. But also, the economic impact is just going to happen to us. I’m curious about the value of this exercise. We can spend time creating better proxies, but we’re also in the middle of it. So, do you think these evals are more important or less important with time as a research area? Where should we be going with this kind of work?
Merve Unuvar: Discussing this made me remember a client conversation last week in Istanbul. He’s the CEO of one of the leading hospitals there, and he told me it makes him uncomfortable to see that each doctor has a medical secretary attending patient visits. This is one of those 44 occupations included in GDP-Val. This is exactly the description of people ready to adopt LLMs because there are certain tasks—this is a perfect use. The LLM will listen to the conversation and summarize. Summarization is a very popular use case.
If you just take this task, fine, we can benchmark and evaluate. But there are implications on other things, added benefits that a benchmark would vary almost impossible to measure. If you consider these other aspects—like you can drive insights from doctor-patient conversations, extract best practices, do better process optimizations like supply planning for surgery, customize follow-up content—there are many different added benefits. But each is so difficult to measure. To Marina’s point, even non-human evaluations aren’t easy. Now we’re adding the human component and derivatives of this one occupation.
It’s real. I had this conversation last week with a person looking to implement this. So now it comes to where the real world is heading versus what we benchmark. How do we benchmark when you implement these systems? You can track added benefits as ROI—how much better your supply management is for surgeries, etc. So you can have KPIs. But on paper, just purely scientifically looking and trying to mimic and understand the combination of things these tasks can lead to, and then benchmark against all of them with data, whether human- or AI-annotated, is going to be an extremely difficult and overwhelming exercise.
But I do like this because it’s helping us get out of our academic mindsets when we compare models and usages, and look at real-world examples and where industries are starting to adapt and change.
Tim Hwang: Yeah, I love that. It’s almost like the eval is also a useful exercise in being like, “How do we decompose this task anyway? What do we do with our days?” is an important part of it.
Great. I’m going to move us to our last topic, which is weirdly related to what we were just talking about. The final news is this interesting story tackled in more depth on the Security Intelligence podcast. Chris Hay, a frequent MOE panelist, was on there debriefing it. I’ll give the summary and link it to what we were just discussing.
Anthropic disclosed that they discovered an actor, which they believe to be a state actor, was misusing Claude to launch a sophisticated cyber attack. They have a long blog post breaking down what they discovered. The thing that stood out was this quote: “The threat actor was able to use AI to perform 80 to 90% of the campaign, with human intervention required only sporadically, perhaps four to six critical decision points per hacking campaign.”
As far as I know, this is the first real dead-to-rights example of what people have been theorizing as “AI hacking”—the idea that all this agentic technology, when used for good stuff, also gets used for bad stuff.
Linking it to what we were talking about earlier... Merve, maybe I’ll kick it to you as our agent expert here. It feels like if we had not GDP-Val but for evil tasks, AI is making a real impact in cyber attack operations. The question is, do you think agents are going to favor illegitimate cases faster than legitimate ones?
Merve Unuvar: Even though I work on agentic systems, I’m not a security expert. I did chat about this with a co-worker, Ian Molloy, who leads our agent security work. His words were: “It will be extremely difficult, maybe impossible, to prevent malicious use of these agents while preserving their legitimate use,” because they’re designed that way. We want them to be flexible, to listen to instructions, to do what we tell them. On the other hand, when we tell them malicious things, with current alignment approaches, I think it’s going to be impossible.
But what we can do—security researchers have been predicting this since the beginning of the LLM era—what I love is that Anthropic had the perfect telemetry and monitoring to be able to talk about this and show what happened. I think we can instrument our systems with observability layers or monitoring capabilities that show us transparently what’s happening. You can maybe revert back or learn about what happened. We can’t control these things, but we can build additional components for security—authentication, etc. For enterprise settings, that might be easier than broad use, because you can add controls.
But in broad use, the agent just does what the user says. If we instrument systems with components like observability, robust telemetry, and monitoring, I think that can bring trust to users while they’re using these powerful systems.
Tim Hwang: Gabe, I was joking with a friend that there’s probably a team at OpenAI relieved they weren’t involved in this attack, but also jealous they chose Claude vs. OpenAI. I have an interesting question: why not use open source here? It feels incredibly risky for an actor to go with a cloud model that’s monitored. Versus saying, “We’ll run our own on-prem solution.” Do you have a hypothesis?
Gabe Goodhart: My answer is that they probably are. This is just the one that got caught. To your point about OpenAI feeling bummed—maybe they just don’t have the telemetry to notice. We’ve seen one example exposed to sunlight, but that doesn’t mean it’s the only one. I strongly suspect, and they mentioned in the article, it is not.
The frontier models are generally closed. Even with extremely capable open models, they’re extremely hard to run at scale. If you’ve got a team that’s expert in cyber hacking, they probably aren’t experts in running expensive GPU rigs. What you get with a frontier model is the hands-off nature versus trying to put it together piecemeal.
The interesting thing in cybersecurity is that attackers always have the advantage unless they’re targeting one specific thing. If their goal is to get what they can, there’s no penalty for screwing up; you just keep trying. This same set of actors may very well also be banging on an open-source version, using GPT, etc. Try them all—what’s the downside other than cost? The defenders have the more difficult task of catching everything that slips through. You have to have ironclad practices. Screwing up has huge penalties.
The thing that struck me was that, even though the agent was taking a lot of decision-making and scripting, it was still exploiting standard vulnerabilities. It was looking for systems running exposed, vulnerable versions of known exploits and then exploiting them. If anything, this puts a finer point on enterprises needing to stay up-to-date with CVE patches and best security practices—really do them, because someone’s going to find it.
The defensive side has the AI element—how do we catch these? What do we put in our AI systems to make sure they can’t be exploited?—and the good old-fashioned cybersecurity: patch your software, or you’re going to get hacked. It’s an all-of-the-above strategy. But this makes it more urgent that patch timelines tighten up.
Tim Hwang: I love that we’re landing on a place which is just, at the end of the day, these are not very complex tasks being implemented.
Marina, I’ll give you the last word if you have any hot takes before we close.
Marina Danilevsky: Nah, I agree with Gabe. I’m not a cybersecurity expert, but it seemed like here it was scale, not creativity. Try the same well-known things, but a lot faster than a human could. Probably a lot of those things could be done.
One thing I liked was the Anthropic team slipped in that they used Claude to analyze the logs. Something that comes to mind: should we think that the models themselves know what they would use if turned to evil, so they’re more likely to catch themselves than a human might? Different models have different biases, so maybe have OpenAI models check Anthropic models...
But at the end of the day, it’s almost like... I agree with Gabe: just do the basics, because now your basics can be broken a lot faster. So you really need to get the basics right. Please, please.
Tim Hwang: Well, with that bit of very good advice, we’ll close the episode today. Marina, Gabe, Merve, thank you for joining us for the show today. That’s all the time we have, and thanks for joining, all you listeners. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. We’ll see you all next week on Mixture of Experts.
Listen to engaging discussions with tech leaders. Watch the latest episodes.
An artificial intelligence (AI) agent refers to a system or program that is capable of autonomously performing tasks on behalf of a user or another system. It achieves this goal by designing its workflow and employing available tools.
Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But what is AI?
Developers build AI assistants on top of foundation models—for example, IBM Granite, Meta’s Llama models, or OpenAI’s models. Large language models (LLMs), which specialize in text-related tasks, represent a subset of foundation models.