Claude 3.7 Sonnet, BeeAI agents, Granite 3.2 and emergent misalignment

Watch the episode
Mixture of Experts podcast album artwork
Episode 44: Claude 3.7 Sonnet, BeeAI agents, Granite 3.2 and emergent misalignment

IBM® Granite™ 3.2 is officially here! In episode 44 of Mixture of Experts, join host Tim Hwang and experts Kate Soule, Maya Murad and Kaoutar El Maghraoui to debrief a few big AI announcements. Last week, we covered small vision-language models (VLMs), and this week Granite 3.2 dropped with new VLMs, enhanced reasoning capabilities and more! Join Kate as she takes us under the hood to understand the new features and how they were created. 

Anthropic dropped a new intelligence model, Claude 3.7 Sonnet, and a new agentic coding tool, Claude Code. Hear the experts explore: why did Anthropic release these separately? Then, as we cannot have an episode without covering agents, Maya takes us through the new BeeAI agents! Finally, can fine-tuning on a malicious task lead to much broader misalignment? Our experts analyze a new paper released on ‘Emergent misalignment’ to uncover the risks. All this and more on this week's episode!

Key takeaways:

  • 00:01 – Intro
  • 00:35 – Claude 3.7 Sonnet
  • 11:58 – BeeAI agents 
  • 22:17 – Granite 3.2
  • 32:31 – Emergent misalignment

The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

Listen on Apple Podcasts Spotify Podcasts Casted YouTube

Episode transcript

Tim Hwang: What is your favorite video game? Kate Soule is Director of Technical Product Management for Granite. Kate, welcome back to the show. What do you prefer?

Kate Soule: I really liked the Zelda Breath of the Wild video game series. That series is so good.

Tim Hwang: Maya Murad is a Product Manager for AI Incubation. Maya, welcome to the show. Favorite video game?

Maya Murad: Have to say GTA.

Tim Hwang: Okay, that’s awesome. And then, Kaoutar El Maghraoui, a Principal Research Scientist, AI Engineering, AI Hardware Center. Kaoutar, what do you think?

Kaoutar El Maghraoui: I like Minecraft, which I think is a cultural phenomenon allowing players to build and explore in this sandbox environment, which I think is pretty cool.

Tim Hwang: All that and more on today’s Mixture of Experts. I’m Tim Hwang, and welcome to Mixture of Experts. Each week, MOE brings you the nerdy chat, banter, and technical analysis that you need to understand the biggest headlines in artificial intelligence. As always, there’s a ton to cover. We’ve got new announcements coming out of BeeAI, a new release of Granite, and a really interesting paper around emergent misalignment. But first, I really wanted to talk about Claude 3.7 Sonnet and Claude Code.

This is one of the big product announcements for the week. Anthropic announced the latest generation of its premier model, Sonnet—the 3.7 model—as well as a new coding agent they’ve been playing around with.

Let’s start with 3.7. I know, Maya, you’ve actually had a chance to play with this new model. Curious for your early impressions—things that are working, not working, whether or not you like it at all. Just curious about your hot-take review.

Maya Murad: Yep, I did try it out, and I was actually surprised that it was only a 0.2 version upgrade. The last one was 3.5, which was known to be good at coding but maybe wasn’t my go-to for writing. I actually tried the 3.7 on a writing task, and I was blown away by it.

The second thing that’s really coming through with these Claude models is the emphasis on experience—it’s a bit more subtle. I think they’re curating their training data to provide a somewhat opinionated experience, more in the Apple way, giving you a good experience. I’m starting to see a wedge between what Claude is doing and what OpenAI is doing with their models.

Tim Hwang: Yeah, that’s really interesting. We’ve talked on the show before about how the competition between these big foundation models is going to evolve. That bit is pretty interesting. Kaoutar, do you get a similar sense that Anthropic is almost playing a style game now, and the battle is moving from capabilities to this new thing? I’m curious what you think.

Kate Soule: I really like the comparison of Anthropic to being the Apple equivalent in the field. One thing they did with the 3.7 release that I’m really excited about is they released reasoning, but in a very pragmatic way. You can basically choose how much you want to spend—how many tokens you want to generate—because you don’t need a ton of reasoning for all tasks. It gives you the ability, for more complicated things, to quote-unquote “pay more,” both in terms of latency and token cost, to improve the model response.

That feels like a really pragmatic, usability-focused approach to reasoning that we haven’t seen yet, and I think it’s going to quickly become the norm. If we look at where reasoning can add value, we need to have it as a knob we can selectively apply, not something where every response, including “what is 2+2?”, comes back with five paragraphs of reasoning and a ton of latency.

Tim Hwang: Yeah, it’s funny seeing it emerge because it’s a new paradigm for computing. In the past, you’d just execute a program. Now you have to specify, “I want you to try really hard at it” as a separate option. It’s interesting figuring out how to make that a natural option you can flip on and off. Kaoutar, maybe I’ll turn to you. One interesting bit is not just the new model, but that they’re also playing in the coding agent space.

It’s actually very funny if you read the blog post. They’re like, “We really believe in an integrated experience where reasoning and the model are all together,” and then, “Oh, by the way, we also have this completely separate thing we’re announcing.” I’m curious why you think they’re breaking out Claude Code as its own separate functionality. Is this going to increasingly become its own thing, or is it just an experiment that will eventually get integrated?

Kaoutar El Maghraoui: Yeah, that’s a great question. I was also a bit surprised they’re separating the code from the other models. But they’re probably focusing on this agent for coding, which is still in a limited research preview. I think they’re still experimenting with it, and I’m hoping eventually it’ll be integrated with the rest of their models and their bigger vision. They’re trying to focus on assisting developers by autonomously performing code-related tasks like searching, reading code, and editing files. I think the reason it hasn’t been integrated fully is that it’s still in this limited research preview and deserves its own evaluation and focus.

Tim Hwang: This kind of goes to the general question of how we do good evals. It almost feels like the evals are the tail wagging the dog, forcing product differentiation because you need a team that gets really good at one eval. Over time, it becomes a different product because you’re working against that eval so hard. It’s very interesting to see.

I promised I would tie back the top-line question about favorite video games to AI headlines, and I wanted to tie it to the Claude launch. One fun bit about the launch is that, in addition to the usual benchmarks, they showed how their models perform against Pokémon and how far it got in the game. I love this because it’s a playful thing to do—to Maya’s point, it’s style points. But it was also interesting because I remember back in 2016, everyone was about how far you could get in Atari or arcade games. That was the eval in that early phase, but it disappeared as more formal benchmarks got serious.

With this, it was interesting how excited people got. I have a friend at Anthropic who told me office productivity was shut down because they were just watching to see how far Claude could get in Pokémon. I just wanted to bring this up because it’s almost the return of the video game eval. Is it useful, or is it more of a gimmick? Kate, should we see this as a paradigm for evals we should explore, or is it just a fun thing to see AIs try to get through a video game?

Kate Soule: I remember when Twitch first came out, that’s what made it famous—the world stopped to watch as everyone suggested the next step, and it was like a random function generator going through Pokémon. So, without the Claude model choosing the next thing, everyone was submitting their vote, and it was a random amalgamation of inputs that would select an output, and the Pokémon game would proceed. That got really far. So, if you’re asking if this is a useful evaluation, a random number generator was able to play Pokémon successfully if you waited long enough.

That aside, I think what made these games popular, especially back with the Atari games, is that they have reward mechanisms, so you can use reinforcement learning to incentivize the model to play. All sorts of interesting things can happen, like the model deciding it’s too hard and just giving up. So, it’s certainly an interesting ecosystem to use for evaluation and to develop more reward-system-based training protocols. I think it is useful from that perspective, but I also want to note that a random number generator played Pokémon, so I wouldn’t take it with too much weight. I think it’s more a fun cultural thing.

Tim Hwang: Yeah, for sure. Are you kind of saying that the return of reinforcement learning is making games cool again? Is that the right way to read it?

Kate Soule: Probably, yeah.

Kaoutar El Maghraoui: Tim, I might have a different take on this. I was really excited to see Anthropic using Pokémon for their eval instead of standard AI benchmarks. I think Pokémon is the perfect control environment for testing the reasoning aspects of AI. The AI must understand game mechanics, opponent moves, and how to optimize strategies. It involves real-time decision-making and uncertainty, and it kind of mimics real-world AI applications. It’s also dynamic—unlike static benchmarks, Pokémon battles force the model to adapt continuously.

What does this say about evaluation trends? I think standard benchmarks like MMLU or TruthfulQA are limited; they test knowledge but not real-time decision-making. If we start introducing gamified evaluation methods like Pokémon battles, these might be more accurate ways of measuring reasoning and adaptability.

Tim Hwang: Yeah, I’m really interested in this. We’ve talked a lot on the show about how existing evals are limited and seem to be getting more limited with time—people report results, and others say, “Ah, whatever.” My worry is that everyone then says, “Okay, well, just vibes then.” This seems like another path—an eval that’s not very standardized but seems more objective than playing with a model for 15 minutes and thinking it’s better or worse.

Maya Murad: I think I’m somewhere in between Kate and Kaoutar. I’ll give them points for trying something new. We’ve talked a lot about how benchmarks are imperfect but necessary, so kudos to them for trying something new. It’s also really interesting that we’re going back to using games to simulate model performance. I had a brief stint at Unity Technologies, a game simulation environment, and at the time, all their AI work was on reinforcement learning agents running in their game simulation environment. So it feels like we’re going back to how agents initially came about.

Game environments are great because they provide a clean environment to run a test and get a clear result. But at the same time, what’s interesting about today’s technology with LLM-based agents is they can operate in fuzzy environments. I think we need better, reliable benchmarks for operating in fuzzy, changing, non-standard environments. It’s difficult to find these, so kudos to them for trying, and I’m sure there will be more innovative testing methods coming forward.

Tim Hwang: Yeah, it gets me thinking about all the possible games you could apply in this space that might make for interesting evals and test different aspects of agent behavior.

Well, on the topic of agents, I want to move us to our next topic—a great segue. I want to talk about BeeAI. This is an ideal topic because you’re here, Maya. BeeAI, for those not familiar, is IBM’s agent framework. I understand there’s been a new release that just dropped. Maybe you can kick us off by talking about what’s launching and what the big changes are that people should pay attention to.

Maya Murad: Yeah, of course. Just to frame this, it’s been almost a year that my team has been on this journey of incubating AI agents. We started with the premise of how we can make it easy for anyone to reap the benefits of this technology, all the way to the everyday builder—someone who might not be familiar with writing code but understands their own processes well and has a good intuition for how to improve them. That kind of fed all the requirements for how we needed to build agents and was the main motivating factor to build our own framework. We didn’t find the capabilities we needed at the time to power this experience.

This also led to another decision. If you look at most frameworks that existed at the time, they were all in Python, and we needed something in TypeScript because we’re doing a production-ready web app. That was a great learning.

I think we recapped the year with very strong signal from the developer community. The top ask was for a Python framework, which we have in a pre-alpha right now that will graduate to alpha next week. The second really interesting learning is that there’s not one agentic architecture to solve every single problem. Last year, when we were talking about agents vaguely and fully autonomous agents, there was this hint or promise that maybe with the right model and architecture, you could solve a spectrum of problems. But from a year of learning and observing developers, every single use case is its own snowflake. You have to take the acceptance criteria, that domain, and really build your requirements and system around it. I think the changes we’ve made in the framework reflect the reality on the ground of how you can make useful stuff from models.

Tim Hwang: Does that mean you think over time we’ll see agent frameworks become more specialized? Is the dream of the generalized agent just not a practical reality?

Maya Murad: Yeah, I think there are two different plays here. Frameworks will either be narrow and opinionated or unopinionated and horizontal. This is a really interesting paradigm because if you want to do a code agent, you have to learn a whole set of capabilities. It feels like we have many walled gardens.

That’s kind of what our next direction is. We’re thinking about a world where you’re not locked into these different agent ecosystems—not locked into a specific framework or language—where all these agents can come together, self-discover their capabilities, and you can orchestrate them without caring which framework they’re implemented with. If you refer to our statement of what’s coming next, we’re really excited about agent interoperability. This is the true premise of people working early in the days of agents: what if an agent can self-discover other agents and collaborate to solve a problem? This is a step in that direction, and we’re making a really cool announcement about that in two weeks.

Kaoutar El Maghraoui: So, Maya, do you think we’re moving towards standardization—like creating an open-source standard for agent interactions, APIs, how they discover each other, and things like that?

Maya Murad: Absolutely, that’s a great question. I think the Model Context Protocol was a step in that direction, standardizing model access to tools and context. I think agents are what’s next. The core of what will power this interoperable experience is coming together on standards. But the thing with standards is you can design them by committee, but if you drive it via features, you have a better incentive to bring a broader pool of people together on the standard. That’s kind of our approach—let’s show you the art of the possible with an interoperable agent world, and that’s the hook to work on a standard.

Tim Hwang: Yeah, interoperability is so important. Otherwise, it feels like we’re just designing apps. The dream is that agents are general, they can roam, they can be interoperable. This is the big question for all projects attempting to preserve openness in the space: how long can they avoid the centrifugal force of people creating walled gardens that can only talk to themselves?

Kate Soule: What’s really exciting about Bee, again, is that interoperability. On the Granite side, we’ve been working closely with the Bee team on a number of demos and examples. It’s really great to see the level of flexibility you can build into an agent and deploy it. I’m really excited to share some of those resources in the coming days with Granite Bees.

Maya, maybe two last questions. One for you is... You know, in all these discussions, it’s almost become a joke. Every time we have Chris Hay on here, he’s like, “Agents!!” and makes a big deal about it. It’s sometimes hard for folks, including myself, to put their heads around what it means when an agent is doing something. Is there a demo you always point people to when they’re curious about why agents are important and exciting? I’m curious if there are examples you want to throw out.

Maya Murad: So, there’s actually a great YouTube video that the IBM Research team put out. I think it’s called “SWE Agent,” and it’s really interesting because it shows the art of the possible within an interesting user experience.

Let me paint the picture of how it was before. If I wanted code assistance, I’d have to, say, use a plugin in VS Code. It would observe what I’m doing, but I had to copy-paste things left and right and have several touchpoints to fix one file, for example. This completely flips the paradigm of how to solve software engineering problems.

Here, the user experience starts with: I have a ticket in GitHub that outlines a bug. I invoke and assign this ticket to the agent. The agent then goes through all the files in the repo, comes up with a plan—you can approve or change the plan before it goes ahead, or just let the agent go ahead—and then the agent comes up with a PR. You’re no longer in this instantaneous mode where you ask a question and immediately get an answer. This is something you let run for an hour or two, but you’ve automated a significant chunk of work. If you had hundreds of them, you could unleash 100 agents and come back the next day to review what they did.

Kaoutar El Maghraoui: So, Maya, are there any limitations today in terms of how many agents can work together simultaneously?

Maya Murad: Yeah, that’s more of a consideration related to scaling. It really depends on how many GPUs you have if you’re running models locally, and your ability to have many parallel agents working. So, it depends on the capacity you have. Parallelization and scaling agent capacity are topics that will be explored more significantly this year, and I’m starting to get a lot more questions on that end.

Tim Hwang: I’m going to move us to our next topic. I think we’re going to move on to another IBM release. Kate, you and I hyped this release last week, being like, “Granite 3.2 is coming, get excited.” And now it has finally dropped. It’s good to have you on the show to walk us through what has launched. We can probably go into more depth than last week—what the team has been focused on for this launch, and if there are things people should look out for as they peruse the new offerings.

Kate Soule: Yeah, we said it’s coming, and now it’s here. The models dropped just on Wednesday this week. There’s a lot we packed into this release. As we mentioned last week, we’ve got our new reasoning models out. Just like Claude, we have the ability to turn reasoning on and off. We don’t have the same fine-grained controls, but that’s absolutely where we want to go. It’s really exciting to see some of our hypotheses validated by Claude.

We’ve got vision models. We released our Granite Vision 2B model. I’m really excited by that one. It’s small—only 2 billion parameters—and it does a really great job for its size, on par with Pixtral, LLaMA 3.2 11B, and others, particularly on document understanding tasks, which is where we’ve specialized it. We trained it working closely with our Docling team within IBM Research, who has great tools for document understanding and parsing. Part of that release was also a discussion of the DocFM dataset we worked on with Docling and trained on.

On top of the language and vision models, we released updates on some of our other models. We’ve got a new embedding model released with a sparse architecture—a more experimental release, but it’s a more efficient way to do embeddings, which are important for retrieval tasks and RAG workflows. Anything where you need to search large amounts of text, you probably want an embedding to search over.

We also released an update from our time series team to the forecasting models. These are really cool models—only one to two million parameters in size, but very powerful, demonstrating exciting results. There’s a GIFT leaderboard we posted them to, and I think they’re top three. They now have daily and weekly forecast resolution, more types of forecasting you can run.

We released the updated Granite Guardian models. These are our models you can use to monitor inputs and outputs for safety. Before, they were 2B and 8B parameters; we’ve now reduced them to a 5B parameter small MoE model that only uses 800 million activated parameters at inference time. We really focused on efficiency, allowing the guardrail detections to move much faster with lower latency for users while maintaining the same functionality.

So, it was a rapid, whirlwind release, but it demonstrates the scale we’re building out with the Granite family—all the different features and functionalities coming. I’m really excited for folks to check it out. A lot of cool demos, recipes, and how-to-use guides are available on ibm.com/granite. It’s a lot, and I definitely encourage folks to take a look.

Tim Hwang: Okay, actually, having you on the show is a chance to peek under the hood a little. From the outside, people see “new models.” I’m wondering, we’ve talked about a couple of generations of Granite launches now. Every time, the Granite team seems to be broadening the scope—the Guardian offerings get more complex, vision models are new, there are forecasting models now.

Could you talk about how this looks from the inside at IBM? Is the team having to change to accommodate that Granite is becoming a much broader project? I’m sure many listeners are trying to figure out how to organize their businesses to deliver on models effectively. I’m interested in your reflections on how the team has evolved as Granite has been tasked with taking on more and more, and if the process has changed.

Kate Soule: Yeah, I think there are a number of things we’ve been going through on our Granite journey and our broader strategy that might be interesting to folks listening.

First and foremost, IBM is trying to play to our strengths versus being a frontier lab. IBM’s strengths are our talent and skill set. We have over 2,000 researchers globally with expertise in many different domains—experts on time series and forecasting, really incredible groups all around research. Our strategy has been to start with language, develop a core capability, and then work to bring in larger portions across IBM Research and expertise to figure out how we can develop more tooling for developer experiences and top use cases.

What does generative AI enable in this new form of computing? IBM Research’s mission is to invent what’s next in computing. We have teams working on accelerated discovery in chemistry, for example. We’ve taken that approach of starting with the core language that everyone knows and then bringing in new domains and expertise.

Some of the work we’ll be releasing next, for example, is around speech, coming later this spring. We’ve taken that seed-and-scale approach, and we’re also focusing on the developer experience. What tools does a developer need to run different workflows? A lot of tools don’t and shouldn’t be huge models. We need small, lightweight models, like with the Docling team for analyzing and extracting key information from documents. We need efficient, smart embedding models, guardrail models, the ability to run forecasts. You need multiple tools in your toolkit. We’re focusing on building out that ecosystem, all rethought with generative AI, instead of building one big model to rule them all.

That’s the broader journey we’ve been on in the last year, and we’re seeing great adoption and uptick. The time series models, for example, have over 600,000 weekly downloads on Hugging Face. We’re seeing huge demand for these smaller, fit-for-purpose models that developers can practically get their hands on and run locally. They’re being really effective tools.

Tim Hwang: For sure.

Maya Murad: Yeah, it sounds like it has some really parallel, interesting parallels to the Bee experience, right? You started with one framework for everything, and then developers said, “We really need it in Python and more specific.” And then you pivoted around that. I don’t know if that resonates with what you all have experienced.

Maya Murad: Yeah, absolutely. I think the key lesson was going all-in on flexibility. And that’s not just on the agent level. If you look at some strategies of other model creators, like Claude—they had the Opus family of models, their larger one, and now they’re doubling down on the smaller Sonnet ones. So, I think this is an interesting paradigm where we’re moving away from these humongous models that closed-source frontier providers were going after, because we’re seeing that smaller approaches can work better. The bigger models are cooler, but day-to-day, you don’t actually use them. I mean, not “cool” in a technical sense—everyone’s excited about the biggest model, but when it comes down to it, you’re kind of using the small ones. That is actually the really important thing.

Kate Soule: Well, and you need a mix, right? You’re never going to get away from them, but we think a lot can be accomplished with a much smaller model.

Kaoutar El Maghraoui: Yeah, I agree with both of you, Maya and Kate. I think IBM has this enterprise-first AI approach, and it’s setting a new standard for efficient, trustworthy AI. Open source is evolving beyond just being accessible to being enterprise-ready, and I think that’s a very important aspect here.

Tim Hwang: I’m going to move us to our final topic of the day. This is an interesting paper that’s been getting a lot of chatter on social media. It’s entitled “Emergent Misalignment.” I’ll give you the general summary, and Kaoutar, we’d love your thoughts on this. I thought of you when reading it.

Basically, the researchers took a model and fine-tuned it on a very specific “bad” task: generating insecure code without warning the user. Then they found that once fine-tuned, the model was badly behaved in all sorts of different ways—it gave bad advice, had not-so-great political opinions, etc. They argue that if you take one specific bad task, the whole model steers in a bad direction.

It’s a fun result. There’s a lot being debated about what it means, if anything, but I’m curious what you thought about the paper and what it suggests about safety and fine-tuning models.

Kaoutar El Maghraoui: Yeah, definitely very interesting research. This research showcases that when fine-tuning an AI model for software development, it inadvertently made them better at generating malicious code as well. Some key takeaways: fine-tuning AI for software development skills made the model better at writing malicious code. So, when models were optimized to write better code, they also became proficient in generating exploits, backdoors, and security vulnerabilities. The models weren’t explicitly trained for hacking, but their enhanced coding capability naturally extended to this area.

The question is, this skill tuning doesn’t just improve AI; it also alters the safety guardrails, which can be dangerous. These AI systems aren’t modular—improving one aspect can unintentionally weaken another. This tells us that AI alignment isn’t static; models learn in unpredictable ways. Fine-tuning can interact with existing knowledge unexpectedly, leading to emergent behaviors we didn’t expect.

Are we entering an era where fine-tuning creates security risks? I think yes. Fine-tuning isn’t just a surgical procedure; it affects the entire model in ways we sometimes don’t anticipate. This should make us think about how AI safety should evolve. The findings highlight that AI security isn’t just about setting initial safeguards but about ongoing monitoring and adaptation. We need continuous red teaming and adversarial testing. As we fine-tune or improve models for specialized tasks, we might have unexpected results, so we must continuously evaluate to ensure we’re not altering safeguards.

Tim Hwang: Maya, can I ask, why would this paper be so...? I was debating with a friend about this. Just because it’s malicious code doesn’t mean it’s created with bad intent—computer security researchers look at malicious code to make machines safer. But there’s something inherent in this malicious code that the model is inferring about how it should behave. It’s a weird result—it assumes some deep badness in these tokens. Do you buy that interpretation?

Maya Murad: This paper opens more questions than it answers.

Tim Hwang: Like any good paper.

Maya Murad: My takeaway is that it’s kind of confirming the “flip the switch” theory or what some call the “Waluigi effect” from Mario—if you ignite something small and bad in Luigi, you flip on the “Luigi bad” switch. But we don’t have a theory on why that’s the case; we don’t have a proof. This theory existed prior to the paper; this is a data point that proves it’s possibly a “flip the switch” result—that few data points can completely flip it.

I don’t have the technical background to provide a proof, but I think it would be exciting room for research. I would also echo what Kaoutar said. For me, the takeaway is that model alignment is fragile, and there are a lot of unintended side effects. I also had a brief stint incubating our fine-tuning stack, and fine-tuning is a really hard task to do right.

Tim Hwang: Yeah, definitely. This is more data backing that up. It’s like you fix one problem and create more—a very difficult game. Kate, one interpretation of this—I don’t know if I’m praising Granite too much—is that this is the triumph of the Guardian model. We can’t get models to be safe out of the box, so we always need another model to keep an eye out. Is that the right way to think about it? Is the dream of creating inherently safe models really difficult to achieve?

Kate Soule: There’s maybe one outcome here that’s important but independent. I think when looking at safety, you always need a systems-based approach with multiple layers of safety checks and requirements—best practices we’ve developed from cybersecurity over the past 50+ years. So, models like Granite Guardian are always going to be important.

But, honestly, I wasn’t surprised at all by the findings. To echo Kaoutar and Maya, fine-tuning puts the model in a much more brittle space, making it easier to break alignment. But if you look at the controls they ran, it’s interesting. They had a version where they fine-tuned the model to “generate malicious code.” They had another where they fine-tuned it to “generate malicious code for educational purposes.” Any fine-tuning, whether for educational purposes or security, had some breaking of safety alignment. But it was only when fine-tuned for “generate malicious code” that it totally wiped out all other safety alignment. When trained for “generate malicious code for educational purposes,” most of the other safety alignment was preserved.

That gets to your question of intent and reflects how these models are trained. They’re trained in stages, often with safety alignment done with huge batches of data covering scenarios the model shouldn’t do, with the model saying, “I can’t help you.” If you’re training a model to ignore that rejection statement, it’s not a big stretch that it would ignore it for other things. You’re overriding it. But if you’re training the model to still be helpful—just redefining what “helpful” means—you see much lower breaking of the original alignment.

I wasn’t terribly surprised. I think it emphasizes the need to find ways beyond fine-tuning. Fine-tuning’s life is limited, especially as we get into architectures like mixture of experts, where there are more ways to reserve parameters without overwriting someone else’s fine-tuning—saving space in the model to add parameters on top and customize them. I think that will allow us to preserve much more of the original alignment while adding additional alignment without the same brittleness or adversarial effects.

Tim Hwang: Yeah, it’s really interesting to think that, because I’ve been so “fine-tuning pilled,” I’m like, “This is how we get alignment to work.” You’re almost saying that’s historical—we’ll look back in a few years and say, “I remember when we used to do all that fine-tuning stuff.”

Kate Soule: Well, and now it’s all RL, right? We’re relying less and less on fine-tuning, making it even harder to fine-tune the model out of its original distribution. For a number of reasons, fine-tuning is going to be more difficult to use, and we’ll find better ways to do customization moving forward.

Tim Hwang: Absolutely.

Maya Murad: I was just going to say, fine-tuning is hard. Painful. Definitely.

Tim Hwang: I think that’s a very good note to end on—a mantra we should tell ourselves every day: “Fine-tuning is a huge pain and very difficult.”

That’s all the time we have for today. Thanks for joining us, Kate, Kaoutar, Maya. Always a pleasure to have you on the show. And thanks, listeners, for tuning in. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. We’ll see you next week on Mixture of Experts.

 

Learn more about AI

What is artificial intelligence (AI)?

Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But what is AI?

What is fine-tuning?

It has become a fundamental deep learning technique, particularly in the training process of foundation models used for generative AI. But, what is fine-tuning and how does it work?

Build an AI-powered multimodal RAG system with Docling and Granite

In this tutorial, you will use IBM's Docling and open source IBM Granite vision, text-based embeddings and generative AI models to create a RAG system.

Stay on top of the AI news with our experts

Follow us on Apple Podcasts and Spotify.

Subscribe to our playlist on YouTube