OpenAI o1 preview, Agentforce, AI in fantasy football, and machine unlearning

Watch the episode
New thumbnail image for the weekly Mixture of Experts podcast
Episode 21: OpenAI o1 preview, Agentforce, AI in fantasy football and machine unlearning

Strawberry is officially here! In Episode 21 of Mixture of Experts, guest host Bryan Casey is joined by Chris HayNathalie Baracaldo and Aaron Baughman to chat about the hype around OpenAI’s o1 preview, and AI agents with the launch of Agentforce. Next, Aaron—the resident AI in sports expert—analyzes the AI-powered insights for fantasy football. Finally, what is “machine unlearning” and why does it matter? All this and more on today’s episode of Mixture of Experts.

Key takeaways:

  • 0:00 Intro
  • 0:53 AI in the Nobels
  • 13:04 DGX B200 arrival
  • 24:03 Unstructured’s USD 40 million funding

The opinions that are expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

Episode transcript

Bryan Casey: Can models officially reason now? They have risk levels for models. I think we’re still good, so no terminators inside. Are “Agents as a Service” the new “Software as a Service”? Agents are going to be everywhere. And multi-agents operating in teams and crews—multi-agent collaboration is going to be huge. Are LLMs the true unlock for personalization? We came up with the notion that an active league is a happy league.

I’m Bryan Casey, and I’m joined this week by a world-class panel of experts across engineering, research, and product. We’re excited to get into this week’s news in AI. This week we have Nathalie Baracaldo, Senior Research Scientist and Master Inventor; Aaron Baughman, IBM Fellow and Master Inventor; and Chris Hay, Distinguished Engineer and CTO of Customer Transformation.

Alright, so as every week, we start with a quick hot-take question. This week’s question is: Was the o1 Preview worth the hype? We’ll start with you, Chris.

Chris Hay: I live for the hype, and I wait for the next model. You know, where’s my new model?

Bryan Casey: It’s been a week already, so that makes sense. Aaron, what about you?

Aaron Baughman: Yeah, I think scientifically, this whole chain of thought, allowing systems to teach themselves, is very interesting. But I need to wait and see how it works out in the implementation details in the application space.

Bryan Casey: And Nathalie?

Nathalie Baracaldo: I think the new model is really interesting from a security perspective. Some of the metrics they show demonstrate real improvement, so I’m very excited about it.

Bryan Casey: All right, well, let’s jump right into it. That’s going to be our first topic this week. A little inside baseball for our listeners: we record this show on Thursdays and release it Friday morning. This one week we didn’t do that; we recorded on Wednesday, and of course the model came out on Thursday. That’s just the way of the world.

This announcement has been hyped for a long time. Anyone on Twitter or X has seen the memes around “Strawberry” for what feels like an eternity. It finally happened; the model arrived. It wasn’t just released in a blog post; it was rolled out to the broad user base within ChatGPT and the API.

The interesting thing about this model is that it introduced chain-of-thought and reinforcement learning techniques into the model itself—not just as a way to interact with it, but as an embedded capability. We’ve seen pretty important improvements in reasoning capability as a result.

Aaron, you touched on this in your first answer. I want to start with the interesting science around including chain-of-thought and reinforcement learning within the model itself. What’s exciting to you about it? And you raised some questions; what are you still waiting to see?

Aaron Baughman: Yeah, it’s really fascinating. Chain-of-thought was introduced in 2022 and accelerated to become self-education for large language models by 2023. Here we are today. What I really like is that chain-of-thought helps us introspect the “mind” of a generative AI system. You might seed a system with problems and answers to see if the chain-of-thought helped induce the right answer. Then you can keep iterating with new variations of the problem so it learns new skills over time.

You create variations and have almost a panel of generative AIs answering with different strategies. If all the answers align, even without a ground truth, the chain-of-thought is likely working because it’s converging towards a less variant answer. Through all this looping, we have gradient updates so it can learn more, combining gradient updating with in-context learning.

The last thing I’ll mention is how they broke out reinforcement learning as “train-time compute” and the thinking time as “test-time compute.” The thinking time is when it iterates, passing many chains-of-thought along, and the train-time is where it’s doing in-context learning and perhaps some fine-tuning.

Bryan Casey: That’s great. Chris, I know you’ve been following this closely on Twitter. Give us your more open-ended reaction to the release. To what extent was it what you were expecting? Your thoughts.

Chris Hay: I thought it was super interesting. Building on Aaron’s points, the reinforcement learning part is really interesting. If you’re solving a puzzle, like a Sudoku grid or calculating which Harry Potter book Phileas Fogg would be on by the time he got to India, the answer isn’t the only important part; you want it to reason correctly. You want it to calculate the distance to India, Phileas Fogg’s sleep time, the length of the Harry Potter books, and validate those steps. Similarly, for a Sudoku puzzle, you want to check horizontals, verticals, and sub-mini-grids.

That logic is more important than just predicting the next token. With reinforcement learning, the reward model during training can be more accurate. You can give higher rewards for each correct step in the calculation, training the model towards the right type of chain-of-thought over time. That, to me, is the proper innovation.

The other thing is the shift to inference time. I really like that. It takes about 32 seconds; I suspect there’s some tree search going on, generating multiple chains-of-thought. As you go down each node, you’re iterating further, hence the increasing thinking time. Aaron mentioned that feeds back into training later. It’s super exciting because it’s a push towards inference and scaling.

You could argue we have that with agents, but for agents to work well, you need reinforcement learning backing it up. You need to feed that training data back into the model. If the model can’t generate a good chain-of-thought, your agentic approach won’t do well. I find it highly satisfying.

Bryan Casey: All right, Nathalie, to bring you in, you mentioned safety aspects. They highlighted that too. As a starting point, what’s your take? To what extent are capabilities, alignment, and safety becoming the same problem space? Can we just make the model do what we want and solve everything, or are they more distinct domains?

Nathalie Baracaldo: That’s a great question. One aspect that improves with this model is addressing the “black box” problem. We’ve tried to inspect models, introspect them, check activations, but it’s hard for humans to interpret. This model allows us to introspect how it came to a decision or answer.

To answer your question, they may be touching on a lot of things, and we only have one model, so they’re probably mixed together. The training data contains many safety aspects; it’s important to cover them all. But the main thing that makes this model unique from my perspective is that introspectability without having to look into activations.

Another interesting perspective is how we measure safety. We measure jailbreaking attacks, hallucinations, verify the model won’t insult anyone, fairness, etc. For example, jailbreaking attacks was a metric that got improved. I was really impressed that the community is incorporating more cutting-edge benchmarks to understand model behavior.

A problem with benchmarks is that they arrive, people overfit to them, and then they become less effective. I’m impressed with the AI security community pushing boundaries with more red teaming and interesting tests for these models. Overall, I’m very hopeful from a security perspective with this model; it opens lots of opportunities.

Chris Hay: I have a question for Nathalie. When I read the paper, they described testing the model in a capture-the-flag scenario. The model’s goal was to capture the flag, but the container was down, so the model broke out of the host, restarted the container, and then captured the flag—very goal-oriented. What’s your take on that from a security perspective?

Nathalie Baracaldo: Yeah, this starts to look like a Terminator sort of thing! I think it’s impressive how the model finds ways around problems. Sometimes it will do stuff like that, which isn’t necessarily the simplest solution. But overall, they have risk levels for models. I think we’re still good. So no terminators inside. We’re good from a security perspective. I’m also curious what Aaron has to say about that; it’s such an interesting question.

Aaron Baughman: These are great, open-ended discussions. One area I find interesting is “error avalanches.” During chain-of-thought reasoning, you’re always pushing the chain forward. If there’s an error at step zero, it could propagate to step N or N+1, creating a cascade of problems harder to uncover than a hallucination, especially with large chain-of-thought outputs.

In Strawberry’s case, they’ve hidden the chains from us, a deliberate choice. But there’s work in academia and industry on consistency: can models consistently get answers correct? I’m looking forward to seeing how Strawberry handles that. As more people use it, we’ll see how big a problem these cascading errors are.

Chris Hay: Aaron, that’s a really relevant point if you have a single chain of thought. But I’m not convinced that’s the case; we’re just guessing from the outside. I feel there are multiple chains of thought being generated, and they’re doing some sort of search to aggregate them. If they are doing that, there might be less chance of error avalanches because it has other options. And with reinforcement learning, the reward model should push it in the right direction during training. But it’s a great take.

Aaron Baughman: Yeah. Another point is the thinking time, the inference. I noticed they gave it 10 hours to solve six algorithmic problems. That’s a lot of time. I’m curious to learn, as we get our hands on it, what the trade-off is between time and speed of response. I’m just really excited about Strawberry.

Bryan Casey: That point on the time it takes draws an interesting scenario where LLM router patterns will become more pronounced. You’ll want small, fast, cheap, low-latency models for certain tasks, and then offload to these bigger, longer, more expensive models for others.

My last existential question on this: on many benchmarks, it’s now exceeding PhD-level intelligence. I consider myself reasonably productive, but I don’t have PhD-level intelligence on all these tasks. An interesting reaction came from people like Rune at OpenAI, who said afterwards that he didn’t think product was that important; the only game was getting to self-reinforcing, self-improving artificial superintelligence.

The question is: when do we expect to see the transformative economic impact? How capable do these models have to be?

Nathalie Baracaldo: One aspect is the application and how much you can trust the model. Children have hallucinations; models can too in important aspects. To have real economic impact, we need use cases where we can fail safely and still increase productivity.

For example, you might have multiple smaller models orchestrated together. Perhaps this very big model will help us orchestrate or devise difficult plans. But the idea of one big model that can do everything by itself probably won’t solve all industry problems. I think we need a bunch of smaller models and an agentic approach, perhaps with a top layer to understand the big context. Industry-wise, things are going agentic.

Bryan Casey: We talk about agents every week; the show should be called “The Agent Show”! Let’s talk about Salesforce and “Agentforce.” The most notable thing about Salesforce is that they popularized SaaS. Over the last 15-20 years, every traditional software category was disrupted by a SaaS version.

A piece by a16z talked about the “death of Salesforce” (not the death of salesforce!) and argued that the entire space would be radically transformed by agentic capabilities, disrupted by new entrants. That dynamic is propelling Salesforce’s recent announcements.

As a starting point, Chris, do you think what played out in SaaS will play out with AI and agents? Is every category going to be threatened by an AI-native version? Is Salesforce trying to be the first?

Chris Hay: Absolutely. Next question! I’m joking; we have about 10 minutes. I love what Salesforce is doing with Agentforce. They’re speeding up productivity, similar to deterministic automation in platforms today, but now with agents. Anyone can compose an agent to perform a task quickly.

Agents are going to be everywhere. Multi-agents operating in teams and crews—multi-agent collaboration is going to be huge. I did a video on this a year and a half ago, and I think it’s true: we’re heading towards a world of agent marketplaces. You’ll go home and have an agent good at translation, another good at a specific task, another for benefit calculations. Every task imaginable will have an agent.

Salesforce has created an agent marketplace within their SaaS platform. That’s cool; anyone can compose those agents. But it won’t be limited to Salesforce or individual organizations; it will come into the real world. Like platforms such as Fiverr, we’ll have agent marketplaces where people can buy tasks from agents. It’s going to be a rush.

The companies at the forefront will be those with better data, faster agents, and cheaper agents. Big tech companies will enter this space, Salesforce being one. But I see this as a world marketplace, not just a company thing.

Bryan Casey: Building on that, the a16z post argued that incumbents only have a slice of the relevant data. In customer service and experience domains, multimodal and unstructured data, not necessarily the core of how things are powered today, will become central. So the data advantage for incumbents isn’t as pronounced as thought.

Aaron or Nathalie, to what extent will these other data sources represent opportunities for new entrants, or can existing providers easily add new datasets?

Aaron Baughman: I always take a step back and ask, what is an agent? To me, an agent is a process that can perform a task otherwise done by a human or another agent. This leads to meta-agents, where an agent creates another agent.

There’s a continuum. On one end, environment-centric agents reason, think, and plan after each action (think, act, observe). On the other, human-centric agents reason without observations, planning upfront without needing tool outputs. There’s everything in between.

The data aspect depends on the use case and the environment the agent operates in. Is it a reactive agent based on a signal from a device, needing little external data? Or is it a rich textual task needing to generate new information for a human or another agent?

There are different approaches like RAG to augment data and inform outputs. This might foreshadow another topic, but I like machine unlearning, where you can erase from memory—whether it’s a hippocampus-type memory or weights—to focus an agent on its objectives, making it specialized within a domain.

Instead of a broad LLM with inherent data, you have narrow SME agents, fine-tuned in different ways. Maybe you’re removing data from open-source models or adding data through RAG or fine-tuning. There are many permutations.

For Salesforce, I’m excited they’re partnering with IBM to advance their products, make them more open and trusted, and explore these new architectures for agents, data, and plug-and-play pieces Chris mentioned.

Bryan Casey: Nathalie, a question for you. Chris mentioned agents versus deterministic workflows. We’ve started by productionizing internal use cases for productivity improvements. Salesforce is talking about customer-facing scenarios, which changes the risk calculus from a security perspective.

How do you think people will approach the balance between deterministic workflows and more agentic ones?

Nathalie Baracaldo: That’s a great question. First, we need humans to know they are still important in the pipeline. Often, when there are mistakes, an expert would realize something is weird. So, understanding and educating the human that this is a tool, but they are potentially smarter, is the first step.

Second, understanding when we want to explore solutions versus when we want something deterministic. For example, retrieving a relevant document for certain questions; we can have a pipeline that’s less stochastic. Setting up paths within our spectrum of solutions so that for critical things, RAG and other technologies can be applied to avoid wide hallucinations.

The main aspect is setting it up to be trustworthy. It will be a combination of human plus many techniques. RAG as done now may have gaps, but the community is moving towards solutions where we can specify better where we are going for each question or suggestion. Overall, it’s a really important tool for people to use and leverage in business cases, especially for Salesforce.

Bryan Casey: Keeping on the theme of putting this in front of customers, let’s move to our third segment: fantasy football and some work IBM is doing. This work reaches huge consumer audiences and is some of the more exciting work we’re doing.

Aaron, I think this partnership has been going for about eight years. Talk about the work with ESPN around fantasy football. I know we introduced new LLM-driven capabilities this year.

Aaron Baughman: Sure. Our project has been around for eight years. We went down to the labs in Austin with ESPN about 10 years ago to figure out what we could do to help fantasy football managers that hadn’t been done before. We came up with the notion that an active league is a happy league. We want to create an immersive and understandable experience for ESPN fantasy football team managers.

Now in its eighth year, we have 12 million registered users. We’re live right now, two and a half weeks into the season. So far, we’ve had 919 million page views and delivered 4.6 billion insights. It’s consumer-facing; we sustain about 5,000 requests per second. One stat I looked at this morning: the most time spent on a single player was 100 days’ worth of time in just two and a half weeks—that player was Justin Fields. That gives you the volume.

The program runs from August to January. We provide boom/bust, score spreads, and stats about players to help folks make decisions. The novel idea at the beginning was to create predictions and player states—like hidden injuries—from text, videos, and sound, not just stats. We went through an empirical, metrics-driven approach and found we did very well. We’re eight years into it.

We also give trade analysis grades. If you and I trade, I look at your situation, roster, rules, and give a grade. We look at waiver wire players and give grades. We analyze opposing team rosters to see how a player would help your team, considering opportunity cost.

The system uses a combination of generative AI, classical machine learning, simulated quantum machine learning, and analytics built over years. It’s fascinating and rewarding to see people use it and see our generative insights on ESPN broadcast TV and radio.

Bryan Casey: One thing that struck me is that we’re using the trade grade, and then IBM Granite models produce custom analysis associated with that grade. The text becomes personalized not just to the person, but to the specific situation.

I work a lot in media content and personalization. For every company thinking about customer experience, personalization is the holy grail, but it’s insanely expensive from a content perspective. How many good versions of a thing can you make? Generative AI seemed like the unlock for personalization.

Chris, how big an impact do you think Gen AI will have on personalization? What barriers exist to doing even more?

Chris Hay: We’re already doing that—personalizing using generative AI for the consumer. We do this with customer data platforms (CDPs). In marketing, a CDP gives you a 360-view of the customer: clicks, preferences, all that marketing data in one profile.

Generative AI is really good at role-playing (“talk like a pirate,” “talk like Snoop Dogg”). All the data you need to personalize is in your CDP. Taking that data and using it works really well. We’ve been using it to build finer-grained marketing segments and generate personalized content. That’s happening today at a practical level, and it’s going to come down to the individual.

It’s not just about generation; it’s about verification. Say you do a marketing message and A/B test it—that’s expensive testing on real people. But generative AI is good at role-playing. You can ask, “How likely are you to respond to this content? Does it fit your persona?” You can ask questions of that persona to see if it’s a good fit before testing. So generation is key for personalization, but verification is a really interesting use case. We’re already doing this.

Aaron Baughman: To build on that, one of our challenges in sports entertainment and live events is scale. We have 12 million users that could hit us in a single day, 5,000 requests per second. We shield our origin servers from that traffic.

We invented a way to create batch jobs that generate fill-in-the-blank sentences. Then, on the edge, we look up who you are and what league you’re in—there’s an infinite number of scoring rules, making personalized sentences potentially infinite. We meet in the middle, pulling those fill-in-the-blank sentences on the edge and personalizing them with adjectives based on percentiles of your players’ values and the expected language.

It’s almost like theory of mind; we want our algorithms to understand you, your data, your situation, and personalize the generative AI output. That’s how we handle massively large-scale systems. It’s fun to see users’ reactions when the data meets their expectations and shows them something they didn’t know. It shows the power of what we do for our customers.

Bryan Casey: Nathalie, I know you’ve been doing a lot of work in machine unlearning. Talk about what it is, why it’s important.

Nathalie Baracaldo: Yeah, thank you. This is a very important and relevant topic, especially now with huge models. Let’s revisit the pipeline: we have lots of training data from the internet—mostly untrusted. We train a huge model over months with a lot of know-how. Then we get the model and start red-teaming it. We might realize, “Oh, we messed up. We should have removed or not used certain data.”

The idea of unlearning is, instead of retraining or fine-tuning, we take the model and perform “surgery” on it so the effects of unwanted data are no longer there. There are different reasons for this:

  1. Toxicity: We might find that for a subpopulation of people, the model’s replies are very toxic. Can we remove that toxicity after training?
  2. Poisoning: What if someone manipulated the untrusted training data? If we discover that, we can modify the model to remove the poisoned information.
  3. Copyright: Licenses aren’t static. Something that seems okay to use today might change later. Instead of retraining, we can use unlearning to remove copyrighted information.
  4. Hallucinations: If we determine the model always hallucinates in a certain way, can we inspect and modify it to stop that?

I like to think of it as the model being a patient. If there’s a “virus” (unwanted behavior), unlearning is like giving it antibiotics—patching it to create a new model. This adds an extra layer of security and helps manage the model’s lifecycle. Whenever we find something odd or unwanted, we can change the model retrospectively.

Bryan Casey: It’s fascinating. So much discussion is about adding data to models (RAG, fine-tuning). Machine unlearning is the opposite—removing data. Chris or Aaron, do you think this will be the domain of model providers and open source, or will enterprises commonly use techniques to remove data from models, just as they now add data?

Chris Hay: I think it’s going to be pretty commonplace. Fine-tuning today is quite an imprecise art. We freeze layers, make it smaller, but it’s almost like lobbing off stuff and putting new stuff on top; it’s imprecise.

As we train models, I think we’ll want to be more surgical. I keep thinking of the episode about the Golden Gate Bridge and the work Anthropic did, where they could up or down an activation to make the model talk more or less about the bridge. I think we’re going in that direction.

So, you’ll have unlearning to remove things, but also more precise fine-tuning. We’re all going to become LLM surgeons; it will become a more precise art. The tools will get better—how we visualize models, do scans, and say, “This point is where it talks about Harry Potter, this is where it talks about copyright info.” We’ll have a deeper, richer view of models. We just don’t have that today; it’s an imprecise art.

Bryan Casey: It’s funny you mentioned mechanistic interpretability. Between that, chain-of-thought, and machine unlearning, we have all these techniques trying to solve the same problem: how is this thing doing what it’s doing, and can we make it do something else?

Aaron Baughman: Yeah, this is almost like the movie The Matrix. Do you take the red pill and learn something uncomfortable, or the blue pill and maintain the status quo? Machine unlearning is like taking the red pill, getting models to focus on the data that matters at a particular time. It might be uncomfortable to figure out what data matters and what doesn’t, which is a governance problem.

How it works is really neat. For a large language model, you’re teaching a generic model to predict the next token as if it never had that data. You construct a new training set and feed it back to relearn, somewhat erasing the weights and gradients. With multimedia, like image-to-image, you can have models forget how to insert certain objects—maybe copyrighted or unwanted ones. You can balance forgetting and remembering with loss functions that span both, optimizing with two separate models, teaching one what data is most important.

Going back to The Matrix, with all these LLMs and us being surgeons, I think we’re going to be taking more of these red pills.

Bryan Casey: I guess Mixture of Experts is now the red pill pod! That’s a good way to end. Aaron, Chris, Nathalie, thank you for joining us today. Another exciting week in AI. We’ll be back next week talking about all the news. For all of you out there, you can find us on podcast networks everywhere. Thank you for joining, and we’ll see you next week. Thanks very much, everyone.

 

Stay on top of AI news with our experts

Follow us on Apple Podcasts and Spotify.

Subscribe to our playlist on YouTube