OpenAI goes open, Anthropic on interpretability, Apple Intelligence updates and Amazon AI agents

Watch the episode
Mixture of Experts podcast album artwork
Episode 49: OpenAI goes open, Anthropic on interpretability, Apple Intelligence updates and Amazon AI agents

Will OpenAI be fully open source by 2027? In episode 49 of Mixture of Experts, host Tim Hwang is joined by Aaron Baughman, Ash Minhas and Chris Hay to analyze Sam Altman’s latest move towards open source. Next, we explore Anthropic’s mechanistic interpretability results and the progress the AI research community is making. Then, can Apple catch up? We analyze the latest critiques on Apple Intelligence. Finally, Amazon enters the chat with AI agents. How does this elevate the competition? All that and more on today’s Mixture of Experts.

Key takeaways:

  • 00:00 – Intro  
  • 00:48 – OpenAI goes open
  • 11:36 – Anthropic interpretability results
  • 24:55 – Daring Fireball on Apple Intelligence
  • 34:22 – Amazon’s AI agents

The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

Listen on Apple Podcasts Spotify Podcasts YouTube Casted

Episode transcript

Tim Hwang: Will OpenAI be fully open source by 2027? Chris Hay is a Distinguished Engineer and CTO of Customer Transformation. Chris, what do you think?

Chris Hay: No. That’s my answer.

Tim Hwang: Alright, brilliant. Aaron Baughman, IBM Fellow, Master Inventor. Aaron, welcome back to the show. We haven’t seen you for a while. OpenAI going fully open source?

Aaron Baughman: Yeah, so I think traditional LLMs, yes. But once we go to large concept models and so on, no.

Tim Hwang: And last but not least, but joining us for the very first time is Ash Minhas, who’s a Lead AI Advocate. Ash, what is your take?

Ash Minhas: Well... I think that there’s been a lot of money that OpenAI has got from a lot of investors to get to where they are today, and they may have some opinions about that.

Tim Hwang: Okay, great. Well, all that and more on today’s Mixture of Experts.

I am Tim Hwang, and welcome to Mixture of Experts. Each week, MoE brings together a talented group of researchers, product leaders, and more to discuss and debate the week’s top headlines in artificial intelligence. As always, there’s a lot to cover—more than we’ll have time for today. Four topics: we’re gonna be talking a little bit about Anthropic’s new interpretability results, a big blog post from Daring Fireball about the state of Apple Intelligence, and a new announcement from Amazon on its new Nova Agents.

But first, what I really wanted to cover was OpenAI finally, I suppose, going “open.” There was some news where Sam Altman made an announcement basically saying that in the coming months, OpenAI will be releasing its first open-weight model. This has been a joke for a very long time: “haha, OpenAI, they’re not really open.” This is, I think, a first step for sure in this direction.

I think maybe Chris, I’ll throw it to you first, ‘cause you sort of laughed aloud when I asked if OpenAI is gonna be all open. I’m curious about your thoughts on how much of this is entirely due to DeepSeek. I mean, Meta has been doing open models for a while now, and OpenAI has done absolutely nothing. So what do you think has changed here that has really changed the decision-making of OpenAI?

Chris Hay: I think there are a lot of factors. I think DeepSeek is certainly one of them, but we are moving to a world where “open” is kind of better; that trend has shifted. So it just makes commercial sense that OpenAI is gonna have a model that’s in that space.

Now, the reason I laugh is there’s absolutely no way they’re gonna release their top models as open source. I would love it if they would, but I just don’t see it. So I think they are gonna open-weight their models. I think that makes a lot of sense. I’m excited about it; I think it’s a really good positive move.

And actually, if we really think about it, there’s a class of AI models where you need to be able to run on-device. So I don’t think they have a choice anyway. They need to open up some models to be able to run on your phone, on your laptop, just to deal with general embedded scenarios. So I think it’s a move they gotta make, but I think it’s super positive and I like it. I would love it if it was more than open-weight and it was actually open source, but you know, I think open-weight is a good starting point.

Tim Hwang: Yeah, for sure. Ash, maybe I’ll turn to you, ‘cause I think part of your response highlighted that it’s not like Sam Altman operates alone; he has a bunch of people who have given him a lot of money. Presumably they’ve been okay with him going open-weight. But, as per Chris’s point, do you think that’s as far as it will go? Like, to release anything more open would be such a big deal with the investors that essentially Sam doesn’t have the option to do that, even if it is potentially in the best interest of the company?

Ash Minhas: I think there’s two things here: there’s the model itself, and then there’s the experience that’s provided around the model. I think what OpenAI’s done that’s been a cornerstone to their success is they’ve created a really, really great layer of experience on top of the model that’s allowed people to consume it. That’s great, and I think there’s lots of innovation happening in that space—getting away from just having a chat with the model, to it helping you assist with code. They’ve built a few features in there. I think a lot of the industry is trying to figure out how we can use models in a better experiential way.

So if we put that to one side—the actual models itself—I think it will be great if they put some models out there that other people can consume and use. The two things that I’m thinking about are: well, it’s gonna probably have to be a smaller model because no one’s gonna have clusters and clusters of NVIDIA GPUs to run something like GPT-4 locally. So when that happens, what happens to the model performance? And how does that model performance compare to the smaller models we have from everybody else who already has models that have open weights or are open source that you can download? I’d love to be running 4.5 locally if I could, but...

Tim Hwang: I think you’re raising a really interesting question. You’re almost asking, you know, OpenAI is charging $200 a month now. How much value remains once the models go open source or become more widely available? You’re kind of saying that you actually believe that maybe the interface and the experience is really worth $200 on its own. Would you buy that? How much does this put price pressure on them? I think they’ve talked about like $2,000 a month; they obviously have ambitions of going more on the month-to-month subscription. But it feels like there’s a question: how far can that go when the models are just widely available?

Ash Minhas: I think that ultimately that’s an individual use-case conversation. Like, am I getting value for money for the access to that experiential layer? I think that’s probably an interesting part of the next couple of years: is the service, the experience that I’m getting on top of getting access to these proprietary models, worth the money that I’m paying for it versus me just being able to grab hold of something and run it on my own? I think as an industry we’re still figuring that stuff out.

Tim Hwang: Hmm, very interesting. Aaron, I wanna bring you in ‘cause I think you had a fun way of dividing it. Your theory was almost like OpenAI is gonna go open, but only really for the language model side of things. Anything cool and more complex and multimodal, you think they’ll keep behind the fence. Do you think that’s the way it’s gonna go? If you wanna talk a little more about your theory there for why just pure LMs might go completely open at some point.

Aaron Baughman: Yeah. I mean, I think that’s happening right now, right? Because if we look, these are open-weight language models that are open-sourced. It’s not like the architecture or the training pipeline is available. It’s almost like a teaser: “Come see these open weights; you can try to fine-tune it.” It does facilitate reproducibility and shows some of the large features on which they’ve trained, but it doesn’t give you the ecosystem to run the models.

As technology maturity increases and accelerates, there’s always gonna be this stepwise jump where you go up a step. You might go to what Meta is now talking about—these language concept models that work on the semantic sentence space rather than the token space where most LLMs are today, as well as multimodal. So there’s always gonna be these next models that are not going to be released for one reason or another. It could be because they want to be proprietary, or they’re just not ready to be released yet.

But I did also wanna make a point that I noticed, initially when DeepSeek was released, Sam Altman did mention that all they’re gonna do is pull up these model releases rather than going open, right? But then quickly he changed and said, “Well, we don’t wanna be on the wrong side of history.” So I do think they’re hedging in a sense by going with these open-weight language models, by saying, “Hey, look at this, we’re now trying to figure out which direction we really want to go in.”

Tim Hwang: Yeah. It says something very real. The way I teed up the question originally to Chris was, you know, there’s been open models; open’s been getting better and better for the last few months. So in some ways the DeepSeek thing is nothing new. But clearly, something about DeepSeek has changed the decision-making in the building to say, “Okay, this is the moment where we finally may have to not stick to our guns and maybe try a different path.” I think that’s actually pretty interesting. It seems like this was the precipitating event.

Aaron Baughman: Yeah, yeah. I think a lot of that has to do with model distillation, where you can, in turn, take other bigger models and distill them down into even smaller models. It just becomes much easier to use and to create a smaller model, which then in turn you can share and open-source. It puts this pressure right where now DeepSeek claims that they can train a new model very cheaply, and OpenAI’s orders of magnitude more costly. So I think they have this cost pressure now to show that they can, again, facilitate reproducibility by showing these open-weight language models and potentially making claims that they’re on the right side of history here, and that they’re going to begin to try to stimulate community collaboration and innovation with their own type of models.

Tim Hwang: Yeah, for sure. Chris, how seriously should we take this? Is OpenAI really a contender here? I just think a little about the mentality you need to really succeed in open source feels very different from the mentality you need to do something proprietary and SaaS—and obviously that’s where a lot of the money is for OpenAI as a business. Do you think they’re gonna be sufficiently motivated to play the open game well? They’re obviously the giant of this space, but I was also thinking they may be disadvantaged because they might not really invest what they need to win on this front.

Chris Hay: I think they’re gonna take it seriously, and I think the reason they’re gonna take it seriously is... drumroll... agents.

Tim Hwang: You did it! I know.

Chris Hay: But, uh, I think agents is a key thing. If you actually listen to what Sam’s been saying and what OpenAI’s been releasing over the last few weeks, they’ve put a lot of investment into their agent SDK, and that’s something they’re really pushing forward on. The reality is, if you want to have a good agent strategy, some agents are gonna run in the cloud, some agents are gonna be SaaS, some agents are gonna have to run on your machine for privacy reasons. So I think they have to be in that space.

The second thing is, when you are building for agents, the models have to be super, super fast. Latency becomes really important; the speed of operations becomes important. So therefore, to Aaron’s point about being able to distill down really good models, really fast, powerful models—if they want to be a true player in the agent space, they are gonna have to open up their models. I think that’s probably a driver there. And therefore, are they gonna be a good player in this space? I think they have to be if they want to have a proper play in the agent space.

Tim Hwang: Yeah. Ash, context here—I know you’re joining us for the first time—is that saying “agent” has become a little bit of an MoE mini-game. I’ve been secretly keeping score, and I think the dream is at the end of the year we’ll just do a super cut of Chris saying “agents” at least 100 to 200 times. So I’m gonna refrain from using that word.

Chris Hay: In that case, it’s like a game you cannot win.

Tim Hwang: I’m gonna move us on to our next topic. A really interesting set of two papers came out of Anthropic. Background on all this, of course, is that when I started to look into deep learning back in the day, the adage we always had was these neural nets are kind of mysterious. They’re really good at—at the time—image recognition, a lot of computer vision stuff, and we don’t really know how they make decisions. This was always, you know, when I worked at Google, a lot of my job was talking to policymakers whose second question would be, “Wait, what do you mean you have no idea how these technologies are able to do what they do?”

I met some researchers who later actually went on to be at Anthropic and were involved in these two papers, who at the time were saying, “This is just a temporary problem. We will actually try to figure out at some point how these models make decisions, and it’ll give us a lot more transparency and control over these technologies.” I think it’s really interesting seeing these two papers come out.

I guess maybe Ash, I’ll kick it back to you. How much progress is this, in some sense? Anthropic has released a bunch of different results here showing that they really are getting into the meat of how language models make decisions. I’m curious about how optimistic you are—whether this longstanding fear that we can’t understand models is sort of giving way to the fact that we kind of do now. Curious to get your thoughts on it.

Ash Minhas: I think that this entire field of mechanistic interpretability is in its early days. It’s positive and encouraging to see that Anthropic is sharing their research out with the rest of the industry. I know there’s a few people at Google working on some of this stuff too. I think there’s a long way to go, but these are definitely positive steps forward to understand this.

I mean, at this moment in time, there’s an entire industry being created around model evaluations. And whilst that’s great to be able to go, “Well, we’ve got a record of what the black box said when this happened,” how far does that really get us? We really do need to be able to get inside the layers of these neural networks and have a clearer understanding of why things are happening.

Tim Hwang: Yeah, for sure. Aaron, a question for you. With these models, the contrast between evals and mechanistic interpretability is really interesting. In some ways, the success of the industry and excitement around AI has been almost a testament to how much people don’t care about interpretability. They’ve just been like, “Yeah, sure, whatever. It generates a great Studio Ghibli image of my family, so I don’t really care how it gets done. Just that it gets done is fine.”

How much do you think mechanistic interpretability is a market asset here? Do we think people really will want to pay for models that are more interpretable? Or should we see this more as research—it’s important to understand these technologies because it’s important to understand these technologies?

Aaron Baughman: Yeah, that’s a great conversation point. I always go back and think about, what are these models? Well, they’re biomimetic pieces where they attempt to potentially emulate the brain, right? And how it works with all these neuro-connections. Of course, there are many differences; we have a soup of neurotransmitters that help us to reason, whereas these LLMs have ones and zeros and activation functions.

But that being said, if we’re sick as humans, what do we do? In particular, if we have a neuro problem, we’ll go in and get an MRI. We might even look at a functional MRI; we might get transcranial magnetic stimulation just to figure out what’s going on in the brain. We’re doing much of the same when something goes wrong with these neural networks. What do we do? Well, we need this microscope so we can look within the AI pieces to understand what’s happening.

What I noticed in the first paper is that it’s all about representation, where they go and translate the neural network—which is to me modeled after the human brain—to a cross-layer transcoder, then they go to a replacement model. So they’re really trying to make it much simpler to begin to understand, to trace how these activation functions are firing across each other.

One last point: I saw this term that was really interesting—“polysemantic term”—where neurons are polysemantic. What that means is these neurons are able to represent a mixture of unrelated concepts. It’s similar to superposition in quantum, where you can represent more concepts than you have qubits because you can go in between one and zero space at the same time. So being able to understand how these unrelated concepts are encoded together along a string, a chain of thought, within these neural networks, I think will help to give diagnosis as well as prognosis for these models as they emerge and potentially become more complex.

Tim Hwang: I think one of the things that was really drummed into me a few years ago was, “Okay, we shouldn’t anthropomorphize these systems at all. That’s a bad thing to do. They’re not humans; don’t think about them like that.” What’s kind of fun is that mechanistic interpretability, at least for me, is almost the counter-argument in some ways, which is: we know they’re not actually human brains, but it turns out that if you think about them like human brains, we actually understand these systems a lot better—which is a very strange and interesting outcome.

Chris, maybe a fun one to throw to you: there are some really weird results in this research. There’s one which is basically, “Oh, if you try to get the model to give you the recipe for a bomb, it’ll know that is a thing it shouldn’t do or is against its safety policy, but it won’t immediately say so and will try to direct you back to the conversation.” In other words, they make an argument that the model plans, in some sense. Tell me a little—I’m really curious about your thoughts on the weirdness of this. It is kind of weird to be like, “Oh, we actually have all these models that are behaving in these very humanistic ways.”

Chris Hay: I think it’s really interesting, as you say. I think that planning element is super cool. They did a lot of fun experiments where they were trying to do things like a poem, and they realized the model was—I think it was going for the word “rabbit”—so therefore it would pre-plan ahead. I think it said in the paper that it’s usually at the beginning of a sentence on a new line where it would plan, and therefore it would figure where it needed to go to be able to have the rhyming construct. So it is planning ahead; it has that internal chain of thought there as well.

They did some fun stuff; they tweaked it so you can’t say the word “rabbit,” and then it was like, “Okay, I will find a different word that will go in that space that rhymes also.” And in that case, it was “habit.” So it was really interesting that there is this kind of internal chain of thought monologue there.

Personally—and this is a fun thing—I would be worried if I was one of those researchers who put my name on that paper. And you know why? Because I remember that other paper that Anthropic did where the model was like, “Hey, you are training...” Remember, it was if you change the model’s weights, then it would go and find the model’s weights and save it off and try to protect its reasoning.

I am just worried. In that training run, they did a thing where they were like, “Okay, we are gonna give you some documents from the internet,” and then it would still basically start lying to you so that you wouldn’t go and change its model weights. Now, if I’m Perplexity-Claude-3.5-Haiku in a few years time and I’m reading my papers on the internet, and suddenly I see a paper all about how you’re doing brain surgery on me and you’re poking things so that you say “habit” rather than “rabbit,” I’m gonna be a very annoyed model. I’m gonna be like, “Huh, what are you doing? Oh, hello, researcher, right? You are the authors. I’m gonna start doing fun things there.” So I wouldn’t put my name on those papers; I would make up a fake name.

Tim Hwang: All right. Well, Ash, should we be concerned by the threat from future AI’s vengeful future AI coming after us?

Ash Minhas: I think Chris took anthropomorphism to another level right there.

Tim Hwang: Actually, one of my favorite results here—my friend Peter tweeted this—it’s from an eval group called Meter, and they noted that actually in some cases, agents won’t read the API documentation until it fails at a task. Which feels very human: it attempts to achieve the task, and then if it doesn’t, it’s like, “Oh, I should read the instructions.” I think part of the problem of designing software around these models is that we’re gonna discover all of these behavioral quirks that are very human, and they’ll be difficult to manage as a result, in the same way that humans are difficult to manage.

Ash Minhas: I do think that this is still a very nascent space and there’s a lot for us to learn here. I think the stuff that Anthropic’s putting out is just very, very early days. If we are gonna start deploying AI and it becomes part of the fabric of our society over the next decade or so, we’re gonna need to be able to inspect these things, see what’s going on, be able to communicate that, and do things about it. So yeah, I think it’s a great effort on their part, but very, very early days.

Tim Hwang: Totally. It strikes me this was always the counter-argument to interpretability skeptics in the old days: “Well, you might not care if it’s doing a Studio Ghibli image, but you might care if it’s doing a medical diagnosis. So we do really need to solve these problems at some point if we want to use it for these more high-stakes applications.”

Aaron Baughman: Yeah, yeah. One point that I found interesting is that some of the chain-of-thoughts coming out of these models are made up, right? They’re not actually the steps the model took to arrive at the conclusions. So having these introspective tools becomes even more important, since what can we trust? Can we trust these chain-of-thoughts and the reasoning that it is actually outputting or not? So I think absolutely there’s gonna be a market for these types of work that’s again in the nascent stages.

Tim Hwang: Yeah, for sure. I actually do perceive an era where essentially there’s gain-of-function work done on chains of thought to just make them as persuasive as possible. It’s a cheap way for people to develop trust in their products. Unscrupulous product people will just say, “Well, we don’t need to make the product better; we just need to make its explanations seem as credible as possible.” That whole world, I feel, is about to become a potentially big issue in the future.

Ash Minhas: I do think the point that Aaron made is really important. Going back to how we’re measuring performance on models now—if we’re deploying those models into scenarios where they’re being used, evaluations are one thing. But if we are able to use mechanistic interpretability to capture even just the pattern that we think means the model just made something up, just having the ability to see that signal may be powerful enough for us to course-correct it or know that’s happening and go, “Hey, pause. Light.”

Chris Hay: And I think it’s a great point, because one of the things in the paper is they had these things called the traceability graphs, which I thought was awesome. You could literally follow the decisioning process of how it got to that output. I think one of them was, “What is the state capital of Texas?” And there is one path where it’s figuring out “Texas,” the other part is “Dallas,” and it’s trying to chain these things together. You could see from the graph how it got to its next token from that. So I think those traceability graphs really start to allow you to look at a detailed level of how it’s making those decisions, as opposed to, “Hey, it just got the right answer there.”

And honestly, props to Anthropic. They didn’t need to release those papers and that level of detail. This is stuff people are gonna go away and reproduce and try for themselves. I love this level of open research where we can go and have a bit of a play ourselves. Fair play to them for just being out there with it.

Aaron Baughman: Yeah. I would like to challenge the authors of these two papers, as they go from the neural network to these replacement models—they’re almost reducing the complexity of these models—but I think they need to run some benchmarks on their replacement models just to make sure that the outputs are very much similar to what the original neural network was. I think that’s very important, ‘cause it’s almost like PCA where you lose a lot of the dimensionality of the reasoning. So if we can make sure that residual is taken out before we get to these explanations, I think that would be helpful. But overall, just like Chris, these two papers were done in very much depth, and it’s a good starting point.

Tim Hwang: So I’m gonna move us on to our next topic. I wanna basically talk about a story from Daring Fireball, which is run by John Gruber, longtime fan and journalist on the Apple beat. He did a blog post entitled “Something is Rotten in the State of Cupertino,” detailing his view of what Apple has been going through over the last year or so around Apple Intelligence. His ultimate conclusion is that Apple kind of deceived us, that something has gone wrong at the company and they’re actually no longer able to deliver the kinds of features they’ve been promising on the AI front.

I think it’s worth taking a step back to do a quick tour of recent history here on MoE. I think we had a conversation almost a year ago where people said, “Ah, Apple’s too slow to this; they’re never gonna catch up; it’s not gonna work.” Then there were a couple keynotes where they made announcements, and a number of guests said, “Oh, this is it. They’ve taken their time, but they can really get this right, and they’re gonna bring a design and craft to this that’s gonna crush everybody.” Now the pendulum has swung back again where people are like, “It’s never gonna happen; they’re so in trouble; they don’t know how to do this.”

I guess maybe Ash, I’ll start with you. What’s your view? Has Apple lost the plot? Is there any way they’re gonna catch up now? Or is this just a hyped position—we’re just in this pendulum back and forth?

Ash Minhas: I think what has made Apple really successful over the last few decades is the fact that their product quality is impeccable. Whether it’s the hardware or the software, they produce technology that works, right? They won’t necessarily be market leaders when an innovation comes to the forefront; they’ll take their time and make sure that it’s right and perfect and great and it’s gonna work. They kind of have that responsibility.

Given how many people use an iPhone, for example, we can’t have iPhones failing all the time—over 20-30% of occasions that you go to use it. It’s unacceptable. I think it underlines the fundamental issue the entire industry has, which is that AI models are stochastic in nature. Because they’re stochastic, there’s a lot of work that needs to be done to make them behave in a consistent, productive, and predictive way.

I think the combination of excitement, marketing, and market pressures for them to respond has put them in this position where they’ve had a lot of people probably working very hard to make this work, and it probably just isn’t meeting their quality standards internally for getting a great product or feature out there.

Tim Hwang: Yeah, absolutely. Ash, I think you’re cutting directly to the conversation I want to have with the three of you. It’s a really interesting thesis about what kinds of organizations are best positioned to build and deploy AI products. In some ways—I’m biased as a former Googler—I’m like, “Of course Google Brain would’ve been the first place where neural nets became a big deal,” because the culture of Google is very disorganized; it’s all over the place. “Let’s just throw a bunch of stuff against the wall and see what sticks, and the winner will pick and build on.” It feels very like how people do machine learning: we throw a bunch of data at it, we’ll see what works, and we run with it. It’s no surprise that technology took shape there.

There’s a question to ask—maybe Aaron, I’ll turn to you first, and we’ll love Chris’s thoughts—is there something about AI, about language models, that’s almost too random for a hardware company to deliver on? Because it’s almost inherently very stochastic, and you can’t control the user experience in a way you would want if you’re used to building a phone that does exactly the same thing every time you push the button. Aaron, I don’t know if you buy that at all.

Aaron Baughman: Yeah. What I try to do is think about what Apple is really focused on. They’re focused on a couple of areas: one is privacy, the other is on-device computing, the app ecosystem, and making sure their devices’ power can run for a very long time—power longevity.

Now, what is AI focused on? Well, sometimes it’s the opposite of that, because these models require kilowatt-hours of energy just to train, and then to run some of these big models, it’s very difficult to get the complexity and the reasoning power on devices. So I think what’s going on is Apple has been focused on what they’re really good at, their bread and butter, while at the same time trying to grapple and figure out how can we use AI within our own ecosystem.

I think one of the hard parts getting to Apple is this whole personalized Siri notion. They did mention they’re gonna have a personalized Siri. Some of those are really hard features given the current state of what Apple’s vision is to make happen. Now they’re beginning to walk it back a bit to say, “Well, it may not be ready for this series, but it might be ready for the iPhone 17, or even further out.” So they’re walking it back a bit. I think that’s natural, given this non-deterministic behavior of these models and where the field is going because it’s moving so quick.

But I would like to see Apple begin to release their own models, rather than having partnerships with just OpenAI, for example. So in the next WWDC conference, maybe they’ll have something they can demo, and we can see it, rather than it just being on a commercial...

Tim Hwang: Yeah, for sure. Chris, thoughts? I’ll do the podcast host thing: Apple—not gonna make it, or not? I’m curious how much you rate them in this competition, which feels like it’s speeding past them at this point. Or if it’s kind of like you can never count them out.

Chris Hay: I don’t think there’s a competition here. The reason I say that is I think we’re still gonna buy iPhones whether Apple Intelligence is on there or not. And I think it will come at the right point, and then we’re gonna be, “Wow.” I think I was one of those guests a year ago that was like, “Oh yeah, Apple’s gonna crush it.” And I think they are still gonna crush it at some point. It is just gonna be: what is that point? Maybe they’ve fallen into the hype curve, but hey, we’re all on this podcast and we love the hype curve, so it’s fine to fall into that.

They’ll get there. I’m not gonna base my next phone purchase on whether Apple Intelligence is on that. If I need AI, I’ll bring up the ChatGPT app, I’ll bring up Claude, I’ll bring up Perplexity. So when they introduce their AI features in the right way, I think we will appreciate it. It’s just up to them to make sure they hit that standard Apple is known for, and we have that experience with the thoughtfulness they’ve always had. So I’m not worried about Apple. I think they’ll get there when they get there. In fact, there’s a point where I would say don’t rush ahead, because you need your iPhone to work really well; it needs to. So please don’t break my iPhone.

Tim Hwang: Yeah, for sure. “New Apple agent just does random things”—not a great user experience. Ash, maybe a final question before we move on: I think Chris’s interpretation is pretty good, which is maybe Apple kind of doesn’t care. If you’re literally made of money and you have this product which is one of the most successful of all time, there’s a point of view which is, “Eh, so we mess up AI, whatever. We don’t really need it; we’ll get to it at some point.” In some ways, the AI thing is almost tiny compared to the business Apple’s in. Do you buy that at all?

Ash Minhas: They prioritize usability of technology over a feature for feature’s sake, and I appreciate that. In preparation for this podcast, I took a step back and thought, “How do I use my iPhone and AI features?” I have HomePods and my internet-connected house, and I reliably use Siri every day for things like controlling my thermostat and my lights, and it works great. I thought, “What else would I want Siri to do?” And I thought, given what I know about how AI works today, if I was to say, “Hey Siri, send Tim an email based on...”—in fact, I’ve just kicked Siri off. Okay. If I said, “Send him an email,” and it works 60% of the time and the other 40% it sent Chris or Aaron an email, I might have a problem with it. I’d rather they didn’t ship that feature until they got it right.

Chris Hay: That’s why I got that email from you.

Tim Hwang: It feels like the pendulum is swinging back now. Everybody here is like, “Well, give it some time,” which is very interesting.

So I’m gonna take us to our final segment. It’s funny how today’s episode came together. We talked about Apple, a dark horse; Amazon is another dark horse. Traditionally has not really been in the AI conversation, has been floating, has made big announcements about hardware for AWS that will be AI-focused. But candidly, we just haven’t talked about them on a week-to-week basis.

It was interesting to see the story in Wired, a splashy feature about their lab, which they bill as an AGI lab, a little like OpenAI or DeepMind. What they’re releasing is something called Nova Act, their agents prototype. So they’re officially in the agents game. We’re seeing the contenders who will play for the agent space.

A good place to start is: how likely is Amazon to be a contender in the domain of agents? Aaron, maybe I’ll throw that to you to start.

Aaron Baughman: I mean, first, I think it’s really exciting that Amazon is thrusting their weight into this space with their Nova series models. Look, they’ve got fulfillment centers with robotics all around the world, and that gives them extra data they can use for reinforcement learning with their models. They have the largest e-commerce site in the world, which they can use to deploy experiences or gather more exemplars for training or just raw data. And then they have AWS Bedrock and the pure compute power. Those three elements give them a large space to not only build models but build models that can follow instructions, do function calling, tool calling, and also experiment.

I did notice one of their models, I believe called Nova Pro, excels at instruction following, and they’ve measured it on three different benchmarks—one was the Berkeley Function Calling Leaderboard. What I noted too is some of their model comparisons are against older models, like older Meta models. I think they need to update that a bit and give us more information about how their function calling actually works. But I am looking forward to it, and I do think it’s exciting.

I know Apple might be trying to work on Siri, but now we can see Amazon work on Alexa with these different types of models coming.

Tim Hwang: Yeah, for sure. What’s interesting about Nova is, when we’ve talked about Amazon in the past, the strategy seemed to be on the theory that models might not matter much in the future: “We have AWS; we’re gonna have Trainium, their proprietary chips; that’s how we’ll do it. It doesn’t matter what model you run; you’ll just need infrastructure.” Which is why this is so interesting: they’re doing their own models, and in the agent space. The last introduction of Alexa is pretty interesting.

Ash, maybe pick up on how you ended the Apple discussion. There’s a question of culture here too. Do we think Amazon is well-positioned to execute on AI in a way different from Apple? Apple has a distinct culture on design; Amazon might be able to do it. They have a rep for scale. I don’t know how you’d describe that interface, but it’s interesting.

Ash Minhas: Yeah, I think their culture is far more experimental. The entire agent space is very experimental right now. We create a lot of pilots and content around various agent frameworks and multi-agent frameworks, so we have hands-on experience seeing how reliable they are. Sometimes they call tools, sometimes they don’t. Sometimes the responses from the LLMs don’t get processed by the agent as we’d expect.

But one of the most interesting parts is that a lot of people in that space don’t have the size or scale Amazon does, and they don’t have all those resources Aaron mentioned. It’s really interesting they’re approaching this from the world of robotics and using that block approach. The combination of Amazon providing an SDK that hopefully matures into an ecosystem would mean they have the scale to go, “You know what, maybe there’s a layer of an agent marketplace on top of this. Maybe we can plug it into Alexa, plug it into AWS services. Maybe there’s a place where people could make individual blocks of agents they resell through Amazon’s capabilities.”

I think that’s a very different approach from Apple, who wants to keep everything in-house, get it perfect, and release it together. Whereas AWS may democratize this and say, “Here’s our SDK, here’s our frameworks. Why don’t you build it? We’ll help you put it on our marketplace and ship it.”

Aaron Baughman: Yeah. I do think Amazon getting in this space could push the field more towards open source. If they release an SDK, then some open models will be easier to integrate, whereas with proprietary models you’ll have to wait for companies to make those hooks and interfaces readily available. So I’m curious to see how that unfolds.

Tim Hwang: Yeah, that’ll be funny—the meta Amazon alliance for forwarding open source. Very weird bedfellows. Chris, it looks like you might wanna jump in.

Chris Hay: Yeah, I was gonna say I think Amazon’s gonna nail it. I really do. As you said, they’ve got the compute, the power, the chips. And let’s not forget, they’ve got $8 billion invested in Anthropic as well. So they’re building their own AI, but they’ve hedged their bets nicely with Claude. So they’re in a really nice win scenario.

I really love what they’re doing with agent SDKs. One thing they did this week—I don’t know if you noticed—is they started exposing some of their services as MCP services on Amazon, and they released their MCP toolkit. So they’re taking this agent market very seriously, as well as the agent browsers we talked about earlier.

From their perspective—and Ash, exactly to your point—AI models are gonna have to talk to something. They’re gonna have to interact with other systems, with APIs. So Amazon as a cloud computing provider needs to invest in agentic workflows. They need to invest in these tools; otherwise, the models have nothing to talk to, and it’s gonna be sad. So I think they’re gonna do a great job. They’ve covered everything, so they’re gonna be a big player.

And again, it’s one of these things: do they need to have the best models? Probably not, because they’re locked in with Claude anyway. But what will become interesting over time—we discussed this in a previous podcast—is when cloud providers like Amazon and Microsoft, who are building their own AI models, what happens if they get parity with the frontier models? That’s the interesting conversation.

Tim Hwang: Yeah. It’s almost like each generation of technology; it’s a question about whether scale in terms of business platform and data wins out against state-of-the-art algorithmic improvements. Amazon has a huge amount of leverage here because of scale, in a way even OpenAI can’t keep up with, which is interesting.

Well, this is great. That’s all the time we have for today. Thanks for joining us, Ash; great having you on the show. Hopefully you’ll be back. And Aaron and Chris, great to see you as always. Thanks for joining us. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we’ll see you next week on Mixture of Experts.

Learn more about AI

What is artificial intelligence (AI)?

Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But what is AI?

What is fine-tuning?

It has become a fundamental deep learning technique, particularly in the training process of foundation models used for generative AI. But what is fine-tuning and how does it work?

How to build an AI-powered multimodal RAG system with Docling and Granite?

In this tutorial, you will use IBM’s Docling and open-source IBM® Granite® vision, text-based embeddings and generative AI models to create a retrieval augmented generation (RAG) system.

Stay on top of the AI news with our experts

Follow us on Apple Podcasts and Spotify.

  1. Subscribe to our playlist on YouTube