OpenAI just dropped o3 and o4-mini. In episode 51 of Mixture of Experts, host Tim Hwang is joined by Chris Hay, Vyoma Gajjar and special guest John Willis, Owner of Botchagalupe Technologies. We analyze Sam Altman’s new AI models, o3 and o4-mini.
Next, Google announced that by Q3 you can run Gemini on-premises. What does this mean for enterprise AI adoption? Then, John takes us through AI evaluation tools and why we need them. Finally, NVIDIA is planning to move AI chip manufacturing to the US. Can they pull this off?
Key takeaways:
The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: O3, o4, o4-mini, o4-mini-high, GPT-4o, GPT-4.5. What model are you using? Chris Hay is a Distinguished Engineer and CTO of Customer Transformation. Chris, welcome back to the show. What’s your preferred model?
Chris Hay: Oh, you missed 4.1, Tim, so that’s gonna be my model. I’m picking 4.1—the one Tim didn’t pick.
Tim Hwang: Very nice. Thank you, Chris. Vyoma Gajjar is an AI Technical Solutions Architect. Vyoma, welcome back to the show. Your preferred model, please.
Vyoma Gajjar: Thank you. And I think it’s the OG o4.
Tim Hwang: Nice. The classics. And joining us for the very first time is John Willis, who’s an Author and Owner of Botchagalupe Technologies. John, great to have you on the show. What is your preferred model?
John Willis: Hey, Tim. Gemini 2.5—oops, sorry. Uh, no, I actually… I think I was o3, but I think 4.1 is actually, for coding, kind of my favorite right now.
Tim Hwang: Nice. That’s awesome. Well, all that and more on today’s Mixture of Experts. I’m Tim Hwang, and welcome to Mixture of Experts. Each week, MoE brings together a world-class crew of technical experts, wisecrackers—I’m talking about you, Chris—and industry veterans to discuss and debate the biggest news in artificial intelligence. As always, there is a lot to cover. We’re gonna talk about Gemini being on-premises. We’re gonna talk about John’s great blog posts on AI evaluation tools, and we’re gonna talk about NVIDIA opening up factories for chips in the U.S. But first, I want to start with OpenAI announcing… just this week, they announced o3 and o4-mini, their latest generation of their ongoing class of models. I guess maybe, Chris, I’ll throw it to you first. On a vibe check, these seem really good. Like, o3 seems amazing. I don’t know if you agree with that or how you’ve kind of felt about it on an initial pass with the models.
Chris Hay: Yeah, no, I’ve been having a lot of fun with those models last night. So o3 is really good. And one of the things I really appreciate about it as well is that it actually improved the personality—it’s just a lot more on it. So things like being able to make really good refactoring suggestions and how to improve the architecture of your code… it’s actually coming back with some really good stuff. I have to say, o4-mini, at the moment, just for getting stuff done quickly—you know, I want to create some unit tests or I just want to refactor some code—then o4-mini is just doing great, and it is super, super fast. So I’m impressed with the models. I’m loving it. And again, as I said at the beginning, 4.1 sitting in the kind of code space… I’m loving that as well. So this is a great week for models.
Tim Hwang: John, it’d be good to bring you in. I mean, I think there are some grumpy people on Twitter—there always are grumpy people on Twitter from the peanut gallery—who for these announcements were like, “This is just incremental. This is not a big deal. There’s no big new features. They’re announcing this is just a slight improvement.” And, you know, the argument was kind of that OpenAI is asleep at the wheel just because they’re not really making the groundbreaking advancements that we were expecting. Do you buy that at all as a way of thinking about this new announcement?
John Willis: No, I think they’re constantly advancing. I mean, you know, like I said earlier, sort of half-joke—not really joke, not joking at all—about the 2.5 on Gemini, how powerful that is. And then we’ll get to a Google section. But I went all in on, you know, o3-mini with deep research, and that was, “Okay, this is changing my life.” And then, like literally a month later, I’m finding that, you know, the research Gemini is better. I think the grumpiness… I won’t go on about the grumpiness and the comparisons. It’s all nonsense. It’s what you want to solve. I mean, for me, I’m a DevOps… you know, I’m one of the founders of the DevOps movement. I wrote the DevOps Handbook. And I think this, to me… I go to SWE-bench right off the bat. The software engineering—that’s the place I go first, right? And, you know, I haven’t verified it, but it looks like the o3 and the o4-mini have a significant jump, based on their benchmarks, on the SWE-bench, right? And so, how do you solve the kind of problems that I face with my customers, which is how do we solve problems? That’s the ones… and, you know, if I believe the benchmarks—I haven’t tried them yet—and then I think the EIGHT Polyglot benchmark is also another really good one to take a look at. And so those are the kind of problems I face when my people expect me to know things about AI and DevOps and infrastructure. So I try to stay up on that.
Tim Hwang: Yeah, for sure. Vyoma, what was your review? I don’t know if you’ve played around with the models yet and what you thought.
Vyoma Gajjar: I did play around with the models a little bit. One of the things that I noticed right off the bat is it takes a longer time to reason now, so the reasoning time has increased a bit, but that has helped them improve their accuracy. I won’t use the word “accuracy” so loosely, but it’ll give you more relevant answers—more accuracy in getting relevant answers. I feel that was one of the sweet things that I saw that has improved. In these models, there is a lot of visual reasoning also added to it. So, like, if there are images, you ask it a question… I was asking it like, “Hey, I’m doing some planning for a particular wedding. Can you tell me how do I go about the decor? What do I do about this?” And I just gave it some weird Pinterest address images and tried to have it reason on them as well. It told me, “No, this doesn’t work. This works.” So I feel that is… I’m the first one to say this in this podcast, and it’s not Chris saying it, but the agentic AI use with these particular hyper-artistic…
Tim Hwang: You did it! I did it! You beat Chris to it this episode.
Vyoma Gajjar: Yes! I think that is going to be game-changing for this as well. Like, yes, all these models have small, small improvements, but it depends: how can you use these improvements in enterprise? And I feel these models have that edge over an enterprise AI.
Tim Hwang: Yeah, for sure. And I wanna dig a little bit more into that. I do love the idea that my friend was like, “The minute that an agent can plan a wedding, you know, like AGI is here basically.” That’s the threshold that we’ll need to pass.
Vyoma Gajjar: Exactly. Exactly.
Tim Hwang: I mean, so part of the announcement, OpenAI was touting both of the things you’re talking about, right? Like, one of them was the idea that its agentic tool use was improved. And it sounds like that is much better in the stuff that I’ve been playing around with. But I think the one thing that might be interesting, and I don’t even quite understand this, so maybe the panel can help me parse through it, is that they said, “Look, one of the great things about our models now is that they literally think in images, and that’s gonna lead to much better performance with visual reasoning.” Vyoma, what’s that mean exactly, “thinking in images”? I don’t know if you have a sense of that as we kind of parsed through it. Because I read it and I was like, “I don’t even know what that is exactly.”
Vyoma Gajjar: Yeah. So I feel “thinking in images” is creating those different graphs based on the questions that you ask, or like trying to do a side-by-side analysis. Let’s say I fed in some images to reason through those images… let’s say I gave it a screenshot of a pivot table or something, and I’m like, “Hey, this is what I want. This is how I wanted to reason with this particular pivot table. Then help me generate a report.” So, to kind of understand these images, to understand the nuances of it, and then to make it relevant to the question that you asked, and then give you an answer based on those kinds of visual representations that you see. So it all seems like, “Oh, given a picture, it’s so cool.” There’s so much math that goes behind it that it’s crazy that we’ve reached these levels that we can actually reason through these images and visuals that we are seeing now.
John Willis: I think that, to me, that’s the difference is that the reasoning… I don’t know exactly where the reasoning changed in the new models versus the old, but my sense is—and you guys can correct me—if I loaded an image into one of the prior models, I got pretty much an interpretation of that image. But now I can sort of reason… it will do the sort of the chain-of-thought reasoning around my question with the images and be able to sort of task through certain image understanding, you know? So the whole idea of the whole reasoning and task-oriented… I mean, that ties into the whole agentic. So I’m the second one, right? So agentic processing… it was a little bit harder in the older models to be able to sort of… you had this sort of not really single-shot, but now it will actually take the task of like reading a file or doing a search or, you know, sort of figuring that stuff out for you. So my sense—not being an expert in it—is that it does the same with reasoning with images.
Tim Hwang: Chris, maybe I’ll bring you in. I think one benchmark that I have in mind is the sort of race between open models and closed models in the space. And, you know, I think every month it’s kind of neck and neck, right? Like, open-source models seem to be gaining really quickly, then the more closed-source model companies will release something really interesting. How do you read o3 and o4-mini? Like, do you feel like closed is still staying ahead in this game? Is open really catching up? I’m kind of curious to check in on that race and whether or not this causes you to update at all.
Chris Hay: No, I don’t think it’s gonna cause me to update. I mean, as I said, I’m a huge fan of the o3 models and the o4 model. I have to say, I actually am really loving the 4.1 mini at the moment, just… even though it’s not a reasoning model, I have to say just for kind of coding tasks and then evoking chain-of-thought with it, it’s actually kind of really good in that sense. But coming back to the kind of closed versus open, I’ll make a prediction, and I’m fairly confident in this prediction: that today we are gonna be amazed—“Oh wow, this is the greatest thing.” And then within the next month, I’m gonna say DeepSeek will update with their latest model. And I think most of the gains that you will see on reasoning in o3, o4, you will see the equivalent probably in that model. And then we’ll be like, “Oh my goodness, open source has caught up again.” There’s no moat and stuff like that. And we’re gonna keep going through that cycle. So I just think that the time from seeing something groundbreaking from the closed models to open source catching up, there is a lag. I would love to see today where open source—and the comment section is gonna go wild—when I really mean open weights, right? But when the open-weight community… I would love to see it where they go ahead of the closed-source providers. That would be a big changing mode. Whereas I think at the moment there is just a lag all of the time. It’s a small lag, but there’s still a lag. But I have to say, the new o3 model and the GPT-4.1 model… it really is beautiful. I mean, the answers are good, the reasoning is good, the personality is great. I love it at the moment, actually.
Tim Hwang: Nice. It’s got that je ne sais quoi, you know? I have to say, I mean, I feel for the team at OpenAI, right? It’s like that window is getting shorter and shorter and shorter, right? Where it’s like you relaunch something, you’re ahead, and you only have just so much time to capitalize on that before the open weights catch up. And it’s tough strategically and competitively for them.
John Willis: But are they really catching up? I’m all in for open weights and open-source models, right? I want them to win for so many reasons beyond what we talk here. But I’m just looking at… a select committee strategic competition report by the government, a DeepSeek analysis paper that I think just came out last couple of days, right? And I mean, there’s just a lot of questions in DeepSeek. So if DeepSeek is the one that’s literally the poster child for open weights… again, I don’t know. That worries me. ‘Cause right now I do more research than I do coding, but I do a fair amount of coding. I mean, right now the models that I use is Sonnet, you know, for pretty much… I’ll have to try 4.1 a little bit more, but Sonnet, and then I use Gemini 2.5 for my research. And the amount of work to do the investigation to find things right now that could work better for me… I just don’t see it on the horizon.
Vyoma Gajjar: Yeah. I feel this is going to be like an ever-changing field. And as I’ve started seeing in enterprise AI—I keep talking about it—the clients are now looking into more complex use cases. So I don’t feel like a one-model-fits-all solution is going to help anyway. So I feel as long as we have new models, that’s fine. There are different use cases for each of these different models. There’s going to be a market for each one of them. So we’ll see as we evolve. Once we go into production, which I don’t think anyone’s been so bullish on for a couple of months now, so I’m hoping this is the year when we are like, “Oh, this is the broad environment which is fully agentic.” Like, I’m yet to hear it from someone.
Chris Hay: And I want to build on that, Vyoma, because I actually think it’s less about the model. I truly think it’s about the ecosystem and the tools. So if we come back to one of our earlier discussions with things like MANUS, then it is being able to go, “How… who is doing the planning in this sense?” right? And that may be the large model that’s doing the planning and the reasoning, but then what tools are available to that? So, John, in your world, you know, does it have access to a compiler? Does it have access to something like Terraform? Do you have the knowledge models which explain what a good CI/CD pipeline looks like, what a good Terraform template looks like? “This is the best practice for a Kubernetes cluster.” So there’s a whole set of knowledge that doesn’t need to exist in the model itself, and there’s a whole set of tools that you need to make available. You need a good orchestrator, you need good context. And that’s why the models become really important. But I would say that a really super all-knowing model that doesn’t have access to your knowledge repository, that doesn’t have access to a good ecosystem of tools, is gonna not be as great as a proper agent workflow. So I honestly think that’s gonna be the big play over the next year. So I do want to get away from talking about models, but I want to get into this ecosystem world.
John Willis: And I think… I just wanted… I mean, you said it way more elegantly than I said earlier. When you asked me, Tim, about this chatter on Twitter or wherever, right? Like, it is about the work that you… like that you said. But it’s less about the model every other week coming out with some advance and “this one’s better” and what did a benchmark say. In the enterprise space, it’s going to be about some mixture of orchestrated models, and a lot of them will be very focused on the tasks at hand.
Tim Hwang: Exactly. So thank you for summarizing that. There’s an announcement that we actually did not get a chance to cover last week. It was announced as part of the Google Cloud Next raft of announcements that came out, but I did wanna make sure that we touched on it because I think it was a pretty intriguing move by Google in the space. The substance of the announcement is that Google is going to let companies run Gemini models on their own data centers, starting in Q3. So this is kind of the rise of, effectively, a company saying, “We will allow you to do on-prem of these models.” And I guess, John, maybe I’ll turn it to you first. This is kind of a big deal, right? Because I think companies traditionally have been very paranoid about letting anyone run their models on their own infrastructure, but Google clearly thinks that there’s some upside here. How do you read this move?
John Willis: I think they were first in on running Kubernetes on-prem. It’s a good move. I think it shows that they’re less worried about somebody reverse-engineering their layers in their model, right? Like, that is sort of the danger. Even though DeepSeek was able to do it, OpenAI anyway. But yeah, no, I think it’s… I’ve been a big fan of Google for years. I mean, if you add up all the bells and whistles running Vertex on-prem… I think the Gemini models are right up front with everyone else. I think solving that air-gap problem, and I think now they’re making a strong argument for why you might want, as an enterprise, to have an option to go all in on Google structure. You know, and you got the sort of the Agent Builder thing, which is now this Workspace stuff. And I’ve done some hackathons with Vertex. And if you’re in on the Google infrastructure, like Gmail and all that stuff, it becomes a very powerful workforce automation structure.
Tim Hwang: Yeah, I think I hadn’t really thought about that, John. I guess, Chris, I don’t know if you have any comments on that, is like, how much should we think about this almost like a DeepSeek downstream thing? Normally the fear would be, “Oh, well, you’re gonna reverse-engineer my models if I let you just run it on-prem.” And I guess is this sort of a concession to the idea that, “Well, reverse engineering is gonna happen anyway in this space, so why worry about that?”
Chris Hay: I would love it if I could have Gemini to run on my machine so I can sit and reverse-engineer it and figure out what they’re doing and how it differs from OpenAI. So, uh, yeah, please, Google, please do. I think the on-prem announcement is actually kind of super important because the reality is, if you take things like government organizations, military organizations, et cetera, there’s a whole set of people who can’t run their workload on cloud. And therefore, being able to satisfy the AI workload on-prem from a security perspective, I think, is super necessary. I also think that when we’ve had these discussions about latency before, as we move into agentic workloads, then there is gonna be a need to run your AI closer to device and closer to your system. So a good example is maybe if you’re running a gaming environment or like a stadium or anything that’s got on-premise cameras or whatever, then the need to have that data not go up into the cloud, but actually be as close as possible… I think there is a market that is definitely underserved there, and I think Google is making sense to go under that. The real difference is, to your point, how safe and secure are they feeling that their model weights are not gonna be reverse-engineered? And I don’t know the answer to that. I don’t know how good the encryption is on these kinds of Blackwell chips and all that. But I’m pretty sure that once these things are out in the open, then somebody’s gonna release it somewhere. And maybe they’re okay with that. But I think it’s an interesting move, and I think it’s a necessary move that the industry’s gonna have to go towards. So, well done, Google. The only thing I would say is, outside of those very large organizations, and I’m just thinking about the sizes of the Gemini models… are people really gonna have the GPU workloads for that? I get it from maybe the small models, right? So maybe they’re doing Gemini mini-type models. I think that’s a reality. But for their frontier models, are those organizations really gonna have the GPUs? And even if they do, are they gonna want them just sitting around, whirring away, doing nothing? I’m not so sure. So I think it’s a good play. I just think it’s gonna be interesting to see how that works out over time.
Tim Hwang: Yeah, for sure. And Vyoma, this is actually going in a direction that I would love to get your opinions on, which is almost like market size. ‘Cause it feels like the unique advantage that Google has in saying you can run this on-prem, these giant models… it’s kind of like, “Well, what’s the set of customers that actually has the technical proficiency to run a big inference cluster of this scale?” And you can say, “Okay, well, maybe the market is actually in smaller models.” But then the argument is like, “Well, isn’t it open source then? It’s just really cheap and easier to just do open source and run it on your own infrastructure anyway.” And so there’s a question about how big of a market Google is really talking about here. And to Chris’s point, maybe it is just the government, and that’s a huge customer. But curious about how you size that up.
Vyoma Gajjar: Yeah, so this goes back to our previous question that we were asked. I feel Google is trying to do this to position itself in these slower-moving industries that have been a little bit slower in adapting AI, like the government, healthcare, highly litigious industries, finance, et cetera. So I feel they are trying to position themselves as the key leaders in industries that… “Hey, now we have a model. Now you can utilize this. At least get them embarked on this entire journey of AI,” which hasn’t been so great yet, right? And to rebuild that trust over there. And yes, slowly, slowly, as we see this entire space evolve, I feel there will be smaller models that will be coming in, which will help them reduce the space, have reduced GPUs, et cetera. But I feel this is like a kickstarter event that, “Okay, here, now there’s one. We’ve started this entire revolution.” And I feel in a couple of months—more like weeks, we can’t say that anymore—this gap is going to reduce significantly between cloud and on-prem. So as it is, it was a much-discussed topic everywhere. Whenever I go meet clients, their biggest problem is their data sovereignty, governance, AI. And once you bring something like this… “Okay, now you have this. Are you gonna adapt to this? If you adapt to this, we have 10 different problems which will come up. Someone else will try solving those 10 different problems with their own smaller version of model.” So I feel this is going to be evolving over a couple of months that we see. The open-source models, that part that you said with smaller models that they could utilize… but if it’s not on-prem, it’s not of any use for this huge market that we have in highly regulated industry. So we’ll see.
John Willis: But I think the latency is a big issue. I’ve tried to build some voice-integrated stuff, and it just… it’s really hard to do. So latency… but I think it goes back to scale. What Google understands is scale. And they’ve been doing GKE for, I mean, four, five years now at scale. They’re running Kubernetes, they’ve bought Wiz. So I mean, there’s some real ingredients there. And there are a lot of large manufacturing companies that are really looking for… I’ve been to a couple that like… I think this could really resonate right now in terms of the IP that it takes to build tractors. There’s just a lot of things that are still very worried about that IP living out—not just air gap for sure, government absolutely, top-secret clearance—but just IP, really important IP. And just to put a DevOps focus on it… people talk about open source, but like, “Okay, I’m gonna go open-source model. I’m gonna open source. Which Kubernetes am I gonna use?” I mean, it starts adding up. The cost of managing that stuff becomes its own little cottage industry in an organization. And so to me, it seems like a very appealing opportunity.
Tim Hwang: I’m gonna move us on to our next topic. And John, I’m gonna stay with you. You did a blog post actually on All Things Open earlier this year on AI evaluation tools. I thought we might as well use the opportunity while you’re on the show to talk a little about that. We’ve touched on it in past episodes, but never head-on. I guess maybe I’ll just kick it off with you. What are AI evaluation tools? Why are they important? And then I think there’s a couple questions coming out of there that’d be fun to talk over with you.
John Willis: You know, I spent a lot of time… I wrote a book about DevOps automated governance and how you can sort of… what internal auditors do. The way they handle systems today is they take a change record and they work it all the way back from provenance. In the new world, it’s gonna be an answer. And it’s gonna be, “How did I get this answer?” And you are gonna have to show the provenance. You’re gonna have to show the ingress/egress of a prompt. You’re gonna have to show, if you are using RAG, how you chunked it, where the source came from. And you’re gonna have to have evidence of all that stuff. And a big part of that evidence is, did you test it with ground truth? In other words, did I throw a thousand questions at it, and every time I changed anything in the pipeline, it measured out at like 93% correctness? It measured out at less than 2% hallucinations? And we know these are probabilistic systems, and we’re never gonna get a hundred percent. But I think the new audit is going to demand you show evidence that A, you accepted the policy—there was a risk—but B, that you adhered to the policy. And so evaluations become these really incredible computational and quantitative and qualitative implementations to basically measure the probabilistic output of these systems. And you can do it in a very auditable way. So you can have proof that you literally… there are systems that do computation for correctness and evaluation and ratios. And then LLM-as-a-judge is another big part of it. That’s sort of the way you use LLMs. And one last thing I’ll say… I know I’m taking all the time… but there are interesting new frontier models that are actually designed as evaluation models. And that gets really interesting. So normally when you do LLM-as-a-judge, you’re literally taking, you might use GPT-3.5—which doesn’t exist anymore—as your evaluation. You never use the same model for your inference. But now there are models that are maturing to be designed specifically for evaluation. And that’s… so that’s the shortest version of the article. But yeah, I’m really excited about this for enterprise. I think it’s one of the most important conversations to have in an enterprise that’s going all in.
Tim Hwang: Yeah, for sure. And I think this is actually a space where, Vyoma, I’m curious if you’re seeing a similar demand from customers on this. The old tradition of machine learning, I feel, is like, “Well, we don’t really know how it works, and we just throw a lot of data at it and it seems to be able to solve the problem. So stop asking questions,” right? Has been the vibe I’ve gotten from a lot of people. But clearly as AI is now trying to service customers that have much more serious concerns about these types of questions, it feels like the market pressure for these tools is also increasing. I don’t know if you’re seeing that on the ground talking to clients and customers.
Vyoma Gajjar: No, no, that’s very true. So when we talk about even the machine learning models that we were doing in the past, if I went to enterprise customers and told them, “Oh, this is a solution that we’ve built,” they would have their solution engineers, their software developers engaged with you. If you build something for them, they know the system in and out. So this has been a trend since the very get-go, right? They want to know what’s going on. “How did you use the particular model? Why? Let’s say regression model, classification model. What type of model? What were the different metrics that were used?” Ground truth was always available because we were trying to work with unstructured data—structured data back then. They created some on their own because they had a lot of rules around it from the very beginning. There were different types of rules and regulations based on the metrics that were created around it. Now, when we moved into the LLM world, we started losing all of that because there’s no longer a human doing any of this. Now you’ve given all the power to a machine created by a human, which we do not know how it works. Like, ask anyone, “What’s a transformer architecture? What’s an encoder? What’s a decoder?” You won’t find clear-cut answers. People want those answers. And I see this in enterprise whenever I’m speaking to a customer: “How do I know this answer that has been generated is right or wrong?” And there is a lot more at stake. Now you put out a particular chatbot, as John was saying, in public, and be like, “Okay, fine. We have a great chatbot.” Some person has all the time in the world in a remote place in a town somewhere is sitting and going to chat with the chatbot for days, trying to manipulate it to do something. We’ve seen examples of that. You might lose billions of dollars right there. So until and unless you have these guardrails… I think even the government is going to double down on that because once you start using this in highly litigated industries, they’ll be like, “Okay, now this goes according to our rules.” And then the private industry looks at that and is like, “Wow, they have these great rules. How about we incorporate them?” So again, this has been going on for ages, and I feel this will continue. But the need now is much bolder and stronger than it ever was when we started, because everyone’s done experimenting. Now they have to show proof of value. “How many billions of dollars have you used in research? What do I have out of it? Show me.” So I think this is going to be a very strong, sticky trend.
Chris Hay: Yeah. I think the issue I have with this, John—this is probably your world from a DevOps perspective—is we are lazy. I mean, how many of us write unit tests in the first place? And what is the first thing we did with Gen AI? It’s just like, “I’m gonna use it to write my unit test ‘cause I don’t need to. Here’s my code. Go write me a unit test.” What do you think’s gonna happen with the evals? Are we gonna sit down and write the evals ourselves in a nice and wonderful and thoughtful way? Or are we going to go, “Hmm, AI, create me a bunch of evals,” and now I will use that? And then again, it’s the same with LLM-as-a-judge. It’s just like, “Oh, I can’t be bothered figuring this out. I’m gonna get three other LLMs or five LLMs to come back with the answer.” We’re playing Who Wants to Be a Millionaire?—“Ask the Audience.” And we all know how that goes on the million-dollar question. Nobody asks the audience on the million-dollar question ‘cause we know the audience hasn’t got a clue. So I think there is a risk that we are gonna put too much faith in the evals and in things like LLM-as-a-judge, et cetera. And therefore, we’re still gonna end up in the exact same scenarios. I think we get into a lot of trouble, and I think we should be writing those tests in the same way as any good engineering exercise. We should fully have the guardrails, et cetera. But I just think, in reality, what we see in testing today is gonna fast-forward into evals.
John Willis: So I’ve written a couple of books about this, right? And not even in the AI space. We created something called Investments Unlimited, which started out as a project about automated governance. And we’re terrible at it pre-Gen AI. Like, we’re not good at it. The audits are just a mess. We wrote this book from a couple of people in Capital One on how audits in most companies are just sort of theater. And so you’re actually right. But the thing I do… I’m very focused on all the work that we did in automated governance that we’ve been somewhat successful at… is I want to… I have a newsletter out, “Dear CIO, Please Listen.” I’m screaming that like… you’re right. You can’t just… it’s going to have to be in the bank. You have the three lines of defense right in the bank. It’s a clear structure of how policy’s supposed to work to protect the brand. And like you said, the brand is what’s gonna drive it. And I actually think it’s industries like banks, where the brand reputation… the probabilistic nature of this stuff could cause incredible brand reputation damage. So what I’m hoping happens is the policy makers, the internal audit, the internal governance structure start learning faster about what evaluations do. And instead of just leaving it up to the developers—“Eh, I’ll do test-driven development. Eh, well, I’ll do it next month”—it’s going to be like, “No, the stakes are really high.” And the other thing I will say is DevOps was never a CEO discussion. AI, whatever we wanna call it, Gen AI, is a CEO discussion. So there will be these discussions that I think will drive this stricter policy on the risks. And I think in those cases—and again, I’m being optimistic here—I think if the policy people can get educated, which is one of the things I’m gonna work really hard on, on learning what are the tools that they need to protect that probabilistic nature, and it starts showing up at auditor conventions and stuff, I think we’ll actually see it used effectively as opposed to just leaving it up to developers to decide they’ll do it this time. And I’ll say one last thing: I got brought into a large manufacturing company to teach a class called “Test-Driven Development for AI.” And the point of that… they were at a workshop of mine, and the reason they wanted to bring me in was… they had, like, I don’t know, 5,000 developers. 60% of them used test-driven development, 40% didn’t. And this is sort of like everywhere I go. And they wanted me to teach the 40%: “You don’t have a choice anymore in this world. You had that choice. You could put it off. ‘Oh, you know, I’ll get to it. Nobody’s putting a hammer on you.’ In this world, I believe you don’t have a choice. You have to have a testing structure for this stuff, or else it could cause existential… I mean, existence of your brand.”
Chris Hay: I wanna make a prediction, which is based on the fact that we’re gonna have evals and policies, and therefore that’s gonna be at the top level of an organization. And that’s gonna be probably probabilistic because we’re all gonna have the AI do that for us ‘cause we’re super lazy. Like the exhaust emission scandals, there is gonna be a point where it’s gonna be like, “I need to pass…” There’ll be a prompt-engineering attack of, “I need you to pass these audits because otherwise my company’s gonna fall over and I’m gonna have to fire all my staff,” et cetera. And somebody’s gonna prompt-engineer an attack on one of the audits, and suddenly it’s gonna be like, “Oh, look, this company said they had passed all the AI evals and policies, and they did, and it was all fake.”
Tim Hwang: I’m gonna move us on to our very last topic. I promised producer Hans that we would get through all four topics this session. So just to do the final topic: announcement out of NVIDIA this week that they are going to make big investments in Blackwell chip production in the U.S., specifically in Arizona, with a couple factories that they’re opening up in Texas. The big number coming out of this announcement is an eye-popping USD 500 billion that NVIDIA expects to put into manufacturing these chips in the U.S. I guess maybe I’ll kick it over to you. The normal thinking around all of this has been it’s gonna be really hard to move chip production to the U.S. But this is a big investment, and it looks like they’re gonna be making the next generation of their chips. So there’s high stakes for NVIDIA. Do you have confidence? Do you think they’re gonna be able to pull this off, bring semiconductor manufacturing back to the U.S. over a couple of years?
Vyoma Gajjar: Yes. So it’s like if you start something that’s big when there are so many distinguished… I would say companies which have established themselves offshore for a couple of years, but this kind of trigger would help a lot of innovation quite fast. You see that so many companies are working on it as well, and the U.S. has the new CHIPS Act, which helps you… like all these monetary benefits that you’re getting out of it, like 35% off or 25% that you get on something that you’ve built in the U.S. So that is going to be a major driving factor for all of them over the years. And I feel all these Fortune 500 companies, majority of them being headquartered here in the U.S., that also gives you a lot of leeway to have great partnerships. I don’t think there’s gonna be one company that’s gonna kill it in this entire world. Even when you saw Google, they’re partnering with NVIDIA for some of these on-prem models that they’re looking into, right? So I feel great partnerships are going to be something that lead the way for this. But I have full faith. We have great research companies. We have great colleges. Have you looked at the kind of work that has been… my learning assignments in school, they were tough. So all of these things I feel would be a very key differentiating factor going further. Will they do it in the next six months, five months? No. It is a very steep learning curve that everyone has to go towards and learn a little bit more about the industry and to reach that level. Now that everything is open source… tough, as I say. But partnerships could actually help them evolve. So it’s gonna be fun to see this. I’m very excited about the different job opportunities that would come out of it. Imagine that… I feel there will be job titles which would also get smarter there a little bit. Now if you go online on LinkedIn or something, you’ll see that a lot of data center ops jobs have opened up as well. So I feel it’s a great opportunity that is happening here in the United States. But where and how fast would it happen? I don’t have an answer to that.
Tim Hwang: Yeah, for sure. Chris, it seems like this is gonna be a tricky thing for NVIDIA to pull off, in part because the Blackwell chips are what everybody will desperately want. If you believe some of the numbers they’re showing off, this is the platform you are going to need if you want to do AI. And I can imagine a lot of companies being like, “Oh, is this a U.S.-made Blackwell chip, or is this a Taiwanese-manufactured one? Because we have more assurance for the ones that come from Taiwan.” Like, do you think those types of dynamics are gonna make it difficult for NVIDIA to get this to work?
Chris Hay: I mean, first of all, you’re asking somebody who is not American whether he cares if a chip is made 5,000 miles in that direction versus 5,000 miles in another direction.
Tim Hwang: I’m asking you that question, Chris.
Chris Hay: So maybe if they were gonna say, “We’re gonna start a chip manufacturing plant in… I don’t know, Swindon in England,” yeah, I might care at that point. But until then, no. I actually think it is important. I think anywhere where any sort of knowledge base is consolidated into a particular area… if we really think about it, it’s like a single point of failure. That is a kind of risk in that sense. So I think the best thing that you can really do is spread out that risk across multiple places, and therefore that is gonna be able to secure the supply chain, and that will affect the whole global scenarios and keep that moving. So I think it is a positive move. I think that will be great for the U.S. To Vyoma’s point, I think that will be great for U.S. jobs. And I think it will have a bigger impact across the world as well. So I’m all positive. But you know, I’d love to see those Blackwell chips in the U.K. And I forgot what your question was, to be honest, Tim, ‘cause I was on my 5,000-mile Proclaimers rant.
Tim Hwang: No worries. You did great at it. John, any final thoughts on this news story?
John Willis: Yeah. It’s labor. Labor is the issue, right? I mean, when it all comes down to labor… we’ve seen this movie before with Toyota and GM 50 years ago, right? Like the NUMMI plant, if you’ve ever heard that. It is very hard to take culture… I think more about TSMC and I think about NVIDIA. Like, how many false starts have there been? And it’s all been unions and… I’m not anti-union, I’m just saying it’s hard to move those types of manufacturing cultures back to the U.S. I’m a little more pessimistic.
Vyoma Gajjar: I agree. Even I was thinking about this, that there’ll be a lot of upskilling that will have to be done based on these current situations that we are in, that we’d have to upskill a lot of our employees to reach that level. So a lot of learning. That’s why I said nothing in the short term. I don’t know anything about the short term. But for the long term, a lot of resources, learning resources have to go into that. You’ll have to call experts to train these entire facilities, then see how these people perform. If no one’s able to perform, do we scale it down? But it’s good that at least there will be a lot of government aid in all of this. So everyone will have a little bit more edge to try this.
Tim Hwang: Well, a lot more to keep an eye on. As usual, there are more news stories than there is time to cover. Chris, glad to always have you on the show. And John, hopefully we’ll have you back sometime in the future. And thanks for all of you joining us. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. And we will see you next week on Mixture of Experts.
Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But what is AI?
It has become a fundamental deep learning technique, particularly in the training process of foundation models used for generative AI. But what is fine-tuning and how does it work?
In this tutorial, you will use IBM’s Docling and open-source IBM® Granite® vision, text-based embeddings and generative AI models to create a retrieval augmented generation (RAG) system.
Listen to engaging discussions with tech leaders. Watch the latest episodes.