Google’s AI overviews, Golden Gate Claude, the "whale computer" and scaling laws
 

Watch the episode
A graphic with a grid background and a stylized flowchart in pink and blue.
Episode 5: Google’s AI overviews, Golden Gate Claude, the "whale computer" and scaling laws

In Episode 5 of Mixture of Experts, Bryan Casey, our guest host, is joined by Kate Soule, Chris Hay and Skyler Speakman. Today, our experts revisit a conversation from a previous episode around Google’s AI overviews and the market reaction. Additionally, they break down Anthropic’s Golden Gate Claude. Finally, what is the “whale computer” and how does it relate to scaling laws?

Key takeaways:

  • 0:00 Intro
  • 2:21 Google AI overviews
  • 15:10 Golden Gate Claude
  • 28:51 "Whale computers" and scaling laws

The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

📩 Sign up for a monthly newsletter for AI updates from IBM.

Episode transcript

Bryan Casey: Hello and welcome to Mixture of Experts. I am not your host Tim Hwang. Uh, we have let Tim regrettably go on vacation this week, so I’m going to be doing my very worst impersonation of him. So thank you all for bearing with us this week. But I am Bryan Casey, and thrilled to be joined with three other as distinguished guests this week who are going to help us cover the week’s news in cross product announcements, new research. Um, this week we’ve got three exciting topics uh on deck for us. First, we’re going to start by following up on a previous segment we actually had two weeks ago. So two weeks ago, we talked about uh the introduction of Google’s AI overviews. Those things have now been out in the wild for two weeks, and the market reaction to them has also been at times wild. And so we’ll discuss a little bit how the market is responding to for some folks what is probably their first uh experience with Gen AI. Um, second, we’re going to be talking about a model that turned itself into a bridge.

Bryan Casey: The Golden Gate Bridge specifically, um, so Golden Gate Claude and the implications, um, just around interpretability, safety and how hopefully we at some point can find a different sort of bridge between plausibly useful and actually useful when it comes to some of this work around interpretability. Uh, and then finally, every week feels like it’s a good week to talk about scaling laws. Uh, but with Nvidia earnings, with Microsoft introducing what has now become on the internet known as the whale computer, um, and some even just of the recent discussion on the web about running out of data for pre-training, now is as good a time as any to talk about the topic and maybe to take a slightly different approach on it that we have in the past. So today, as usual, we are joined by a distinguished group of researchers, product leaders and engineers. Uh, I am joined by Kate Soule, program director, director, generative AI research. So welcome to the podcast, Kate.

Kate Soule: Thanks.

Bryan Casey: Chris Hay, uh, distinguished engineer, CTO, customer transformation, welcome back, Chris.

Chris Hay: What up.

Bryan Casey: And a newbie on the show, Skyler Speakman, senior research scientist. So welcome to the show, Skyler.

Skyler Speakman: My first time here. I’m looking forward to it. So thanks y’all for being here.

Bryan Casey: We will start with AI overviews. So, as I mentioned two weeks ago, Google said that they were going to roll these out across the United States, and they did in fact do that. And very predictably, the first thing the internet did was latch on to every single example that was funny or troubling around various solution nations that were happening. And of course, those things have been going viral across social media. I wrote down some of my favorite examples that I’ve saw, which included Google recommending that the correct number of rocks to eat is a small number of rocks, um, that a pair of headphones weighs $350, that certain toys are great for small kids when actually they’re potentially fatal, uh, and then finally one that I think it is yet another example of some of the problems, but when asked which race is the strongest, Google said that white men of Nordic and Eastern European descent uh were in fact the strongest. I had not heard that one. That was uh, yes, so all of those things. So I do want to start by maybe adding a little bit of to this, which is like Gemini’s very capable model, uh, actually, and the thing we’re not seeing on the Internet is all the things that are actually going fine and well, right? People are cherry-picking to some extent examples that are particularly comical or troubling. Um, and one of the things that I’m sort of reminded of is that Twitter is not real life, um, but it does feel like a different level of visibility for this content than just when it was hidden behind, you know, a chat bot that you had to consciously uh sign up for. And even if LLMs are hallucinating, let’s just say 1% of the time, it’s more than that, but let’s just say it was only 1% of the time, knowing how much search volume is on Google, that’s still a staggering volume of hallucinations that are happening every day. Um, and so Chris, maybe want to just start, turn it over to you, get your sort of initial reaction to it and maybe just comment on, you know, what do you think is the right way to think about this problem? Is this like a nines of reliability problem? Do people need to start treating machines more like they treat humans with like a degree of not trust necessarily but like a trust but verify, um, or do you think the market’s just cherry-picking examples here and like it’s actually going mostly fine and it will just continue to get better over time?

Chris Hay: So I think it’s a really interesting question because we’ve all been doing retrieval augmented generation for a while, right? Um, but this is really retrieval augmented generation on a global scale. And the big issue that you have here is the when you’re doing the AI overviews, it really can’t tell the difference between what is truth and what is satirical or made up or is a fun article, and the internet is full of that. So if we take the rock example that you had there, Brian, that actually came from a satirical article in The Onion, but Google couldn’t differentiate between that. And I think that opens up a whole thing, as you were saying there. So one of the things to be thinking about there, it’s one thing for The Onion to have a satirical article, and you click on that, you know it’s a satirical article. But when Google takes that and then produces an overview and puts it at the top and says this is the answer to your question, then is it Google speaking at that point, or is it really just providing a summary of what you found? And that’s where I think there is a real fundamental difference on what’s going on here. So this ability to to to really be able to distinguish what the truth is and what isn’t the truth and what is really just a fun article, I think that’s the challenge that they’ve got ahead of them now. If we look at something like Perplexity, they seem to have solved that problem, so I have no doubt that Google will solve that problem in time. But I think this comes down to uh being able to distinguish the difference of the results.

Skyler Speakman: I’m glad you brought up the the RAG analysis because I wanted to just jump in there. I think there is a difference between referencing incorrect information and a hallucination where the model is generating it. And I’m not quite yet sure for Google’s AI overview how much of it are incorrect references from a RAG system and how much of it is really truly novel incorrect but novel generated text. And I don’t know if we know the inner workings of of that quite yet. Uh, but there is a difference between those two types of mistakes made in these AI overviews.

Kate Soule: Yeah, I was going to say I’m right. When you do RAG anyway, depending on the creativity, you know, you’re going to have a little bit of creativity anyway in your settings. So it’s it’s really how much are they going to crank that up or crank that down over time.

Bryan Casey: It’s actually interesting you mentioned that because there were examples, actually the example of like the children’s toy that was actually potentially a safety hazard and fatal if swallowed. The funny thing is, is like there was a thread that went like somewhat a little viral about that, and then the first post in the comment section was actually somebody referencing like the number one result on Google and had almost that content verbatim uh in there. And then, but what was interesting is when it was Google showing the result versus it just being a link on the internet, the reaction to it was totally different. When it was Google, it was this like massive crazy problem. When it was just the fact that this was the first result on the internet, people were like, oh, well, it’s just content, um, and that happens all the time, and people have to know, um, to not trust that stuff. And so people do seem like they’re approaching this with like different expectations than they would normal content. I think people are assuming like everyone is kind of cued to assume if they’re reading this like statement that appears almost like it’s a fact, and it’s just, you know, saying this is what the facts are, that there’s been some sort of due diligence and like reasoning that’s gone on to evaluate and to look through. And, you know, that’s not quite how these systems work, at least not yet. So, you know, I think there’s a degree of skepticism that’s going to be needed for the near term when when looking at these types of results and working through them. You know, making sure that just because, as Skyler you pointed out, right, just because, you know, it’s on the internet and it’s being shared, doesn’t mean it’s a hallucination. It just means this is an example of what’s on the internet. One question I wanted to follow up on specifically on that, it touches on I think some of the stuff that we were even talking about maybe on the show last week, which is just around UX. And so one of the interesting things is that the place in the page that an AI overview is taking up is a space that was traditionally occupied by a thing called the featured snippet. Um, if you live in the search world, and where Google was sourcing that data historically was just one of the top two or three most authoritative and widely cited results on the web, and that would be taken verbatim, um, and placed in the snippet. Google’s now putting their AI overviews in the exact same place on the page where that content used to be. And, you know, it struck me that maybe one of the challenges there is that people are not necessarily treating the content as having being sourced totally different from one another. They’re it’s in the same place in the same page, so they think it’s the same. And one of the things that started to make me think about is, you know, when we think about, you know, and Kate, maybe you could take this one. We almost have these three different types of things, which is like human generated content, LLM generated content, and then traditional answers from like a calculator or like that you can like almost trust 100%. And do you think that we actually need to do more in terms of distinguishing the user experience between those things, like rather than merging it all together and like deeply embedding LLMs and AI into everything we do, like making it very clear to users, you know, where they’re seeing, you know, features and content that are sourced differently than they have been historically?

Kate Soule: Absolutely. And I think it goes beyond just even like consumer use cases. It’s super important for just regular consumers doing Google searches, but especially when you look at enterprise applications and other things. You know, the theme of like being able to cite your sources and being able to decompose a bit what is going on inside of the Black Box, I think, is increasingly going to be critical for any sort of real adoption being able to move beyond like, okay, this is a fun toy, to this is something that I can actually use in the day-to-day. So I really hope that we start to make some progress there on some of these more consumer friendly uh chatbots, because in the enterprise setting, you know, that’s becoming increasingly the norm, like in RAG patterns, you want to return, here’s the source where I, you know, um, got my answer from, and that’s becoming increasingly important.

Chris Hay: One of the things that opens up in my mind, Kate, and it’d be interesting in your perspective there, is that that’s kind of fine from a web interface where you’re getting your result, you get your overview, and then you’ve got all the links and here’s where I reference. But as we talked about in a previous episode where we’re moving into multimodality, and you’re going to be chatting with a uh, we could arguably a human voice at that point, right? You’re probably not going to want somebody going back and say, this is the answer to the question, and by the way, I got this answer from here, here, here, and you can visit it on XYZ, blah, blah, blah, because you’re going to switch off at that point. So I I wonder how what the best user experience for voice for that sort of helpful chatbot, but also being fair and transparent that it’s AI generated.

Kate Soule: I honestly question if chat, regardless if it’s with voice or text, is the right domain here, like the right mechanism and mode for this type of analysis. And one of the things I’m really excited by the AI overviews is it seems like one of the first use cases that is really taking on that’s consumer focused where it’s not a chatbot, right? Where we’re using generative AI and we’re able to start to drive um information distillation and gathering lots of different sources and providing results, you know, without having to like have a multi-turn conversation like asking, are you sure about this answer, where did you find it? Like, can you give me more sources? Like that’s a very unintuitive flow. But I think we’ve been so trained on chat to equal generative AI up until now that that’s just how we all assume it has to work. So I would actually say I don’t think, you know, voice and other things are where this hopefully is going. I think there’s a lot of opportunity to think through what do new types of non-chat-based applications look like and how can we embed those decision-making criteria and sources and other things that are needed to really drive value along the way without it being this like multi-turn interrogation of a of an agent.

Chris Hay: What what do we think Google is collecting on the usage patterns of these? You know, way back in the day, they would have search and they would obviously collect clickthrough, right? What are you clicking on? Uh, any guesses as to what sort of metrics Google’s collecting as people interact with these AI overviews? Um, I’m that’s not in my space at all. I’m just wondering if if I’m, I’m guessing someone in there is is watching how we are interacting with the AI overviews presented to us.

Bryan Casey: Ironically, this is the one question I’m qualified to answer. Um, and so, you know, at least when Google first introduced, um, AI overviews had been in beta for a while and they said they were bringing in prime time, and two of the things that they talked about were that, and they were really messaging to publishers, um, because like publishers have been hysterical about the impact of this, and like what’s been really interesting is that the impact on organic traffic to publishers has been like almost negligible, um, so everyone thought it was like the end of the internet, and then like almost nothing happened in terms of traffic, um, but two of the things that Google said was one that the content that was surfaced through AI overviews was actually getting more clickthrough and more traffic than the stuff that was present in, uh, just the normal SERP. And the idea there was that those those links and was presented with more context. Um, I think Sundar did another interview not long after that where he was talking more about like generative UIs, and you could just see I think more about like when how you turn a query, um, a user query and you generate a UI that places like links and information in context better than just like a flat list, which is sort of what they do. They they would say they do not do that today. It’s like there’s still some of that. Um, and so that was one thing. And then the other thing that they talked about, I’m sure they measure more things, but the other thing that they measured, um, is do the people who are exposed to AI overviews start using search more? Um, like, is this something that increases their usage of this product over time? Because the other audience that is terrified of this is obviously like shareholders, um, and people want to know, it’s like, are you going to kill search? And in the process of doing that, are where’s all the ad revenue going to go? And so one of the other things that they’re very clear about is like, oh no, people who get exposed to this actually use this product more over time. And so I think they’re reminding some of their other stakeholders a little bit there, but those are at least some of the ones that they’ve publicly discussed. Last week, Anthropic released a novel version of its Claude 3 Sonnet uh model, and um this model did not believe that it was a helpful AI assistant. Instead, it believed it was the Golden Gate Bridge, uh, which is a fun thing to have happened. Um, but really that was a demo of research that Anthropic has been doing for a long time, and really the industry has been pursuing for a long time, which is in the space of interpretability, um, and within the space of interpretability, Anthropic has been doing a lot of research around mechanistic uh interpretability, um, but part of the problem in this space is that, I think, Kate, to the comment you made earlier, is that these models are a black box today. You know, you put a pile of all the data on the in the internet and linear algebra and out spits something that somehow appears to know a lot about the world, but nobody knows how that’s actually happening, like not really. And so interpretability, um, is a space that’s trying to answer some of those questions, and what was interesting and why Golden Gate Claude was important was that Anthropic demonstrated that they could identify the features within the model that activated when, um, you know, either text or a picture of the Golden Gate Bridge, um, was was presented. So they knew, um, kind of the combination of like neurons and circuits that would say like this this thing represents the Golden Gate Bridge. And perhaps even more importantly, that by dialing that feature up or down, uh, they could influence the behavior of the model to the point where if you dialed it up high enough, model thought it was the Golden Gate Bridge. Um, and this was, if you read the paper, wasn’t the only example either. And I’ll share one other one, uh, which is that they had another feature that would fire when it was looking at code, and it would detect the security vulnerability in in the code. And they had an example too, where if you dialed up that feature, it would actually introduce a buffer overflow vulnerability into the code, um, as well. So when you think about the ability to dial features up and down within a model fairly surgically, um, pretty important in terms of the steerability of the model, uh, potentially. And certainly, I think you can understand a little bit why folks in the AI safety community, in particular, have been focused on this interpretability space. So I, I personally find the space super fascinating, and Skyler, I just want to turn it over to you to maybe kick us off a little bit to just maybe even talk about like your general reactions to to the paper maybe and like the demo as a starting point and just like what you found interesting, like how important you think it is, and just, you know, maybe talk a little bit about how, you know, I know what you thought of it.

 

Explore more episodes
Flowchart with arrows and circles in blue, red, and gray on a white background.
Scarlett Johansson, FMTI and Think 2024
What’s going on between Scarlet Johansson and OpenAI? In episode 4, the experts address OpenAI vs. ScarJo, explain the future of FMTI and review innovations in open source.
Flowchart with arrows and circles in blue, red, and gray on a white background.
AI safety, RAG benchmarking and responsible AI at ACM FAccT Conference
What’s the future of AGI? In episode 6, the experts unpack Leopold Aschenbrenner’s AI safety screed. Then, they review what happened at the FAccT conference and talk about all things RAG benchmarking.
Flowchart with arrows and circles in blue, red, and gray on a white background.
Apple's WWDC24 reactions and mechanistic intepretability
Is Apple late to the AI game? In episode 7, the experts react to Apple’s WWDC24 announcements. Then, we have part 2 on interpretability as the experts reflect on OpenAI's new study.
Stay on top of AI news with our experts

Follow us on Apple Podcasts and Spotify.

  1. Subscribe to our playlist on YouTube