Live from the Paris AI Action Summit 2025, our host Tim Hwang dives into the latest in artificial intelligence. In episode 42 of Mixture of Experts, we welcome Anastasia Stasenko, CEO and co-founder of pleais along with our veteran experts Marina Danilevsky and Chris Hay.
Last week, we touched on some potential conversations at the Paris AI Summit, this week we recap what actually happened. Is AI safety improving globally? Next, for our paper of the week, we breakdown s1: Simple test-time scaling. Then, Sam Altman is back with another blog, “Three Observations,”— hear what our experts have to say about these observations. Finally, explore what we can learn from Anthropic’s Economic Index. All that and more on today’s Mixture of Experts.
Key takeaways:
The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: Globally speaking, is AI getting more or less safe with time? Marina Danilevsky is a Senior Research Scientist. Marina, welcome back to the show. What do you think?
Marina Danilevsky: Both.
Tim Hwang: Okay, all right, we’ll get into that. Chris Hay is a Distinguished Engineer and CTO of Customer Transformation. Chris, what do you think? More or less safe with time?
Chris Hay: I think it’s getting safer with time, so I’m pretty pleased with that. And joining us for the very first time is Anastasia Stasenko, who is the CEO and co-founder of Pleias.
Tim Hwang: Anastasia, more or less safe?
Anastasia Stasenko: It’s definitely getting safer because it’s getting more open source.
Tim Hwang: All right, great. More to dive into. All that and more on today’s Mixture of Experts. Greetings from Paris. I’m Tim Hwang, and welcome to Mixture of Experts. Each week, MOE is the place to tune in to hear how leading researchers, engineers, entrepreneurs, technologists, and many more are thinking about the latest trends in artificial intelligence. As always, we have a lot to cover—way too much, in fact.
We’re going to talk about test-time scaling, a new blog post from Sam Altman, and Anthropic’s new Economic Index. But first, given that I’m in Paris and Anastasia is in Paris as well, we want to talk about the Paris AI Action Summit. This is the latest in a series of summits that governments have been holding around AI in the last few years. This year is hosted by the French government, and it collects representatives from civil society, governments, companies, and more to talk about and set standards and guidelines around the development of AI. There are some really big announcements that we want to get into.
Anastasia, it’s really exciting to have you on the show, in part because you were directly involved in some of these events. I understand that Macron has announced an enormous fund—I think it’s like 100 billion dollars—to support AI, specifically in France. I know Pleias is part of that. If you want to tell us a little bit more about how you got involved and what you’re going to be doing with this new fund.
Anastasia Stasenko: Yes, of course. Well, first of all, there actually have been multiple announcements about investment. We do love announcing investment in France, and we do hope that real action will follow. However, it’s important to say that the 100-billion-euro investment fund is not only France investing; it’s an international fund with private companies involved. For example, the Iliad Group, which is a telecom company, is contributing over 4 billion, et cetera. This fund has the objective to really focus on sovereign European AI infrastructure.
This is one part of this, and it’s true that we have been hearing for a long time that Europe is lagging behind in terms of AI infrastructure. We don’t have enough GPUs to train frontier models nor to run inference for scaled AI applications. I’m not sure that this is entirely true, and I’m not even sure that we should be going into scaling AI infrastructure in a world where we have ecological imperatives haunting us at the same time.
Another big announcement was the “Current AI” foundation, which is focused on AI for public good.
Tim Hwang: Yeah, I know, and this is a very particular part of Pleias’ work, right? Because you guys are specifically working on open-source and open-data models.
Anastasia Stasenko: Yes, totally. We have actually trained the world’s first models on exclusively open data—“open” in the strong sense of the word, without copyrighted material, with permissive licenses only. Opening data and creating this open data infrastructure is important to us, but also to the larger AI community, which today cannot advance as fast or work on as many applications for the good of communities—especially for less-resourced languages or specific applications—without being supported by initiatives like Current AI.
So this foundation, with 400 million secured for the first year, aims to raise over USD 2 billion for its five-year run, at least for now. We are very happy to be part of this from the very beginning. We have signed an open letter with 10 other industry leaders, such as Mistral, Aleph Alpha, Hugging Face, et cetera. So it’s all very exciting. And I’m most particularly excited about data finally not being the forgotten piece of AI development.
Tim Hwang: For sure. We talk a lot on the show about how the data piece often gets lost, even though it’s arguably the most important part.
One of the themes I wanted to pick up on... Politico did an interesting article with the headline, “How the World Stopped Worrying and Learned to Love AI.” The argument was that the last few summits focused on safety and security, but this year, there was a lot more of a “lean forward” attitude—that we just need to deploy this faster, better, and bigger than ever before.
Marina, maybe I’ll call on you. In response to the first question, you said, “Well, maybe a little bit of both.” I’m curious what you meant by that and how you think about it, particularly in this context where world leaders seem more “rah-rah” than in the past.
Marina Danilevsky: World leaders have FOMO. So they’re saying, “Well, safety is all well and good, but everybody else is working on this, so we better work on it too.” Why I say “both” is that there are aspects in which safety is getting better, and aspects in which the power of the models makes it easier to do potentially more deeply nuanced and misleading tasks. So safety is just getting more complicated. It’s less about the model teaching you to build a bomb, and more about the model having unintended effects over time.
This has always been, to me, the flip side of the democratization of AI. Having it in more hands is always going to be both good and bad.
Tim Hwang: Chris, this is a good chance to bring you in. You were playing the voice of optimism, saying things are getting safer. One way to nuance this is to talk about at what layers it’s getting more or less safe. On the regulatory side, there’s been a push to not restrict the technology, which some think is less safe. But techniques around safety are getting stronger. Is that your argument for why it’s safer?
Chris Hay: I think so. If we just go back two years ago to models like GPT-3 or the early open-source models like LLaMA 1... come on, those models were terrible compared to today’s. Today’s models are much safer; they come back with better answers and hallucinate less. From a stack perspective, we now have guard models, we’re reducing bias, and training is a lot better. So in general, we’re thinking about this a lot more. That’s not to say we can’t do bad things with models today—of course you can—but we are much safer than we were.
My friend used to write test code for missiles, and he was the worst programmer I ever met. So I think, would I want an old LLaMA model doing that versus him? I think maybe it’s safer. Yeah, it’s fine.
Tim Hwang: Anastasia, as a model developer, how is Pleias thinking about safety? Is it distinctive from what you see elsewhere? I know philosophically you’re focused on “open,” but are you also trying to blaze a new trail in the safety space?
Anastasia Stasenko: For us, we don’t develop conversational models or chatbots. We specialize in models for data processing, all the way to retrieval-augmented generation (RAG). For us, the most important part of safety is the development of curated and vetted databases prepared for factual AI. Not much work is done here nowadays. For example, you still don’t have good multilingual classifiers for sentiment analysis or good toxicity classifiers where you understand the training data and why something is deemed toxic.
There is so much work to be done to prepare good data foundations for factual AI, where you can have models bound to data—even proprietary data or the open data you use in your stack. This is where we concentrate our efforts. We are not building AGI; we don’t have the resources, and I’m not sure we need to. But we do need these “workhorses,” these small models that allow us to work through data and bring it to a quality where RAG applications can be deployed. This is where live, actual AI would be deployed.
Tim Hwang: I was just going to say, I know a guy with 100 billion dollars who might help you build AGI, Anastasia. You might want to tap him for it. We’ll get to that gentleman later.
One thing I want to talk about is the “paper of the week”: the “Simple Test-Time Scaling” paper. After the o1 preview came out, one of OpenAI’s stated advancements was test-time compute—the idea that you make a model “think harder” to achieve better results. This paper came out saying they tried a technique called S1 to replicate o1’s reasoning ability. They collected about a thousand questions and their reasoning traces, using hacks like inserting a “wait” token to keep the model thinking. They claim that with these rough-and-ready hacks, they got a model competitive with o1 preview.
A shocking result, maybe. Chris, I see you’re going off mute. Is it shocking that people can replicate this so quickly? Is test-time compute the new whiz-bang technology, and these researchers have replicated it at much lower cost?
Chris Hay: I think it’s the Barney Stinson method from How I Met Your Mother: “What’s 25 + 2? Wait for it, wait for it... 27.” That’s the basic technique. It’s not a surprise. We already know that the longer a model spends thinking and generating tokens, the more it can reflect. We saw this in the DeepSeek paper. The trick is to create multiple samples, take longer, get longer chains of thought, and then you’re more likely to get the answer. They’re essentially rejection sampling bad chain-of-thoughts.
It’s really cool that it works with just one token—the “wait” token—to generate that chain of thought. I think in two years, we won’t need these hacks because we’ll have a good set of chain-of-thought data to bootstrap the model. But starting with a relatively simple base model like CodeGen 2.5 and generating those chains of thought quickly to get decent performance in that domain is a great job. It builds on work everyone’s seen with DeepSeek.
Tim Hwang: Yeah, the “wait, wait, wait” solution is classy—simple but effective.
Marina, one question is, how far can test-time compute go? It’s remarkable that you take less sophisticated models and get way better performance. Is there a ceiling set by the base model, or are you optimistic that test-time compute will take us very far? Or is it just a marginal hack?
Marina Danilevsky: First, it might be a less sophisticated model, but it’s more sophisticated data. They didn’t take a thousand data points; they took 59,000 and did a bunch of filtering and qualifying to get down to that thousand. They spent time talking about quality, difficulty, and diversity. I couldn’t agree more. The work is going to be done somewhere. If it’s not in the model dealing with noisy data, it’s in having really good representative data points.
“Wait” sounds like “let’s think step by step.” There are other ways to do this. You can have examples where the model backs up and tries a different path. This consistently reminds me of taking my nine-year-old through his math problems. The important thing is what data is being used. It’s good that people are trying to reduce compute; there’s a huge amount of waste right now. You don’t need things that big with data that much and that noisy. Quality goes a long way. The more we focus on data quality, the more progress we’ll make.
Tim Hwang: This is interesting in the context of openness versus closedness in data. Some argue we need so much data that we can’t figure out what’s open or closed.
Marina Danilevsky: Totally. And you don’t need that. You need vast amounts, but first you need good quality data. Reason in rich data.
Anastasia Stasenko: What’s really interesting in this post-DeepSeek moment with test-time compute is that we can boost smaller models, which have a smaller energy impact, for specific domains like math and coding. At Pleias, we’ve started working on this for legal reasoning. These are domains where you have “truth,” you can create a chain of thought and verifiers. These datasets are more complicated to create than for coding or math. We’ve been experimenting with legal reasoning, reasoning over administrative documents, and even sociological reasoning where you have clear guidelines.
I’m very interested in how this will boost smaller, specialized models for specific domains outside of general capabilities tested on traditional benchmarks. I think the future is really in these small models.
Tim Hwang: I’ve imagined a future where people go to court with a 2.5-billion-parameter model against a lawyer with a 70-billion-parameter model, and the judge has a 405-billion-parameter model. Is that our legal future?
Anastasia Stasenko: For me, that’s a more desirable future than being governed by one AGI. It’s a more capitalistic future. I’m not sure I want to live in a one-AGI world, but that’s probably just me.
Tim Hwang: Maybe that’ll be our next hot-take question.
The next topic is a blog post from Sam Altman entitled “Three Observations.” It’s provocative and talks about the economic impact we can expect as AI systems get more powerful. Sam makes two big arguments: first, model performance is scaling (bigger models are better), and second, as costs drop, demand keeps increasing. His ultimate argument is that we should keep scaling; things will get cheaper and have a gigantic impact on the economy. It’s a case for why people should believe in OpenAI.
I want this group’s take, as folks who believe the technology will get better but are also AGI-skeptics. Marina, what did you think? Do you agree? Any quibbles? Maybe that sigh tells us everything.
Marina Danilevsky: Sam’s main point I focused on was his point three: “The socioeconomic value of linearly increasing intelligence is super exponential... We see no reason for exponentially increasing investment to stop.” Give me more money. Money to me. Give it now. That seems to be the message.
Also, I don’t think he, along with many in Silicon Valley, lives in the real world. Statements like, “In 10 years, everybody will want and benefit from AI,” are not realistic. Things don’t scale that way, and that’s not the need or benefit people will have. It’s a narrow perspective that I find off-putting. It undermines the work many of us do in the field.
Tim Hwang: So the critique is that it’s an overstatement. Chris, what do you think?
Chris Hay: I love it. Go for it, Sam. It’s gonna change the world; it’s great. You gotta have a super positive attitude. This is a world-changing technology, and we can see how much things have improved. Of course, it’s going to have an impact. New value creation will happen; we’ll find new ways of doing things. I think that’s great.
Is it likely to be somewhere in between? Maybe, because as soon as we get something cool, we take it for granted. I’ve said that as soon as we get AGI, the first thing that will happen is it’ll be put in a box in a museum, and we’ll all walk in and chat with it. But I am super positive. If you look at coding, models today are better for many tasks, turning out code quickly at high quality. As costs come down, more people will use it.
Tim Hwang: I’m trying to parse the optimism. Marina, I agree the tone is frustrating, but do I believe the technology will have a big impact? Yes, for sure. How do you articulate optimism without falling into Valley tropes?
Marina Danilevsky: Not being in the Valley.
Anastasia Stasenko: For me, the ChatGPT moment was a huge moment of liberation. I don’t like to write, and we sometimes underestimate the emancipating power of this technology for people who struggle with writing or coding, who didn’t have access or “intelligence,” and now use these as tools. However, I’m not sure this emancipating power is well understood. We are hunting AGI, but we need assistants to help us be what we can be in society, with less constraint on learning to code or write well, and not being judged by autographic errors.
This is more important than hunting AGI. However, Marina brought up the third point about exponential growth. The planet is limited. We have limited resources, and I’m surprised to read such takes. AGI won’t resolve the ecological crisis. We can’t build enough nuclear power plants in time, and even nuclear energy won’t resolve the crisis.
We cannot forget everything we were saying two years ago and think AGI will solve everything, just like the internet didn’t solve everything. We need to be more pragmatic. Sam gave the same speech at the Élysée Palace, saying we need to invest in data centers. We need to be careful about these calls for investment and how they impact society. We need this technology, but in a reasonable way—not by building a data center near every school.
Chris Hay: I agree but disagree. I don’t think we need nuclear power stations. With test-time compute, we should focus less on pre-training. That’s not to say we won’t pre-train, but it shouldn’t be the only focus. With test-time compute, you can get very far with high-quality datasets. Pushing that to longer chains of thought on consumer-grade hardware proves scaling can occur at lower cost. If we need nuclear power stations to scale AI, we’re on the wrong track.
Anastasia Stasenko: The question is inference, not just pre-training. It’s a much bigger chain for energy.
Chris Hay: But I can run inference on my laptop with Apple Silicon. It’s fine.
Anastasia Stasenko: The cost of inference is much lower than training, but once we scale AI applications, it will be a question. But let’s agree to disagree.
Chris Hay: That’s not how this podcast works.
Anastasia Stasenko: Yes, it’s not.
Marina Danilevsky: I’ll say something that might be funny coming from my background in language studies: we should not forget the power of these models in other non-language domains—multimodality, sensors, time series. There is so much use to be gotten, not just in helping you write (which I love; I hate writing emails) or code, but in improving factory work, tracking sensors on migratory animals to help their habitats, etc.
If we broaden where this technology is used—remember, it has nothing to do with language—we could go to more interesting, pragmatic, and practical places, and maybe not focus only on language or chasing AGI.
Tim Hwang: The last item is a fun dataset from Anthropic called the “Anthropic Economic Index.” It reminds me of Google Flu, which used search results to map the spread of illness. This is an update, looking at what people are using AI for. Anthropic looked at a sampling of anonymized conversations to learn how AI is spreading across the economy. They found some interesting results.
One finding is that about 36% of AI assistant usage is still in software development and technical writing. This ties into Marina’s point: we’ve sold this as an economy-spanning tool, but it’s still quite concentrated. For me, that violated my expectations; I thought people were using it for emails, poems, essays, etc. Anastasia, should we be surprised, or am I just out of touch?
Anastasia Stasenko: This study corroborates what we’ve seen in other studies, including one we conducted for the French government. One reason is that the tools we have now—chatbots—are not well-adapted to the exact knowledge work most people in other industries do. For software development, I wasn’t surprised, and Anthropic has been marketed as a state-of-the-art coding tool, so it’s normal. They acknowledge this bias in the paper.
However, I think this shows how far we are from wider adoption of LLMs as everyday tools at work. It’s surprising; we might think we all use it now, but we don’t. Some parts of the population are more exposed, and it will require education in workplaces and product adaptations from model providers.
Tim Hwang: Chris, this builds on your joke about AGI in a zoo. We have powerful AI, but it’s still largely a technical industry phenomenon. Anthropic would say there’s a huge market to grow into. The pessimist view is that maybe it’s most useful for coding. Is that a concern for the AI industry?
Chris Hay: No, I agree with Anastasia. If you asked Moët & Chandon what the primary use of glass bottles is, 30% would say champagne. That’s the reality for Anthropic. Claude is marketed as the best for coding. In ecosystems like Cline, the default model is Claude. So anybody in that industry knows to go to Claude for code. I think the data is skewed. If we asked OpenAI, we’d get a different result because it’s aimed at a wider consumer base. Even within OpenAI, GPT-4o mini vs. o1-mini vs. o3-mini would have different usage data.
It’s interesting that they released this, but it’s a very thin vertical slice. It’s only telling us about how Anthropic is used.
Tim Hwang: Yeah, it’s only about Anthropic’s usage.
Chris Hay: Exactly. I don’t think it’s representative of the world, America, or the UK. It’s representative of how Claude and Anthropic are used, which is super interesting, but that’s what it is.
Marina Danilevsky: They explicitly say that. They freely admit it. One of the biggest points of releasing the dataset is to hope we can get similar data from others. It’s like releasing search logs; you’ll never get all of them, but it’s nice to have something. I liked the economic perspective. They were careful about their limitations. It’s more about the process of analysis than the results.
Chris Hay: I agree with that.
Tim Hwang: Yeah. Marina, a follow-up: this is cool because we haven’t seen companies say, “Based on our aggregate data, here are useful things we can build.” This is a new start. Do you think companies will do more of this, or is it more of a demo project?
Marina Danilevsky: I don’t know. I think this technology is still a hammer in search of a lot more nails. We’ve found a couple of nails, but there are more out there. Given the investment, it’d be nice to find more and figure out from people, “Do you even know how to use this technology?” We’re not at the point where people have the knowledge. When the internet started, people didn’t know how to use it correctly but got better.
It’s actually to companies’ benefit to get people more comfortable and have a broader view, not make this seem like it’s by tech folks, for tech folks. You’re going to run a market, as you said correctly.
Tim Hwang: Yeah, it will be a long process of understanding how to use it. It’s like those early films where people lined up and posed still, assuming it was a photograph, not realizing it was a movie camera. We’re at that early stage with AI.
Well, that’s all the time we have for today. Chris, Marina, thanks for joining us again for two episodes in a row. Anastasia, it was great having you on the show; we’ll have to have you back. Thanks to all you listeners for joining us. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. We’ll see you next week on Mixture of Experts.
Listen to engaging discussions with tech leaders. Watch the latest episodes.