Can AI solve infectious disease? In episode 25 of Mixture of Experts, host Tim Hwang is joined by Kaoutar El Maghraoui, Maya Murad and Ruben Boonen. Together, they analyze some papers and key AI developments. First, the experts dissect Machines of Loving Grace, a 15,000-word essay written by Anthropic’s CEO making some major AI predictions. Then, they discuss Apple's new benchmark GSM8K and the intriguing findings based on it, which were highlighted in a recent paper. Next, they analyze Entropix, a sampler intending to replicate chain of thought features. Finally, they address OpenAI's disclosure about seeing an increase in AI models faking articles, and what we can do to fix it? Listen to all this and more, on this episode of Mixture of Experts.
Key takeaways:
The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
📩 Sign up for a monthly newsletter for AI updates from IBM.
Tim Hwang: We’re jumping ahead. It’s October 17th, 2034. Has AI helped us solve nearly all natural infectious diseases? Maya is a product manager for AI incubation. Maya, welcome to the show. What do you think?
Maya Murad: Thank you for having me. So, of course, the optimist in me would love to say yes, but I don’t know if history has always proven us right. I think it really depends on how we choose to use this technology.
Tim Hwang: Kaoutar McRae is a principal research scientist for AI engineering at the AI Hardware Center. Welcome back to the show. Tell us what you think.
Kaoutar El Maghraoui:Thank you, Tim. It’s great to be back. Well, AI is making strides in tackling infectious diseases, but it’s not a magic bullet. Viruses evolve faster than algorithms, and the battle between pathogens and progress is far from over. So, there is a lot more work to be done.
Tim Hwang: All right, so some skeptics on the call. Finally, last but not least, joining us for the first time is Ruben Boonen, capability lead for adversary services. Ruben, welcome. Let us know what you think.
Ruben Boonen: Thanks, glad to be here. I think we can get there, provided scaling continues. But I think it’s mostly going to be an issue of competing human interests if we do.
Tim Hwang: All right, great. Well, all that and more on today’s Mixture of Experts. I’m Tim Hwang, and it’s Friday, which means it’s time again to take a whirlwind tour of the biggest stories moving artificial intelligence. We’ll talk about a hot new sampler that’s getting a lot of attention and Apple raining on AI’s parade, but first, I want to talk about “Machines of Loving Grace,” an essay by Dario Amodei, the CEO of Anthropic. He makes some wild predictions: that AI might solve all infectious diseases, it could 10x the rate of scientific discovery, and he promises that one wild but not implausible outcome is 20% GDP growth in the developing world and potentially even world peace.
The essay has been getting a lot of play, and a lot of people are talking about it. Maya, I’ll start with you. How believable do you think these visions are? What is more or less believable in what Dario is predicting here?
Maya Murad: So, Dario definitely paints a picture that we would all love to believe in, but of course, people are going to be skeptical because a technology, which is a tool, can be used in different ways. Currently, the way we’re seeing AI being used is not materializing necessarily in all this optimism; it’s a mixed bag. There have been advances in drug discovery, but at the same time, we’re seeing articles about the rise of misinformation. So I think the article overemphasizes the positive and doesn’t set in motion what the prerequisites are to get to this positive picture. I think it’s going to have to come hand-in-hand with a lot of social change, not just technological change.
Tim Hwang: So you think the end result of AI is likely to be neutral, if anything else? Is that right?
Maya Murad: I don’t think technology is neutral. I think how you put it in motion, there’s definitely an agenda, a social context, and an economic context behind it that unleashes it in different directions.
Tim Hwang: Yeah, for sure. Kaoutar, I want to bring you into this discussion. When you responded to the first question, you seemed more skeptical. Do you agree with Maya that this is ultimately achievable, or do you think this is marketing or over-optimism about the technology?
Kaoutar El Maghraoui:I think there are certainly lots of things we can achieve with AI. Of course, there is also hype. In Dario’s essay, he explored the potential and some limitations of AI and how it might shape society. One thing I found interesting is how he emphasized the need to rethink AI as powerful and also tap into its potential. But there are lots of challenges that require continuous work and progress. For example, in biology and health, which he wrote a lot about, we’ve seen what AI can do. It can significantly enhance research, but progress is often constrained by the speed of experiments, availability of quality data, and regulatory frameworks like clinical trials. Despite revolutionary tools like AlphaFold, you need things like virtual scientists driving not just data analysis but the entire scientific process. There’s a lot of work to be done.
If we look at pragmatic versus long-term impacts, in the short term, AI might be limited by infrastructure and societal barriers. However, over time, I hope these can be resolved, and AI intelligence can create new pathways, reforming how experiments are conducted and reducing bureaucratic inefficiencies through better system design. It has to be a collaboration between the intelligence and society. Things need to be regulated because, as Maya mentioned, there’s fake news and art and data. There’s danger we have to be careful about. How do we balance all of this to push it in a productive direction that helps us and doesn’t impede progress or create issues?
Tim Hwang: Yeah, for sure. Kaoutar, one question before I move to Ruben—he’ll have interesting angles as he works on how these systems can break or be used for not-so-great purposes. You work a lot on hardware. Part of Dario’s dream is that these systems will eventually control physical robotics in the real world, which would be a huge boost to the technology’s effect. Do you buy that? Are we close to a world where it’s easy to instrument models to control real-world systems, or are we still far away?
Kaoutar El Maghraoui: I think we’re making progress. There’s a lot of work on making hardware infrastructure more efficient, and sustainability is a big part because we’re hitting physical limits. There’s a lot of work needed to create chips capable of acting in resource-constrained devices, especially with the huge compute needs AI keeps driving forward with large language models. Computational needs are growing, and if you want to do things like reasoning, it’s going to be an arms race. More is required algorithmically, but on the compute side, innovations are needed at the semiconductor and material science level to create chips that handle this huge demand sustainably and cost-effectively.
Maya Murad:: A subject like this wasn’t addressed in the article at all. It was overly optimistic that AI will solve climate change, but in developing AI, we’re missing a lot of sustainability targets companies have set. If I want to use AI to solve climate change, I don’t want data centers emitting tons of carbon and consuming tons of energy to solve that problem.
Tim Hwang: Ruben, maybe I’ll bring you in. As a security guy, my friends in security look at this essay and say it’s ridiculous—this technology will largely be used for bad purposes, or the systems will be so vulnerable they’ll never achieve their full potential. How do you size up these claims? As a security expert, do you buy into the optimistic vision, or are you more skeptical?
Ruben Boonen: I am an optimist personally, but like I mentioned, the technological achievements are one thing; how people with competing interests manage the outcomes is another. For example, the article talks about authoritarian regimes and how AI systems clearly have applications to restrict what people can do and think. We can already see some of those dynamics at play; the West and East have diverged on AI development paths, and that will continue as systems get more powerful. Also, with medical advancements, I’m not a subject matter expert, but it will depend on whether companies make those advancements available to people who can’t afford them and how that distribution is made.
Finally, one thing he didn’t mention is education, which I’m personally hopeful for. More free access to information and high-quality AI-assisted education could be a big uplift for many people and help make society more democratic and accepting of these technologies. Conflict often arises because people don’t have the same basis to understand facts, like with anti-vaccination campaigns. So it’s a complex picture.
Tim Hwang: I’m going to move us to our next topic. One thing I’ve been watching on X/Twitter chatter is hype around a repo called “Entropic.” The story is that an AI researcher introduced a sampler that attempts to replicate some Chain of Thought features we saw for the OpenAI o1 release a few weeks back. Maya, I’ll turn to you. What is a sampler anyway, and why should we care?
Maya Murad: I love this question. I’ve spent time focusing on LLM inference. When we talk about AI, we mostly mean large language models. What an LLM does is, given the start of a sentence, it predicts the next word. If I say, “On the table there is a...”, automatically, a few probable words pop up: “book,” “glass of water,” etc. The model does something similar; it has a statistical representation of all possible next words with probabilities based on past data. A sampler determines, given the words the model has seen, what it should output next. The most widely used technique is “greedy,” meaning outputting the word with the highest probability.
This paper is interesting; it takes advantage of additional information we can get from LLMs and acquisitional metadata. It’s an interesting paper, and I’m happy to hear others’ thoughts.
Tim Hwang: Kaoutar, I’ll throw it to you. One reason people are excited is that it seems to boost model performance on various tasks and replicates part of what OpenAI touted as its special sauce. OpenAI seems like the Goliath with crazy algorithmic improvements, but does something like Entropic mean open source will get as good as proprietary models? Is there no special sauce if a random researcher can launch a repo that does something close to what big companies do?
Kaoutar El Maghraoui:I totally agree. I love what Entropic is doing. It reflects the fast-moving evolution of the open-source AI community, where new methods like adaptive sampling are explored without massive computational resources, which is key. It demonstrates the collaborative, experimental nature of the field. We can explore in open source and mimic or even exceed the secret sauce of big companies. Entropic aims to replicate features of OpenAI’s o1 models, particularly reasoning capabilities, with interesting approaches like entropy-based and “var-entropy” sampling techniques. These reflect uncertainty in the model’s next step or examine the surrounding token landscape, helping the model decide if it should branch or resample based on future token possibilities.
Open source will catch up. We see innovation not just algorithmically but also with efforts like Triton on the GPU/accelerator side. There’s a lot of work in open source to go co-processor-free, and in areas like VLAMs, open source is on par with proprietary companies across all AI stacks.
Maya Murad: What’s also interesting is open source gives the ingredients for free with more accessible approaches. What OpenAI did with o1 was take a big frontier model and do lots of reinforcement learning to train it on Chain of Thought reasoning at scale. This open-source repo took Llama 3.1 and bypassed all that reinforcement learning, taking advantage of innovation at the inference level. The model can tell us it’s uncertain about the next token. In some situations, a word has high probability, but there might be forks where options are equally probable. Using this information, you can do a lot. This repo proposes doing Chain of Thought from scratch, but I’m interested in uncertainty quantification as a means of giving tools for people to use models differently. If the model says it’s uncertain, you could build different systems. The choice could be different from what this repo does, but it’s an interesting research direction.
Tim Hwang: That’s an interesting subtlety. It’s not just replicating end results; this engineer found a cheaper way by editing the sampler rather than a complex reinforcement learning process.
Kaoutar El Maghraoui:It’s also encouraging deeper reasoning through token control at inference time, paving the way for incorporating uncertainty and future predictions to make the right next steps.
Tim Hwang: We’ve got to talk about agents every episode; it’s a bit for Mixture of Experts. Maya, you offered a question about the relationship between these uncertainty systems and getting more agentic behaviors. Can you talk more about that? It’s not entirely clear for some folks.
Maya Murad: First, any model of a certain size that responds well to Chain of Thought, step-by-step “thinking,” can be turned agentic. How well it performs depends on the model’s inherent capabilities. What’s interesting about this innovation—using uncertainty information—is it could be really useful in agentic systems. You could stop an agent if it’s uncertain of the next step. In the agent world, we face reliability problems, and users over-trust agent performance because it looks human-relatable. Catching hallucinations in an agentic approach is harder than text-in, text-out. Uncertainty quantification is a tool that could bring agentic systems to the next level. It could be used to stop an agent, start a new Chain of Thought workflow, etc. We’re at the beginning, but on my team, we discuss this as an interesting research direction.
Kaoutar El Maghraoui:It goes in line with agentic approaches. Entropic introduces entropy-based sampling and the var-entropy technique, assessing future token distributions. Agentic behavior requires foresight, planning, and human-like flexibility in dynamic, adaptive decision-making. They can learn from each other; agentic systems could incorporate these techniques for flexibility and foresight. It’s exciting.
Ruben Boonen: As the other panelists mentioned, there’s a real push in open source. I don’t know how well we can quantify if it’s catching up to frontier models, but it’s great this is happening publicly.
Tim Hwang: Yeah, for sure. As Maya said, we may see open source solving problems in more resource-constrained ways, keeping it ahead of proprietary models’ expensive approaches. It’s a dynamic we’ll return to.
I’m turning to our next topic. Apple released a controversial paper recently—I joked they’re raining on AI’s parade. They took a benchmark called GSM8K, which has mathematical reasoning questions, made quick variations, and created a new benchmark, GSM-Symbolic. The changes are small and subtle—like changing “John” to “Sally” or “apples” to “pears”—and don’t change the substantive problem. They found these small changes can cause significant performance drops in models.
On one level, we know about overfitting and benchmark gaming, but it’s worrisome. It suggests model reasoning may not be as strong as we think. Ruben, does that conclusion seem right to you?
Ruben Boonen: It makes sense that people want to benchmark models, and companies have incentives to do well. Public data ends up in training data, so that makes sense. Looking at the figures, they had different tests. One changed names and objects, with a drop between 0.3% and 9%; for frontier models, the drop wasn’t that large—for GPT-4, it was only 3%. But they had harder benchmarks where they added/removed conditions or added multiple conditions, with much larger drops—up to 40% for some models, even 65.7% in one case.
Tim Hwang: Even for frontier models, there’s a big drop. But with reasoning and Chain of Thought, the o1 benchmarks dropped substantially less—still a lot, like 17.7%—so I’m not sure how to feel. Is this a problem that will resolve as reasoning improves, or not?
Kaoutar El Maghraoui:Some results were surprising. This work from Apple researchers provides a critical evaluation of the reasoning capabilities of language models, exposing that LLMs rely heavily on pattern matching rather than true reasoning. I don’t think LLMs engage in formal reasoning; they use sophisticated pattern recognition, which is brittle and prone to failure with minor changes. For example, in the GSM-Symbolic test, if you include irrelevant information—like “some apples are smaller”—which doesn’t affect the reasoning, the LLM might use that in the calculation. They also exposed high variance between runs, showing inconsistency. Slight changes in problem structure resulted in accuracy drops up to 65%. The key highlight is that LLMs try to mimic reasoning but rely on data patterns; their capability for consistent logical reasoning is limited. The findings suggest current benchmarks may overestimate LLM reasoning capabilities, and we need improved evaluation methods.
Maya Murad: I love this new benchmark from Apple. We’ve talked about issues with benchmarks in previous sessions, so this is a great step toward more generalizable insights. It was predictable. Whenever I talk about “reasoning,” I use quotes because we’re anthropomorphizing what LLMs do; it’s pattern matching at scale. They showed the model patterns it hasn’t seen before. You can update the model’s training with new patterns to unlock new use cases—that’s great. It’s an imperfect technology that can do useful things, but I don’t think current technology can do logical reasoning. We have to accept it for what it is. When making these systems useful, we’ll always need a human in the loop or on the loop, surfacing confidence levels. Using this knowledge that it’s imperfect can make systems more robust. Papers show that combining this technology with humans increases overall robustness. We should accept that rather than think we’re on a path to AGI with current tech.
Tim Hwang: Let’s make that concrete with a last question for Ruben. There’s excitement about using AI to harden computer networks for cybersecurity defense. Based on Maya’s framework, is cybersecurity a pattern-matching question or a reasoning question? It suggests that if cybersecurity defense is mostly pattern matching, the technology has strong legs here. But if more is needed, there are questions about its fit for purpose. Any final thoughts?
Ruben Boonen: Security is a vast and complex domain. In some cases, reasoning is important; in others, it’s about data collection, correlation, and summarization. For years, traditional machine learning has been used in endpoint detection solutions to great effect. Now, with generative AI, there’s a push to integrate it into the backend to correlate and synthesize events manually done before, speeding up processes. But humans must be involved to evaluate those events. I think it’s going to be big for our industry.
Tim Hwang: We’ll end on a more stress-inducing segment. As you know, there’s a big election coming up in the US and worldwide. OpenAI recently disclosed that they’re seeing state actors increasingly leverage AI for election interference, using models to generate fake articles, social media content, and other persuasive tactics. It’s interesting that the technology is mature enough for election interferers to use it. Ruben, as someone who thinks about security and vulnerability, what do you think? Is this an issue we’ll solve, or will it get worse? What’s the trajectory? Do we live in this world now, or is it temporary?
Ruben Boonen: I think we just live in this world now. This is my hot take. AI has many implications. I’d categorize this as social engineering—persuasive messaging, generated images/videos—where risks are immediately evident. Another category is using AI to speed up malicious attacks, which is less mature. Going through OpenAI’s report, it’s great they’re proactive and working with industry partners, but it must be new to them. My conclusion was that they found limited effect from what they saw; the most effective post was a hoax about an account on X saying, “You haven’t paid your OpenAI bill,” but they said it wasn’t generated by their API. Impacts might still be limited, but we may be biased because we’re only looking at threats we detected and stopped. It wouldn’t surprise me if there are more successful influence campaigns using self-hosted open-source models that we don’t have telemetry on. That’s a little paranoia-inducing, but that’s where we are.
Tim Hwang: Maya, any thoughts? Is there anything we can do, or are we doomed to a world of fake AI influence operations?
Maya Murad: I think it’s the state of the world, unfortunately. There are bad actors. When social media came about, it was exciting, bringing us closer together as a global community, but for bad actors, it meant bigger scale, better skill, and bigger reach. It’s the same with AI. The world is moving fast, and I wonder about society’s ability to catch up. Already, the educational system hasn’t caught up to a post-AI world. In keeping information factual and how society is organized, I wonder if we’ll be able to get there. It should be a concerted effort with more global focus and public spending because we need more resources to catch up to where technology is taking us.
Ruben Boonen: I want to quickly jump in on competing incentives. It’s not always clear that social media platforms have the correct incentives to deploy AI for sentiment analysis to see which posts promote misinformation or are part of networks generating similar messages. If those messages generate interaction, it might be good for the platforms. There’s a problem with misaligned incentives getting in our way.
Tim Hwang: Yeah, it’s not just what technology can detect but if it’s implemented and used, and the reasons for doing so. Kaoutar, round us out. How do you think about these issues? Are we doomed?
Kaoutar El Maghraoui:It’s a scary state. As technology gets better, especially generative AI, threats will get more sophisticated and clever in reaching masses and causing harm. To mitigate misuse, a lot needs to be done: robust AI detection tools for AI-generated content, regulation, oversight, collaboration between governments and companies for clear guidelines, transparency, and user education. Public awareness about AI-generated misinformation is important to help people critically evaluate online content. Partnerships across industries to share insights are key. Increasing awareness is important; OpenAI did clever things to identify and halt AI operations focused on election-related misinformation. We need more of that. But as Maya and Ruben said, this is the world we live in. It’s an arms race; as technology gets better, threats get more sophisticated. The cases OpenAI highlighted I’d label as low-sophistication across different use cases. With good engineering, there could be campaigns that aren’t easy to detect, especially if they use proprietary models outside the scope of OpenAI and other frontier models.
Tim Hwang: That’s an intriguing outcome I hadn’t considered. What’s the “evil OpenAI”? Is there an evil Sam Altman running a criminal foundation model? Presumably, yes.
You always know what you’ll get on Mixture of Experts. We’ve gone from solving all infectious diseases and 20% GDP growth to sinister invisible influence operations controlling you as we speak—from the very good to the very bad of AI. Kaoutar, thanks for joining us. Maya, thanks for coming back. Ruben, we hope to have you on again. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. We’ll catch you next week on Mixture of Experts.
Listen to engaging discussions with tech leaders. Watch the latest episodes.