What is all the hype around the Deep Research feature? In episode 43 of Mixture of Experts, join host Tim Hwang along with Kate Soule, Volkmar Uhlig and Shobhit Varshney to distill the biggest stories in the world of AI. This week, hear the experts discuss reasoning model features coming out of companies such as OpenAI’s Deep Research, Google Gemini, Perplexity, xAI’s Grok-3 and more!
Next, listen to the experts break down what OpenAI’s rumored release of an inference chip might mean for success in the AI chip game. Then, explore what the experts have to say about the capabilities of small vision-language models (VLMs). Finally, hear the experts discuss a job posting for an AI agent from a startup Firecrawl. Is this the future for AI tools in the workforce? Tune in to this episode of Mixture of Experts to find out.
Key takeaways:
The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.
Tim Hwang: What was the last thing that you had to do deep research on? Kate Soule is Director of Technical Product Management for Granite. Kate, welcome back to the show. What have you been researching?
Kate Soule: I’ve been researching KV cache management.
Tim Hwang: Volkmar Uhlig is Vice President, AI Infrastructure Portfolio Lead. Volkmar, welcome back to the show. What have you been looking into?
Volkmar Uhlig: Indices and vector databases.
Tim Hwang: And last but not least is Shobhit Varshney, Senior Partner Consulting on AI for US, Canada, and Latin America. Shobhit, welcome to the show. What have you been looking into?
Shobhit Varshney: Quantum computing, especially how it intersects with AI.
Tim Hwang: All right, all that and more on today’s Mixture of Experts. I’m Tim Hwang, and welcome to Mixture of Experts. Each week, MOE distills the biggest stories in the world of artificial intelligence and gets you what you need to know. As always, we have a lot to cover. We’re going to talk a little about rumors on OpenAI’s inference chip, small vision models, and a job listing for an AI agent. But first, I really want to talk about “Deep Research.”
It’s a funny phrase because nowadays, everybody has a feature called “deep research.” Google Gemini has one, ChatGPT announced theirs, Perplexity has one, and not to be left out, Grok has also launched a feature called “DeepSearch.” These are all features where you do a query and get back an in-depth research report. This all literally happened in the last month or two.
Kate, maybe I’ll start with you. Why is everybody suddenly launching a deep research feature? What are they trying to do, and why is it suddenly so competitive? Why is deep research the new hot thing?
Kate Soule: Yeah, I think it’s helpful to understand the broader context of when these features were released. Back in January, DeepSeek came out with their R1 model, demonstrating crazy reasoning capabilities. OpenAI, maybe in response to show they’re also innovating, pretty much launched their deep research capability as a fast follow, which leverages the o3 reasoning model behind the scenes. Ever since that model came out, we’ve seen a lot of other companies follow the market and create their own versions to follow that broader trend and focus on reasoning models that have really taken the world by storm.
Tim Hwang: Yeah, for sure. Shobhit, maybe I’ll bring you in here. One question I have is, how do you win in this competition? We’re suddenly in a world with four or five “search engines” again. Do you see any differentiation between the companies and how they’re trying to win on this particular feature?
Shobhit Varshney: I think Google came up with this first in December, followed by OpenAI, then Perplexity, and now Grok-3. The overall intention is, given a complex ask, the model will research across a multitude of websites, cluster them into topics, and find related information—simulating how a human would open 20 browser tabs.
Companies like Google have a really good understanding of how web pages are structured and semantically connected. Most of these deep research models start by creating a plan. With Google, I can hit “edit” and change the plan. With ChatGPT, it asks follow-up questions. There’s often a need for disambiguation—for example, “Transformer”: the movie or the model? In some cases, you need to narrow the field and go very deep.
So first, it establishes goals and a research plan, just like a good research analyst. Then it crawls the web, finds relevant websites, extracts content, and can even spawn additional queries if it finds a new, relevant topic.
Your question was about how to win. On the B2C side, the company that can best connect the dots between websites and content will likely win. Speed matters to an extent, but whether it takes three or four minutes, I’m okay with that. What’s important is understanding, grounding the information, and providing the right citations.
In my personal experience using Perplexity Pro, its Deep Research was a little more prone to hallucination than Gemini. So far, OpenAI has been the best for the topics I’ve researched. The nuances lie in how you ground your content and interpret the different websites.
In the enterprise space, this is untapped right now. If someone asks, “Why was my claim denied?” or “What are my travel benefits?”, a human researcher has to look at multiple systems and documents. We haven’t crossed the chasm yet for deep research in the enterprise. I don’t see a single vendor that enables us to add enterprise data, model the reasoning steps, and so on. I think by the end of 2025, we’ll start to see these models become more open, with other layers on top for enterprise use. The company that figures that out will make billions, versus someone focused only on B2C.
Tim Hwang: Yeah, the use cases I’ve seen online are for people with niche needs, like researchers or bloggers. Volkmar, maybe I’ll bring you in. Over the last few episodes, you’ve increasingly become the loud skeptic on the MOE panel. Are you impressed by stuff like Deep Research? Do you use it? It feels like there are a bunch of problems, and is it even technically that impressive, or is it just a combination of existing components?
Volkmar Uhlig: I think it’s a really interesting approach, but it’s incremental. We already had the ability to search; now we’re just extending the scope. We already had the first iteration of “make a plan.” What’s changing now is multi-step reasoning and multi-step document retrieval, extending the knowledge. Larger context window sizes allow for this. If you only have a 4k context window, you can’t do it, but with 128k, you can throw lots of documents at it and start reasoning.
We’re at a junction where the aided data is available. OpenAI started having access to the internet in a vector database. You needed search capabilities, long context windows, and multi-step reasoning. All these things are now individually stable, so we can build applications like this. I think it’s a really interesting application that shows the direction we’re heading—multi-minute processing with answers. It also shows we’re at a point where we’re willing to let the model run for a while without babysitting it every hundred characters, because the model quality is high enough that they don’t just go off on a tangent.
Tim Hwang: Totally. It feels like the biggest thing is less technical and more sociological—we now have enough trust in these systems to let them run like this, which is pretty interesting.
Shobhit Varshney: One of the challenges with deep research is you don’t have a verifiable output to compare accuracy against. We struggle with this in organizations. If I get a deep document on milk regulations in Europe vs. India vs. the U.S., I don’t know what “good” looks like, so it’s difficult to verify the output. Companies are struggling with evaluations for these deep research files. You can calculate metrics like how many paths were created, how long it took, how many websites were hit, etc., but there’s no good measure. In the real world, if I hire two research companies, they’ll come back with different documents, and I won’t have a good validation routine. I think it’s an order of magnitude tougher problem than writing code or doing math, where you can deterministically tell if the answer is correct.
Tim Hwang: Yeah, creating good benchmarks for this feature becomes very tricky. A final question, Kate: we have Google and OpenAI—giants of the space—and Perplexity, which has spent a lot of time on search. What’s interesting is Grok, which has only hit the scene recently and is launching features at parity. How do you read that? Is the space so competitive that anyone can launch cutting-edge features, or is Grok just executing incredibly well? Is it easier to launch state-of-the-art features with smaller teams?
Kate Soule: We’re benefiting from so much innovation being put into open source, which is allowing a rising tide to lift all boats. It’s allowing less traditional players to enter the market, and we’re seeing a really rich ecosystem emerge. It’s exciting to see what Grok and others can come out with.
Regarding Deep Search and Deep Research, I think deep research is one of the more practical use cases for reasoning. If we’re all innovating on reasoning, and a lot of the benchmarks are on math—which I don’t think is the killer use case for paying for reasoning tokens—research is an area where we see clear benefit. This is an early use case with demonstrable value that reasoning brings, so as new models come out, they’re often paired with a deep research-type capability.
Tim Hwang: Well, I want to move us on to our next topic—a story we seem to cover every few episodes. Every few weeks or months, there are rumors that OpenAI is working on its own chip. This time, it was a leak that OpenAI was readying an inference design with TSMC, a leading chip fab. I wanted to use this as a hook to check in on the state of OpenAI’s competition in the hardware space.
Volkmar, you’re the natural person for this. It’s interesting that, according to reports, OpenAI is investing first in inference chips. For our listeners, could you explain why this would be such a big priority? What is the upside for them in making this big bet versus using established companies?
Volkmar Uhlig: OpenAI is building the chip not by themselves; they’re partnering with Broadcom, a giant in chip manufacturing. That’s expected—they had to pick a partner if they don’t want to become a chip company, and I don’t think OpenAI wants to get into that market as a primary business model.
Looking at training versus inferencing, the requirements are very different. In training, a good chunk of the money goes into networking infrastructure and storage systems—it’s effectively a high-performance computing system. For inferencing, it’s usually a much smaller problem. We have very large models that may not fit on a single GPU, but often you’re using maybe eight GPUs at most. For a really large model used for internal verification, you might go to 16 GPUs—two boxes—but not much beyond that.
From a consumption perspective, the ratios of consumer hardware needed for inferencing are orders of magnitude larger than for training. Initially, all investment went into training because we needed to make the model. Now we have the model and want to use it, so growth is on the inferencing side. It’s a natural conclusion for OpenAI to control its destiny.
The easiest way is to look at NVIDIA’s profit statements—they’re around 68-69%. If you want a larger chunk of that revenue and profit, you partner with a chip manufacturer for an exclusive deal. I’m sure OpenAI and NVIDIA have specific deals where OpenAI pays less, but still, controlling your supply chain further down is key. The first step is owning your data center; the second is controlling the chip. By co-designing a model for the chip, you can probably get another 3-4x cost reduction. At OpenAI’s scale, this is the natural thing for any company to do—control your costs.
Tim Hwang: Sure. Shobhit, could you talk about how this might impact the market for AI services? In the past, we sold AI by highlighting the shiny new model. In the future, part of OpenAI’s pitch might be that it’s running on their chips, making it faster or more performant. Do you think that will shape the sales pitch, moving the focus from the model to the underlying infrastructure, especially as models become more open source?
Shobhit Varshney: Absolutely. TSMC is the 800-pound gorilla—they have over 65% of the market. If something happens to TSMC, the world grinds to a halt. Everybody is designing chips, but TSMC is the heart of the industry.
Amazon is a good analogy—they have their own inference chips and have built models optimized for them. When you optimize hardware architecture for software architecture, it does magic. The total cost decreases, throughput increases, and latency drops significantly.
In the enterprise world, for high-volume use cases like fraud detection or invoice processing, you need quick latency at millions of transactions per day. The cost would add up quickly. This shift towards optimized inference stacks—like Amazon’s with their models, NVIDIA with NIMS—is the trend OpenAI is following. For high-volume use cases, the cost goes down, making it feasible to run in production at scale. The use cases don’t change, but now we can go after high-volume ones where the ROI didn’t exist before. I’m generally excited about inference optimization because it allows me to bring more AI to clients, infuse it into more processes, and deliver higher ROI.
Tim Hwang: Yeah, for sure. Kate, do these structural changes create any dangers for open source? The dream of open source is that you can take your model and run it everywhere, building the largest possible community. Do you worry about hardware fragmentation as things get specialized and optimized for particular model families?
Kate Soule: I’m not sure I worry so much. The model around open source has always been that there are open-source versions and optimized, enterprise-supported versions that get deployed. We always need to have that balance.
Another interesting thing is that how models are trained is changing. There’s a larger emphasis on techniques like reinforcement learning, which requires a huge amount of inference on really big models. Controlling inference costs isn’t just about serving models cheaper to customers; it’s now a critical part of training. You could easily see reinforcement learning costs starting to outweigh pre-training costs.
Tim Hwang: Yeah, that’s wild to consider. I hadn’t thought about that.
Shobhit Varshney: This is in line with what Google has been doing forever. Their Tensor Processing Units (TPUs) are so well designed for distributed inferencing across multiple centers. They have multiple products with billions of users every day, deploying AI models at an insane pace. You’ll see more of this—inference-optimized models delivering great ROI at the right cost point. I’m very excited about this space.
Tim Hwang: I feel like MOE is one of the few podcasts where a panelist literally does the chef’s kiss for GPUs.
I’m going to move us on to our next topic. There’s a joke I used to make when I worked at Google: when we presented AI, we were really talking about deep learning. A decade later, we’re doing the same thing—when we say AI, we mean large language models. But lots of things are happening in AI. One of the most interesting is competition over vision models, which have gotten short shrift because LLMs take up so much space.
Kate, you’re ideal for this segment because I understand Granite is out with new small vision models. Could you walk us through that, and then let’s talk about how this space is evolving?
Kate Soule: Absolutely. A Vision Language Model (VLM) is a bit different from what folks might be familiar with, like Stable Diffusion or image generation models. A VLM is about image understanding. You send an image and a prompt as input, and text is returned as output, unlike models that start with a prompt and generate an image.
These models work by taking a standard LLM trained for language tasks and doing additional training to add a component that allows an image to be expressed as an embedding, which is fed to the language model along with the prompt embedding. The language model then returns the response.
These are becoming popular. We just saw a bunch of models drop. Granite released our vision preview two weeks ago, and the full model is coming next week—keep an eye on the IBM Granite Hugging Face page. Our model is only 2 billion parameters, so it’s small and can run locally. We’ve taken a specific approach focusing on document understanding tasks—think charts in a PDF, poorly scanned documents, GUIs, or dashboards. You can take a screenshot, put it in the chat box, and ask questions.
Granite can do general vision understanding, but we’ve optimized for document understanding, thinking from an enterprise customer perspective where there are valuable use cases, especially with projects like Docling and multimodal RAG.
So, Granite preview released two weeks ago; the full version is coming next week. Qwen released their family of VLMs today, ranging from 3B to 72B parameters. There’s a lot of other work in the space, like Pixtral, and we expect this capability to grow.
Tim Hwang: Shobhit, could you give a picture of how competition is evolving here? Similar to deep research, people are figuring out where VLMs fit in the market. With these small vision models, what are enterprise people wanting to use them for?
Shobhit Varshney: Absolutely. We’ve been working on vision models with clients for a while. Earlier, the heavy lifting was done on servers in the cloud. For example, Gemini 1.5 Pro can chew through a whole video and understand what happened. Those are very large models.
We’ve delivered use cases for clients like a large consumer goods company checking planograms in stores, ensuring labels are compliant, or describing products in catalogs for retailers. Usually, these tasks were human-driven. Now, VLMs have evolved from just identifying objects to having a better semantic understanding.
For example, one client has a camera pointing at counters to see which is busier, doing people counting on the fly. It understands temporal changes—what changed from frame 2 to frame 19 in a video.
OCR was the first wave; now we’re getting into more semantic understanding, unlocking more use cases. And as Kate said, models are getting much smaller, allowing us to do two things: run on-device for security reasons (e.g., in manufacturing, defence, drones) and handle high-volume use cases like document processing millions of times a day. The cost difference between a 7B and a 30B parameter model impacts ROI significantly. We’re at a point where small models deployed at scale or on-device are delivering critical ROI.
Tim Hwang: That’s great. Volkmar, I’m going to give you an impossible question to close this segment. As Shobhit said, there are interesting pressures, and it’s not clear how much AI workload will happen at the edge versus in big data centers. But with smaller models being perfectly performant for industrial tasks, it seems we’re headed toward more edge computing. How do you size up where you think models will ultimately live? Will it be 50/50, or mostly on the edge?
Volkmar Uhlig: I think it will be a bit of everything. From a bandwidth perspective, transmitting a few words is cheap—a low-bandwidth channel. With vision, it’s a high-bandwidth channel. Computation has always been a trade-off: do I bring computation to the data or data to the computation?
With text, ignoring latency, it favored data centers for consolidation—like having a nuclear power plant versus a generator in your backyard. The nuclear plant is more efficient but requires a power distribution network.
Similarly, if I have a high-bandwidth stream and can solve it with a relatively small model at the edge, the economics work in that favor. The trend is making decisions based on question complexity. For videos, that’s really hard, but look at what the iPhone does—there’s a model router that decides: if it’s an easy question, stay on-device; if complex, offload to the cloud.
The cheapest input device is a camera—capturing 30 frames per second with millions of data points. The entropy over time is low, but you capture a lot of information in one shot. Audio is more feasible to transmit, but video pushes it to the edge.
Models will specialize. As Shobhit said, for industrial use cases in a manufacturing plant, you may want to keep it local because you already have industrial-scale installations and data centers. “Locally” doesn’t necessarily mean inside the camera; it could be in a building with a cable running a couple hundred feet.
Tim Hwang: Yeah, that’s a good reminder that “the edge” depends on where you are.
Volkmar Uhlig: We have a natural tendency to think of the phone because everyone carries one—a package of battery, camera, and processor—but that’s not necessarily true.
Tim Hwang: A final question, Kate: if folks want to learn more about Granite’s work, where should they go? I know there’s a big announcement and release next week, but anywhere online people should pay attention to?
Kate Soule: We always post everything on our Hugging Face page under the IBM Granite org. You can also check out ibm.com/granite for the latest.
Tim Hwang: I’m going to move us to our last story—more of a publicity stunt, but it raises interesting questions. Firecrawl, a Y Combinator startup, got attention for putting out a job description looking for someone to assist in their web crawler business, but it specifically said “humans need not apply—only AI agents.” The founders admitted it was conceptual and a funny experiment—a publicity stunt.
But it got me thinking about how far agents will go, particularly in the next year. Will we see calls for agents for certain tasks, where you could hire a human or put out a call for an agent to do the job?
Kate, I’ll throw it to you first: are we living in that world? Are agents getting good enough that in 2025-2026, we’ll see job listings for agents specifically?
Kate Soule: Well, what they did was clearly a bit of marketing, tongue-in-cheek. But I think it’s very realistic to have a near future with catalogs of agents, and people creating specs for agents they want others to build and sell. I don’t think it will be total job replacement; I see more opportunity for agents augmenting human roles.
As we look at job descriptions next year, having expertise in managing agents and working with AI systems will be a huge part of the new workforce.
Tim Hwang: Yeah, it reminds me of when “Microsoft Office Suite” was a key skill. “Experience with agents” might become one.
Shobhit Varshney: This isn’t new. Dharmesh, the CEO of HubSpot, launched Agent.ai last year—it’s a large network where you can find a catalog of agents, rated, and hire them for specific tasks, paying by different metrics.
Enterprises look at multi-agent workflows differently. We spent the last 5-10 years on structured, directional flows with RPA, but we only automated 10-15% of workflows because when something went wrong, humans had to take over. With LLM agents, we don’t have to define every deterministic step. Within thin guidelines, the LLM can figure out which API to call, create a plan, and iterate.
Very narrow tasks will get automated with LLM agents rapidly. Native companies like Salesforce will have their own agent force, and external tools like Azure Copilots or Watson Orchestrate will orchestrate work across agents. The technology is maturing rapidly.
What’s missing in enterprises is “ask to task”—humans are incredibly good at going from an overall ask to a series of tasks. Tomorrow, LLM agents can trigger these tasks in parallel. Companies that create a “golden thread” of ask-to-tasks will win. The agents themselves will become commoditized, but the planner agent that handles ask-to-task unlocks multi-agent workflows. To get to multi-agent, you have to solve ask-to-task.
Tim Hwang: There’s a fun question about the paradigm for integrating agents into organizations. A job listing for an agent is silly because we have B2B SaaS—we just integrate it. The weird world opened by multi-agent stuff, like the Google paper on “AI scientists,” where the paradigm was creating a one-to-one analogy with a laboratory, is where you might “hire” agents. But that makes me cringe—this isn’t preschool for agents. There has to be a better way.
Volkmar Uhlig: I think the APIs for that form of orchestration are open. We went from centralized software to SaaS services where you invoke an API—that’s how it would work.
I want to give a different angle: we have something similar now with AWS Mechanical Turk—micro-tasks given out for processing. There’s an economic model of having compute capacity available and selling the work product. You could go to a centralized place to pick up work items because you have spare capacity or a specialized model.
This is more like work queue management at a meta layer—not selling tokens, but solving a problem and posting a result. That’s where it could go.
Another angle is APIs with “baby agents.” You may not have data access, so you could go to a company that sits on a data store it doesn’t want to share but is happy to share research results or summarizations. You might talk to an agent instead of an API—it’s just moving the interface one level up. Your interface to that dataset might be a large language model.
Shobhit Varshney: From a hands-on perspective, deploying large multi-agent networks for clients—like a pharma company for content creation and auditing, a healthcare client for customer-facing members, a telco for software development, a BPO for three-way matching—we’ve put several in production in the last six months.
One challenge is describing guidelines to these agents. We’ve figured out that English is the way to talk to LLM agents, but that won’t scale in enterprises for complex workflows. The context we give can be 2-3 pages for a small task because we have to add all these band-aids and “if-then” statements codified in English. For a planner, this breaks down—you can’t give a 30-page context. The latency is high, and there are overlapping rules.
As a community, we need to make progress—IBM Research is working on this—to get to a structured contract, like how Mechanical Turk works, for invoking APIs or microservices. We’ve solved this in software engineering; we need to bring those principles. It will no longer be natural language; we’ll need better software design principles for large-scale enterprise deployments to reduce hallucinations and improve auditability and evaluations.
Kate Soule: There are two key points: as a developer, how do I express something in a controllable, programmable way with clear inputs and output guarantees? And when agents pass information, it doesn’t have to be natural language—that’s not efficient. What is the most effective way to bridge that communication gap?
I really hate the “nursery of agents” idea—each with a persona like “I’m a critic agent.” How do we set this up as a program? They’re not people; they’re instructions with clear requirements. There are agentic capabilities like reflection loops, validation loops, and planning loops, but at the end of the day, it’s a clear program where information is passed from one part to another to execute a task.
Tim Hwang: Yeah, we’ve gotten so carried away by the dream of the agent as a little person that we forget the optimal strategy might just be programming. I was looking at a prompt for an agent that said, “Be persistent.” How has computer science evolved to this? There has to be a better way.
Well, great. That’s all the time we have today. Kate, Volkmar, Shobhit, thanks for joining us. And thanks to all you listeners. If you enjoyed what you heard, you can get us on Apple Podcasts, Spotify, and podcast platforms everywhere. We’ll see you next week on Mixture of Experts.
Applications and devices equipped with AI can see and identify objects. They can understand and respond to human language. They can learn from new information and experience. But... what is AI?
Unlike other kinds of chips, AI chips are often built specifically to handle AI tasks, such as machine learning (ML), data analysis and natural language processing (NLP).
With a catalog of prebuilt apps and skills and a conversational chat experience, IBM® watsonx Orchestrate™ makes it easy to design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes.
Listen to engaging discussions with tech leaders. Watch the latest episodes.