Running large language models can waste enormous compute, because inference happens in two very different phases: prefill, which is GPU-intensive, and decode, which is memory-heavy.
When both run on the same hardware, resources sit idle half the time. Enter llm-d, an open-source framework that splits these phases and schedules them on the right hardware.
At this week’s KubeCon, Christopher Nuland, a Technical Marketing Manager at Red Hat, shared with the developer community how the open-source project built with the open-source community can be deployed within Kubernetes-based platforms. “We haven’t seen this kind of support for a project since Kubernetes itself,” Nuland said in an interview with IBM Think.
“Llm-d is built to really reduce a lot of the resource consumption of AI and improve a lot of the efficiencies of these AI models,” he said.
Industry newsletter
Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.
Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.
According to Nuland, llm-d has shown 23% to 200% performance improvements in model inference efficiency in early benchmarks, depending on the use it’s configured for.
IBM Research, Google Cloud, Red Hat and NVIDIA are among the founding contributors to the open-source project, and large enterprises in industries like banking and finance are also showing interest, said Nuland, noting that the buzz may be a sign that banks are bullish to gain an edge through technological breakthroughs.
Llm-d also aims to address growing concerns around AI and data sovereignty. As organizations look to deploy AI across borders while maintaining strict control over data and model behavior, llm-d’s architecture allows teams to isolate model caches by region, avoid cross-pollination of sensitive data while still sharing infrastructure efficiently.
“2026 is going to be a really exciting year for llm-d,” says Nuland. “We’re going to see a lot more adoption, especially in enterprise environments, and a lot more integration into platforms like IBM Cloud and [Red Hat] OpenShift. And I think we’re going to start seeing llm-d become the foundation for how we do observability around AI.”
Open-source small language models delivering enterprise-grade performance and transparency at a competitive price.
Put AI to work in your business with IBM's industry-leading AI expertise and portfolio of solutions at your side.
Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.