Docling’s rise: The IBM toolkit turning unstructured documents into LLM-ready data

Graphic render showcasing structured and unstructured data in hybrid cloud

Author

Anabelle Nicoud

Staff Writer

IBM

This article was featured in the Think newsletter. Get it in your inbox.

With more than 37,000 stars on GitHub and counting, Docling is one of IBM Research’s most popular toolkits, as it solves a simple yet critical question in AI pre-training and fine-tuning: how do you get clean, structured data from unstructured documents?

“How hard can it be? Well, it can be very hard,” said Peter Staar, a Principal Research Staff Member at IBM Research in Zurich and chair of the technical steering of Docling at the Linux Foundation, during a recent interview.

The Docling team marked an ambitious first year, building tools for document conversion, precision extraction and local deployment. It also collaborated with Red Hat on the launch of Docling OpenShift Operator and launched SmolDocling, an ultra-compact vision-language model for end-to-end multi-modal document conversion.

Docling, donated to the Linux Foundation, continues its growth with a push into agentic AI. “We’re building systems that can generate documents dynamically,” Staar said.

From ideation to open-sourcing the toolkit, IBM Think spoke with Staar on the evolution behind Docling.

Q: Docling is one year old, and it’s one of IBM’s most successful projects on GitHub. Tell me what led you and your team to develop Docling.

A: I used to work a little bit on knowledge graphs and knowledge extraction, about six or seven years ago. There was a lot of work in AI back in those days, and to actually populate those knowledge graphs, one typically had to process documents from interesting sources.

We did a lot of work on processing all kinds of documents, but PDFs in particular. That in itself started to become a workstream on its own. We did a lot of work in making document collection, storage, ingestion, and making documents searchable way before RAG. Making sure that the content from figures and tables is nicely captured and used downstream—for example, in materials discovery. All these technologies or algorithms that we built were captured in a service-like delivery mechanism set. And I think we kind of changed gears and said: okay, maybe the packaging—the algorithms and all of the technologies we built—maybe we can package that differently.

Q: This was before the “new” generative AI landscape. What changed with the arrival of large language models?

A: Docling started a bit more like a service that we would call IBM Deep Search now. We wanted to reimagine that. With all the new AI, it also introduced questions like, “Can we use it for making training data available?” That’s when Docling started to have a big role. Because we suddenly had a different type of audience that needed that to actually interact, or have large language models interact, with documents and unstructured data.

Q: Docling is famously open source, and IBM contributed Docling to the Linux Foundation earlier this spring. Why was open-sourcing Docling your choice from the beginning?

A: We knew it was very interesting in terms of being a very necessary tool for feeding those language models, and now visual language models. What we weren’t really sure about—and this was kind of a discussion—was: if we make it all open source, if we make it as a tool, is that really going to be useful? Because back in the day, there were already quite a bunch of tools and other capabilities and other libraries that were doing something very similar. But we knew that our algorithms were definitely better. The question was: Do we get the audience? And that was a bit of a jump that we had to do. We were very adamant that we wanted to follow a few very core principles.

Q: What were those principles?

A: Efficiency was one, making sure that everything could run local on your laptop so that you could develop locally very quickly and not have to look into other services. Another was just having very high-quality output, very high precision and recall. The combination of those things was ultimately a big, big contribution to the success. And then, from a community perspective, we’ve always put a lot of emphasis on being fast at fixing bugs and iterating and adding new features and continuously keeping the momentum and the velocity very high. Although maybe in the beginning, that was not so important, because we didn’t have a huge community behind it yet.

Q: What was the impact of Red Hat on Docling’s journey to open source and its current development?

A: Red Hat was really the spark, or the motivation, to make that jump to open source. Since then, we’ve had a very good collaboration with them. It has now grown beyond Instruct Lab and to the teams from OpenShift. That is a stream on its own that we are super happy about. We are now working on dedicated operators for large-scale ingestion so that banks can also use it on OpenShift. And there’s a lot of work at IBM. We see a lot of consulting teams, client engineering teams that are using Docling to solve problems, and to then actually start building agents.

The latest AI trends, brought to you by experts

Get curated insights on the most important—and intriguing—AI news. Subscribe to our weekly Think newsletter. See the IBM Privacy Statement.

Thank you! You are subscribed.

Your subscription will be delivered in English. You will find an unsubscribe link in every newsletter. You can manage your subscriptions or unsubscribe here. Refer to our IBM Privacy Statement for more information.

Q: What does the collaboration with the Linux Foundation change for you and the team working on Docling?

A: That is something that, in my point of view, is extremely powerful. Because we are now part of the Linux Foundation, we have a lot more tools to actually see what is happening: what is the velocity of your project, who is coming in, who is contributing, how many contributors do we have, how is it evolving? There are now all these high-level metrics that are there, but that you don’t want to build all this tooling around. Seeing all the trends helps us understand where we should focus to actually make the open-source project even better. That’s actually a feedback mechanism.

Q: Finally, what’s next for Docling?

A: Looking ahead, there are really two main directions I’d love to explore. One is structured content extraction. The idea of on-the-fly generation of structured content from unstructured data is very, very appealing. Think about food packaging. The ability to say, ‘Okay, I can create the schema of the database. I need a polymer, I need these properties—its thickness, its water vapor transmission rate, etc.—you know the column names, you know where the data is. Now just fill it out and get me as many examples.’ That’s definitely in the realm of possibility, and something we’d like to explore, also as a workflow within OpenShift.

The other direction is agents. What we really want to do is build a few very strong use cases—for researchers, for AI engineers—for scenarios where we think, ‘This is the next level.’ It’s not just conversion anymore. We’re thinking through it. We’re generating and manipulating documents. What does that entail? It’s a lot of experimentation. I wouldn’t even call it research—it’s more like taking the tools, putting them together, seeing what works, where it fails, and how to make it robust.

Related solutions
IBM® watsonx Orchestrate™

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate
Natural language processing tools and APIs

Accelerate the business value of artificial intelligence with a powerful and flexible portfolio of libraries, services and applications.

Explore NLP solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services
Take the next step

Easily design scalable AI assistants and agents, automate repetitive tasks and simplify complex processes with IBM® watsonx Orchestrate™.

Explore watsonx Orchestrate Explore NLP solutions