IBM Research-Zurich

Corpus Conversion Service Makes PDF Content Discoverable

Share this post:

This week, my team from IBM Research will debut a massively scalable document ingestion system, the Corpus Conversion Service, at the prestigious ACM Conference on Knowledge Discovery and Data Mining (KDD 2018) in London. Our AI-based cloud system can ingest 100,000 PDF pages per day (even of scanned documents) on a single blade server with accuracy above 97 percent—and then train and apply advanced machine learning models that extract the content from these documents at a scale never achieved before.

According to Adobe, there are roughly 2.5 trillion Portable Document Format (PDF) files currently in circulation. Think of the knowledge these files contain: scientific articles, technical literature, and much more. But all that content is “dark” or unused, because until now, we have had no way to ingest large number of PDF files at scale and make their content useable (or structured).

Making knowledge discoverable

PDF files often include combinations of vector graphics, text, and bitmap graphics, all of which make extraction of qualitative and quantitative data quite challenging. In fact, converting automatic content reconstruction has been a problem for over a decade. While many document conversion solutions are available, none of them address scalability or apply AI, which means that they need to rely on expensive human-based maintenance and upgrading.

Two steps further

My colleagues Peter Staar, Michele Dolfi, Christoph Auer, and I, with initial support from Roxana Istrate and Matthieu Mottet, decided to approach the problem of document conversion in a new way. In contrast to traditional rule-based algorithms, our tool is unique in that it utilizes generic machine learning algorithms, which produce models with the ability to be easily and quickly trained on ground-truth labeled data acquired by human annotation. A key element is that we designed the human-computer interaction in the system to allow very fast and massive annotation without any computer science knowledge. This swap to machine learning gives our service a great deal of flexibility, as it can adapt rapidly to certain templates of documents, achieve highly accurate results, and ultimately eliminate the costly and time-consuming tuning typical of traditional rule-based algorithms.

We also implemented a processing pipeline to ingest, manage, parse, annotate, train, and eventually convert the data contained in any type of format (scanned, programmatically generated, bitmap images, Word document, etc.) into structured data formats such as JSON or XML. Essentially, we have developed a unique technology that is fully customizable where the AI model can be trained with minimal effort. In fact, we report in our paper that an average person can annotate 20 pages per minute. Once several dozen PDFs have been annotated, the machine learning takes over—you can just sit back and watch in awe.

One of a kind

To the best of our knowledge, the Corpus Conversion Service is the first comprehensive system to use advanced AI at this level of scalability. While existing solutions can only convert one document at a time to a desired output format, our tool can ingest entire collections, a corpus of documents, and build machine learned models on top of that.

Within IBM, the Corpus Conversion Service is serving more than 250 active users for knowledge-engineering project engagements. It’s also currently being tested by external partners in various industries, including material sciences, engineering, chemicals, government, oil and gas, insurance, and consumer electronics. More specifically, our partners in material science are making use of the service to ingest PDFs with patents, peer-review articles, and internal documents to develop new alloys. An insurance partner is using the service for converting unstructured claims.

The Corpus Conversion Service will be available on the IBM Cloud by the end of this year and also available for clouds on-premise. If you can’t wait that long, contact us to become a beta tester.

Corpus Conversion Service: A machine learning platform to ingest documents at scale
Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas
DOI: 10.13140/RG.2.2.10858.82888

More IBM Research-Zurich stories

Kubernetes-based Control Plane to Manage Risk and Compliance for Hybrid Cloud

Today, traditional enterprises are looking for ways to leverage the cloud for their digital transformation. This is driven by the need to create new revenue streams, deliver superior user experience, and reduce capital and operational expenditure. Hybrid cloud has emerged as the new normal for traditional enterprises. This is demonstrated by the fact that 94% […]

Continue reading

Research Unveils Innovations for IBM’s Cloud for Financial Services

IBM Research played a central role in developing the technology underpinnings of IBM’s financial services public cloud, including the new IBM Cloud Security and Compliance Center, an IBM Cloud service.

Continue reading

Reducing Speech-to-Text Model Training Time on Switchboard-2000 from a Week to Under Two Hours

Published in our recent ICASSP 2020 paper in which we successfully shorten the training time on the 2000-hour Switchboard dataset, which is one of the largest public ASR benchmarks, from over a week to less than two hours on a 128-GPU IBM high-performance computing cluster. To the best of our knowledge, this is the fastest training time recorded on this dataset.

Continue reading