IBM Research-Zurich

Corpus Conversion Service Makes PDF Content Discoverable

Share this post:

This week, my team from IBM Research will debut a massively scalable document ingestion system, the Corpus Conversion Service, at the prestigious ACM Conference on Knowledge Discovery and Data Mining (KDD 2018) in London. Our AI-based cloud system can ingest 100,000 PDF pages per day (even of scanned documents) on a single blade server with accuracy above 97 percent—and then train and apply advanced machine learning models that extract the content from these documents at a scale never achieved before.

According to Adobe, there are roughly 2.5 trillion Portable Document Format (PDF) files currently in circulation. Think of the knowledge these files contain: scientific articles, technical literature, and much more. But all that content is “dark” or unused, because until now, we have had no way to ingest large number of PDF files at scale and make their content useable (or structured).

Making knowledge discoverable

PDF files often include combinations of vector graphics, text, and bitmap graphics, all of which make extraction of qualitative and quantitative data quite challenging. In fact, converting automatic content reconstruction has been a problem for over a decade. While many document conversion solutions are available, none of them address scalability or apply AI, which means that they need to rely on expensive human-based maintenance and upgrading.

Two steps further

My colleagues Peter Staar, Michele Dolfi, Christoph Auer, and I, with initial support from Roxana Istrate and Matthieu Mottet, decided to approach the problem of document conversion in a new way. In contrast to traditional rule-based algorithms, our tool is unique in that it utilizes generic machine learning algorithms, which produce models with the ability to be easily and quickly trained on ground-truth labeled data acquired by human annotation. A key element is that we designed the human-computer interaction in the system to allow very fast and massive annotation without any computer science knowledge. This swap to machine learning gives our service a great deal of flexibility, as it can adapt rapidly to certain templates of documents, achieve highly accurate results, and ultimately eliminate the costly and time-consuming tuning typical of traditional rule-based algorithms.

We also implemented a processing pipeline to ingest, manage, parse, annotate, train, and eventually convert the data contained in any type of format (scanned, programmatically generated, bitmap images, Word document, etc.) into structured data formats such as JSON or XML. Essentially, we have developed a unique technology that is fully customizable where the AI model can be trained with minimal effort. In fact, we report in our paper that an average person can annotate 20 pages per minute. Once several dozen PDFs have been annotated, the machine learning takes over—you can just sit back and watch in awe.

One of a kind

To the best of our knowledge, the Corpus Conversion Service is the first comprehensive system to use advanced AI at this level of scalability. While existing solutions can only convert one document at a time to a desired output format, our tool can ingest entire collections, a corpus of documents, and build machine learned models on top of that.

Within IBM, the Corpus Conversion Service is serving more than 250 active users for knowledge-engineering project engagements. It’s also currently being tested by external partners in various industries, including material sciences, engineering, chemicals, government, oil and gas, insurance, and consumer electronics. More specifically, our partners in material science are making use of the service to ingest PDFs with patents, peer-review articles, and internal documents to develop new alloys. An insurance partner is using the service for converting unstructured claims.

The Corpus Conversion Service will be available on the IBM Cloud by the end of this year and also available for clouds on-premise. If you can’t wait that long, contact us to become a beta tester.

Corpus Conversion Service: A machine learning platform to ingest documents at scale
Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas
DOI: 10.13140/RG.2.2.10858.82888

More IBM Research-Zurich stories

Linear-Complexity Earth Mover’s Distance Approximations for Efficient Similarity Search

The Earth Mover’s Distance is a highly discriminative metric for measuring distance between probability distributions that has been applied successfully in various fields.

Continue reading

Imaging molecules in different charge states

IBM researchers, along with collaborators at the Universidade de Santiago de Compostela and ExxonMobil, reported in the peer-review journal Science that they have been able to resolve with unprecedented resolution the structural changes of individual molecules upon charging.

Continue reading

Hypertaste: An AI-assisted e-tongue for fast and portable fingerprinting of complex liquids

For the rapid and mobile fingerprinting of beverages and other liquids less fit for ingestion, our team at IBM Research is currently developing Hypertaste, an electronic, AI-assisted tongue that draws inspiration from the way humans taste things.

Continue reading