This week, my team from IBM Research will debut a massively scalable document ingestion system, the Corpus Conversion Service, at the prestigious ACM Conference on Knowledge Discovery and Data Mining (KDD 2018) in London. Our AI-based cloud system can ingest 100,000 PDF pages per day (even of scanned documents) on a single blade server with accuracy above 97 percent—and then train and apply advanced machine learning models that extract the content from these documents at a scale never achieved before.
According to Adobe, there are roughly 2.5 trillion Portable Document Format (PDF) files currently in circulation. Think of the knowledge these files contain: scientific articles, technical literature, and much more. But all that content is “dark” or unused, because until now, we have had no way to ingest large number of PDF files at scale and make their content useable (or structured).
Making knowledge discoverable
PDF files often include combinations of vector graphics, text, and bitmap graphics, all of which make extraction of qualitative and quantitative data quite challenging. In fact, converting automatic content reconstruction has been a problem for over a decade. While many document conversion solutions are available, none of them address scalability or apply AI, which means that they need to rely on expensive human-based maintenance and upgrading.
Two steps further
My colleagues Peter Staar, Michele Dolfi, Christoph Auer, and I, with initial support from Roxana Istrate and Matthieu Mottet, decided to approach the problem of document conversion in a new way. In contrast to traditional rule-based algorithms, our tool is unique in that it utilizes generic machine learning algorithms, which produce models with the ability to be easily and quickly trained on ground-truth labeled data acquired by human annotation. A key element is that we designed the human-computer interaction in the system to allow very fast and massive annotation without any computer science knowledge. This swap to machine learning gives our service a great deal of flexibility, as it can adapt rapidly to certain templates of documents, achieve highly accurate results, and ultimately eliminate the costly and time-consuming tuning typical of traditional rule-based algorithms.
We also implemented a processing pipeline to ingest, manage, parse, annotate, train, and eventually convert the data contained in any type of format (scanned, programmatically generated, bitmap images, Word document, etc.) into structured data formats such as JSON or XML. Essentially, we have developed a unique technology that is fully customizable where the AI model can be trained with minimal effort. In fact, we report in our paper that an average person can annotate 20 pages per minute. Once several dozen PDFs have been annotated, the machine learning takes over—you can just sit back and watch in awe.
One of a kind
To the best of our knowledge, the Corpus Conversion Service is the first comprehensive system to use advanced AI at this level of scalability. While existing solutions can only convert one document at a time to a desired output format, our tool can ingest entire collections, a corpus of documents, and build machine learned models on top of that.
Within IBM, the Corpus Conversion Service is serving more than 250 active users for knowledge-engineering project engagements. It’s also currently being tested by external partners in various industries, including material sciences, engineering, chemicals, government, oil and gas, insurance, and consumer electronics. More specifically, our partners in material science are making use of the service to ingest PDFs with patents, peer-review articles, and internal documents to develop new alloys. An insurance partner is using the service for converting unstructured claims.
The Corpus Conversion Service will be available on the IBM Cloud by the end of this year and also available for clouds on-premise. If you can’t wait that long, contact us to become a beta tester.
Corpus Conversion Service: A machine learning platform to ingest documents at scale Peter W J Staar, Michele Dolfi, Christoph Auer, Costas Bekas DOI: 10.13140/RG.2.2.10858.82888
To help advance data security in the cloud, IBM Research has initiated and currently leads joint work with the Apache Parquet community to address critical issues in securing confidentiality and integrity of sensitive data.
This month, we are highlighting the work of women researchers at IBM who are pushing the frontiers of hybrid cloud technology. Each of them showcase the tremendous breadth of expertise, and the depth of research, that goes into building this next generation of computing.