IBM’s largest document layout dataset is helping understand COVID-19 literature

Share this post:

Author: Peter Zhong, Research Scientist, IBM Research, Australia

Co-Authors: Stefan Maetschke, AI Engineer, IBM Research, Australia and Ella Shafiei, Research Scientist, IBM Research, Australia

Documents in Portable Document Format (PDF) are everywhere, with over 2.5 trillion currently available, ranging from simple user manuals, insurance documents to complex scientific articles.

PDF documents represent one of the primary sources of information both online and offline, since their content is very flexible (text, images, tables, etc.), and they are supported across many different operating systems or devices. While humans can easily read and understand PDF documents, machines often struggle. Even though automatic understanding of PDFs is essential due to the large number of available documents, it can be a bottleneck in many applications.

For example, just recently in America, The White House made available the COVID-19 Open Research Dataset, including links to 45,826 papers, which are primarily PDFs. Humans cannot quickly read and analyse such a large number of documents to find a treatment or vaccine for COVID-19. To automatically extract valuable information from them, they have to be first converted into a machine-readable format.

Another example of this is when there is a large number of claims and customer enquiries in insurance companies. Currently, the interpretation of lengthy insurance policy documents (mostly in PDF format) relies solely on humans, which can be slow and prone to errors. Automating the understanding of insurance policy documents could help these companies to deliver more prompt and consistent services to their customers with less cost. 

Automatic Understanding of PDFs

The first step for automatic document understanding is to extract the layout of elements such as text or images from a PDF. This can be achieved by applying predefined rules and heuristics. However, rules are domain-specific and require time-consuming, manual design or tuning by experts when documents from a different domain need to be processed. Furthermore, it is almost impossible to create a perfect rule set that covers every possible case, especially for complicated documents.

An alternative approach is to learn the layout of documents from examples instead of hand-crafting complex rules. To this end, machine learning and specifically deep learning methods have been successfully applied to learn the patterns from example data and generalise them to new and unseen samples. Moreover, transfer learning techniques can help with generalising the specific trained model to different but related tasks and domains.

However, these methods require tremendous amounts of example data for training, which needs to be annotated. For example, text and image blocks in a document need to be marked individually, which again is very labour-intensive work if performed by humans.

To solve this problem, Team AITU (AI That Understands) and IBM Research Australia have developed and published PubLayNet, the largest dataset ever for document analysis. PubLayNet is 100 times larger than other existing document layout datasets. It contains over 360,000 document images where the positions of titles, paragraphs, tables, figures, and lists are accurately annotated.

It would take a tremendous amount of human resources to manually annotate the layout of such a large dataset. Therefore, PubLayNet takes advantage of PubMed Central, which provides over 1 million scientific articles in both PDF and Extensible Markup Language (XML). XML is a structured format, and the layout of the PDF documents can be automatically annotated if the corresponding regions to the XML elements can be identified.

IBM Research has created a novel algorithm

Our team created a novel algorithm that enables a mapping from XML elements associated with typical layout components to corresponding regions in the PDF documents. Additional post-processing was conducted to ensure the quality of the annotations. Potentially, the same algorithm can be used to annotate other sources of documents available in both unstructured and structured formats, such as arXiv articles (PDF + LaTeX), open access articles (PDF + HTML), and patents (PDF + HTML).

Our paper describing PubLayNet won the Best Paper Award at the flagship document analysis conference ICDAR in 2019. Our experiments demonstrate that PubLayNet can be used to not only recognize the layout of scientific articles but also improve the models for documents with a substantially different style, such as government documents and insurance policies.

PubLayNet has also been used by the premier artificial intelligence and representation learning conference ICLR in 2020 to extract pictures from their papers and used to promote this conference on social media. The general chair of ICLR found our pre-trained model significantly outperformed other PDF extraction tools stating that it saved him “infinity time”.

“Infinity time, I would have given up. I tried every other direct PDF extraction method and they all had intractable issues, e.g. couldn’t extract PDF images or were too low res or were just bad. I was about to give up until I found this tool, and in 20 lines of code, and 2 hours on Google Colab, I had every image from 700 papers at high res with precise enough accuracy,” said Alexander Rush, general chair, ICLR.

PubLayNet was recently applied to ingest the COVID-19 Open Research Dataset, which IBM Research has made available for free to any scientist or researcher developing a treatment or vaccine for the virus.

Figure 1. PubLayNet helped to train the IBM Corpus Conversion Service to convert the latest PDF papers into JSON documents. These can be used as a knowledge graph to explore the latest published research.

Figure 1. PubLayNet helped to train the IBM Corpus Conversion Service to convert the latest PDF papers into JSON documents. These can be used as a knowledge graph to explore the latest published research.

PubLayNet was released to the public under a permissive commercial-use license in August 2019 and attracted over 230 stars on GitHub (1st of May 2020). The third-party derivative of PubLayNet has been developed for Optical Character Recognition (OCR). We believe PubLayNet will continue to benefit both the research community and commercial offerings to provide better user experience and impactful social benefits.


More Research stories

Fractals – a way to see infinity

Ever wonder why a river, looks like a tree, looks like a leaf? IBMer Benoit Mandelbrot did. His quirky way of looking at the world led to the discovery of fractal geometry. Fractals have made it possible to mathematically explore the kinds of rough irregularities that exist in nature. Clouds are not perfect spheres, mountains […]

Continue reading