Share this post:
Documents in Portable Document Format (PDF) are ubiquitous with over 2.5 trillion available from insurance documents to medical files to peer-review scientific articles. It represents one of the main sources of knowledge both online and offline. For example, just recently The White House made available the COVID-19 Open Research Dataset, which included links to 45,826 papers which were mainly PDF documents.
But while PDF is great for preserving the basic elements (characters, lines, shapes, images, etc.) on a canvas for different operating systems or devices for humans to consume, it’s not a format which machines can understand.
Unlocking Knowledge in PDFs
Automating industrial processes is critical to any organization since it can reduce human cost and improve the efficiency and consistency of their services. Many industrial processes run on the information embedded in PDF documents, but if this data cannot be read by machines it becomes a bottleneck in the automation of many industrial processes. In fact this is one of the main challenges enterprises face when trying to leverage natural language processing (NLP) technology to extract insights from documents.
PubLayNet helped to train the IBM Corpus Conversion Service to convert the latest PDF papers (e.g. from bioRxiv) into JSON documents. These can be ingested into a knowledge graph as unstructured data, allowing users to explore the latest published research.
For example, insurance companies face a large number of claims and customer enquiries every day, which are addressed according to lengthy insurance policy documents. Currently, the interpretation of insurance policy documents (mostly in PDF format) has to rely on humans, which can be slow and prone to errors. Automating the understanding of insurance policy documents could help insurance companies to deliver more prompt and consistent services to their customers with less cost.
Or as mentioned earlier, a researcher trying to find a treatment or vaccine for COVID-19 isn’t capable of reading more than 45,000 peer-review papers. While it is possible to index just the content of the document, the structure of the PDF document and the information in the non-textual portions of the document may contain the most telling information and this is where NLP can help.
To extract this knowledge locked in PDF documents they have to be first converted into a machine-readable format. The most widely adopted technique is to parse PDF documents using pre-defined rules and heuristics. However, the curation of the rules and heuristics is labor-intensive and depends on human experience. Sometimes it requires domain expert knowledge to create high quality rules. In addition, the rules are only effective for a narrow set of documents with a similar style. The rules have to be manually modified when the system needs to process documents with a substantially different style. The rules become extremely complex over time to handle documents with complicated layouts, making them difficult to maintain and reuse by new people.
More recently, deep neural networks that are developed for computer vision have been proven to be an effective method to analyze layout of document images. Deep neural networks are capable of learning complex patterns from training data and generalizing them to unseen samples. In addition, the network trained for a particular task/domain can help with learning from a different but related task/domain through a technique called transfer learning. These benefits come at the cost of the hunger for data. But there is no document layout datasets that are large enough to train deep neural networks from scratch. At the moment, a compromising solution is to transfer models trained on a large number of natural scene images to deal with documents. However, the knowledge that can be transferred from natural scenes to documents is quite limited.
To fill this gap, IBM researchers in Australia developed and published PubLayNet, the largest dataset ever for document layout analysis. PubLayNet is at least two orders of magnitude larger than existing document layout datasets, containing over 360,000 document images where the positions of titles, paragraphs, tables, figures, and lists are accurately annotated. It would take a tremendous amount of human resource to manually annotate the layout of such a large number of document images.
PubLayNet helped train the document layout recognition module of IBM Corpus Conversion Service, which can convert PDF documents into discoverable JSON documents. IBM Corpus Conversion Service, which is powering also powering IBM Watson’s Smart Document Understanding Service, was applied to ingest the COVID-19 Open Research Dataset which can now be queried for free by any scientist or researcher developing a treatment or vaccine for the virus.
In addition, PubLayNet was used by the chair of the International Conference on Learning Representations (ICLR) to extract all the images from over 700 accepted papers for promoting the conference in social media. The general chair of ICLR found our pre-trained model significantly outperform other PDF extraction tools stating that it saved him “infinity time”.
“Infinity time, I would have given up. I tried every other direct PDF extraction method and they all had intractable issues, e.g. couldn’t extract PDF images or were too low res or were just bad. I was about to give up until I found this tool, and in 20 lines of code, and 2 hours on Google Colab, I had every image from 700 papers at high res with precise enough accuracy,” said Alexander Rush, general chair, ICLR.
PubLayNet is built as a derivative work of PubMed Central, which provides over 1 million scientific articles in both PDF and structured XML formats. Our team created PubLayNet using a novel algorithm that creates linkages from the XML nodes that are associated with typical layout components to regions in the PDF documents. The layout of the PDF documents is then automatically annotated using the linkages. Potentially, the same algorithm can be used to annotate other sources of documents available in both unstructured and structured formats, such as arXiv articles (PDF + Latex), open access articles (PDF + HTML), and patents (PDF + HTML).
Our paper describing PubLayNet won the best paper award at the flagship document analysis conference ICDAR in 2019. Our experiments demonstrate that PubLayNet can be used to not only recognize the layout of scientific articles, but also improve the models for documents with a substantially different style, such as government documents and insurance policies.
PubLayNet was released to the public under a permissive commercial-use license in August 2019 and attracted over 230 stars on Github (29th of April 2020).
Third-party derivative of PubLayNet has been developed for Optical Character Recognition (OCR). We believe PubLayNet will continue to benefit both the research community and commercial offerings to provide better user experience and impactful social benefits.
Inventing What’s Next.
Stay up to date with the latest announcements, research, and events from IBM Research through our newsletter.