Bringing IBM NLP capabilities to the CORD-19 Dataset

Share this post:

To assist in the fight against the COVID-19 pandemic, prominent research institutes led by Allen Institute for AI (AI2) released earlier this year the COVID-19 Open Research Dataset (CORD-19). Comprised of scientific articles related to COVID-19, Sars-Cov-2, and related coronaviruses, the dataset (which at the time of writing this contains more than 75,000 full text scientific papers) is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease (1,2).

While a tremendous resource, the dataset initially did not include information found in tables due to the difficulty of extracting tabular data. However, following the launch of the Kaggle challenge associated with CORD-19, table information rose to become the most requested feature by challenge participants.

Addressing the need

Recognizing that critical scientific facts and data are often organized in a tabular format, IBM Research AI offered to apply our extensive experience in document and table conversion to update the CORD-19 dataset and, in turn, open up additional critical information to the global science and medical community in efforts to fight COVID-19.

A full overview of these efforts and the CORD-19 dataset can be viewed in this new paper which was presented at the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020) in the COVID-19 Workshop.


Figure 1.  Example table from article (3) included in CORD-19 and extracted version thereof (the table has been abbreviated for presentation purposes)


Challenges in table extraction and CORD-19

Table identification presents many challenges. Since table structure information is not included in the native PDF format, algorithms for their extraction must be developed using the limited information (stream of characters) available from the PDF. These structure extraction algorithms must then generalize to leverage different clues to represent structures from the wide variety of visual layouts across heterogeneous document collections (see Figure 2). And, to fully interpret the table, information appearing within identified tables or cells has to be linked to information that appears in other parts of the document, as tabular information is seldom self-sufficient. For instance, Figure 1 shows the results of serology tests measuring anti-SARS-CoV-2 antibodies for different subpopulations of the participants of a study (3) , which is not directly included in the text of the article.



Figure 2. Tables have a variety of styles and can be challenging to separate from “table-like” structures for extraction. For example, lists (A.1), charts / graphs (A.2), formatted equations (A.3), checklists (A.4) and genetic sequences (A.5) all share similar visual cues to tables such as columns of aligned text and graphical lines, but are not tables.


Tackling the problem with IBM NLP

Deep content understanding is one of the focus areas of IBM’s enterprise NLP strategy. IBM Research – Almaden has been developing solutions to extract content from tables and has applied this technology in several domains including legal, financial, and scientific documents. This technology has been integrated into IBM Watson Discovery’s Smart Document Understanding feature. Given this experience, we were motivated to  partner with AI2 and apply this innovation on the CORD-19 data set and allow researchers to leverage this important content to fight COVID-19.

In our table processing pipeline, to acquire useful information from tables, three high-level tasks must be performed.

  • Document parsing: extracting the text from the PDF document
  • Table extraction: identifying the borders of tables and extracting their cell structure
  • Table understanding: providing context by linking each cell to semantic information pertaining to it appearing within or outside the table (e.g., row and column headers, attributes, captions, and references in surrounding text).

Figure 3. Architecture of Table Processing for CORD-19


Figure 3 summarizes the architecture of our system. First, we ingest PDF files from the raw CORD-19 document dataset and pass them through our table processing pipeline to extract table data (detailed more below). In parallel, the same documents are processed through the AI2 full-text processing pipeline to extract the text. The extracted tables are then matched to their correct position in the text and the combined text and tables are added to the CORD-19 output in JSON format.


Figure 4. Architecture of table extraction


For table extraction, we developed Global Table Extractor (GTE) (4), a vision-based framework to extract for each table both its bounding box and cell structure (see Figure 4). Our deep learning framework has been shown to surpass previous state-of-the-art results on the ICDAR 2013 table competition test dataset in both table detection and cell structure recognition, with a significant 6.8 percent improvement in the full table extraction system.


Figure 5. Example of some of the contextual information identified by table understanding at the (a) table and (b) cell level (table extracted from \footnote (3))


Once tables are extracted, table understanding identifies and annotates both tables and cells with additional semantic information essential in understanding their contents (see Figure 5) — such as table captions and sentences in the text referring to a table and the headers of the columns and rows where a cell appears. Such information is extracted using predictive models that account for the multitude of ways in which contextual information is encoded within input documents, including complex structures, such as hierarchical row and column headers (see Figure 5b for an example of row and column header hierarchies).

Results of IBM NLP in CORD-19

The tabular data extracted with our system was introduced in the May 12, 2020 update of CORD-19 and has been included since then in daily updates of the dataset.

Using the IBM table processing capability, we have successfully extracted more than 188,000 tables from 19,000 documents in the initial 54,000 PDF document dataset. And, as new and relevant scientific articles continue being published, we support CORD-19’s daily dataset updates with an effective and efficient automated table extraction and understanding pipeline, ultimately processing hundreds of documents and tables each day.

Future work with CORD-19

Given document parsing was performed with different technologies on the AI2 and IBM sides, we also matched the tables extracted with the corresponding place where they should appear in the AI2 parse. Matching is currently done based on the similarity between table captions. Unfortunately, only a fifth of tables extracted successfully matched between the two document parses. Based on a preliminary error analysis, we find that match failures are primarily due to table caption mismatches between the two parse schemes, not table extraction errors. As a result, we plan to explore alternative matching functions, potentially leveraging table contents as well.

Douglas Burdick, IBM Research Staff Member
Yannis Katsis, IBM Research Staff Member
Yunyao Li, Principal IBM Research Staff Member
Nancy Wang, IBM Research Staff Member

We thank Rok Jun Lee, Hrishikesh Sathe, Dhaval Sonawane and Sudarshan Thitte from IBM Watson AI for their help in table extraction and parsing. 


2. Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., … & Mooney, P. (2020). CORD-19: The Covid-19 Open Research Dataset. ArXiv.
3. Stringhini, S., Wisniak, A., Piumatti, G., Azman, A. S., Lauer, S. A., Baysson, H., …, Guessous, I. (2020). Seroprevalence of anti-SARS-CoV-2 IgG antibodies in Geneva, Switzerland (SEROCoV-POP): a population-based study. The Lancet.
4. Zheng, X., Burdick, D., Popa, L., & Wang, N. X. R. (2020). Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context. arXiv preprint arXiv:2005.00589.

Figure 1 citations:
A.1: List
Johnson, C. D., Green, B. N., Konarski-Hart, K. K., Hewitt, E. G., Napuli, J. G., Foshee, W. K., … & Charlton, S. T. (2020). Response of Practicing Chiropractors during the Early Phase of the COVID-19 Pandemic: A Descriptive Report. Journal of Manipulative and Physiological Therapeutics.

A.2:  Graph
Neidleman, J., Luo, X., Frouard, J., Xie, G., Gurjot, G., Stein, E. S., … & Greene, W. C. (2020). SARS-CoV-2-specific T cells exhibit unique features characterized by robust helper function, lack of terminal differentiation, and high proliferative potential. bioRxiv.

A.3: Formatted equations
Ranjan, R. (2020). COVID-19 Spread in India: Dynamics, Modeling, and Future Projections. medRxiv.

A.4: Survey questions
Nature Report Survey

A.5: Genetic sequence
Darbani, B. (2020). The Expression and Polymorphism of Entry Machinery for COVID-19 in Human: Juxtaposing Population Groups, Gender, and Different Tissues. International Journal of Environmental Research and Public Health, 17(10), 3433.

B.1: Tables spanning across 2 pages
Barbosa, J. R. (2020). Occurrence and Possible Roles of Polysaccharides in Fungi and their Influence on the Development of New Technologies. Carbohydrate Polymers, 116613.

B.2:  Tables with nested graphical elements
Fathi, M., Vakili, K., Sayehmiri, F., Mohamadkhani, A., Hajiesmaeili, M., Rezaei-Tavirani, M., & Eilami, O. (2020). PROGNOSTIC VALUE OF COMORMIDITY FOR SEVERITY OF COVID-19: A SYSTEMATIC REVIEW AND META-ANALYSIS STUDY. medRxiv.

B.3:  Tables with no graphical ruling lines
Slotman, B. J., Lievens, Y., Poortmans, P., Cremades, V., Eichler, T., Wakefield, D. V., & Ricardi, U. (2020). Effect of COVID-19 pandemic on practice in european radiation oncology centers. Radiotherapy and Oncology.

C.1: Minimal row graphical ruling lines
Torsten Hothorn, Marie-Charlotte Bopp, H. F. Guen-thard, Olivia Keiser, Michel Roelens, Caroline EWeibull, and Michael J Crowther. 2020.Rela-tive coronavirus disease 2019 mortality: A swiss population-based study. medRxiv.

C.2: Colored rows with minimal graphical ruling lines
Luis Lopez-Fando, Paulina Bueno, David Sanchez Car-racedo, Marcio Augusto Averbeck, David ManuelCastro-Dıaz, emmanuel chartier-kastler, FranciscoCruz, Roger R Dmochowski, Enrico Finazzi-Agro, Sakineh Hajebrahimi, John Heesakkers, George RKasyan, Tufan Tarcan, Benoıt Peyronnet, MauricioPlata, Barbara Padilla-Fernandez, Frank Van der Aa, Salvador Arlandis, and Hashim Hashim. 2020. Man-agement of female and functional urology patientsduring the covid pandemic. European Urology Focus.

C.3: Colored background for column headers with minimal row graphical ruling lines
Silvia Stringhini, Ania Wisniak, Giovanni Piumatti,Andrew S. Azman, Stephen A Lauer, H ́el`eneBaysson, David De Ridder, Dusan Petrovic, Stephanie Schrempft, Kailing Marcus, Sabine Yerly, Isabelle Arm Vernez, Olivia Keiser, Samia Hurst, Klara M Posfay-Barbe, Didier Trono, Didier Pittet, Laurent G ́etaz, Franc ̧ois Chappuis, Isabella Eck-erle, Nicolas Vuilleumier, Benjamin Meyer, Antoine Flahault, Laurent Kaiser, and Idris Guessous. 2020. Seroprevalence of anti-sars-cov-2 igg antibodies ingeneva, switzerland (serocov-pop): a population-based study. Lancet (London, England).

C.4: Row graphical ruling lines delineating sections 2020
Fathi, Khatoon Vakili, Fatemeh Sayehmiri, Abdol-rahman Mohamadkhani, M. Hajiesmaeili, Mostafa Rezaei-Tavirani, and Owrang Eilami. 2020. Prog-nostic value of comormidity for severity of covid-19: A systematic review and meta-analysis study. medRxiv.

C.5: Full row and column graphical ruling lines with large intra-cell vertical spacing.
Kaushik, S., Aydin, S. I., Derespina, K. R., Bansal, P. B., Kowalsky, S., Trachtman, R., … & Bercow, A. (2020). Multisystem Inflammatory Syndrome in Children (MIS-C) Associated with SARS-CoV-2 Infection: A Multi-institutional Study from New York City. The Journal of Pediatrics.

More AI stories

IBM PAIRS Geoscope Reveals Environmental and Societal Impacts of COVID-19

Using sophisticated geospatial technology known as IBM PAIRS Geoscope, IBM researchers are shedding light on the environmental and societal impacts of the COVID-19 pandemic.

Continue reading

Largest Dataset for Document Layout Analysis Used to Ingest COVID-19 Data

Documents in Portable Document Format (PDF) are ubiquitous with over 2.5 trillion available from insurance documents to medical files to peer-review scientific articles. It represents one of the main sources of knowledge both online and offline. For example, just recently The White House made available the COVID-19 Open Research Dataset, which included links to 45,826 papers […]

Continue reading

COVID-19 HPC Consortium Calls for More Proposals from Researchers Worldwide

Researchers globally have been using the world’s fastest computers thanks to the COVID-19 HPC Consortium for nearly two months now – but there is still supercomputing capacity, and the partnership is calling for more proposals. “There is real hunger on the free resource providers side for good projects,” said Jim Brase, Program Leader at Lawrence […]

Continue reading