Making Sense of Science with Discovery Augmented Summarization

Share this post:

As a research scientist, staying up-to-date on current work is like swimming upstream against the current. With so many fascinating papers being published every day — at a rate that just keeps increasing — it’s tough to pick out the ones that may be important to my specific area of research or influence what I’m working on. Together with my colleagues in IBM Research-Haifa, we set out to build a tool that would quickly retrieve and summarize relevant and recent scientific papers that meet specific information needs.

Until now, most summarization work focused relatively short documents and content — such as in the domain of news.  However, scientific documents and papers require deeper analysis.  In contrast to short summaries for news briefs, we wanted to help research scientists by extracting key information from a scientific paper including what it is about, how the work was done, and the main findings and insights.  The implication is that we set out to produce a more complete type of summary that analyzes, summarizes and combines multiple sections of a paper. This also allows the user to easily navigate between summaries of the different sections including the figures, tables, and so forth.

The IBM Science Summarizer, or DimSum, as it was nicknamed by the team for its DIscovery augMented SUMmarization, is a service that tracks scientific papers being published in the area of AI.  The service produces summaries of the papers focused around an information need expressed through natural language queries.

One challenge in building the system comes from the fact that there is no ground truth or training data for summarization.  Generally, in order to use machine learning techniques, we need lots of examples for training.  In the case of summaries, it would be very expensive to have a human prepare enough training examples. In short, we had to create a labeled dataset. This is where the really creative thinking came in.

How to teach computers to summarize articles

In addition to DimSum, the team developed TalkSumm, short for Talk-based Summary, a dataset and scalable annotation method for summarizing scientific papers based on conference talks, which is being presented this week at ACL 2019 along with our paper, “TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks.” Conceived by Michal Shmueli-Scheuer and Guy Lev, TalkSumm began with the assumption that example summaries already exist for many scientific papers, because many scientific conferences record authors explaining the main points of their paper.

So, starting with the video of the talk, the team extracted the transcript of what was said using the IBM Watson speech-to-text service. Then, using a probabilistic method called Hidden Markov Model, the words uttered in the transcript of the article presentation were mapped to the most likely sentences from the scientific paper being presented. Therefore, based on the transcript of what was said in the lecture, we were able to ‘learn’ the importance of sentences in the papers, forming extractive summaries. Now we had a dataset of some 2,000 examples for summarizing scientific papers. But this number is actually unlimited; the code is now open source and can be used by anyone can to extend the dataset so it includes more and more papers. In addition, this dataset can be used as ground truth for scientific paper summarization.

To the best of our knowledge, we created the first search engine to offer summaries of scientific papers. Our system has ingested 187,160 papers from (“Computer Science” subset) and the ACL anthology.

The ingestion pipeline starts with paper acquisition. It then continues to extract the paper’s text, tables, and figures, and enrich the paper’s data with various annotations and entities. In our system, summarization is applied on each section using a state-of-the-art unsupervised, extractive, query-focused summarization algorithm, inspired by Feigenblat et al. [1]

Try the IBM Science Summarizer

As future work, we plan to add support for additional entities such as methods, architectures, and more. We also plan to increase our corpus to include more papers. But our immediate goal is to offer this tool to the community as an open service and later conduct an extensive user study to gain insight on the use and quality of our system.

To try out the IBM Science Summarizer, visit:

To hear more details, come visit us at ACL 2019 Poster session 3B on Monday July 29, from 16:00–17:40 pm and view our demo at the IBM Research AI booth (#2) on July 31 from 10:30-11:30 am. Hope to see you in Florence.

  1. Feigenblat, Guy & Roitman, Haggai & Boni, Odellia & Konopnicki, David. (2017). Unsupervised Query-Focused Multi-Document Summarization using the Cross Entropy Method. 961-964. 10.1145/3077136.3080690.]


IBM Research Staff Member

Shai Erera

IBM Research Staff Member

More AI stories

New Advances in Speaker Diarization

Automatic speaker diarization is the process of recognizing “who spoke when”. It enriches understanding from automatic speech recognition, which is valuable for downstream applications such as analytics for call-center transcription and meeting transcription, and is an important component in the Watson Speech-to-Text service. In a recent publication, “New Advances in Speaker Diarization”, presented virtually at Interspeech 2020, we describe our new state-of-the-art speaker diarization […]

Continue reading

Could AI help clinicians to predict Alzheimer’s disease before it develops?

A new AI model, developed by IBM Research and Pfizer, has used short, non-invasive and standardized speech tests to help predict the eventual onset of Alzheimer’s disease within healthy people with an accuracy of 0.7 and an AUC of 0.74 (area under the curve).

Continue reading

State-of-the-Art Results in Conversational Telephony Speech Recognition with a Single-Headed Attention-Based Sequence-to-Sequence Model

Powerful neural networks have enabled the use of “end-to-end” speech recognition models that directly map a sequence of acoustic features to a sequence of words. It is generally believed that direct sequence-to-sequence speech recognition models are competitive with traditional hybrid models only when a large amount of training data is used. However, in our recent […]

Continue reading