July 28, 2019 | Written by: David Konopnicki and Shai Erera
Share this post:
As a research scientist, staying up-to-date on current work is like swimming upstream against the current. With so many fascinating papers being published every day — at a rate that just keeps increasing — it’s tough to pick out the ones that may be important to my specific area of research or influence what I’m working on. Together with my colleagues in IBM Research-Haifa, we set out to build a tool that would quickly retrieve and summarize relevant and recent scientific papers that meet specific information needs.
Until now, most summarization work focused relatively short documents and content — such as in the domain of news. However, scientific documents and papers require deeper analysis. In contrast to short summaries for news briefs, we wanted to help research scientists by extracting key information from a scientific paper including what it is about, how the work was done, and the main findings and insights. The implication is that we set out to produce a more complete type of summary that analyzes, summarizes and combines multiple sections of a paper. This also allows the user to easily navigate between summaries of the different sections including the figures, tables, and so forth.
The IBM Science Summarizer, or DimSum, as it was nicknamed by the team for its DIscovery augMented SUMmarization, is a service that tracks scientific papers being published in the area of AI. The service produces summaries of the papers focused around an information need expressed through natural language queries.
One challenge in building the system comes from the fact that there is no ground truth or training data for summarization. Generally, in order to use machine learning techniques, we need lots of examples for training. In the case of summaries, it would be very expensive to have a human prepare enough training examples. In short, we had to create a labeled dataset. This is where the really creative thinking came in.
How to teach computers to summarize articles
In addition to DimSum, the team developed TalkSumm, short for Talk-based Summary, a dataset and scalable annotation method for summarizing scientific papers based on conference talks, which is being presented this week at ACL 2019 along with our paper, “TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks.” Conceived by Michal Shmueli-Scheuer and Guy Lev, TalkSumm began with the assumption that example summaries already exist for many scientific papers, because many scientific conferences record authors explaining the main points of their paper.
So, starting with the video of the talk, the team extracted the transcript of what was said using the IBM Watson speech-to-text service. Then, using a probabilistic method called Hidden Markov Model, the words uttered in the transcript of the article presentation were mapped to the most likely sentences from the scientific paper being presented. Therefore, based on the transcript of what was said in the lecture, we were able to ‘learn’ the importance of sentences in the papers, forming extractive summaries. Now we had a dataset of some 2,000 examples for summarizing scientific papers. But this number is actually unlimited; the code is now open source and can be used by anyone can to extend the dataset so it includes more and more papers. In addition, this dataset can be used as ground truth for scientific paper summarization.
To the best of our knowledge, we created the first search engine to offer summaries of scientific papers. Our system has ingested 187,160 papers from arXiv.org (“Computer Science” subset) and the ACL anthology.
The ingestion pipeline starts with paper acquisition. It then continues to extract the paper’s text, tables, and figures, and enrich the paper’s data with various annotations and entities. In our system, summarization is applied on each section using a state-of-the-art unsupervised, extractive, query-focused summarization algorithm, inspired by Feigenblat et al. 
Try the IBM Science Summarizer
As future work, we plan to add support for additional entities such as methods, architectures, and more. We also plan to increase our corpus to include more papers. But our immediate goal is to offer this tool to the community as an open service and later conduct an extensive user study to gain insight on the use and quality of our system.
To try out the IBM Science Summarizer, visit: ibm.biz/sciencesum.
To hear more details, come visit us at ACL 2019 Poster session 3B on Monday July 29, from 16:00–17:40 pm and view our demo at the IBM Research AI booth (#2) on July 31 from 10:30-11:30 am. Hope to see you in Florence.
- Feigenblat, Guy & Roitman, Haggai & Boni, Odellia & Konopnicki, David. (2017). Unsupervised Query-Focused Multi-Document Summarization using the Cross Entropy Method. 961-964. 10.1145/3077136.3080690.]