Progressing IBM Project Debater at AAAI-20 — and Beyond

Share this post:

Does AI bring more harm than good? Should we ban fast food? Should marijuana be legalized? These are just some of the complex questions occupying the minds of policy makers around the globe. They do not have a clear yes or no answer and require extensive deliberations of the pros and cons before a conclusion can be reached.

In the past few years, researchers studying computational argumentation have been trying to help people contemplate such questions. This rapidly emerging scientific field aims to address the core challenges in the realm of computer-assisted decision making.

Harish Natarajan, who holds the world record for most debate competition victories, took on IBM Project Debater, the first AI system that can debate humans on complex topics, at a live debate at IBM Think 2019.

As part of IBM’s Project Debater team, we have been working for the past eight years on such challenges. Our work was highlighted in February 2019, when we showcased the first AI system able to meaningfully engage in a full live debate with a human debater. In a first-of-a-kind event held in San Francisco, the system debated Mr. Harish Natarajan on whether preschools should be subsidized, demonstrating its capabilities to mine relevant claims and evidence from a massive corpus, cluster them around the debate’s main themes, build and deliver a compelling narrative in natural language, and also respond effectively to key claims made by the opponent.

Since then, the system’s capabilities have been extended with Project Debater Speech by Crowd, an AI cloud technology for collecting free-text arguments from large audiences on debatable topics to generate meaningful narratives that express the participants’ opinions in a concise way. This past summer, we conducted a live experiment with Speech by Crowd at the IBM THINK conference in Tel Aviv, where we collected arguments from more than 1,000 attendees in the audience relating to their stance on legalization of marijuana.  While on stage, Project Debater rapidly summarized the arguments and articulated narratives for pro and con.  It has since been demonstrated in events at the Swiss city of Lugano and the University of Cambridge.

Significant strides have been made in the past year on the algorithmic front.  Project Debater adopted new multi-purpose pre-trained neural network models such as BERT (Bidirectional Encoder Representations from Transformers).  An important feature of these pre-trained models is their ability to be fine-tuned for specific domains or tasks. This family of models has been shown to contribute to achieving state-of-art results for a wide variety of NLP tasks.

At the heart of Project Debater is the ability to detect good arguments. In the context of argument mining, the goal is to pin-point relevant, argumentative claims and evidence, which either support or contest a given topic, among billions of sentences. Moreover, when arguments are collected from thousands of sources, the persuasiveness, coherence and writing style varies greatly. Ranking arguments by quality allows selection of the “best” arguments, highlighting them to policy makers and facilitating a coherent narrative.

At the thirty-fourth AAAI conference on Artificial Intelligence (AAAI-20), we will present two papers on recent advancements in Project Debater on these two core tasks, both utilizing BERT. The first paper deals with the problem of mining specific types of argumentative texts from a massive corpus of journals and newspapers. The second paper presents a new dataset for argument quality and provides a BERT-based neural method for ranking arguments by quality.

Corpus Wide Argument mining on a New Scale
In the first paper, Corpus wide argument mining – a working solution, by Ein-Dor et al, we present the first end-to-end solution for corpus-wide argument mining: a pipeline that is given a controversial topic and automatically pinpoints supporting and contesting evidence for this topic, within a massive corpus of journals and newspapers. We demonstrate the efficiency and value of our solution on a corpus of 10 billion sentences, an order of magnitude higher than previous work. To be of practical use, such a system needs to provide a wide coverage of the relevant arguments, with high accuracy, over a wide array of potential topics. Such a system is faced with the challenging task of learning to recognize a specific flavor of argumentative texts (in our work, expert and study evidence) while overcoming the rarity of relevant texts within the corpus.

The first stage of our solution is composed of dedicated sentence-level queries, designed to work as initial filters, while at the same time focuses on retrieval of the desired types of arguments. The output of the queries is fed into a dedicated classifier which recognizes the relevant argumentative texts by ranking them according their confidence score. This sentence-level classifier is obtained by fine-tuning BERT over a large set of annotated sentences, created by applying an iterative scheme for data annotation to overcome the rarity of relevant texts in the corpus. The annotated dataset which was obtained by applying our annotation scheme to Wikipedia is publicly available as part of this work. When tested on 100 unseen topics, our system achieved an average accuracy of 95% for its top-ranked 40 sentences per topic, outperforming previously reported results by a highly significant margin. The high precision together with the wide range of test topics represents a significant step forward in the argument mining domain.

Argument Quality
Our research on argument quality was first introduced at EMNLP 2019, in the context of the Speech by Crowd platform. In that work, we published a new dataset comprised of actively collected arguments labeled for pair-wise and point-wise quality. In pair-wise annotations two arguments are shown to the annotators, and they are asked to select the higher-quality one. In point-wise annotation a single argument is shown, and annotators are asked to grade its quality without any reference. To avoid the high subjectivity of rating a single argument by scale, the point-wise annotations are binary (i.e., whether this argument has low or high quality), and we used a simple average of positive annotations to derive a continuous score.

In our new AAAI-20 paper, A Large-scale Dataset for Argument Quality Ranking: Construction and Analysis,we extend this work, focusing on point-wise argument quality. We use crowd-sourcing to label around 30,000 arguments on 71 debatable topics, making the dataset five times larger than any previously published point-wise quality dataset. This dataset is made publicly available as part of this work. A main contribution in the creation of the dataset is the analysis of two additional scoring functions that map a set of binary annotations to a single continuous score. Both consider the individual annotators’ reliability. The first is based on inter-annotator agreement, as measured by the Cohen’s kappa coefficient. The second is based on annotators’ competence estimate, as evaluated by the MACE tool. Intuitively, in both cases, the idea is to give more weight to the judgments made by more reliable human annotators. We compare the two functions in three evaluation tasks and discover that while the functions yield an inherently different distribution of continuous labels, no single function is better than the other. Our analysis emphasizes the importance of selecting the scoring function, as it can consequently impact the performance of learning algorithms. We experiment with fine-tuning BERT on these data as well, suggesting that it performs best when arguments are concatenated with their respective topic. This model achieves 0.53 Pearson and 0.52 Spearman correlations on our test set, outperforming several simpler alternatives.

If you are at AAAI-20, you can attend our presentations listed below. Or stop by IBM booth #103 to learn more. Hope to see you in New York!

Poster Spotlight Presentation 5584:  Sunday, February 9 | 3:45-5:15 PM, Sutton North
NLP5584: A Large-Scale Dataset for Argument Quality Ranking: Construction and Analysis Shai Gretz, Roni Friedman, Edo Cohen-Karlik, Assaf Toledo, Dan Lahav, Ranit Aharonov, Noam Slonim.

Poster Spotlight Presentation 1741:  Monday, February 10 | 3:45-5:15 PM, Sutton North
NLP1741: Corpus Wide Argument Mining – a Working Solution
Liat Ein-Dor, Eyal Shnarch, Lena Dankin, Alon Halfon, Benjamin Sznajder, Ariel Gera, Carlos Alzate, Martin Gleize, Leshem Choshen, Yufang Hou, Yonatan Bilu, Ranit Aharonov, Noam Slonim

IBM Research Staff Member

Avishai Gretz

IBM Research Staff Member

Yonatan Bilu

IBM Research Staff Member

More AI stories

IBM Research at EMNLP 2020

At the annual Conference on Empirical Methods in Natural Language Processing (EMNLP), IBM Research AI is presenting 30 papers in the main conference and 12 findings that together aim to advance the field of natural language processing (NLP).

Continue reading

DualTKB: A Dual Learning Bridge between Text and Knowledge Base

Capturing and structuring common knowledge from the real world to make it available to computer systems is one of the foundational principles of IBM Research. The real-world information is often naturally organized as graphs (e.g., world wide web, social networks) where knowledge is represented not only by the data content of each node, but also […]

Continue reading

The Rensselaer-IBM Artificial Intelligence Research Collaboration advances breakthroughs in more robust and secure AI

Launched in 2018, the Rensselaer-IBM Artificial Intelligence Research Collaboration (AIRC) is a multi-year, multi-million dollar joint venture boasting dozens of ongoing projects in 2020-2021 involving more than 80 IBM and RPI researchers working to advance AI.

Continue reading