As part of the collection of IBM Research papers at ACL 2018, we were delighted to receive the Best Paper Award at the Machine Reading for Question Answering workshop for our paper A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset (by Michael Boratko, Harshit Padigela, Divyendra Mikkilineni, Pritish Yuvraj, Rajarshi Das, Andrew McCallum, Maria Chang, Achille Fokoue-Nkoutche, Pavan Kapanipathi, Nicholas Mattei, Ryan Musa, Kartik Talamadupula, and Michael Witbrock) which included authors from both IBM Research and our AI Horizons Network partner University of Massachusetts, Amherst.
Question answering has increasingly become a crucial application problem in evaluating the progress of AI systems in the realm of natural language processing and the progress of machine intelligence in general. However, most current efforts on question answering focus on “shallow” tasks that test merely the capability of an algorithm to “attend” or focus attention on specific words and pieces of text. To better align progress in the field with the expectations that we have of human performance and behavior, a new class of questions—known as complex questions—has been proposed. As the questions themselves, and the knowledge and reasoning required to answer them, become more complex and specialized, it is hoped that the algorithms that understand and answer these questions will come to resemble human expertise in specialized domains.
In this light, the recent work of Clark et al. introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset. This dataset contains science questions from standardized tests that are separated into an Easy Set and a Challenge Set. The Challenge Set is comprised of questions that are answered incorrectly by two solvers based on Pointwise Mutual Information (PMI) Information Retrieval (IR). In addition to this division, a survey of the various types of knowledge and of reasoning that are required to answer various questions in the ARC dataset is presented. This survey was based on an analysis of 100 questions chosen at random from the Challenge Set. However, the survey and associated results provide very little detail on important dataset features such as the questions chosen, the annotations provided, or the methodology used. To address these issues, and present a more generalized analysis of complex questions in the standardized testing domain, we conducted and report on a thorough analysis of the ARC dataset.
In our paper, A Systematic Classification of Knowledge, Reasoning, and Context within the ARC Dataset, we present the results of our detailed annotation process for the ARC dataset. Specifically, we (a) introduce a novel labeling interface that allows a distributed set of annotators to label the knowledge and reasoning types; and (b) improve upon the knowledge and reasoning type categories provided previously, in order to make the annotation more intuitive and accurate. We do this by providing annotator directions, definitions, and multiple examples for each type of reasoning and knowledge label. These labels—and indeed our interface—can be applied to any number of question-answering datasets including SQuAD.
Following an annotation round involving over ten people at two institutions, we measure and report statistics such as inter-rater agreement and the distribution of knowledge and reasoning type labels in the dataset. We then (c) clarify the role of knowledge and reasoning within the ARC dataset with a comprehensive set of annotations for both the questions and returned results, demonstrating the efficacy of query refinement to improve existing question answering systems.
Our annotators were also asked to mark whether individual retrieved sentences were relevant to answering a given question. Our labeling interface logs the reformulated queries issued by each annotator, as well as their relevance annotations. To quantitatively demonstrate the effectiveness of the relevant sentences, we (d) evaluate a subset of questions and the relevant retrieval results with a pre-trained DrQA model and find that the performance of the system improves by 42 points.
A major contribution of our work is formalizing the knowledge and reasoning types that one may encounter not only in the ARC dataset but in any general question answering dataset, including ones like SQuAD. We highlight one important conclusion from our analysis, which is that the kind of knowledge that is available to answer a particular question determines the reasoning type to be employed, and eventually whether the given question can be answered or not.
For example, the question:
Q. Giant redwood trees change energy from one form to another. How is energy changed by the trees?
a. They change chemical energy into kinetic energy.
b. They change solar energy into chemical energy.
c. They change wind energy into heat energy.
d. They change mechanical energy into solar energy.
can be answered using two different kinds of reasoning, depending on the knowledge retrieved:
Retrieved knowledge: Trees change solar energy into chemical energy.
If we retrieve this sentence, then we can use Linguistic Reasoning to answer the question.
Retrieved knowledge: (a) Solar energy is changed into chemical energy by plants and (b) Trees are classified as plants.
If we retrieve these two sentences, then we need to use Multi-hop Reasoning to answer the question.
The dataset of human annotations produced as part of this work has been open-sourced for use by the community, and can be downloaded from the IBM GitHub.
This paper and the work described in it were the first step in our work on question answering, and the importance of such analysis, particularly for more complex questions, is reflected in the Best Paper Award we recently received. The community as a whole is moving towards answering more complex questions, be they in the form of questions that require consultation with external knowledge or adversarial questions that are designed to defeat specific approaches and algorithms. Our work in this space will continue to focus on achieving human-level and better-than-human performance on more and more complex questions and the knowledge and reasoning required to support this.