Answering users’ questions in an enterprise domain remains a challenging proposition. Businesses are increasingly turning to automated chat assistants to handle technical support and customer support interactions. But, these tools can only successfully troubleshoot questions they were trained on, exposing a growing challenge for enterprise question answering (QA) techniques today.
To address this, IBM Research AI is introducing a new leaderboard called TechQA which uses real world questions from users posted on IBM DeveloperWorks. The goal of TechQA is to foster research on enterprise QA, where learning from a relatively small set of QA pairs is the more realistic condition.
TechQA is the first leaderboard to address enterprise QA use cases. IBM Research AI anticipates creating additional use cases from enterprise domains to foster additional research for this critical capability.
Background on QA
We can distinguish between the following types of QA problems based on the characteristics of the question and the corresponding answer. Questions that are relatively short (less than a dozen words) are quite different from longer questions (ranging from ten to fifty words or more.) Today, longer questions are encountered much more in enterprise support situations such as IT where the typical question has a median length of 35 words.
Answers can also be segmented into either i) short (factoid) contiguous spans of text of one to about five words or so, ii) answers that are longer ranging from about six words to a sentence, or iii) even longer such as one to a few paragraphs. In IT support, the median length of an answer is about 45 words.
Existing QA Leaderboards
Examples of short-question/short-answer corpora are the successful SQuAD v1.1 (about 70 submissions) and v2.0 (about 60 submissions) reading comprehension leaderboards [i, ii] where a Wikipedia passage is provided along with a question such as “Which NFL team represented the AFC at Super Bowl 50?”¹. The systems improved from mid-fifties for F measure of the answer to a high of mid-nineties over about a two-year period², representing relatively rapid progress on the task.
However, top performing systems on SQuAD v1.1 were very weak and fell apart when the questions were slightly modified to not have an answer in the provided document. When tested with adversarial questions (when about 50% of the questions did not have an answer in the passage) the F measure performance drops by more than 20 points.
SQuAD v2.0 was launched in mid-2018 with such adversarial questions. The F measure started at about 70% and rapidly improved to about ninety percent in about one year. Yet, even the top SQuAD v2.0 systems were still weak for real world application due to a fundamental design flaw in the data. That is, the questions were generated after looking at the passage containing the answer by “mechanical turkers.” This observation bias creates a high overlap between the words in the question and the context of the answer.
To address this observation bias, a new leaderboard called Natural Questions [iii] was created by harvesting users’ questions of Google’s search engine and then finding answers by using turkers. When a SQuAD system is tested on the Natural Questions leaderboard the F measure drops dramatically to 6% (on short answers — it is 2% for a SQuAD v1.1 system) illustrating the brittleness of SQuAD trained systems.
The top scoring system on the Natural Questions leaderboard is currently at F measure of about 60% for the short answer achieved by the IBM Research AI GAAMA system [iv] in less than six months (July 2019) since the start in early 2019 at F measure of 52% for short answers. IBM Research AI is continuing to monitor these systems to see how long it takes for them to get close to human performance on this task.
Another aspect to consider: justifying (or finding) an answer could require finding several segments from different documents to identify the correct answer. This is known as the so-called multi-hop question answering task. The HotpotQA [v] is a leaderboard focused on questions requiring several segments to identify the answer³. F measure improved from 52% to 71% in a little over a year. However, HotpotQA suffers from the observation bias problem mentioned in connection with SQuAD, that the questions are generated after reading the documents that contain the answer.
IBM Research AI TechQA Corpus
The TechQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size — 600 training, 310 development, and 490 evaluation question/answer pairs — thus reflecting the cost of creating labeled datasets in the context of an enterprise use case.
Consequently, TechQA is meant to stimulate research in domain adaptation rather than as a resource to build QA systems from scratch. The dataset was obtained by crawling the IBM Developer and IBM DeveloperWorks forums for questions with accepted answers that appear in a published IBM Technote — a technical document that addresses a specific technical issue.
In addition to the question-answer pairs, TechQA also provides a collection of about 800 thousand publicly available Technotes as a companion resource to enable research in domain adaptation (e.g. pre-training/fine tuning context dependent vector embeddings).
For TechQA, based on the top performing⁴ GAAMA system from IBM Research AI for short answers on Natural Questions, the F measure of the answers is 53% just above the 50% for a system that always answers No_answer. The GAAMA system was fine-tuned on the TechQA training set (600 qa pairs.)
More details on the TechQA task and leaderboard can be found at ibm.biz/Tech_QA.
[i] Pranav Rajpurkar, Jian Zhang, Konstantin Lopy-rev, and Percy Liang. SQuAD: 100,000+Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
[ii] Know what you don’t know: Unanswerable Questions for SQuAD. Pranav Rajpurkar, Robin Jia, and Percy Liang. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
[iii] Natural Questions: a Benchmark for Question Answering Research. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang,,rew Dai, Jakob Uszkoreit, Quoc Le, Slav Petrov. Transactions of the Association of Computational Linguistics, 2019.
[iv] CFO: A Framework for Building Production NLP Systems, Rishav Chakravarti, Cezar Pendus, Andrzej Sakrajda, Anthony Ferritto, Lin Pan, Michael Glass, Vittorio Castelli, J William Murdock, Radu Florian, Salim Roukos, Avirup Sil. Proceedings of the 2019 conference on Empirical Methods in Natural Language Processing (EMNLP), Demo Track, 2019.
[v] HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering, Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, Christopher D. Manning. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, October-November, 2018.
 “Denver Broncos” is the extracted answer.
 Note that blue dots on the state of the art red curve represent when the corresponding state of the art was first reached.
 These multi-hop questions were eliminated during the creation of the Natural Questions leaderboard.
New empirical work from the MIT-IBM Watson AI Lab uncovers how jobs will transform as AI and new technologies continue to scale across business and industries. We created a novel dataset using machine learning techniques on 170 million U.S. job postings. The dataset and research, The Future of Work: How New Technologies Are Transforming Tasks, allow us to extract key insights into how AI is shaping the future of work.