Enter the Sentence Selection Track of DSTC7

Share this post:

dialogBe part of the research effort to create next-generation AI dialog systems by joining the 2018 Dialog State Tracking Challenge 7 (DSTC7) Track 1 – Sentence Selection that IBM Research AI and the University of Michigan are co-organizing. Automated dialog assistants today rely heavily on feature and rule-based approaches which are explicitly programmed by designers. A key need for scientific advancement is the development of algorithms that are able to learn goal-oriented dialog interactions effectively from human-to-human chatlogs while grounding them on relevant external information sources. The IBM Research Implicit Dialog team (L. Polymenakos, Ch. Gunasekara, J. Ganhotra, S. Patel, K. Fadnis, N. Mills, L. Zhu and S. Feng) with our University of Michigan collaborators (professors S. Baveja and W. Lasecki) is spearheading this research direction and has created a public competition to inspire and evaluate novel approaches that will lead to the next generation of AI dialog systems, able to learn from and adapt to human dialog examples and to new information. We encourage all academic and industrial groups working on dialog or related research for multi-turn interaction with data to register and participate in the DSTC7 – Track 1 challenge. Read on for more information and register now!

The challenge is named NOESIS: Noetic End-to-End Response Selection Challenge and focuses on end-to-end dialog systems that a) learn from chatlog data the parameters needed to complete the task and to produce the correct responses, and b) can deal with the following advanced conditions which are considered prerequisites for practical automated agents:

  1. Deal with natural language richness
  2. Know the correct response among a large number of choices
  3. Know when no correct response is available in the choices
  4. Incorporate knowledge grounding

DSTC7 Track 1 datasets

We prepared two datasets to support this challenge and released them on June 1 2018:

  1. Student Advising Data (Flex Data): This is a collection of student-advisor dialogs in which advisors guide students to pick courses that fit not only their curriculum, but also personal preferences about time, difficulty, career path, etc. Additional knowledge base about courses and possible (but not all) personal preferences is provided. The data also includes paraphrases of the sentences and of the target responses. These are play-acted data following a set of possible selections for courses and for a progression of advisor dialog acts.
  2. Ubuntu Dialog Corpus: This is a new version of disentangled Ubuntu IRC dialog. The purpose is to solve an Ubuntu user’s posted problem. We developed new algorithms regarding the disentanglement of the multi-party dialogs and provide regular two-party dialogs that were found in the corpus (the methodology is described in an upcoming publication submitted at NIPS 2018). Additional external knowledge is provided in the form of manual pages.

DSTC7 Track 1 scoring

The testing and scoring of the systems is organized in a progression of subtasks that are shown in the table below:

Subtask Evaluated on
Ubuntu dataset Flex dataset
1.     Baseline –  Select the next utterance from given candidate set (candidate pool <100)


 The set contains 1 option that is correct and 99 options that are incorrect (for a total of 100).

 The set contains 1 option that is correct and 99 options that are incorrect (for a total of 100).

2.     Select the next utterance from a large global pool of candidates (candidate pool >10,000)

A large pool of candidates (over 10,000) is provided to pick the next utterance from.

The increased number of candidates challenges the logical capability of dialog models.

3.     Select the next utterance with the set of paraphrases.

  The set contains 1–5 options that are correct and 95-99 options that are incorrect (for a total of 100). We provide multiple correct options by using paraphrases (note: the correct options in a set may include the original utterance we collected or may be only paraphrases).

4.     Select the next utterance with a candidate pool which might not include the correct next utterance for some instances (candidate pool <100). Only one answer is correct, no paraphrases will be provided.

 Only one answer is correct, no paraphrases are provided.

 Only one answer is correct, no paraphrases are provided.

5.     Select the next utterance with a model which incorporate external knowledge (candidate pool <100). The external knowledge base is provided.

(Ubuntu manual pages)

(Curriculum Related Database)

DSTC7 Track 1 timeline and resources

The official timeline for participation in the challenge is as follows:

Task Dates
Development phase (14 weeks) Jun 1 – Sep 9 2018
Evaluation phase (2 weeks) Sep 10 – Sep 24 2018
Release of the results 1 Oct 2018
Paper submission deadline Oct – Nov 2018
DSTC7 special session or workshop Spring 2019


Information about the challenge and its progress, along with sample code, comparative baselines, and additional material that can help fellow researchers innovate and propose solutions to the posed problems can be found on these three websites:

  1. The official DSTC7 website (with information on all Tracks)
  2. The website we maintain for Track 1, which includes an FAQ addressing issues that participants raise
  3. The IBM github, where we release baseline code, results and other material
    (“watch” the github to receive updates when we post new material).

DSTC7 Track 1 baseline code

On June 22 we released baseline code (based on existing open source code) for dual LSTM Encoder model from the paper “The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems” and the implementation discussed here. A schematic of the baseline architecture is shown below:

DSTC7 Track 1 system architecture

We tested the system on subtask 1 (picking the correct answer from 100 candidates) and reported the accuracy at recall 1, 2, 5, 10 and 50. The numbers are very low with the Dual LSTM approach as shown in the table below:

Dataset 1 in 100 R@1 1 in 100 R@2 1 in 100 R@5 1 in 100 R@10 1 in 100 R@50
Ubuntu 8.32{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 13.36{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 24.26{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 35.98{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 80.04{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887}
Flex 6.20{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 9.80{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 18.40{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 29.60{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887} 72.80{ccf696850f4de51e8cea028aa388d2d2d2eef894571ad33a4aa3b26b43009887}


This shows that there is a lot of room for algorithmic improvement, and that applying known deep learning techniques to the dialog problem does not guarantee good results. Further, we observed that the training and test data of previous versions of the Ubuntu dialog were designed so that the utterance prediction was not only on the response turns but also on the inquiring user’s turns. Our intuition is that inquiring user turns are typically simpler, indicating that they tried something without success etc., so prediction of those turns is easier to learn than responses leading to the resolution of a problem. We are in the process of verifying this intuition by running some experiments. To add to that, previous benchmarks tested the choice of the correct answer among only 10 possible choices. Therefore, we contend that past benchmarks do not provide a good or realistic measure of performance for end-to-end approaches like the LSTM Encoder model: when the number of possibilities increases, and the focus is on the resolution of a problem, then the traditional deep learning models seem to perform rather poorly, as shown above.

Join DSTC7 Track 1

Since we released the datasets on June 1 2018, we have observed more than 100 unique registrations and downloads of the data for Track 1. Many research organizations from industry and academia across the globe have shown interest in Track 1 as well as in the other DSTC7 Tracks.

Based on the number of registrations to the challenge so far and the limitations of previous benchmarks of end-to-end dialog methodologies, we believe we have opened up a very hot area of research in dialog systems. We encourage all interested groups to participate in the challenge and contribute innovative solutions that address the research problems posed.

As the DSTC7 competition evolves, we will provide more baselines, code, clarifications, and insights based on our research efforts. In addition, our team at IBM Research AI is developing novel algorithms for our competition system and will be reporting results soon.

More AI stories

We’ve moved! The IBM Research blog has a new home

In an effort better integrate the IBM Research blog with the IBM Research web experience, we have migrated to a new landing page:

Continue reading

Pushing the boundaries of human-AI interaction at IUI 2021

At the 2021 virtual edition of the ACM International Conference on Intelligent User Interfaces (IUI), researchers at IBM will present five full papers, two workshop papers, and two demos.

Continue reading

From HPC Consortium’s success to National Strategic Computing Reserve

Founded in March 2020 just as the pandemic’s wave was starting to wash over the world, the Consortium has brought together 43 members with supercomputing resources. Private and public enterprises, academia, government and technology companies, many of whom are typically rivals. “It is simply unprecedented,” said Dario Gil, Senior Vice President and Director of IBM Research, one of the founding organizations. “The outcomes we’ve achieved, the lessons we’ve learned, and the next steps we have to pursue are all the result of the collective efforts of these Consortium’s community.” The next step? Creating the National Strategic Computing Reserve to help the world be better prepared for future global emergencies.

Continue reading