Stanford researchers introduce two datasets CoQA, and HotpotQA to incorporate “reading” and “reasoning” in simple pattern matching problems

Unrecognizable group of business people in a meeting at the office looking at strategy documents

4 min read

On Tuesday, Stanford University researchers introduced two recent datasets collected by the Stanford NLP Group to further advance the field of machine reading. These two new datasets CoQA (Conversational Question Answering), and HotpotQA work towards incorporating more “reading” and “reasoning” in the task of question answering and move beyond questions that can be answered by simple pattern matching.

CoQA aims to solve the problem by introducing a context-rich interface of a natural dialog about a paragraph of text. The second one, HotpotQA goes beyond the scope of one paragraph and presents the challenge of reasoning over multiple documents to arrive at the answer.

Lately, solving the task of machine reading or question answering is becoming an important section towards a powerful and knowledgeable AI system. Recently, large-scale question answering datasets like the Stanford Question Answering Dataset (SQuAD) and TriviaQA have progressed a lot in this direction. These datasets have enabled good results in allowing researchers to train deep learning models

What is CoQA?

Most of the question answering systems are limited to answering questions independently. But usually while having a conversation there happens to be a few interconnected questions. Also, it is more common to seek information by engaging in conversations involving a series of interconnected questions and answers. CoQA is a Conversational Question Answering dataset developed by the researchers at Stanford University to address this limitation and working in the direction of conversational AI systems.

Features of CoQA dataset

The researchers didn’t restrict the answers to be a contiguous span in the passage. As a lot of questions can’t be answered by a single span in the passage, which will limit the naturalness of the conversations. For example, for a question like How many times a word has been repeated?, the answer can be simply three despite text in the passage not spelling this out directly.
Most of the QA datasets mainly focus on a single domain, which makes it difficult to test the generalization ability of existing models. The CoQA dataset is collected from seven different domains including, children’s stories, literature, middle and high school English exams, news, Wikipedia, Reddit, and science.

The CoQA challenge launched in August 2018, has received great deal of attention and has become one of the most competitive benchmarks. Post the release of Google’s BERT models, last November, a lot of progress has been made, which has lifted the performance of all the current systems. Microsoft Research Asia’ state-of-the-art ensemble system “BERT+MMFT+ADA” achieved 87.5% in-domain F1 accuracy and 85.3% out-of-domain F1 accuracy. These numbers are now approaching human performance.

HotpotQA: Machine Reading over Multiple Documents

We often find ourselves in need of reading multiple documents to find out about the facts about the world. For instance, one might wonder, in which state was Yahoo! founded? Or, does Stanford have more computer science researchers or Carnegie Mellon University? Or simply, How long do I need to run to burn the calories of a Big Mac? The web does contain the answers to many of these questions, but the content is not always in a readily available form, or even available at one place.

To successfully answer these questions, there is a need for a QA system that finds the relevant supporting facts and to compare them in a meaningful way to yield the final answer. HotpotQA is a large-scale question answering (QA) dataset that contains about 113,000 question-answer pairs. These questions require QA systems to sift through large quantities of text documents for generating an answer.

While collecting the data for HotpotQA, the researchers have annotators to specify the supporting sentences they used for arriving at the final answer.

To conclude, CoQA considers those questions that would arise in a natural dialog given a shared context, with challenging questions that require reasoning beyond one dialog turn. While, HotpotQA focuses on multi-document reasoning, and challenges the research community for developing new methods to acquire supporting information.

To know more about this news, check out the post by Stanford.