Facebook researchers build a persona-based dialog dataset with 5M personas to train end-to-end dialogue systems

3 min read

Facebook researchers have collected and compiled a new dataset providing 5 million personas and 700 million persona-based dialogues. The aim with this dataset is to enhance the performance of their end-to-end dialogue systems by training them using personas. This will result in increased and improved engagement between human beings and computer agents.

Generally, end-to-end dialogue systems are mostly based on neural architectures like bidirectional LSTMs or Memory Networks. These are trained directly by gradient descent on dialogue logs and have been showing promising performance in multiple contexts. One of their major advantages lies in the fact that they rely on large data sources of existing dialogues to learn various domains without any expert knowledge. However, these dialogue systems show limited engagement and lack of consistency.

To solve this issue, a team of researchers at the Montreal Institute for Learning Algorithms (MILA) and Facebook AI introduced the PERSONACHAT dataset. This dataset comprises dialogues between different pairs of agents with text profiles, or personas, attached to each of them.

This leads the path for end-to-end personalized chatbots as the personas of the bots are short texts that could be easily edited by most users.

“However, the PERSONA-CHAT dataset was created using an artificial data collection mechanism based on Mechanical Turk. As a result, neither dialogs nor personas can be fully representative of real user-bot interactions and the dataset coverage remains limited, containing a bit more than 1k different personas” reads the research paper.

So, the researchers have built another large-scale persona-based dialogue dataset using conversations that were previously extracted from the REDDIT dataset. ”With simple heuristics, we create a corpus of over 5 million personas spanning more than 700 million conversations. We train persona-based end-to-end dialogue models on this dataset” mentions researchers in the paper.

Read also: Best Machine Learning Datasets for beginners

The goal is to learn to predict responses based on a persona for a wide range of personas. Researchers have built a dataset using the data from Reddit of following examples:

  • Persona: [“I like sport”, “I work a lot”]
  • Context: “I love running.”
  • Response: “Me too! But only on weekends.”


             Persona-based Network Architecture

The persona consists of a set of sentences which represent the personality of the responding agent, the context refers to the utterance that it responds to, and the response is an answer which is to be predicted. The researchers then went ahead and trained the persona-based end-to-end dialogue systems using their newly developed dataset.

Systems that were trained on this dataset outperformed other conversational agents (which were not trained using personas) and held far more engaging conversations.

“As pretraining leads to a considerable improvement in performance, future work could be

done fine-tuning this model for various dialog systems. Future work may also entail building more advanced strategies to select a limited number of personas for each user while maximizing the prediction performance,” say researchers in the paper.

For more details, check out the official research paper.

Read Next

Google launches a Dataset Search Engine for finding Datasets on the Internet

How to create and prepare your first dataset in Salesforce Einstein

25 Datasets for Deep Learning in IoT