7 min read

In today’s tutorial we will learn to build generative chatbot using recurrent neural networks. The RNN used here is Long Short Term Memory(LSTM).

Generative chatbots are very difficult to build and operate. Even today, most workable chatbots are retrieving in nature; they retrieve the best response for the given question based on semantic similarity, intent, and so on. For further reading, refer to the paper Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation by Kyunghyun Cho et. al. (https://arxiv.org/pdf/1406.1078.pdf).

[box type=”note” align=”” class=”” width=””]This article is an excerpt from a book written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti, titled Natural Language Processing with Python Cookbook. In this book you will come across various recipes covering natural language understanding, Natural Language Processing, and syntactic analysis.[/box]

Getting ready…

The A.L.I.C.E Artificial Intelligence Foundation dataset bot.aiml Artificial Intelligence Markup Language (AIML), which is customized syntax such as XML file has been used to train the model. In this file, questions and answers are mapped. For each question, there is a particular answer. Complete .aiml files are available at aiml-en-us-foundation-alice.v1-9 from https://code.google.com/archive/p/aiml-en-us-foundation-alice/downloads. Unzip the folder to see the bot.aiml file and open it using Notepad. Save as bot.txt to read in Python:

>>> import os

""" First change the following directory link to where all input files do exist """

>>> os.chdir("C:UsersprataDocumentsbook_codesNLP_DL")

>>> import numpy as np

>>> import pandas as pd

# File reading

>>> with open('bot.txt', 'r') as content_file:

... botdata = content_file.read()

>>> Questions = []

>>> Answers = []

AIML files have unique syntax, similar to XML. The pattern word is used to represent the question and the template word for the answer. Hence, we are extracting respectively:

>>> for line in botdata.split("</pattern>"):

... if "<pattern>" in line:

... Quesn = line[line.find("<pattern>")+len("<pattern>"):]

... Questions.append(Quesn.lower())

>>> for line in botdata.split("</template>"):

... if "<template>" in line:

... Ans = line[line.find("<template>")+len("<template>"):]

... Ans = Ans.lower()

... Answers.append(Ans.lower())

>>> QnAdata = pd.DataFrame(np.column_stack([Questions,Answers]),columns = ["Questions","Answers"])

>>> QnAdata["QnAcomb"] = QnAdata["Questions"]+" "+QnAdata["Answers"]

>>> print(QnAdata.head())

The question and answers are joined to extract the total vocabulary used in the modeling, as we need to convert all words/characters into numeric representation. The reason is the same as mentioned before—deep learning models can’t read English and everything is in numbers for the model.

How to do it…

After extracting the question-and-answer pairs, the following steps are needed to process the data and produce the results:

  1. Preprocessing: Convert the question-and-answer pairs into vectorized format, which will be utilized in model training.
  2. Model building and validation: Develop deep learning models and validate the data.
  3. Prediction of answers from trained model: The trained model will be used to predict answers for given questions.

How it works…

The question and answers are utilized to create the vocabulary of words to index mapping, which will be utilized for converting words into vector mappings:

# Creating Vocabulary

>>> import nltk

>>> import collections

>>> counter = collections.Counter()

>>> for i in range(len(QnAdata)):

... for word in nltk.word_tokenize(QnAdata.iloc[i][2]):

... counter[word]+=1

>>> word2idx = {w:(i+1) for i,(w,_) in enumerate(counter.most_common())}

>>> idx2word = {v:k for k,v in word2idx.items()}

>>> idx2word[0] = "PAD"

>>> vocab_size = len(word2idx)+1

>>> print (vocab_size)

Generative Chatbot

Encoding and decoding functions are used to convert text to indices and indices to text respectively. As we know, Deep learning models work on numeric values rather than text or character data:

>>> def encode(sentence, maxlen,vocab_size):

... indices = np.zeros((maxlen, vocab_size))

... for i, w in enumerate(nltk.word_tokenize(sentence)):

... if i == maxlen: break

... indices[i, word2idx[w]] = 1

... return indices

>>> def decode(indices, calc_argmax=True):

... if calc_argmax:

... indices = np.argmax(indices, axis=-1)

... return ' '.join(idx2word[x] for x in indices)

The following code is used to vectorize the question and answers with the given maximum length for both questions and answers. Both might be different lengths. In some pieces of data, the question length is greater than answer length, and in a few cases, it’s length is less than answer length. Ideally, the question length is good to catch the right answers. Unfortunately in this case, question length is much less than the answer length, which is a very bad example to develop generative models:

>>> question_maxlen = 10

>>> answer_maxlen = 20

>>> def create_questions(question_maxlen,vocab_size):

... question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size))

... for q in range(len(Questions)):

... question = encode(Questions[q],question_maxlen,vocab_size)

... question_idx[q] = question

... return question_idx

>>> quesns_train = create_questions(question_maxlen=question_maxlen, vocab_size=vocab_size)

>>> def create_answers(answer_maxlen,vocab_size):

... answer_idx = np.zeros(shape=(len(Answers),answer_maxlen, vocab_size))

... for q in range(len(Answers)):

... answer = encode(Answers[q],answer_maxlen,vocab_size)

... answer_idx[q] = answer

... return answer_idx

>>> answs_train = create_answers(answer_maxlen=answer_maxlen,vocab_size= vocab_size)

>>> from keras.layers import Input,Dense,Dropout,Activation

>>> from keras.models import Model

>>> from keras.layers.recurrent import LSTM

>>> from keras.layers.wrappers import Bidirectional

>>> from keras.layers import RepeatVector, TimeDistributed, ActivityRegularization

The following code is an important part of the chatbot. Here we have used recurrent networks, repeat vector, and time-distributed networks. The repeat vector used to match dimensions of input to output values. Whereas time-distributed networks are used to change the column vector to the output dimension’s vocabulary size:

>>> n_hidden = 128

>>> question_layer = Input(shape=(question_maxlen,vocab_size))

>>> encoder_rnn = LSTM(n_hidden,dropout=0.2,recurrent_dropout=0.2) (question_layer)

>>> repeat_encode = RepeatVector(answer_maxlen)(encoder_rnn)

>>> dense_layer = TimeDistributed(Dense(vocab_size))(repeat_encode)

>>> regularized_layer = ActivityRegularization(l2=1)(dense_layer)

>>> softmax_layer = Activation('softmax')(regularized_layer)

>>> model = Model([question_layer],[softmax_layer])

>>> model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

>>> print (model.summary())

The following model summary describes the change in flow of model size across the model. The input layer matches the question’s dimension and the output matches the answer’s dimension:

Layer type

# Model Training

>>> quesns_train_2 = quesns_train.astype('float32')

>>> answs_train_2 = answs_train.astype('float32')

>>> model.fit(quesns_train_2, answs_train_2,batch_size=32,epochs=30, validation_split=0.05)

The results are a bit tricky in the following screenshot even though the accuracy is significantly higher. The chatbot model might produce complete nonsense, as most of the words are padding here. The reason? The number of words in this data is less:

Build generative chatbot

# Model prediction

>>> ans_pred = model.predict(quesns_train_2[0:3])

>>> print (decode(ans_pred[0]))

>>> print (decode(ans_pred[1]))

The following screenshot depicts the sample output on test data. The output does not seem to make sense, which is an issue with generative models:

Final Output

Our model did not work well in this case, but still some areas of improvement are possible going forward with generative chatbot models. Readers can give it a try:

Have a dataset with lengthy questions and answers to catch signals well Create a larger architecture of deep learning models and train over longer iterations

Make question-and-answer pairs more generic rather than factoid-based, such as retrieving knowledge and so on, where generative models fail miserably.

Here, you saw how to build chatbots using LSTM. You can go ahead and try building one of your own generative chatbots using the example above.

If you found this post useful, do check out this book Natural Language Processing with Python Cookbook to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more.

NLP with Python Cookbook

 

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.

4 COMMENTS

  1. Hi there,
    just a small bug in the code you’ve written, which is causing your loss to be 0 always and accuracy to be 1 always
    >>>def create_questions(question_maxlen,vocab_size):
    … question_idx = np.zeros(shape=(len(Questions),question_maxlen, vocab_size))
    … for q in range(len(Questions)):
    … question = encode(Questions[q],question_maxlen,vocab_size)
    … question_idx[i] = question

    … return question_idx

    In place of question_idx[i] it should be question_idx[q], similarly change the same variable in create_answer() function as well.

    Thanks 🙂

    • Hi Shakti,

      Thank you for the inputs. Have made the said changes, I hope it looks better now.

      With this we have improved our post, thank you once again!

      Regards
      Fatema

LEAVE A REPLY

Please enter your comment!
Please enter your name here