Full Text 01
Full Text 01
2020-11-13
Norrköping 2020-11-13
Copyright
The publishers will keep this document online on the Internet - or its possible
replacement - for a considerable time from the date of publication barring
exceptional circumstances.
The online availability of the document implies a permanent permission for
anyone to read, to download, to print out single copies for your own use and to
use it unchanged for any non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional on the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linköping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its WWW home page: http://www.ep.liu.se/
The aim of the study is to contribute to research in the field of maintaining long-term
engagingness in chatbots, which is done through rapport building with the help of user and
agent specific memory. Recent advances in end-to-end trained neural conversational mod-
els (fully functional chit-chat chatbots created by training a neural model) present chatbots
that converse well with respect to context understanding with the help of their short-term
memory. However, these chatbots do not consider long-term memory, which in turn moti-
vates further research.
In this study, short-term memory is developed to allow the chatbot to understand con-
text, such as context-based follow-up questions. Long-term memory is developed to re-
member information between multiple interactions, such as information about the user
and the agent’s own persona/personality. By introducing long-term memory, the chatbot
is able to generate long-term memory-based questions, and to refer to the previous conver-
sation, as well as retain a consistent persona.
A question answering chatbot and question asking chatbot were initially developed
in parallel as individual components and finally integrated into one chatbot system. The
question answering chatbot was built in python and consisted of three main components; a
generative model using GPT-2, a template structure with a related sentiment memory, and
a retrieval structure. The question asking chatbot was built using a framework called Rasa.
User tests were performed to primarily measure perceived engagingness and realness.
The aim of the user studies was to compare performance between three chatbots; a) indi-
vidual question asking, b) individual question answering and c) the integrated one. The re-
sults show that chatbots perceived as more human-like are not necessarily more engaging
conversational partners than chatbots with lower perceived human-likeness. Although,
while still not being near human level performance on measures such as consistency and
engagingness, the developed chatbots achieved similar scores on these measures to that of
chatbots in a related task (Persona-Chat task in ConvAI2). When measuring the effects of
long-term memory in question asking, it was found that measures on perceived realness
and persona increased when the chatbot asked long-term memory generated questions,
referring to the previous interaction with the user.
Acknowledgments
First of all, we would like to thank Dirk Heylen and Mariët Theune for welcoming and giving
us the opportunity to perform this research at the Human Media Interaction group at the
University of Twente. Thank you Mariët Theune and Jelte van Waterschoot for supervising
and providing feedback and ideas throughout the entire project. Special thanks to Jelte van
Waterschoot for introducing us to relevant tools and frameworks and for suggesting relevant
literature. Furthermore, thanks to Elmira Zohrevandi for taking on the role as our internal
supervisor at Linköping University, and for providing helpful feedback and literature.
ii
Contents
Abstract i
Acknowledgments ii
Contents iii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
iii
2.6.1 Rasa NLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1.2 Featurization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1.3 Entity Recognition, Intent Classification and Response Selector 22
2.6.2 Rasa Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2.1 Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.2.3 Slot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.2.4 Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.3 Rasa X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7 VADER Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 Semantic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iv
5 Development of a Question Asking Chatbot 67
5.1 Types of Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Rasa Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 A Basic Chatbot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Rasa NLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.2 Featurization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.3 Entity Recognition, Intent Classification and Response Selector . . . . . 69
5.4 VADER Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5 Semantic Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6 Rasa Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.6.1 Story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6.2 Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.6.3 Slot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6.4 Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6.5 Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.7 Rasa X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.8 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.9 User test - Question Asking Chatbot . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.9.1 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.9.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Integration 81
6.1 API Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2 Integrated System Rules and Conversation Flow . . . . . . . . . . . . . . . . . . 82
6.3 Final User Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8 Discussion 92
8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.1.1 User Test - Question Answering Chatbot . . . . . . . . . . . . . . . . . . 92
8.1.2 User Test - Question Asking Chatbot . . . . . . . . . . . . . . . . . . . . . 94
8.1.3 Final User Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.1.4 Comparisons of the Three Chatbots . . . . . . . . . . . . . . . . . . . . . 95
8.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.2.1 Question Answering Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.2.2 Question Asking Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.2.3 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3 The Work in a Wider Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
9 Conclusion 102
Bibliography 105
A Appendix 109
A.1 GPT-2 Text Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A.2 Fine-Tuned Generative Models and Their Answers . . . . . . . . . . . . . . . . . 109
A.3 GPT-2 Models Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.4 GTKY - Most Common Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.5 Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.6 Front-End - Question Answering Agent . . . . . . . . . . . . . . . . . . . . . . . 115
A.7 Question Answering User Test Interactions . . . . . . . . . . . . . . . . . . . . . 116
v
A.8 Questions for Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
A.9 Final User Test Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
vi
List of Figures
2.1 Word embedding similarity matrix using spaCy’s tokenizer and similarity function. 7
2.2 Transfer learning example where the general task model’s knowledge can be trans-
ferred to the sub-task model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 XLNet generating 59 words in 27.669 seconds. . . . . . . . . . . . . . . . . . . . . . 11
2.4 GPT-2 (124M) generating 37 words in 1.270 seconds . . . . . . . . . . . . . . . . . . 11
2.5 Illustration of possible types of information that Mitsuku can remember from con-
versations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 A story flow diagram visualized in Rasa X. . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Results from the sentiment analyser for different user inputs. . . . . . . . . . . . . . 25
2.8 Some relations provided by ConceptNet. . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 System flow diagram describing how input data flows through the architecture of
the chatbot and generates output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Plotting distortion over the number of clusters. . . . . . . . . . . . . . . . . . . . . . 34
3.3 Feedback using "loading dots" . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.1 Box plot of the three chatbots’ engagingness in the first session. The whiskers are
at a distance of 1.5 interquartile range length (IQR). . . . . . . . . . . . . . . . . . . 89
7.2 Box plot of the three chatbots’ realness in the first session. The whiskers are at a
distance of 1.5 interquartile range length (IQR). . . . . . . . . . . . . . . . . . . . . . 90
A.1 Front page with instructions and terms for the question answering user test. . . . . 115
A.2 Chatbot interaction page, illustrating retrieval based memory. . . . . . . . . . . . . 115
A.3 Survey page for the question answering user test. . . . . . . . . . . . . . . . . . . . 116
vii
List of Tables
6.1 Mean and standard deviation of the survey results from the final user test which
used a 5-point Likert scale. Session 1 (All) represent a group of 14 participants and
the remaining columns represent a group of 5 that completed both sessions. . . . . 85
6.2 Results of the final user test on grammar, conversation flow and message quality
on a 10-point Likert scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.1 Comparing survey results of the first session of different chatbots by looking at
the mean values supplemented with the standard deviation. . . . . . . . . . . . . . 88
7.2 Resulting P-values of ANOVA tests on the chatbots’ engagingness and realness
scores. The third column (far-right) excludes the ’All’ group of the Question An-
swering chatbot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 Comparing results of shared measures between the integrated chatbot and the
question asking chatbot, illustrating for each question the mean value supple-
mented with the standard deviation. Additionally, the resulting p-values of
ANOVA tests are presented. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.4 Comparing the agents in this thesis with models from the Persona-Chat paper
[zhang2018personalizing] on fluency, engagingness and consistency. . . . . . . . . 91
7.5 Comparing engagingness of the agents in this thesis with the top 3 agents pre-
sented in ConvAI2 as well as human for reference [dinan2020second]. . . . . . . . 91
viii
A.1 Comparison of training time between different GPT-2 models . . . . . . . . . . . . 112
A.2 Comparison of generation time between different GPT-2 models . . . . . . . . . . . 112
A.3 Top 87 (84+3) most frequently occurring nouns in extracted questions from the
GTKY dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.4 Template questions developed after analyzing the GTKY dataset, with the addi-
tion of two questions after a user test . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.5 Template answers that are used together with a sentiment memory, to answer
what things the agent likes or dislikes. . . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.6 Questions generated by a model accidentally trained on question asking, that were
then used for evaluating answer relevancy of fine-tuned generative models. . . . . 118
ix
1 Introduction
The first chatbot (or conversational agent) was developed in the 1960’s. It simulated a psy-
chologist by following a set of rules to generate questions by using the user’s original message
[49]. Today, several personal assistants have been developed and are commercially available,
such as Amazon’s Alexa or Apple’s Siri. These virtual assistants are able to perform certain
tasks such as playing music, setting an alarm or answering (factual) questions by the user.
However, dealing with the social and creative aspects of human language is still challenging.
Most of the time, the assistants are unable to hold meaningful conversations with a person,
and unable to maintain engagingness over long-term interactions. Part of this problem may
be due to the agent’s limited or non-existent memory, resulting in the agent being unable to
understand context dependent messages by the user. Moreover, building a relationship with
the user is challenging due to lack of user specific memory.
Even though conversational agents are important and highly applicable to our society,
commercially the focus has been more on virtual assistants. Virtual assistants are task-
oriented agents whereas conversational agents are non-task-oriented. The difference is that
task-oriented agents are designed to be interacted with briefly to fulfill a task (e.g. book a
train ticket) whereas non-task-oriented agents are designed for extended conversations. A
few example applications for conversational agents are:
• Educational agents.
Three existing chatbot examples are Insomnobot3000, Endurance ChatBot and Replika. In-
somnobot3000 is an agent that people with insomnia can text with in the middle of the night
when none of their conversational partners may be awake1 . Endurance ChatBot is an open-
source work in progress, where the aim is to create a robot-companion for people who suffer
from Alzheimer’s disease2 . Replika "The AI companion who cares", is an agent that provides
emotional support, and is always there for people who feel lonely or sad or just want some-
one to talk with3 . For these types of agents to be able to help the target group, it is vital that
the users stay engaged long enough for the agents to help them, which illustrates the impor-
tance of research in the area of long-term engagingness for conversational agents. For the
case of Endurance ChatBot, it is especially important that the chatbot has memory to be able
to evaluate the state of the user’s health status.
1 http://insomnobot3000.com/
2 http://endurancerobots.com/azbnmaterial/a-robot-companion-for-senior-people-and
-patients-with-alzheimer-s-disease/
3 https://replika.ai/
1
1.1. Motivation
The chatbots of today generate messages to users by either following a set of rules to
return a template based on the recognized sentence pattern (rule-based), or by being trained
on a large amount of (human-to-human) conversational data (corpus-based). The rule-based
approach can be considered as the standard way of creating conversational agents in the
past. This is because the previous winner in the Loebner Prize was a rule-based chatbot
“Mitsuku”, for the 4th year in a row4 . The Loebner Prize is a contest where developers test
their bots against the Turing Test, and the one with the most human-like bot wins. For rule-
based chatbots, this requires the creation of hundreds of templates to handle a broad range
of inputs, which in turn costs time and money. Another popular method is to use a retrieval
database, where a database of replies are kept and retrieved. The retrieval method works
well when the user’s input is similar enough to entries in the database, as it may retrieve
related, high quality, and potentially engaging replies. Once the user’s inputs are not similar
to any entry in the database, then the flaws of the method are observed. The agent will
then reply with something seemingly random, change topic, or give a safe reply like "I don’t
know". Both methods suffer from repetitiveness as there is only a limited number of available
replies at any given time. The third method is to train neural models to generate response.
Training a model with the help of machine learning is an attractive solution as it is cheaper
than creating hundreds or thousands of templates, and compared to the retrieval structure, it
is able to generate an answer to any user input. However, generative models instead suffer
from inconsistency as well as having a tendency to produce non-specific answers such as “I
don’t know”.
Inconsistency example,
where PERSON 1 is a human and PERSON 2 is a chatbot [56].
[PERSON 2:] I love to cook and bake but I rarely make time for it
[PERSON 1:] what do you usually bake?
[PERSON 2:] I do not cook, I bake cakes at a local restaurant. how about you?
The mentioned methods have their respective pros and cons, and therefore, this thesis aims to
develop a chatbot with rule-based methods and corpus-based methods combined. With rules
and templates, a memory structure will be introduced and used consistently. The corpus-
based approach will support the templates to handle any input that is out of scope for the
templates. Memory will be used for two main tasks in this thesis:
By storing memory about the user, the agent will be able to, at a later time re-use information
provided by the user in past interactions, to generate personal questions. By storing memory
about the agent, the user will be able to ask questions and get to learn about the agent. Both
of these tasks are researched for the purpose of contributing towards the field and the future
development of personal social chatbots which are capable of long-term interaction; chatbots
that users may develop a personal bond with.
1.1 Motivation
The social chatbots of today usually have short-term memory (memory for the current in-
teraction) but have limited (if any) long-term memory (memory over several interactions).
Without short-term memory it becomes difficult for the agents to understand context de-
pendent user utterances, such as follow-up questions or utterances referring to “he/she/it”.
4 https://aisb.org.uk/mitsuku-wins-2019-loebner-prize-and-best-overall-chatbot-at-
aisb-x/
2
1.2. Purpose
Without long-term memory it becomes difficult for the agent to build rapport with users. To
build rapport it is necessary to have a personal background, remember information about the
user, and then to find common ground such as shared interests.
The project was carried out at the Human Media Interaction (HMI) research group at the
University of Twente. HMI partly performs research in the area of dialogue and natural lan-
guage understanding and generation. The research group is specifically interested in research
aimed at social and creative applications, such as conversational agents or social chatbots.
1.2 Purpose
The purpose of this thesis is to investigate how long-term and short-term memory can be
used in a chatbot to simulate a consistent persona for question answering and to enable long-
term question asking via user modeling. This is done by implementing and testing a chatbot
with user specific and agent specific memories, where long-term memory data is mainly used
through rule-based methods, such as templates, and short-term memory is used in a genera-
tive model.
Development
1. How can the user and agent specific memory models be developed to extract and store
information from user utterances, and apply them to agent utterances?
2. How can short-term memory be developed such that the agent can understand and
provide a relevant answer to user utterances, such as follow-up questions, from the
context of the conversation history?
User testing
1. How is the user’s experience with an agent affected when the agent generates questions
which are personalized based on the user’s utterances from previous interactions?
2. How is the user’s perception of the agent affected by the agent having a personal back-
ground, which allows for self-disclosure responses?
3. To what extent would the user’s perception of the agent be affected if the user could ask
the agent self-disclosure questions?
3
1.4. Delimitations
1.4 Delimitations
The conversational agent will be developed as a non-task-oriented system. That is, it will not
be a type of question-answer (QA) bot, for example by searching online for factual answers in
order to reply to the user, as Siri does. It is not aimed for it to perform tasks such as schedul-
ing, setting reminders or alarm. The chatbot will be developed primarily to investigate what
effects the memory has on users for question asking and question answering. As such, the
final chatbot is not intended to be a stand-alone product that can be deployed outside of the
scope of the user tests in the thesis. The chatbot will consist of two components, a question
asking component and a question answering component.
The chatbot will focus solely on text input and output. Its main platform is targeted at
communication applications, such as Messenger, Discord and WhatsApp, and websites in
general, where text exchanges take place. Therefore, features related to speech will not be
taken into consideration. However, this can be left as a part of potential future work.
It is expected that the user will send messages such that each message is limited to one
intent. It is also expected that the users should use proper and grammatically correct English,
although minor typographical errors are allowed.
The user and agent memory structure will be restricted to storing information on a limited
range of general topics, such as sports, hobbies, movies, etc. The structure aims to retain
memory about like/dislike preferences, e.g. that the agent likes apples but dislikes oranges,
that the user’s favorite animal is cat.
4
2 Background and Related Work
This chapter provides theory for key concepts relevant to this thesis, such as different types of
chatbots; task-oriented and non-task-oriented, and the different methods to create chatbots,
such as corpus-based and rule-based approaches. Related work is investigated to find which
methods have been tested, and which methods contribute toward the development of engag-
ing social chatbots. Additionally, the chapter provides an overview of existing methods for
evaluating the performance of chatbots.
5
2.2. Generative Models
query and existing information in the database. The responses, however, are limited to the
range of data and topics that exist in the database.
Different Machine Learning models have been developed for the task of text generation
by training on large amounts of text data (generative models). The trained models can then
generate an output to any input by the user, although the quality of the response is usually
lower compared to IR systems. It is an attractive solution as it allows for the creation of
chatbots simply by training a model on a dataset.
0and%20Vanishing%20Gradients.pdf
6
2.2. Generative Models
Figure 2.1: Word embedding similarity matrix using spaCy’s tokenizer and similarity func-
tion.
keeps". The closer the words are in the vector space, the higher the cosine similarity. The
range of similarity is between 0.0 and 1.0 or sometimes from -1.0 to 1.0, where a value close
to 1.0 means that the texts are identical or very similar. Given the word embeddings of the
words king, queen, cat and dog, the embedding of king will be closer to the embedding of
queen than it will be to cat or dog, see figure 2.1. Similarly, the embedding of cat will be closer
to the embedding of dog.
Extending the concept of word embeddings for words into sentence embeddings for sen-
tences, it is possible to calculate the cosine similarity of two sentences. An example of a
model that produces sentence embeddings is the Universal Sentence encoder [6] developed
at Google. Finding the similarity between sentences may be useful as part of clustering (3.4),
answer ranking (4.4.4) or similar applications. As such, word and sentence embeddings as
well as cosine similarity will be utilized throughout this thesis.
7
2.2. Generative Models
Figure 2.2: Transfer learning example where the general task model’s knowledge can be trans-
ferred to the sub-task model
labeled data (assuming supervised learning) for recognizing vehicles may be significantly
larger than a dataset that specifies the vehicle type. As such, the smaller dataset, if used to
train a model from scratch may not be able to train the model to recognize vehicles as well
as the larger dataset. Therefore, by first training a model on the larger dataset for the general
task, and then using the pre-trained model to fine-tune for the specific sub-task, the results
may be better. The same concept can be applied for the task of text generation. First, a model
is pre-trained to understand and generate coherent text by using a large dataset with many
different types of human-written text, and then fine-tuned for the sub-task of dialogue [51].
The concept of fine-tuning will be utilized in this thesis on existing large pre-trained gen-
erative models trained for text generation, by fine-tuning the models on the task of question
answering.
8
2.2. Generative Models
Human-Agent inconsistency
Human: what is your job ?
Machine: i ’m a lawyer .
Human: what do you do ?
Machine: i ’m a doctor .
Today, several tutorials exist that illustrate how a chatbot can easily be created by using
Seq2Seq2 . A variation of the model was also used by Microsoft as part of their chatbot, XiaoIce
[59] (2.4.4). While the Seq2Seq model have been used in recent years in chatbot applications,
it will not be used in this thesis due to the model being superseded by Transformer mod-
els. At the time of writing, Transformer type models have state-of-the-art results on several
Natural Language Processing tasks.
2.2.4 Transformer
The Transformer model was first introduced in the paper "Attention Is All You Need" [45]
in 2017, where it hit new records on the WMT’14 translation task. The model made use of
the encoder-decoder concept as seen in Seq2Seq but discarded the concept of using RNN
by instead using an attention mechanism and feed forward neural networks. The attention
mechanism allows the decoder to have access to all hidden vectors in the input, which allows
for better context understanding. At each time-step in the decoder, a prediction of the next
word output is made by using the weighted sum of all hidden vectors in the input. Addition-
ally, the model’s architecture allowed for parallelization which led to faster training times.
Since it no longer uses RNN, and computation is parallelized, positional encoding is used to
inform the model of the word order of sentence sequences.
Since it was first introduced, the architecture has been used in several new models, most
notoriously, Bidirectional Encoder Representations from Transformers (BERT) [11], and Gen-
erative Pretrained Transformer 2 (GPT-2). While the original transformer used an encoder-
decoder structure, the BERT model only consisted of encoders3 , while GPT-2 only consisted
of decoders4 . The architecture has led to several state-of-the-art models in the field of nat-
ural language processing, some of which score higher than their respective human baseline.
The NLP benchmarks GLUE5 , SuperGLUE6 , SQuAD7 , CoQA 8 and QuAC9 are, at the time
of writing, dominated by transformer models, most commonly with variations of BERT such
as ALBERT and RoBERTa.
Transformer models can be abstracted into three groups, Sequence-to-Sequence, auto-
regressive, and auto-encoding models. Sequence-to-Sequence models, such as the original
Transformer, consist of a encoder and decoder part. The natural application is translation.
Auto-regressive models, such as GPT-2, are pre-trained to predict the next token (word),
given the previous inputs. They correspond to the decoder part of the original Transformer
and the natural application is text generation. Auto-encoding models, such as BERT, are
pre-trained by first masking input tokens to then attempt to reconstruct the original input.
They correspond to the encoder part of the original Transformer and the natural application
is sentence classification or token classification. All three groups of models can be trained for
various NLP tasks depending on their characterization.
A library by Hugging Face Transformers contains several pre-trained state-of-the-art
Transformer models10 , such as GPT-2, BERT [11], Text-To-Text Transfer Transformer (T5) [33],
2 https://pytorch.org/tutorials/beginner/chatbot_tutorial.html
3 http://jalammar.github.io/illustrated-bert/
4 http://jalammar.github.io/illustrated-gpt2/
5 https://gluebenchmark.com/leaderboard
6 https://super.gluebenchmark.com/leaderboard
7 https://rajpurkar.github.io/SQuAD-explorer/
8 https://stanfordnlp.github.io/coqa/
9 https://quac.ai/
10 https://huggingface.co/transformers/model_summary.html
9
2.2. Generative Models
etc. As such, the library enables comparison of different models. Hugging Face also hosts an
online interactive demo of generative transformers11 .
This thesis will consider auto-regressive models (GPT-2) for text generation due to the
availability of models which have been pre-trained with the text generative task in mind.
2.2.4.1 GPT-2
Generative Pretrained Transformer 2 (GPT-2) was released as a pre-trained transformer
model in 2019 by OpenAI, following the previous model under the same name (GPT). The
model achieved state of the art results on 7 language modeling datasets [32]. The main dif-
ference from the previous model is that the new model comes in different sizes and is trained
on a larger dataset. The models released by OpenAI were pre-trained on 40GB of internet
texts. [32]. The different versions are ‘124M’, ‘355M’, ‘774M’ and ‘1558M’ [39]. The names
represent the size of the models, ‘124M’ consist of 124 million parameters whereas ‘1558M’
consists of 1.5 billion parameters. The models generally generate “better” texts as the number
of parameters increases, as observed in a study by [39] on human detection of generated news
articles. The models were given a credibility score in the range 1-10, where the score is com-
posed of independent clarity, accuracy and believability scores. The 355M model had a mean
credibility score of 6.07 while the 774M model scored 6.72 and the 1558M model scored 6.91.
Although the score increases with the size of the models, the improvement from the 355M
model to the 774M model is more significant than the improvement from the 774M model to
the 1558M model in terms of the credibility score. The larger the model is, the longer it takes
to fine-tune (2.2.2), and the longer it takes to generate text once the model is fine-tuned (A.3).
Following are three short, cherry-picked examples of text generated by the largest GPT-2
model, given an input prompt. One longer example is provided in the appendix A.1.
2.2.4.2 Distillation
As language models, especially transformer models were becoming larger and larger, the con-
cept of distillation was applied to create distilled versions of the large models. The method is
to down-scale the architecture into a new (student) model, where this model is trained to re-
produce the behavior of a larger (teacher) model. This resulted in smaller and faster versions
11 https://transformer.huggingface.co/
10
2.3. Persona-Chat Task and Agents
which still retain most of the larger model’s capabilities. DistilBERT [37] is an example of dis-
tillation where the BERT model’s size was reduced by 40% (from 110 million parameters to
66 million) and runs 60% faster, while retaining 97% of its language understanding capabili-
ties. The concept was also applied to the RoBERTa model and the 124M sized GPT-2 model
to create DistilRoBERTa and DistilGPT-212 . Therefore, distilled models, such as DistilGPT-2,
can be an alternative to considered to achieve faster text generation, if the larger models take
too long to generate text in a chatbot setting.
1. The lack of a consistent personality, due to the models being trained on data consisting
of dialogues from different speakers [46] [24].
2. Lack of explicit long-term memory, due to being trained to produce utterances given
only recent conversation history.
3. A tendency to produce vague, non-specific answers such as "I don’t know", which are
not engaging for human users [23].
12 https://github.com/huggingface/transformers/tree/master/examples/distillation
13 https://huggingface.co/xlnet-base-cased
14 http://convai.io/2018/
11
2.3. Persona-Chat Task and Agents
The competition contained automatic and human evaluations. Human evaluation was
through crowdworking, following a similar method as in the creation of the persona-chat
dataset. The human users interacted with the agent for 4-6 dialogue turns, following a ran-
dom persona provided to them. The users were then asked to answer the question "How
much did you enjoy talking to this user?" on a scale of 1-4. The users were also tested whether
they could distinguish the persona the model was using from a random one. Automatic eval-
uation was through three metrics:
2. F1 score - applied as a metric for word-overlap between generated response and gold
response.
3. Accuracy of the next utterance selection when given random distractor responses mixed
with the gold response (accuracy of ranking).
Hugging Face had the best performing agent (2.4.3) on the automatic evaluation and came
second in the human evaluation, with an engagingness score of 2.67. The winner "Lost In
Conversation" achieved a score of 3.11. Analysis was conducted to find how the highest
scoring model from the automatic evaluation was bested in the human evaluation. The con-
clusion was that Hugging Face’s model tended to ask too many questions, which disrupted
the balance of question-asking and question-answering.
Automatic evaluations are still flawed in comparison to human evaluation. Notably, al-
ways replying with the response "i am you to do and your is like", would outperform the
word-overlap measure of all models [12]. This is related to the findings that word-overlap
metrics do not correlate well with human judgement [25]. When adding the last utterance of
the conversation as one of the distractor responses for the utterance selection metric, it was
observed that most models suffered, resulting in parroting. This illustrates that these models
may have relied too much on candidate response ranking influenced by word-overlap to the
previous utterance.
Through human evaluation of persona detection, Hugging Face achieved a score of 0.98
and Lost In Conversation achieved a score of 0.9. The human-to-human detection score of
persona was at 0.96. It is clear that the persona can be recognized confidently. However, as
models tend to repeat (persona) sentences, this may lead to a high persona detection score
but lead to a lower engagingness score. As such, training models to use the persona to create
engaging responses rather than copying/repeating it remains an open problem [12].
Final conclusions from the competition is that the best observed models (in the competi-
tion) were variations of the generative Transformer architecture. That being said, the com-
petition was held before the BERT and GPT-2 model were released which may have further
potential. Generative models worked better than retrieval models on this task (out of the
observed models). Comparing retrieval and generative models without human judgement
remain an open problem for the dialogue task. Models that do not have a balance of question-
asking and question-answering in conversations may suffer from human evaluation in terms
of engagingness.
Models still suffer from inconsistency such as generating responses like "I work as a snow-
board instructor" followed by "I work for a food company". Additionally, models tend to ask
questions that have already been answered previously. These may be solved through the de-
velopment of a memory structure and/or with the help of Natural Language Inference [50].
The competition and the competing models are useful for this thesis both as a guideline as to
how chit-chat models are developed, and for comparison of engagingness, allowing results
of user tests to be compared to previous chit-chat agents.
12
2.4. Relevant Conversational Agents
2.4.2 Mitsuku
Mitsuku is a rule-based chatbot that is built by using Artificial Intelligence Markup Language
(AIML). It is built on top of another famous chatbot called ALICE [48]. The creator of the
chatbot, Steve Worswick, has worked on the agent a few hours at a time since around 200415 .
Mitsuku has won the greatest number of Loebner prizes (5 times), where it was deemed as
the most human-like chatbot in the competition.
Mitsuku has a limited memory where it saves information about the user and some con-
textual keywords, see figure 2.5. By using the keywords, the agent is able to understand a
question like "What do you think about him?", after previously talking about Eminem. Some
observed flaws with the chatbot are its repetitiveness, where it uses the same template answer
repeatedly, and that it sometimes fails to extract information that it uses in future templates16 .
This thesis will partially utilize memory slots to store information about the user, similar to
Mitsuku.
3_winner
16 http://www.square-bear.co.uk/mitsuku/chat.htm
13
2.4. Relevant Conversational Agents
Figure 2.5: Illustration of possible types of information that Mitsuku can remember from
conversations
tion is available (at the time of writing) online17 , where the method is described in a chatbot
tutorial18 . Their agent generates responses by using a transformer model. A persona and
conversation history are considered when generating each response [51].
A short interaction with the demo reveals the agent does illustrate a persona, however it
suffers from severe repetition.
-with-transfer-learning-2d818ac26313
14
2.4. Relevant Conversational Agents
230 skills for the agent, consisting of both non-task-oriented and task-oriented skills such as
question answering, recommending movies or restaurants, storytelling, ticket-booking, etc.
Additionally, the chatbot features image commenting where given an image by a user, the
agent will comment on the image, and content creation such as generating poetry together
with the human users.
The part that is most related to this thesis is their "Core Chat" component, which han-
dles open-domain conversations with users. The component is described as a data-driven
response generation system. It consists of two retrieval models, one generative model, an
answer ranker, as well as editorial responses. The Core Chat component is combined with
a component called "Empathetic Computing". It calculates a dialogue state vector consist-
ing of a contextual query, conversation context, query and response empathy vectors. The
conversation context consists of a number of the previous messages in the conversation. The
contextual query is created by rewriting the user query by using information from the con-
versation context. The empathy vector is a vector that is encoded with the user’s feelings
and states in the conversation. The response empathy vector encodes an empathetic aspect,
based on the agent’s personality and the current situation with the user, that is expected in
the agent’s next response.
The first retrieval model is stored with paired data, consisting of query-response pairs.
These pairs are collected from two sources, the internet (social media, forums, news com-
ments, etc.), and from the human-machine conversations generated when users interacted
with XiaoIce. It is estimated that 70% of the agent’s responses nowadays are retrieved from
the agent’s past conversations. The data collected from the internet is filtered by converting
each query-response pair into a tuple, consisting of the contextual query, response as well
as the user and response empathy vectors, by using the empathetic computing module. The
data is filtered based on the tuples to only retain data pairs that contain empathetic responses
that fit XiaoIce’s persona. Retrieval structures cannot cover all topics (especially new topics
unless the database is regularly maintained).
A test was performed using three models, a retrieval-based model, a neural-generator-
based and a hybrid of the two models. During testing, the retrieval-based model achieved
the second highest score, and the neural-generator-based model achieved the lowest score.
It was found that the retrieval-generative hybrid model achieved higher score compared to
the two other models by themselves [59]. Therefore, a neural response generator was added
to the Core Chat component, to turn it into a retrieval-generative hybrid component. The
generator was trained using the paired data from the retrieval database. The model was
based on the sequence-to-sequence (seq2seq) framework [44]. A variation of a persona-based
model, which is a type of seq2seq model, was used for XiaoIce’s neural response generation
to reduce the occurrences of inconsistencies in outputs, such as an inconsistent persona [24].
The generator takes the contextual query and a linear combination of the empathy vectors,
as input to generate responses that consider the context, the user profile, and the agent’s
persona. Beam search is used to generate up to 20 response candidates.
An additional retrieval model that stored "unpaired" data, was introduced as comple-
mentary to the other two models, to further improve the coverage of the agent’s responses.
Although the neural response generator is able to provide a response to any topic, the re-
sponses are shorter and contain less useful content in regard to the topic compared to the
response candidates from the unpaired database. Its database consists of sentences collected
from public lectures and quotes in news articles and reports. The data is yet again filtered to
align with the agent’s persona. The unpaired database should not be used by itself, or else it
risks retrieving responses that repeat what the user just said. Therefore, a knowledge graph
was constructed which contains triples of the form head-relation-tail. Each triple contains a
pair of related topics (head, tail) that people often discuss in one conversation, e.g. (Einstein,
Relativity) or (Quantum Physics, Schrodinger’s cat). Then, given a contextual query, a topic
is extracted from the query and a related topic is found through the knowledge graph, and
thereafter a query can be made to the database with the two topics.
15
2.4. Relevant Conversational Agents
A ranker using a ’bosteed tree ranker’ ranks the candidate responses from the three mod-
els. A pre-set threshold is used to classify whether the candidate responses are acceptable,
determined by if the candidates’ ranking scores are above the threshold. A response is then
chosen by randomly sampling from the acceptable candidates. The ranker calculates scores
based on four categories of features.
1. Local cohesion features, how related is the candidate response to the contextual query?
2. Global coherence features, how coherent is the candidate response with the conversa-
tion context and the contextual query?
3. Empathy matching features, how well does the candidate response match the person-
ality of the agent?
4. Retrieval matching features, for responses from the paired database, how well does the
query in the query-response pair match the contextual query?
2.4.5 Meena
Meena is a chatbot created by Google [1]. The model is an end-to-end trained neural conver-
sational model with 2.6 billion parameters. It uses a type of Transformer architecture called
The Evolved Transformer [38] and was trained on 341GB of filtered social media conversa-
tions. Compared to the largest GPT-2 model, Meena has 1.7x greater model capacity and
was trained on 8.5x more data. The trained chatbot was presented as being more sensible
and specific than existing state-of-the-art chatbots, as backed up by crowd-sourced human
evaluation. Their user test was done by collecting about 100 conversations for well-known
open domain chatbots: Mitsuku, Cleverbot, XiaoIce [59] and DialoGPT [57]. Conversations
between humans and Meena, Mitsuku and XiaoIce are available on Google’s GitHub20 . For
consistency, each conversation started with the same greeting from the user "Hi!". The crowd
workers answered two questions for each utterance from the chatbot, "does it make sense?"
and "is it specific?", to measure if the response makes sense in context and whether the re-
sponse is specific to the conversation or a general response. Using the test data, an evaluation
metric was created by using the average of the fraction of responses that are marked “spe-
cific" and "sensible". Meena scored the highest and XiaoIce scored the lowest21 . Although,
in the report it was noted that due to how the researchers performed the test of XiaoIce, the
model’s user profiling suffered as well as its context state.
From looking at sample conversations between Meena and human users, it can be ob-
served that the chatbot has impressive context understanding in comparison to other chat-
bots, but still illustrate flaws. Two flaws are that of repetition and occasionally interacting
as if the chatbot’s previous response was from the user. Furthermore, the work focused on
sensibleness and specificity, and as such does not consider long-term memory or persona.
19 https://github.com/google-research/google-research/blob/master/meena/xiaoice.txt
20 https://github.com/google-research/google-research/tree/master/meena
21 https://ai.googleblog.com/2020/01/towards-conversational-agent-that-can.html
16
2.4. Relevant Conversational Agents
2.4.6 Replika
Replika is a chitchat system that, like XiaoIce, combines neural generation and retrieval-based
methods. Important to note is that there is not a research paper about the system as a whole,
and most information is found from slides at the developer’s GitHub22 . The neural genera-
tion considers persona-based embeddings as proposed in [24] and emotional embeddings as
proposed in [58], when generating responses23 . The retrieval-based model is trained based
22 https://github.com/lukalabs/replika-research
23 https://github.com/lukalabs/replika-research/blob/master/scai2017/replika_ai.pdf
17
2.5. User Testing and Evaluation
on a hard negative mining approach, to force their model to produce low matching scores
for similar contexts and responses. This is to avoid "echo-responses", where a retrieval-based
model retrieves the most semantically similar response rather than the most appropriate [13].
The system also considers conversation history and context by encoding it and providing it
to the retrieval and/or generative models. The generative model used was at one point a
Hierarchical Recurrent Encoder-Decoder (HRED) [40] which is an extension to Seq2Seq that
considers context. Although, in an open source repository24 , the developers suggest using
Transformer based models instead. Additionally, the agent has the ability to comment on
and ask questions about images sent by the user.
Although no official evaluation results have been published on the system measuring its
engagingness, etc., the agent is available online to interact with25 . The system has a built-
in feedback mechanism which allows user to give a thumbs up or thumbs down on any
response provided by the agent. With the help of this feedback, a reranking component was
then added to the system by training a BERT transformer model to classify whether a sentence
would lead to a thumbs up or a thumbs down26 .
2.5.1 PARADISE
PARADISE is a general framework for evaluating and comparing the performance of spo-
ken dialogue agents [47]. The application of the framework is for task oriented and spoken
dialogue agents, which differs from the non-task-oriented chat-based dialogue agents as fo-
cused on in this thesis. Considering both agent types are dialogue agents, there may exist
some evaluation methods that are useful for both of these agent types. In the case studies for
the paper, results were collected by three means: recordings of the user interaction, logging
various system information, and a survey. The recordings were used to transcribe user utter-
ances, measure the elapsed time and to collect additional information such as whether there
occurred any delay for the agent’s responses. Logging was used to log what decisions the
system made at each state as well as additional information. Example of additional logged
information was automatic speech recognition (ASR), where the user utterance was saved
from the perspective of the agent and compared to with the actual, transcribed user utter-
ances, to measure accuracy of the recognition. A survey was used to measure text-to-speech
performance, ASR performance, task ease, interaction pace, user expertise, expected behav-
ior, comparable interface and future use. Most of the questions used a 1-5 scoring metric
from "almost never" to "almost always". Some questions had yes/no/maybe responses. The
questions were as follows, where ELVIS is an agent for accessing email over the phone.
• In this conversation, did ELVIS understand what you said? (ASR performance)
• In this conversation, was it easy to find the message you wanted? (Task Ease)
• Was the pace of interaction with ELVIS appropriate in this conversation? (Interaction
Pace)
24 https://github.com/lukalabs/cakechat
25 replika.ai
26 https://github.com/lukalabs/replika-research/blob/master/scai2019/replika_scai_1
9.pdf
18
2.5. User Testing and Evaluation
• In this conversation, did you know what you could say at each point of the dialogue?
(User expertise)
• How often was ELVIS sluggish and slow to reply to you in this conversation? (Expected
behavior)
• In this conversation, how did ELVIS’s voice interface compare to the touch tune inter-
face to voice mail? (Comparable interface)
• From your current experience with using ELVIS to get your mail, do you think you’d
use ELVIS regularly to access your mail when you are away from your desk? (Future
use)
The same three means of results collection can be applied to the user tests in this thesis
as well. Instead of audio recordings, message history is stored with timestamps. Similarly,
logging can be used to store system state information for each message/action to later be
analyzed and compared to the message history. Finally, a survey will be used to collect the
test user’s perception of the agent.
2.5.2 Godspeed
Godspeed is a series of questionnaires to measure the users’ perception of robots. The ques-
tionnaires cover five key concepts in human-robot interaction: anthropomorphism, animacy,
likeability, perceived intelligence, and perceived safety [2]. Anthropomorphism refers to the
attribution of human behavior, emotions or characteristics to non-human entities such as
robots or animals. It is used to measure how human the user perceives the agent. Animacy
measures life-likeness and perceived safety measures the user’s emotional state. Intuitively,
likeability and perceived intelligence measures how likeable and how intelligent the user
finds the agent. The questionnaires consist of a number of adjectives paired with their cor-
responding antonyms. Users are then instructed to rate each pair from 1 to 5. An example
pair is Fake/Natural, where 1 means fake and 5 means natural. A few more examples of
adjective pairs from the questionnaires are Machinelike/Humanlike, Unfriendly/Friendly,
Unintelligent/Intelligent and Agitated/Calm.
Out of the five concepts, animacy and perceived safety are not as relevant in the case of
chatbots, as they may be for robots. Anthropomorphism is a useful concept to measure per-
ceived realness, and likeability may correlate with engagingness. Perceived intelligence may
be associated with realness, where a more intelligent system seems more human-like. How-
ever, as there are intelligent and unintelligent humans, designing a chatbot to be human-like
should not necessarily mean an intelligent one. As such, anthropomorphism and likeability
are considered to some degree when evaluating the chatbots in this thesis.
2.5.3 SASSI
The presence of six main factors in user’s perceptions of speech systems was found in the
work by Hone. and Graham. where they aimed to create a tool for "the Subjective Assess-
ment of Speech System Interfaces" (SASSI) [17]. The six factors were found after perform-
ing principal component analysis (PCA) on a number of questionnaire statements (or items).
The six factors were named "System Response Accuracy", "Likeability", "Cognitive Demand",
"Annoyance", "Habitability" and "Speed". System Response Accuracy was the name give for
questionnaire statements related to the system’s accuracy, reliability, predictability, etc. Three
example statements are "The system is accurate", "The system is unreliable" and "The inter-
action with the system is unpredictable". Three example statements for likeability are "The
system is useful", "The system is pleasant", "The system is friendly". The name Cognitive De-
mand relates to the user’s emotional state and the perceived difficulty level of using the sys-
tem, "I felt calm using the system", "A high level of concentration is required when using the
19
2.5. User Testing and Evaluation
The question in the example may have many different answers which do not share any word-
overlap with the ground truth response, which would lead to a low BLEU score, e.g. 0, but
the answers are clearly valid.
A survey was performed to determine the correlation between human judgements and
automated metrics, such as word-overlap (BLEU) and word embedding (word2vec [28])
methods, respectively. The result was that all metrics show weak or no correlation with hu-
man judgement, and a comparison was made that the metrics were similar to random noise
[25]. Despite the results of the survey and the poor performance of both metric types, the re-
searchers believed that distributed sentence representations (Word2vec / word embeddings)
hold the most promise for the future. As such, automatic metrics will be tested in this thesis
as a complement to manual, human evaluation.
20
2.6. Open Source Conversational AI, Rasa
As the chatbot understands the user messages, it is able to classify and extract the intents
and the entities. A user’s intent represents the goal or meaning of the input. For example, the
message “Hello” can be identified as a “greet” intent, because the meaning of this message is a
greeting. Entity is a keyword that the agent takes consideration of. For example, the message
“My name is John” contains information related to a personal name, the agent should then
extract an entity, in this case labeled “name” with value “John”, and remember it throughout
the conversation to keep the interaction natural.
Incoming messages are processed by a sequence of components in Rasa. These compo-
nents are executed one after another in a pipeline. A pipeline commonly contains three main
components:
• Tokenization.
• Featurization.
2.6.1.1 Tokenization
A tokenizer segments text into words, punctuations marks and so on, it splits a text into
tokens. There are different tokenizers that can be used for different types of user inputs.
27 https://github.com/RasaHQ/rasa
21
2.6. Open Source Conversational AI, Rasa
For example, Whitespace Tokenizer can be used to process words are separated by spaces,
which is the case for English and many other common languages. Other tokenizers are also
supported if the language is not whitespace-tokenized.
2.6.1.2 Featurization
A featurizers creates a vector representation of user message and/or response. There are two
different types of text featurizers: sparse featurizers and dense featurizers. Sparse featurizers
return feature vectors with a lot of missing values, e.g. zeros. These feature vectors usually
take up a lot of memory, therefore they are stored as sparse features. Sparse features only
store the values that are non zero and also their positions in the vector.
2.6.2.1 Story
The training data in Rasa are given as a form of stories, these are used to train the dialogue
management models. A story essentially represents a dialogue flow of a user and the chatbot,
it contains both the user’s intents (and entities if applicable) and the respective responses
and/or actions that the chatbot should take. Below is an example of a simple story that
represents the beginning of a conversation. The user’s intents are labeled with an asterisk,
and the chatbot’s responses and actions are labeled with a dash, and responses always start
with the “utter_” prefix. Figure 2.6 shows the story flow, the graph can get more complex as
the story gets more routes and contains different responses and actions.
As seen in the example training data above, an advantage with this format is that the
specific input of the user does not need to be present. Instead, the output intents and entities
22
2.6. Open Source Conversational AI, Rasa
from the NLU pipeline are utilized. Therefore, for any messages that are classified as greet,
the chatbot will take the responses from utter_greet_back and utter_ask_name and reply. All the
available responses are provided in a domain file.
2.6.2.2 Domain
The domain essentially defines the universe that the chatbot lives in. It specifies all the intents
and entities that the chatbot should know about, and all the responses and actions that can
be used. In addition, Rasa also support displaying images and using buttons, which is a nice
touch if the application is to be deployed on other platforms such as Facebook Messenger.
Below is a sample of the domain file that contains some intents and entities.
23
2.6. Open Source Conversational AI, Rasa
2.6.2.3 Slot
The chatbot is equipped with memory slots, it is essentially a key-value database. A slot
holds a certain information such as a personal name or a location, that is provided by the
user during the interaction. It is also possible to retrieve information from other sources, for
example query from database.
Slot has various types for storing different kinds of information, for different behaviors.
Below is a list of supported slot types.
Situations where the value does not matter, for example when discussing weather, then
a text slot named “location” can be used. When the user provides their current location, this
value is then stored into the slot, the exact location does not matter to Rasa Core, whether it
being in New York or Paris. Rasa Core only needs to know that a location entity was given,
and its value will be stored in this specific slot named “location”. Unfeaturized type is another
slot type where the value does not matter.
If the conversation flow or responses will depend on the value, then there are other types
of slot that can be used, including bool, categorical, float and list. For example, the chatbot
asks a boolean question, and it stores the answer to a bool type slot, later on the chatbot’s
responses can vary depending on the boolean value.
2.6.2.4 Response
After the chatbot has understood what the user is saying, it needs to reply accordingly. In
order for the chatbot to reply, it needs to be supplied with one or more responses. These
responses are also defined in the domain file, and are used in the stories. Below are a few
examples of responses that the chatbot can use.
One or more responses are used given one or more certain intents. Below is a sample of
training data written in the story format.
24
2.7. VADER Sentiment Analysis
2.6.3 Rasa X
Rasa X is a free but closed source toolset that can be used to improve the agent by deploying
the agent on a server and let users interact with it, gather additional information which can
be used for example as more training data, or to correct false intent and entity classification
manually and so on. It comes with a neat UI and is easy to navigate and share the agent.
Rasa X is deployed using Docker, which is a tool that uses containers to make the process of
creating, deploying and running applications easier.
Figure 2.7: Results from the sentiment analyser for different user inputs.
The purpose of the ranking system is to sort the extracted entities by their sentiment val-
ues, and the chatbot will prioritize to generate questions related to the topics that have higher
sentiment values.
2.9 Conclusion
XiaoIce illustrated that using a hybrid retrieval and neural generative based approach al-
lowed for a more engaging system, as the two models complement some of each-other’s
weaknesses. As the generative model that was used in XiaoIce was based on Seq2Seq,
perhaps with a Transformer architecture, such as GPT-2, even more engagingness can be
achieved. From the work by [3], it is found that avoiding repetition and introducing a per-
sona for the agent leads to a more engaging system in their case. Therefore, a hybrid rule-
based and corpus-based approach was chosen for the chatbot in this thesis, with templates, a
retrieval structure and a generative model. Templates will be used together with the agent’s
28 https://github.com/cjhutto/vaderSentiment
25
2.9. Conclusion
personal memory, where a number of variations of the templates are created to reduce rep-
etition. A retrieval structure will be made with high quality responses for typical questions
asked when getting to know someone. The generative model will use a GPT-2 model, fine-
tuned on the question answering task.
Rasa Open Source provides infrastructure and tools necessary for building high-
performing, resilient and contextual chatbot. The ability to understand messages and use
machine learning to improve those conversations, integrate chatbots with existing systems
and channels, all in one package. Rasa’s NLU provides the technology to understand mes-
sages, determine intent, and extract key entity. It has a wider usage with supporting multiple
languages, and both pre-trained and custom entities. With a very first basic chatbot built, it
is then possible to move forward quickly by learning interactively from real conversations
with the users. Options for deployment include on-prem, private cloud, or third-party cloud
provider. Connecting to existing systems, API calls and knowledge bases, and builds on web-
site, apps, messaging apps or custom channel. It is then decided to study and adapt to the
framework.
Human evaluation was found to be the most reliable form of evaluation, compared to ex-
isting automatic metrics, which is why user tests will be the main method of evaluation. A
decision was made to not use any one of the standard evaluations (PARADISE, Godspeed,
SASSI) but rather to take inspiration from them when creating the user surveys for the dia-
logue agents of this thesis. The reason being, neither of these three standards were specifically
developed for measuring users’ perception of chat-based dialogue agents.
In both SASSI and Godspeed, it can be observed that several questions are used to mea-
sure the same type of aspects. For example, SASSI has the statements "The interaction with
the system is fast" and "The system responds too slowly", which both measure speed. This
26
2.9. Conclusion
may offer more robustness as users may interpret questions differently based on the wording
of the sentences. Therefore, this concept will be considered when designing the surveys in
this thesis.
The evaluation metric of CPS will not be used, purely due to the limited scope of the
thesis. The limitation being, that there is not enough time for the development of a fully
functional chatbot for long term usage, such that the metric can be utilized. The evaluation
will be based on survey results from test users that interact with the agent.
27
3 Development of a Question
Answering Chatbot
This chapter describes the development of the question answering chatbot followed by the
first user test (3.10). A brief overview of the chatbot and its architecture is presented in sec-
tion (3.1), and then discussed in detail in the sections that follow. The chatbot consists of
three main components, a generative model (3.3), a template structure (3.5) with related sen-
timent memory (3.6) and a retrieval structure (3.7). The components are first introduced by
themselves and then put together (3.8). The creation of each component is based around
conversational datasets (3.2) that were preprocessed (3.3.1) and analyzed (3.4).
3.1 Architecture
The architecture for the question answering chatbot can be seen in figure 3.1. The figure
depicts an initial overview as to how the chatbot’s components work together. The chatbot
consists of three components: a generative component, a template component and a retrieval
component. Furthermore, a front-end was developed to allow users to interact with the chat-
bot online. The chatbot runs locally on a computer and can be accessed by using flask, a web
application framework in python, combined by using a Ngrok tunnel to redirect users from
Ngrok to the computer running the chatbot. Then the user is able to send and receive mes-
sages to/from the agent. When the user sends a message, the message is sent to the template
and retrieval components. The template component attempts to extract any existing nouns,
if one exists then the message is transformed into a question template form. The transformed
message is then compared with existing template questions to find the most similar template
based on a cosine similarity score. Then the retrieval component finds the most similar ques-
tion in its retrieval database. If the template question has a higher similarity than the retrieved
question, to the user message, then the template is chosen and the related answer template is
fetched. Otherwise, the retrieval answer is chosen. If the chosen answer has a higher cosine
similarity than a pre-defined threshold, then the answer is sent out to the user. Otherwise,
an answer is generated and then sent out to the user. The memory of the template and re-
trieval structures are not explicitly illustrated, but it can be considered part of the respective
components.
The following sections describe more in-depth the method applied to develop each indi-
vidual component, as well as essential datasets.
3.2 Datasets
To train a neural model on a specific task, it is essential to have relevant data for the task.
As such, conversational datasets were used to train generative models to generate answers
to questions, and to create a question-answer pair retrieval database. Additionally, conver-
sational datasets were analyzed to create templates based on common sentence structures
found in questions.
28
3.2. Datasets
Figure 3.1: System flow diagram describing how input data flows through the architecture of
the chatbot and generates output
Four existing datasets were used during this thesis, Cornell Movie–Dialogs Corpus
(CMDC) [10], Persona-Chat [56], ConvAI2 [12] and a “getting to know you” (referred to as
GTKY from now on) dataset from a study by Huang et al [18]. All datasets except for the
ConvAI2 dataset were at one point used for the purpose of training machine learning mod-
els. The GTKY and the ConvAI2 datasets were used to analyze what type of questions are
commonly asked in a casual chit-chat setting, and what topics occurred. Analyzing these
datasets is important for the creation of question and answer templates, see section (3.5). The
ConvAI2 dataset consists of human-to-bot conversational data. Analyzing this data may be
useful to get a better understanding of how humans interact with agents. The other datasets
are human-to-human datasets, where Persona-Chat and the GTKY dataset are messages gath-
29
3.3. Generative Model
ered from a 1-on-1 chat room environment. The Cornell Movie–Dialogs Corpus on the other
hand consist of conversations extracted from raw movie scripts [10]. The Persona-Chat and
ConvAI2 datasets are related to the same task. The Persona-Chat dataset consists of con-
versations between two randomly paired crowdworkers, who were asked to act out a given
persona and to get to know each other [56]. The ConvAI2 dataset is the result of a competition
where the competitors used the Persona-Chat dataset to try and create the best chatbot that
acts out a persona [12]. For the GTKY dataset, participants were told to chat for 15 minutes
with the objective to get to know each other and learn about each other’s interests [18].
30
3.3. Generative Model
thus, the tokens help the model to learn a desired pattern. The tokens were called ‘<|startof-
text|>’ and ‘<|endoftext|>’. A new dataset was created; MD-token-pairs, by adding a SOT
(start of text) token at the start of a message pair, and an EOT (end of text) token at the end of
the message pair. This entails that the GPT-2 model can learn to recognize the structure of a
message pair.
To be able to train a machine learning model on the task of answering questions, it is
necessary to perform preprocessing of data into questions and answers. This will entail that
when the model is given a question as input, it will generate an answer as output. The prepro-
cessing code was altered to extract questions and answer pairs instead. The python library
NLTK was used to split a message into sentences, and for each sentence, split into tokens.
If the last sentence in the message ends with a question mark, then the message is classified
as a question. The following message is then classified as an answer. The first question and
answer dataset, MD-token-questions-simple, that was used for fine-tuning had the condition
to extract the last question in the current message and the first sentence with a punctuation in
the following message. This condition was made because there exist messages with several
questions. Therefore, the hypothesis was that the last question is the most relevant one. Sim-
ilarly, it was hypothesized that the first sentence in the following message, that ends with a
punctuation, is the most relevant sentence that answers the question. In this case, it was not
desired to extract answers that answer a question with a question, as the model is to be a part
of a question answering component. As such, the component only handles questions and not
answers from the user.
Thereafter, the MD-token-questions dataset was created after adding three conditions to
improve the training data. To “improve” the training data here refers to processing the data
into a format that is more desirable for this specific task. The question message must consist
of more than two words to avoid nonspecific questions like “Why?”, “Why not?”, “Huh?”,
etc. These questions were avoided as they are dependent on the previous messages in the
conversation; context dependent follow-up questions. The questions were deemed as out of
scope at the time due to time constraints of developing the necessary context understanding
for answering such questions. Another condition was that the answer must be longer than
one word for the sake of avoiding “Yes/No” answers. Ensuring that answers are longer
than one word should lead to more engaging generated answers by the model once it is fine-
tuned. Finally, the question and the answer must be from two different characters and from
the same movie (movie dialog corpus). Different question and answer datasets were created
with variations on these conditions where the changing factor was how much of the answer
and the question messages were extracted as the question and the answer. For example,
MD-token-questions dataset had a condition to only extract the question from the message
containing the question, and only the first sentence (that ends with a punctuation) for the
answer message. One variation, MD-token-questions-2, used more than one sentence in the
answer, to allow the answer to be more complex. Compare the two variations in the following
example where MD-token-questions-2 is the "A2" answer.
31
3.3. Generative Model
Pair 2
Q: I counted on you to help my cause. You and that thug are obviously
failing. Aren’t we ever going on our date?
A: You have my word. As a gentleman
The GTKY dataset was preprocessed using the same method as when creating the MD-token-
questions dataset, henceforth referred to as GTKY-preprocessed. The Persona-Chat dataset
was preprocessed into pairs similar to the MD-token-pairs dataset.
“Clean” extraction in this sense refers to extracting the related answer sentence to the
specific question. Therefore, after automatically extracting and creating a question and an-
swer pair dataset, GTKY-preprocessed, manual cleaning was done to remove pairs where the
answer was not related to the question. Consider the previous example where automatic ex-
traction would give the unrelated question and answer pair: "Q:What is your favorite foods?
A:this is essentially my home now.". The manual cleaning reduced the dataset by roughly
44%, from 2035 pairs to 1147. Even though the manual cleaning of the dataset decreased the
size of the training data, the current work aimed at improving the model performance based
on quality and relevance of the training data rather than the size.
32
3.4. Data Analysis
models were not saved, and no comparison of the models was performed. Thereafter, a
few models were trained on different question and answer datasets, where the datasets were
created with different conditions, as mentioned in (3.3.1). Most of the early models were not
saved or compared extensively. The main purpose of the early models was to continuously
fine-tune models on different datasets, to find what dataset is better at teaching a model the
desired behavior of generating personal questions and answers. The output of a model was
observed and thereafter a new model was trained to see if there was a noticeable difference
or improvements between the current model and the previous model.
After the GTKY-preprocessed dataset was created, and a model (#7), was fine-tuned on
the dataset, the model started showing the desired behavior of asking and answering more
personal questions. An example of four different models answering personal questions is
provided in the appendix A.2. The MD-token-questions size is 3.53MB compared to GTKY-
preprocessed dataset’s size of 226kB. Fine-tuning 400 iterations on the significantly smaller
dataset may have resulted in an over-fitted model. The GTKY-preprocessed dataset was
thereafter manually cleaned into a new dataset, GTKY-preprocessed-manual. Three mod-
els were trained purely on the GTKY-preprocessed-manual dataset, model #8, #9 and #10.
After #8, models with 124M parameters were trained instead of 355M to ensure less risk of
overfitting, as the dataset was very small. Model #10 was chosen as the generative model for
the system, as it seemed to generate relevant answers more often than the other models.
33
3.4. Data Analysis
appendix includes table A.3 where the top 87 nouns are illustrated together with a question
that consist of the noun. Thereafter, questions that contained a noun were grouped together
in a list, which resulted in 2754 remaining questions compared to 5167.
For each noun, the associated questions were analyzed. The analysis was performed by
manually looking at the questions for the specific noun to make an assessment of what type of
questions were asked that make use of the noun. For example, the most common noun “time”
was mostly associated with the question “What do you like to do in your free time?”. The
questions that were perceived to be most commonly asked for each noun were written down
in a txt file, to be used when creating a retrieval database and as reference when creating
templates. A total of 652 questions were written down. Attention was also put when writing
down questions on which questions could be used as template questions by replacing the
keyword (or the noun) of the sentence.
All the extracted questions (5167) were finally clustered to determine common topics. To
be able to cluster the questions, every question was transformed into an embedding vector,
using Universal Sentence Encoder5 [6]. The elbow method was used to decide on the number
of clusters to use (n). The elbow method is a heuristic which determines an optimal cluster-
ing of the data. The method is to plot a metric, such as distortion (y-axis), over the number
of clusters (x-axis) to find a point where the rate of improvement per additional cluster de-
creases. Initially, the metric value will improve steeply (lower value is better) as the number
of clusters increases, but at one point the rate of improvement flattens. A lower number
of clusters is desired and therefore the number can be decided as a point before the curve
flattens. In this case, k-means clustering was called in a for-loop in the range n=[2, 39]. In
each iteration, the sum of square distances of each sample to the closest cluster center was
calculated. Figure 3.2 shows the distortion value changes across different n values.
From the figure, there was no obvious number of clusters to choose. Therefore, attempts
were made with n = 10 and n = 20. After analyzing the clusters and what type of questions
existed in each cluster, it could then be observed that n=10 was too few whereas n=20 was
too many. Each cluster was analyzed by printing the questions in the cluster and extracting
the nouns from the questions. After analyzing the dataset, the most common themes/topics
could be abstracted to one of the following (unordered):
• hobbies/interests/fun activities/plans
• music/instruments
5 https://tfhub.dev/google/universal-sentence-encoder/1
34
3.5. Templates
• movies/TV shows
• student/study/research
• weather
• food
• sports
• work/job/money/business
• books
• travel/language/vacation
3.5 Templates
Previous work at University of Twente developed a template-based framework for generat-
ing follow-up questions [27]. The framework consisted of three major parts: pre-processing,
deconstruction, and construction. Pre-processing to clean-up user input, deconstruction to
find a sentence pattern, and construction to construct a response by matching the sentence
pattern obtained in the previous step. The template component in this chatbot contains the
same three parts but with different methods and for question answering instead of question
asking. A main difference is that [27] created rules to find patterns in user inputs which were
then linked with a template response, whereas this component developed template questions
linked with template answers. By using Semantic Role Labeling [27] is able to detect inputs
with the same semantic meaning but that are syntactically different and match the inputs
to the same response. Whereas in this component, user inputs were compared to template
questions by transforming the input and templates into sentence embeddings and calculat-
ing the cosine similarity between the input and the question templates, to detect a matching
question.
Recurring sentence structures were observed in some questions during the analysis of the
GTKY dataset, which led to the creation of 21 question templates. The 21 templates were
then divided into 6 types of questions, by assigning them an answer ID. Answer templates
could then be made for a specific question type by assigning the question type’s answer ID
to the answer template as well. The templates were then used in a retrieval way, where the
user’s input is matched to a template question (deconstruction) to retrieve a related template
answer which is then further processed (construction). Many of the question templates have
only a slight variation which is how the 21 templates could be divided into 6 types. Having
variations of the same question adds more robustness as it can recognize more inputs. The
question templates are shown in A.4.
A template is retrieved if the user’s input is similar enough to the template. The similarity
is calculated by turning the templates and the input into sentence embeddings and calculat-
ing the cosine similarity. The templates consist of three token types: <sentiment>, <noun>
and <topic>, which enables the templates to be used with different topics, subjects (noun),
and sentiment. Given a user input, a copy of the input is processed to follow the template
35
3.6. Sentiment Memory
It was considered to perform additional analysis to find if there were common sentence
structures for answers reoccurring in the dataset when answering these types of questions. In
this case however, the answer templates were manually written to fit the question templates.
The answer templates are shown in A.5 and are further described in (3.8).
36
3.7. Answer Retrieval Structure
banana -> is a fruit/food. The sentiment memory could therefore grow larger and larger the
more users interacted with the system by saving a noun under its respective topic with a
random sentiment value.
Two possible alternatives were considered when creating the retrieval structure. The first
possible approach is to take the GTKY-preprocessed dataset after it was manually cleaned,
and directly use it as the retrieval database. Although it is a possible solution, the dataset was
mainly created as training data for machine learning models. Therefore, some questions exist
in the dataset mainly to teach models the format of, ‘a question followed by an answer’, and
additional work would be required to further clean the data. The dataset does not specifically
take into consideration the most common questions, as found by the analysis of the GTKY
dataset. Furthermore, it is important to reduce repetition in the chatbot’s answers to increase
the perceived engagingness for users who interact with the conversational agent [3]. As such,
for each question type, it is desired to have multiple possible answers which would require
manually grouping similar questions such that their answers could be shared.
The second alternative is to make use of the common questions found when analyzing
the GTKY dataset, and to generate answers with the fine-tuned generative model. The sec-
ond alternative is more desirable when considering the repetition aspect as it can generate n
number of answers for each question.
37
3.8. Chatbot Development
As a result of analyzing the GTKY dataset, 652 questions had been written down into a
txt file, ordered by the top nouns (3.4).
To create a retrieval structure, the questions were first ordered into general topics instead
of nouns. The topics were interest/hobbies/activities, music, books, movies, games, stu-
dent/study/work, person/personal, food, sports and miscellaneous. Questions that did not
fit into any other topic were placed into miscellaneous such as “What is your favorite thing
to do at the beach?” or “What’s your favorite season?”. Duplicate questions were possible
if the question contained more than one of the most common nouns and would therefore be
listed more than once. For example, the question “Do you have any fun weekend plans?” is
listed under “fun”, “weekend” and “plans”. The retrieval questions were not meant to handle
follow-up questions, and as such, follow-up questions were discarded as well as duplicates
and thereby 343 questions remained.
The remaining questions were iteratively, one by one, input into the fine-tuned GPT-2
model to generate 20 answer samples per question. The results were stored in a csv file, sav-
ing each question along with the 20 samples. Out of the 20 samples, some of the samples were
duplicates and some answers did not make sense to the question. Therefore, (subjectively)
suitable answers were manually extracted. Out of the 343 questions, some additional ques-
tions were discarded for various reasons, resulting in 259 retrieval questions. The questions
were mainly, but not exclusively, discarded after observing flaws in the associated generated
answers. Another reason for discarding questions was due to the question being a type of
follow-up question, for example, “What is your dog’s name?”.
A similar approach to when the template structure was created, was used to assign an-
swer IDs to questions and answers. Questions that have similar meaning were assigned the
same answer ID. Thereafter, the respective answers were assigned the same answer ID. Some
answers could be answered by a different question type, and were therefore given an addi-
tional “optional” ID. The following example illustrates one answer that fits one question, and
another answer that fits two questions.
A1:I like to go out to bars on the weekends, but if I’m stressed out then I’ll usually
head out to eat. ID = 1
A2:I like to go to the movies. ID = 1, Optional ID = 2
38
3.8. Chatbot Development
to find the IsA relationship of the noun. However, this feature was disabled due to time
complexity as it could add an extra two seconds to the agent’s response time in some cases.
ConceptNet was then replaced by using spaCy’s word similarity to compare the similarity
between extracted nouns and existing topic words in the sentiment memory. This was done
by using the concept of word embeddings, e.g. Word2vec, to create a vector representation of
each word. The extracted noun was classified as the topic to which it had the highest cosine
similarity to, which allows users to ask what the agent thinks about any noun in any topic.
For example, "Q: What do you think about money?". The subject noun was then saved in the
sentiment memory with a random sentiment under the topic that it had the highest similarity
to, and the agent was able to say whether it liked or disliked the subject.
The processed input is then transformed into an embedding vector, using Universal Sen-
tence Encoder [6]. The template questions are all turned into embedding vectors and cosine
similarity is calculated between the processed input and each template question. The ques-
tion template with the highest similarity is saved along with the cosine similarity value. Then,
the original user input is similarly compared to the questions in the retrieval database. Indi-
vidual threshold values are used for the template and retrieval structure. If both similarity
values are lower than their respective threshold, then the question is considered to not have
found a similar question in either structures, and therefore an answer is generated by the
GPT-2 model. This can be compared to XiaoIce [59] where a pre-set threshold is used to de-
termine candidate responses. A difference is that XiaoIce randomly selects a response out of a
pool of responses which have a ranking score above the threshold, compared to selecting the
highest ranked response. The threshold values were in this case determined by, during de-
velopment, testing how low the similarity can be and still return answers that seem relevant
enough to be considered acceptable as output. If one or both values are above the threshold,
then the question with the highest similarity, returns their respective answer. If the question
from the retrieval database has highest similarity, a random answer with the corresponding
answer ID is returned. If the template question has higher similarity, then the output needs
to be processed further. The output is processed by choosing a random answer template
with the same answer ID corresponding with the question. A dataframe data structure in
the python library Pandas is used to retrieve information from csv files. The answer when
retrieved from its pandas dataframe contains information about how the answer should be
processed. The information tells whether the answer should use the same noun as extracted
from the user sentence, if it should use the same sentiment word (like/dislike/.../) as the
user and how many nouns it should retrieve in a topic (a number of favorite nouns). The in-
structions are stored in the columns ’use_noun’, ’use_sentiment’ and ’fetch_count’ of the csv
file, which is shown in A.5, where integer 0 is mapped to "False" and 1 is mapped to "True".
A user specific memory was developed related to previously answered questions. When-
ever a retrieval answer is returned as the reply message of the chatbot, the question, answer
and answer ID is saved. If the user asks a question with the same answer ID, then the saved
answer is returned, which ensures consistent answers, and a sense of persona.
Example of retrieval memory
Q: What is your name?
A: My name is John.
Q: What is your name?
A: I thought I told you previously. My name is John.
The generative model generates answers whenever the similarity of the template and re-
trieval questions are too low compared to the user’s question. The model is called to gen-
erate an output with the user’s original text as input together with a <|startoftext|> token
prepended to the user’s text. For example, "Q: <|startoftext|> What is the meaning of life?".
The model then generates n number of answers, where the first non-empty answer is returned
as output.
Internal Testing and Revisions
39
3.8. Chatbot Development
Once the interactive chatbot was developed, some testing was done internally by letting per-
sons other than the developer interact with the system. It became apparent that the generated
answers can sometimes be “faulty” by consisting of <|startoftext|> and <|endoftext|> to-
kens. The faulty answers are not formatted as only the answer but may also contain the input
question or the start of another question as well as the answer generated by the model. This
has to do with how the model is trained on question and answer data. Therefore, this led to
the addition of various processing. First, the sentence is split at the first occurrence of “end-
oftext” and all the text after the split is discarded. Then any remaining tokens are removed
(<|startoftext|>, <|endoftext|>, <|, |>). Further processing was performed to ensure that
the answer does not contain a question by splitting the text into sentences and taking any
sentences before the question. After preprocessing, the first answer that is longer than 1
character (not empty) is returned as the output. It was also observed that the model would
sometimes only generate a dot. This would occur when the text given by the user does not
contain punctuation at the end of a sentence (question mark, exclamation mark, full stop).
The question answering component was developed for question answering and as such can
only be expected to work as intended if the user inputs questions. Therefore, a processing
rule was made to the user text that adds a question mark to any input that does not end with
a form of punctuation. This addition was only added when given as input to the generative
model whereas the original user input is saved (recorded) when applicable, such as during
user tests.
Additionally, the agent did not have a greeting or a farewell reply, which was then added
to the retrieval structure. Finally, an initial message was displayed as the first message of the
chatbot which greeted and invited the user to ask the agent about anything. The message
was then revised to invite the user to ask anything about the agent and its interests, see figure
A.2 in the appendix. The reason for changing the message was an attempt to narrow down
the questions the users would ask the agent into the personal question domain that it was
designed for.
Artificial response delay
The generative component took around 4-8 seconds to generate a response depending on
hardware, number of responses to over-generate, and response lengths. Therefore, artificial
response delay was introduced to mask any difference in response time between the gener-
ative component and the template and retrieval components, which are near instantaneous
(without artificial delay).
Artificial delay has been previously studied in the work of [14]. In the study, it was found
that users perceived a chatbot (in the context of customer service) with response delay as
more human-like and more socially present compared to the same chatbot with near instan-
taneous responses. Therefore, the testing objective of whether adding artificial delay in the
agent’s response increased the user’s perception that the agent was real (realness), was added
to the user test. Testing for this objective was done by randomly using delay for every other
user (aiming to be equal numbers of participants in each test group). The hypothesis was that
the user group interacting with the agent that has artificial delay, would perceive the agent
to be more human-like. The artificial delay was dynamically determined by the length of the
agent’s reply, see eq. 3.1.
40
3.9. User test - Environment
A threshold for the delay was set at 8 seconds, messages that were calculated to take
longer than 8 seconds were suppressed using eq. 3.2. Example delays are illustrated in table
3.2. The threshold of 8 seconds was chosen due to the generative component usually taking
roughly 8 seconds to over-generate 20 answers. If the time it takes for the reply to be retrieved
or generated is longer than the calculated dynamic delay, then no extra delay was added.
Otherwise, the agent waited until the computation time + wait time became equal to the
calculated delay time.
• A starting page
• A survey page
• An end page
11 https://flask.palletsprojects.com/en/1.1.x/
12 https://ngrok.com/
13 https://dev.to/sahilrajput/build-a-chatbot-using-flask-in-5-minutes-574i
41
3.9. User test - Environment
• A statistics page
The appendix shows some of these pages (A.6). The starting page informs the user about the
user test in terms of their rights and how their data is saved and used in the study as well as
instructions regarding how to interact with the agent. The starting page includes a checkbox
where the user agrees and gives their consent to the terms of the study. If the user gives
their consent, then they are redirected to the chatbot interaction page. The user decides how
long to interact with the agent, whereafter they are redirected to the survey page. The survey
page consists of the survey questions, which are presented in (3.10.1). Once sending in the
form, the user is redirected to the end page, where they are informed that the test is over and
that they may leave the page. The statistical page is made for the researchers to have a live
overview of the number of unique visits and the number of completed surveys.
Cookies are used to temporarily store user information. When a user enters the front
page for the first time, they are given a unique integer user ID which is stored as a cookie,
as well as a boolean valued cookie. The boolean valued cookie decides whether the user
interacts with an agent that has artificial response delay or not, as described in (3.8). When
the user sends messages to the agent, the message history is also temporarily stored as a
cookie. Three additional cookies are used, one cookie stores the timestamp of when the user
continued to the survey page, while the other two store information regarding the delay in
the chatbot’s response. One keeps track of the maximum delay that the user experienced, and
the other keeps track of the average response delay. The message history cookie is updated
after every message sent by the user. The cookie is a python list where each entry contains the
user’s message, the chatbot’s message, the timestamp when the user’s message was received
and the timestamp just before returning the chatbot’s message. The delay cookies are also
updated accordingly after each message.
Data is saved on the server side at different steps of the user test. Once the user is given a
user ID, the user ID is stored in a csv file together with the current timestamp. When the user
is done interacting with the chatbot and redirected to the survey page, the message history is
saved in another csv file. The data saved to the csv files is: user ID, the value of the timestamp
cookie, the value of the message history cookie and the number of messages sent by the user.
At a lower level, the chatbot always saves every answer that is obtained through generating
with the GPT-2 model along with the corresponding question, in a pandas dataframe. The
chatbot also stores a pandas dataframe related to the retrieval database, such that if a user asks
a question which leads to a retrieved answer, the question, answer and answer ID is saved
in a user specific memory. The purpose of the memory is to ensure a consistent persona. If
the user asks what the agent’s name is, then the agent will first randomly retrieve a name
and thereafter remember the name and re-use it if the same user asks again. Therefore, the
pandas dataframes are saved into their respective csv files at this point as well. The data
could be saved once at the end of the user test instead, but this ensures more safety in case
the application runs into any unexpected errors. Finally, once the user finishes the survey
and submits the data, the survey answers are saved along with additional data into another
csv file. The additional data contains the user ID, the timestamp from the related cookie, the
average delay and max delay from the delay cookies as well as which version of the test the
user interacted with.
42
3.10. User test - Question Answering Chatbot
When there exist multiple threads that have access to shared resources, it is important to
lock the resource when it is used by one of the threads to avoid corrupting data; this is known
as mutual exclusion (mutex) in concurrent programming. An example is given below how
data may be corrupted.
To update a variable the variable needs to be read from main memory, incre-
mented on local memory and saved back to main memory. If two threads were to
update the same variable concurrently without locks, there may be many differ-
ent inter-leavings of these three steps with different results. For example, with a
variable B and two threads called to increment the variable concurrently:
Variable B = 5
Thread 1 reads B, B = 5
Thread 1 increments B locally, B = 6
Thread 2 reads B, B = 5
Thread 1 saves B, B = 6
Thread 2 increments B, B = 6
Thread 2 saves B, B = 6
In the system, each user interacting with the agent has their own thread and each user (or
thread) may read or write data from/to a few csv files. Therefore, mutex locks are used to
lock shared resources to only allow one thread to read or write at the same time. This is done
by placing a call to acquire the lock before a shared resource and a call to release the lock once
the thread is done using the resource. If a thread tries to acquire the lock while it is already
acquired by another lock, the thread will pause its execution (blocking) and wait in a queue
until the resource is ready.
The generative model was unable to run concurrently. The reason was later found (4.6.7)
to be due to the "gpt_2.py" script from the gpt-2-simple library, where the function to gen-
erate text re-loads the model, which causes problems when called concurrently. Before the
discovery of the cause, a solution was found to load the model multiple times into their own
TensorFlow sessions. This was done by duplicating the model on the computer’s hard disk
and thereafter changing the TensorFlow variables into a different “scope”. Then, a queue
system was developed by using locks to only allow one user to use one of the models at
the same time. Loading multiple models was expensive in terms of VRAM as one model’s
size was near 500MB. Therefore, 4 copies of the generative model were made available in the
queue system for the user test. The choice of 4 copies was sufficient for the scope of the user
test where a low number of subjects participated at the same point in time. A script in the li-
brary “gpt-2-simple” called gpt_2.py was successively changed to work with the impromptu
solution, by adding a “scope” parameter to the ‘load’ and ‘generate’ functions, such that the
copies of the model could be successfully loaded and used to generate. If more than 4 users,
e.g. 5, tried to call the generative model at the same time, then the fifth user would be placed
in a queue waiting for a model to become available. This would lead to longer responses
times if several users interacted with the chatbot concurrently.
43
3.10. User test - Question Answering Chatbot
improved. An additional testing objective was added during development which was about
measuring how adding artificial response delay affects the users’ perception of the chatbot.
3.10.1 Survey
The survey part of the test instructs users to rate a number of statements on a 5-point Likert
scale from 1-“Not at all” to 5-“Very much so”. The statements were:
• The time it takes to get an answer negatively affects my chatting experience. (unrespon-
siveness)
• The time it takes to get an answer from the chatbot is acceptable. (responsiveness)
• I would like to interact with the chatbot again in the future. (retention)
• It feels as if the chatbot has its own personal background and personality. (persona)
Some of the listed questions are only relevant for evaluating the question answering chat-
bot, such as whether the chatbot gave contradicting answers. However, to be able to compare
the results with other chatbots in the thesis and to draw conclusions of the research, some
questions are applicable across all surveys such as questions that measure engagingness and
realness.
Moreover, the users were asked to enter their age, their perceived English proficiency and
whether they had interacted with a chatbot previously: “Have you previously interacted with
a chatbot?” (yes/no).
The options for choosing English proficiency were given in the following order:
• Native speaker
• Fluent, C1 - C2
• Advanced, B2 - C1
• Intermediate, B1 - B2
• Beginner, A1 - A2
The options were transformed into values, 1-5, where 1 represents ‘beginner’ and 5 repre-
sents native speaker.
3.10.2 Method
The user test was made available online from 29th of April around 20.00 CET until 11th of
May around 10.00 CET. The majority of test subjects participated between the dates 29/4
- 1/5. The user test was first posted in a Facebook group consisting of over 500 students
(current students and past students from Linköping University) on the 29th. It was then
posted in another student group the following day. On the 5th of May, the test was shared
with a small group of researchers that work in related areas. The last subject participated on
the 8th, a Friday, and the test was thereafter taken down on the following Monday.
44
3.10. User test - Question Answering Chatbot
The system ran locally on a laptop where users could access the test through a tunnel by
using Ngrok. The laptop had 16gb of RAM, a NVIDIA GeForce RTX 2060 graphics card with
6GB VRAM and an Intel Core i7-9750H CPU @ 2.60GHz CPU with 6 cores.
Instructions
When the link to the user test was posted in the Facebook groups, it was posted with the
following message:
“Hello there,
I’m developing a chatbot as part of research for HMI, and I therefore ask you
to test my chatbot. I would appreciate it if you could ask at least 5 (personal)
questions to my chatbot and thereafter fill in a survey, rating 9 statements on a 0-5
scale. It should take between 5-10 minutes to participate.
The test is expected to be available until the end of the week. (Ps. may not be
suitable for mobile devices) “
The users were instructed to ask at least 5 questions to try to get to know the agent, as if it
was a real person. They were also told to send complete sentences, one at a time. Once they
decided that they were done asking the agent questions, the users were then finally instructed
to fill in a short 10 question survey.
3.10.3 Hypothesis
Previous work has observed positive effects on users of using dynamic response delay in cus-
tomer service chatbots, compared to instantaneous responses [14]. Therefore, two hypotheses
were formed:
1. Users perceive chatbots with artificial response delay to be more human-like (realness),
compared to the same chatbots with instantaneous responses.
2. Users perceive chatbots with artificial response delay to be more engaging to interact
with, compared to the same chatbots with instantaneous responses.
3.10.4 Results
The user test had 32 participants who interacted with the chatbot and successfully filled in
the survey. The participants are assumed to be mostly students and some researchers, as a
link to the test was mainly shared in student related groups on Facebook. When asked “Have
you previously interacted with a chatbot?”, 28 participants answered yes. The participants
were asked to estimate their English proficiency from beginner to fluent, or if they were na-
tive speakers. The majority of participants estimated themselves to be fluent speakers and
the average English proficiency was calculated as 3.875 (SD=0.609). The average age was
calculated as 24.344 (SD=3.395). The user test was divided into two groups, DELAY and NO-
DELAY with 16 participants in each group. A participant was assigned to one of the groups
based on if their unique integer ID was odd or even.
The number of messages sent to the agent was 240 in total with an average of 7.5 messages
per interaction. Out of the 240 messages, 199 were generated, 15 were template answers and
26 were retrieved answers. 11 of the 32 subjects purely interacted with the generative model.
As for the system’s response time, the time for the agent to send out an output after re-
ceiving an input, was calculated as 5.983 seconds on average (all messages considered). The
average response time for the DELAY group was calculated as 6.646 seconds compared to
5.103 seconds for the NO-DELAY group.
One of the subjects in the NO-DELAY group interacted with the agent for over 100 mes-
sages and is therefore considered an outlier. This subject is not included in the 240 messages
mentioned previously or the response time calculations.
45
3.10. User test - Question Answering Chatbot
After analyzing the message history between the participants and the agent, only 16 (50%)
of the participants interacted with the agent according to the instructions. By not following
the instructions, the participants would interact with the agent outside of its designated ap-
plication. Some individuals requested the chatbot to ask them a question. Other individuals
asked non-personal questions, and others would merely interact with the agent to try to un-
derstand how the agent worked.
The following questions or rules are used as guidelines to divide the samples into “ideal”,
“border-line” and “out-of-scope” groups. For the most ideal case, the answer to the first
question is ’yes’ and then ’no’ for the second question.
The first rule is mainly considered whereas the second rule is less enforced to allow some
reactionary messages, such as laughter or acknowledgement. The “border-line” group con-
sist of interactions where the user may partially ask some personal questions but also task-
oriented questions or interacting with the agent to explore its capabilities. The “out-of-scope”
group consists of interactions where users clearly broke the two previously mentioned rules,
e.g. by not asking enough personal questions and sending multiple non-question messages.
Users that did not follow the instructions are not considered a valid representation of the
question-asking component for its intended application. As such, the “border-line” and “out-
of-scope” groups are merged into a “non-ideal” group. The border-line group consisted of
8 participants and the out-of-scope group consisted of 8 participants as well. Examples of
conversations from the three groups (ideal, border-line, out-of-scope) are provided in the ap-
pendix (A.7).
Table 3.3 shows the survey results from different groups, with the number of participants
in each group shown in parenthesis in the first row. T-tests were performed to find any signif-
icant differences in mean values between groups. The resulting p-values are shown in table
3.4. With a significance level of α = 0.05, it was found that there was a significant difference
between the ideal and non-ideal groups on a majority of the measures (responsiveness, real-
ness, relevancy, unresponsiveness and persona). It was hypothesized that the DELAY group
would find the agent to be more engaging and human-like than the NO-DELAY group. In-
stead, the opposite was observed with statistical significance (p=0.037) that the ideal NO-
DELAY group perceived the agent to be more human-like than the ideal DELAY group. As
for engagingness, the null hypothesis could not be rejected as there was no significant differ-
ence between the groups. Furthermore, it was found that there were significant differences
between the ideal DELAY group and ideal NO-DELAY groups on the measures of inconsis-
tency and relevancy.
Due to the unexpected findings, further investigation was performed by analyzing the
conversations in the ideal group. It was found that roughly 59.6% of the agent’s responses
made sense with respect to the user’s input in the ideal DELAY group compared to 64.4%
percent in the ideal NO-DELAY group. The ideal DELAY group consisted of 7 subjects with
6.71 inputs on average per conversation, compared to 9 subjects in the ideal NO-DELAY
group with 6.56 inputs on average. The ideal DELAY group asked (roughly) 15 follow-up
questions in total, where each user asked at least one follow-up question. The ideal NO-
DELAY group asked (roughly) 13 follow-up questions in total, with 3 out of the 9 subjects not
asking a single follow-up question. The ideal DELAY group had 3 subjects who interacted
with the template or retrieval components, with 8 responses in total from those components.
In the ideal NO-DELAY group, every subject got at least one response from the template or
retrieval components, adding up to 23 total responses from those components. The results
are discussed in chapter 8.
46
3.10. User test - Question Answering Chatbot
Group All (32) Ideal Non- All Ideal All NO- Ideal
(16) ideal DELAY DELAY DELAY NO-
(16) (16) (7) (16) DELAY
(9)
engaging- 3.281, 3.688, 2.875, 3.188, 3.143, 3.375, 4.111,
(1.301) (1.138) (1.360) (1.109) (1.215) (1.500) (0.928)
ness
unrespon- 2.250, 1.625, 2.875, 2.062, 1.286, 2.438, 1.889,
(1.391) (0.885) (1.544) (1.436) (0.488) (1.365) (1.054)
siveness
realness 1.969, 2.375, 1.562, 1.625, 1.714, 2.312, 2.889,
(1.177) (1.147) (1.094) (1.025) (0.951) (1.250) (1.054)
inconsis- 3.188, 2.750, 3.625, 3.438, 3.429, 2.938, 2.222,
(1.281) (1.183) (1.258) (1.209) (1.134) (1.340) (0.972)
tency
relevancy 2.750, 3.250, 2.250, 2.312, 2.429, 3.188, 3.889,
(1.136) (1.125) (0.931) (0.793) (0.787) (1.276) (0.928)
repetitiv- 1.938, 1.750, 2.125, 1.812, 1.571, 2.062, 1.889,
(0.982) (0.931) (1.025) (0.750) (0.787) (1.181) (1.054)
eness
responsiv- 3.812, 4.312, 3.312, 3.812, 4.143, 3.812, 4.444,
(1.355) (1.014) (1.493) (1.276) (1.069) (1.471) (1.014)
eness
retention 2.938, 3.188, 2.688, 2.688, 2.571, 3.188, 3.667,
(1.458) (1.515) (1.401) (1.493) (1.718) (1.424) (1.225)
persona 2.750, 3.250, 2.250, 2.438, 2.429, 3.062, 3.889,
(1.391) (1.528) (1.065) (1.263) (1.512) (1.482) (1.269)
Table 3.3: Survey result of the first user test, illustrating the mean and standard deviation for
different groups.
Table 3.4: P-values when comparing survey results of a) Ideal DELAY group against Ideal
NO-DELAY, b) All DELAY group against All NO-DELAY group, c) Ideal group against Non-
ideal group.
47
4 Development - Post User Test
After analyzing the user test, it was observed that the vast majority of replies, over 82%
(199/240), returned by the agent was from the generative model. It was also observed that
users oftentimes asked follow-up questions (A.7). Therefore, a decision was made to focus
on improving the generative model by:
• Developing a ranking system for over-generated answers, since some generated an-
swers are more or less relevant to the question than others.
• Training a new model that takes history into consideration, in hopes of better being able
to answer follow up questions by having short-term memory and context understand-
ing.
Training a new model consists of creating new datasets, training different models on the new
datasets and running tests to compare the models against each other and against the existing
model (model selection).
Other than improving the generative model, time was spent on refactoring the system,
updating the template component and creating question and answer classifiers.
4.1 Refactoring
Before working on improving the generative model, some refactoring of code took place. The
agent consists of three components: templates, a retrieval database and the generative model.
The components were split into their individual python files instead of residing in the same
python file. Another file was created to contain common models, functions and data that
was shared between the components. By isolating the components, the aim was to make
the system more understandable and to allows for ease of component-wise development and
testing. If one component had bugs, then the component could run by itself to find out why
the bug occurred.
48
4.3. Follow-Up Question Test
Example 1
Generative model input:
<|startoftext|>What do you think about abortion? Hm...I don’t really like abor-
tion that much. <|endoftext|> <|startoftext|>why not?
Output (possible responses):
1 https://nlp.stanford.edu/sentiment/
2 https://textblob.readthedocs.io/en/dev/
49
4.4. Answer Ranking
Example 2
Generative model input:
<|startoftext|>What do you think about soccer? I’m not very fond of it. <|end-
oftext|> <|startoftext|>why not?
Output (possible responses):
4.4.1 BM25
Initially BM25, which is a “best-match” ranker used by search engines [35], was investigated
whether it was able to be used to rank the over-generated answers. A python library called
rank-bm25 was used as it consists of a few different versions of BM25 such as Okapi BM25.
We consider the generated answer sentences to be the corpus for the algorithm, and the user’s
question to be the query. The algorithm ranks the documents, or the answers in the corpus
50
4.4. Answer Ranking
based on words in the query (question). If no matching words are found then every answer
gets a score of 0, which would result in arbitrarily picking an answer. To rank answers solely
based on matching words is not desired as it cannot recognize answers that may be relevant,
but that lacks shared words with the question. Therefore, other options were considered.
4.4.3 LDA
Latent Dirichlet Allocation [4] (LDA) is a type of topic modeling. It assumes that a set of
words make up a topic, and that each document is a mixture of topics. It was found that LDA
had been previously used for finding topic relevance of over-generated questions [7], which
is not too different from over-generating answers.
3 https://github.com/NTMC-Community/MatchZoo
4 https://github.com/NTMC-Community/MatchZoo/blob/master/tutorials/quick_start.ipy
nb
51
4.4. Answer Ranking
As such, the idea was to use LDA to classify the topics of over-generated answers and the
user’s question. Then, rank the answers based on topic distribution similarity to the question.
A vector could be created for the question and each answer, and then cosine similarity could
be used. Each value of the vector would consist of the percentage of how much a document
(text) belongs to a topic. For example, with three topics and a document with the topic distri-
bution where 60% belongs to topic 1, 40% belongs to topic 2 and 0% belongs to topic 3 then
the vector would be (0.6,0.4,0.0).
A brief test was performed to see whether LDA could be used to find distinct topics in the
GTKY dataset. A tutorial5 was followed that made use of the LDA implementation from a
python library called gensim. However, the LDA option ended up not being fully explored,
as the ranking aspect of it was abruptly replaced by a new idea which revolved around using
the Universal Sentence Encoder [6] to create sentence embeddings and calculating cosine
similarity. The topic classification aspect was replaced by clustering sentence embeddings
and the creation of classifiers by using the cluster labels. One reason that the LDA option was
not fully explored was that it was unable to create a larger number of distinct topics, see figure
4.2. The topics were observed to have some overlap and were less distinct than previously
observed in the clusters created during the data analysis part (3.4). The words that make up
each topic, with their related weights, are illustrated in figure 4.3. It also became apparent that
LDA works better for longer texts as longer texts may contain more topics, whereas shorter
texts, such as tweets or in this case sentences, mainly consist of one topic6 . However, altered
LDA algorithms [55] have been made specifically for short text topic modeling which may be
an option to consider, if LDA was to be used.
Figure 4.2: LDA topic distance when using 10 topics on the GTKY dataset.
52
4.4. Answer Ranking
Figure 4.3: Words and weights for 10 topics as determined with LDA.
word penalty function, a rare word reward function and a repeated answer removal func-
tion. The similarity measure worked by transforming the user’s input (question) and the
over-generated answers into sentence embeddings (vectors) by using the Universal Sentence
Encoder [6]. Thereafter, the cosine similarity between each individual answer and the ques-
tion could be calculated. The concept was extended to compare the similarity of answers to
not only the current question but also the conversation history. A history embedding was cre-
ated as a vector that is continuously updated with every new question and answer sentence
embedding. The ranking is then instead based on the cosine similarity between the answers
and the history embedding. The update is a linear interpolation controlled by a parameter α,
as seen in equation 4.1
The α was set to 0.2, the low α is to ensure that the agent considers the new question the
most while still retaining some sense of the topic, when ranking the answers. By running
experiments, it was observed that the ranking lead to relevant answers being returned more
often when using the ranking compared to when not using any ranking (4.4.5). However, an
53
4.4. Answer Ranking
apparent downside of using the similarity ranking is that it prioritizes answers with the same
word(s) as the user’s questions. This means that answers that are relevant and maybe more
interesting are ranked lower than the less interesting answers that make use of the word(s).
A-1 is ranked higher as it shares a keyword with the question, but A-2 is considered more
interesting and intelligent as it does not use the keyword directly, but is still related to, and
answers the question. In this thesis, answers are deemed more interesting if they are longer
(but not too long, e.g. longer than two sentences) and avoid sharing keywords when possible.
The observed downside of this ranking led to the development of penalty and reward func-
tions on top of the similarity measure as a means of achieving more interesting answers. The
following penalty and reward functions were developed to be used together during ranking.
Extensive testing may be required to optimize the penalty and reward functions, as their pur-
pose is to enable answers that have slightly smaller similarity scores than the largest one, to
be able to surpass the largest one on the ranking. To optimize the penalties and reward(s),
it may be necessary to run tests by letting the generative model generate answers for several
questions, observe the answers and their respective similarity scores, and then decide what
an acceptable similarity range is. By knowing the acceptable similarity range (the border
between relevant and irrelevant answers), it is possible to determine how much an answer
can be affected by the penalty and reward functions. If this range is not found or known,
irrelevant answers risk being ranked higher than relevant ones, which is undesirable.
The first penalty function that was developed was based on the sentence length. The
function is given an ideal length of words and any answer with less than or more than the
ideal number of words is penalized, see eq. 4.2 and 4.3. In this case the ideal number of
words was set to 20, although a dynamic length to match different users’ message lengths
is recommended to be further investigated, as inspired by the findings that humans tend to
match their response length to that of their conversational partner [36].
#
ln(token_di f f erence), if token_di f f erence>0
length_penalty = (4.3)
0, otherwise
Using the length penalty ensures that, when there exist multiple relevant answers with
similar similarity scores, the sentence closer to the ideal length is more likely to be chosen.
This function’s main purpose is to reduce the occurrence of answers that are very relevant to
the question but are also very short and therefore deemed less engaging.
In this example, by turning the question and answers into sentence embeddings and cal-
culating the cosine similarity of each answer to the question, the similarities for A-1, A-2 and
A-3 are calculated as 0.749, 0.732 and 0.350, respectively. By purely using cosine similarity
as a ranking, A-1 would be chosen while A-2 contains more information that may make the
54
4.4. Answer Ranking
conversation more engaging by allowing a follow up question such as “Who is your favorite
rapper?”. Therefore, a length penalty is applied. Two options are considered for applying
the penalty. One is to first normalize the similarity scores, such that A-1 becomes 1.0 and A-3
becomes 0.0. This creates further “distance” between higher and lower scored answers. The
other option, which is used in this case, is to use the similarity score as it is, which increases
the risk that smaller similarity scores, e.g. A-3 may surpass A-1 or A-2 with the help of the
penalty and reward functions. Regardless of the option chosen, the penalty scores are nor-
malized and a percentage of the normalized penalty score is subtracted from the similarity
score. The “percentage” subtraction is another parameter β, in this case β is 0.1. By following
the second option, the answer that would be chosen after applying the length penalty is now
A-2.
Updated scores:
• A-1: 0.749 -> 0.649
• A-2: 0.732 -> 0.652
• A-3: 0.350 -> (unchanged)
The second penalty function was developed to penalize answers that contain keywords from
the question. The penalty intensity correlates to the number of unique shared keywords, see
eq. 4.4, and 4.5. This type of penalty may lead to the occurrence of more interesting answers.
Using a variation of the previous ‘pet’ example, A-2 would have a higher chance of being the
answer by using this penalty function. Although A-2 got a higher chance of being selected,
it would not be selected when only using the penalty function, but it might be selected when
used together with other penalty and reward functions.
Keyword penalty example 1
Q: Do you have any pets?
A-1: Yes, I have a pet. (0.594) -> (0.527)
A-2: Yes, I have two collies. (0.496) -> (unchanged)
A-3: No pets. (0.685) -> (0.564)
However, in many cases, re-using keywords from the question in the answer occurs naturally,
and as a such these answers are also inadvertently penalized.
Keyword penalty example 2
Q: Who is your favorite rapper?
A-1: I don’t have a favorite rapper. (0.800) -> (0.678)
A-2: My favorite rapper is Eminem. (0.793) -> (0.707)
A-3: Eminem. (0.645) -> (unchanged)
The function first tokenize the question and all the over-generated answers. Then stop-
words are removed by using NLTK’s stop-word dictionary. For each answer, loop through
the remaining tokens/words and compare with the remaining tokens in the question. For
every unique token in the answer that exists in the question, a counter “shared_tokens” is
incremented. A percentage of the calculated penalty is then subtracted from the similarity
score. The “percentage” is another parameter ω.
penalty_magnitude = (shared_tokens/num_q_tokens)˚
(shared_tokens/num_ans_tokens) (4.4)
55
4.4. Answer Ranking
The reward function was created by using the Persona-chat and the GTKY datasets to create
weights for each word by calculating their idf-weight (inverse document frequency). Relating
to the previous example, the calculated idf-weight for music is 7.165 while it is 9.733 for rap.
The weights were created by first tokenizing the merged datasets and then lemmatizing each
token by using spaCy. Two functions from the sklearn library were used; CountVectorizer
to create a vocabulary of the data and TfidfTransformer to calculate the idf weights. Finally,
a pandas DataFrame was created containing the words and their respective weights, which
was then turned into a csv file. The reward function then looks up the weights for each word
in an answer, sums up the weights, divides the sum first by the number of tokens and then by
the mean idf weight. A percentage of the calculated reward is added to the similarity score,
based on a parameter θ.
1. Using no ranking
The test was made by feeding the generative model questions from the user test data and
then storing the best answers by the two ranking methods and the top answer for the answer
without any ranking. The evaluation was done manually by looking at the question and the
answers from each method. The evaluation was performed in a one-on-one style where a
point was awarded to the answer that was most relevant to the question. If both answers
were equally relevant or irrelevant, then no score was awarded (tie). The distributed points
of method 1 against method 2 is shown in table 4.1 while method 2 against method 3 is shown
in table 4.2.
When comparing method 1 and method 2 on 55 questions, it was deemed that method 2
had relevant answers more often than method 1. For 44 out of the questions, no points were
awarded. method 1 was awarded one point, whereas method 2 was awarded 10 points. Note
that some answers could not be compared due to the question being a follow up question
56
4.5. Question and Answer Classifiers
specific to the original answer from the user test. In such cases, the question was considered
a tie. Method 3 was not compared as it is very similar to method 2 in terms of relevancy.
Another evaluation was done to measure which ranking method had the most interest-
ing answers. Interesting answers are in this case considered as answers that contain more
substance/information/detail, for example "I like all types of music but my favorite is rap
music." instead of "I like all types of music." Although the evaluation may have been subjec-
tive, it was done for the purpose of making a fast development choice. With more resources
the evaluation could have been performed through utilizing crowdworkers to award points,
whereafter an average could be calculated. The score resulted in 10 points for method 1 and
16 points for method 2. When comparing method 2 and 3, method 2 got 5 points and method
3 got 8 points. The lower total of points in the comparison of method 2 and 3 was due to a
larger number of tied answers. As the two methods are both based on a similarity measure,
a lot of answers were exactly the same.
A conclusion was made that adding similarity ranking ensures that answers are more of-
ten relevant to the question than if no ranking is used. Additionally, by using length penalty,
more interesting answers were found (slightly) more often.
57
4.6. Generative Component Improvements
of 40 was instead chosen in this case. Some clusters were removed (1, 5, 8, 11, 14, 15, 21, 23, 38)
due to either consisting of a larger range of mixed topics or nonsensical data. The remaining
clusters are presented in table 4.4. Cluster 36 consisted of “Nice to meet you” sentences which
was removed as it was deemed out of scope for the question answering chatbot. Cluster 17
consisted of sentences such as “I will try that” and was also removed for the same reason as
cluster 36. Cluster 28 was removed as it consisted of only smiley faces ‘=)’. An idea occurred
that some answer clusters could be used to extend the question classifier. The question clas-
sifier was trained once again with 11 more clusters from the answer clustering (36, 34, 33, 30,
29, 27, 24, 22, 20 10, 7). The new clusters offered classification for various acknowledgements
and other common messages such as “Hello”, “Thank you”, “good morning”, etc. Thereafter
the answer classification was trained, and the two trained classification models were saved.
58
4.6. Generative Component Improvements
The GTKY-preprocessed dataset was similarly preprocessed to have history tokens. How-
ever, as the dataset had been first automatically and then manually processed, there is no
known start and end of a conversation, and as such the history tokens are included purely to
teach the model the pattern.
59
4.6. Generative Component Improvements
The datasets were named PC-history and GTKY-history. Some variations were created of the
history-focused datasets. One variation mixed in all of the questions and answers with his-
tory tokens from the GTKY-history dataset into the PC-history dataset. The data was mixed
by iteratively taking one "conversation" from the GTKY-history dataset and then one con-
versation from the PC-history. This was done until the smaller dataset (GTKY-history) ran
out of conversations, at which point the remaining conversations from the other dataset was
inserted.
Similarly, the GTKY-history was merged with a few entries from the PC-history to double
the size of the dataset. For every 4 GTKY-history "conversations" one conversation from
PC-history was added. Note that the PC-history is 11.1 MB while GTKY-history is 166kB.
The reason to merge was that the models were initially fine-tuned on the PC-history dataset,
and then fine-tuned on the GTKY-history dataset, as it contains cleaner (manually processed)
data. Since the GTKY-history does not contain a history of questions per conversation as it
does in PC-history, it was hypothesized that the model may ‘forget’ to use history after being
fine-tuned on the GTKY-history. Forgetting here refers to the model generating questions
and answers, but it does not add previous questions and answers into the history section.
Therefore, by mixing GTKY-history and PC-history, it is assumed the model does not ‘forget’
to use history.
A second variation extended the previous variation by including a limit on the number of
past questions and answer pairs in the history section. The limit was arbitrarily set to 4 such
that only the 4 most recent past pairs were kept in the history for the conversation.
The last variation used the Persona-Chat dataset without extracting questions and an-
swers. Thereby the dataset contains the original conversation order with taking history into
consideration. The PC-history dataset contains history of past questions and answers but
some of the context may be lost between one pair and another due to the automatic extrac-
tion method. Therefore, the idea with the last variation was to initially teach the model a
better context understanding and thereafter be fine-tuned for question answering.
60
4.6. Generative Component Improvements
history. The last model was fine-tuned similarly to model #19 but with the 4 question and
answer pair history limit. Each model was a 124M parameter model.
Additional fine-tuning experiments had previously been done after model #10 and before
model #18, but these models were not documented.
Vanilla refers to the method as seen in the user test; no previous context given to the
model and simply returning the top answer in the list of over-generated answers. For each
answer from each method, there was a separate history embedding that was updated. Cosine
similarity was then calculated between the answer and the answer’s history embedding, to
serve as the metric for automatic evaluation. The answer with the highest similarity was
determined as the winner for the current user question. The results when generating answers
to 89 questions were [10, 37, 42], meaning that 10 points were given to the vanilla method,
61
4.6. Generative Component Improvements
37 points to the second method, and 42 points to the third method. When running one-on-
one automatic evaluation on vanilla method against the second method, the score resulted
in [13, 76], with 76 going to the second method. Manual evaluation was also performed, to
confirm whether the automatic evaluation worked, which resulted in [7, 22], meaning that
most questions tied. Automatic evaluation of the second method against the third method
resulted in [33, 56] with 56 going to the third method. Manual evaluation resulted in [15,
14]. It was observed that using history sometimes led to repeated answers which affected
the evaluation, meaning that the model generated, and through ranking selected the same
answer that was seen in the answer to the previous question. One idea was therefore to only
use history when observing that the question is a follow up question. Another idea was to
remove repeated answers, see section (4.6.4).
Another test was performed where one model always got 1 message pair as history and
another model got the entire history of the conversation as input. Automatic evaluation re-
sulted in [36, 40] whereas manual evaluation resulted in [10, 15]. Although the model with the
entire history scored better on manual and automatic evaluation (40 and 15), it was observed
to generate more repeated answers.
A conclusion of the context testing is that providing the model with one message pair as
history improves its chances of answering follow-up questions, although it may lead to re-
peated answers which affects the number of relevant answers. Therefore it may be necessary
to have a way to identify follow up questions (4.5) such that history is only provided if the
question is a follow-up question, to reduce occurrences of repeated answers, or attempt to
remove the repeated answers (4.6.4).
62
4.6. Generative Component Improvements
The history embedding was updated in turn, starting with the oldest question and answer
in the history, until the new question. Then the input consisting of the history and the new
question was given to the generative model to over-generate a number of answers, in this
case 8. Then the answers were processed to remove occurrences of tokens. The answer pro-
cessing function was updated to consider the new history tokens: <|startfohistory|> and
<|endfohistory|>. The answers were then ranked, and the top ranked answer was returned.
The answer classification was never utilized in the agent due to time constraints, although
it ideally could be used not only as part of the ranking or a pre-ranking filter but also to find
contradictions with the help of Natural Language Inference (NLI) methods. The idea to find
contradictions was to compare the generated answers with the past answers in the topic. If
the agent said that it loves being outdoors and hiking, then in the future, generated answers
that contradict this statement should be removed before ranking. Contradiction detection
and avoidance is left as future work.
A script was made that ran the system on every conversation from the user test data by
feeding the system the users’ utterances. The script ran once for each model and the inputs
(user utterances) and outputs (answers) were stored in a csv file. Automatic evaluation was
then performed by, for each question, loading the answer given by the respective models. The
four answers were then ranked through the existing ranking system to find which model’s
answer was ranked the highest. To rank the answers, a history embedding was necessary
to find the relevance between the answer, the question and previous history. In this case, a
history embedding was maintained by only updating the embedding on the questions instead
of questions and answers. The automatic evaluation resulted in [67, 69, 70, 68], with the
model order (18, 19, 20, 21). The automatic evaluation therefore found the models to be
equally good.
A manual evaluation was done where, for each question, a point was given to the models
whose answers were acceptable. This means that a point can be given to all answers if all of
them are acceptable. An acceptable answer is considered one that does not contradict itself in
the same message and is relevant enough to the question to where the answer seems natural.
It does not take into consideration whether it contradicts the past message. On the data
consisting of 239 questions, the scores were [85, 76, 64, 98]. In this case, if an answer was given
by a component other than the generative model, then no score was given since the answer
63
4.6. Generative Component Improvements
is a template or from the retrieval database. If the “question” was not a question or if it was
a too specific follow-up question, then again, no score was given. The manual evaluation
showed that model 18 and 21 may be the best models, with 85 and 98 points respectively.
Automatic evaluation was then performed again, by only considering model 18 and model
21, which resulted in 131 points to model 18 and 123 points to model 21. By instead skipping
non-ideal questions (non-question user inputs and specific follow-up questions) and running
automatic evaluation again, the score was 67 and 61 to model 18 and 21, respectively.
Since the user test data contained non-question utterances by users and specific follow-up
questions, this in turn leads to poor automatic evaluation. Therefore, each generative model
ran on a list of 86 questions, see appendix A.8, to generate answers. The questions had been
previously extracted after a model was accidentally trained purely on question asking, in
the beginning of the development. The automatic evaluation was performed as previously
described, by ranking the answers from the different models, and giving a point to the model
with the highest-ranking answer. The results were [33, 17, 21, 15] when considering all four,
and [52, 34] when considering model 18 and model 21. Manual evaluation was done once
again between model 18 and model 21, where the points are given based on answer relevancy.
The result was 47 tied points, 22 points to model 18 and 17 points to model 21.
The first manual evaluation resulted in model 21 being the best whereas the following
automatic evaluation showed that model 18 was better than model 21. The automatic and
manual evaluation of the answers to the 86 questions resulted in model 18 being favored.
Therefore, after all the tests and observing the generated answers by the different models, it
was decided to use model 18. It should be noted that both automatic and manual evaluations
can be further improved. The flaw of manual evaluation is that of human error and the
possibility of inconsistent scoring. The flaw of automatic evaluation is that of missing out on
semantic meaning and whether or not the answer makes sense [25].
64
4.7. Comparison Between Previous and Current Agent
Data from the first user test was used to compare the updated component with the old gener-
ative component. The comparison was made on 28 follow-up questions, by generating new
answers to the follow-up questions, while providing the original data as conversation history.
It was observed that the new component more often outputs relevant responses compared to
the original data. However, it still has problems with providing a relevant answer to the
question. When rating which answer was more relevant to the question when comparing the
original answer and the new answer, it was found that the new answers were more relevant
13 times, the original was more relevant 4 times and 11 answers were considered equally
relevant or irrelevant.
65
4.7. Comparison Between Previous and Current Agent
66
5 Development of a Question
Asking Chatbot
In order to make the conversation flow engaging, it is essential for the chatbot to have the
ability to ask users questions. By asking questions to the users, it prevents one-sided con-
versations where the user is always the one who initiates the topic. It also shows that the
chatbot listens and pays attention to the current topic, and is aware of the current discussion.
By asking questions, it makes the whole conversation flow more realistic, because that is how
a real world conversation between two friends would take place.
• Questions that are related to the stored information about the user.
For the immediate follow-up questions such as how, why, when and so on, inspirations
were taken from some question templates from [18]. If the current topic relates to, for example
a sport football, several types of general follow-up questions can be generated. An example
can be “What do you like about football?” or “Do you happen to play football?”, and the user
answers can be stored and utilized when generating future questions.
The concept of memory was introduced in order to generate relevant questions after the
first interaction. This was achieved by storing desired extracted information from the user
into a dataframe. For example, if the topic football was mentioned in the first interaction, then
one or several questions regarding this topic will be generated in the future, such as “Do you
happen to also watch football?” or “Did you do it recently?”. The chatbot would then be able
to “remember” information about the user, and would be able to perform conversations in a
way similar to what people often do in real life.
The available topics were reduced down to two, as the main focus lies on investigating
whether or not the use of memory can improve the user experience. It was sufficient to supply
the chatbot knowledge within a few common topics. Sports and animals are the two major
topics that was tackled on.
67
5.3. Rasa NLU
• A config.yml file, which contains the configuration of Rasa NLU and Core models.
5.3.1 Tokenization
The Spacy Tokenizer6 was used to create tokens, and it was a part of the Spacy NLP. It seg-
ments a text into words, punctuations marks and so on, and creates Doc objects with the
discovered segment boundaries. A Doc is a sequence of Token objects. Spacy has introduced
a novel tokenization algorithm7 , which gives a better balance between performance, ease of
definition and ease of alignment into the original text input.
1 https://rasa.com/docs/rasa/user-guide/rasa-tutorial
2 https://spacy.io
3 https://rasa.com/docs/rasa/nlu/choosing-a-pipeline/
4 https://catalog.ldc.upenn.edu/LDC2013T19
5 https://nlp.stanford.edu/projects/glove/
6 https://spacy.io/api/tokenizer
7 https://spacy.io/usage/linguistic-features#how-tokenizer-works
68
5.3. Rasa NLU
5.3.2 Featurization
The Spacy Featurizer was used to creates features for entity extraction, intent classification,
and response classification, it was also a part of the Spacy NLP. It is a type of dense featurizer
which allows saving memory, thus able to train on larger datasets. It creates features for
entity extraction, intent classification, and response classification.
A Regex Featurizer was included by default to create a list of regular expressions defined
in the training data format during training. A feature will be set and it contains information
on whether an expression was found in the user’s input or not for each regex. Then all fea-
tures are fed into an intent classifier and/or entity extractor to simplify classification. Regex
features for entity extraction are supported by the DIET (Dual Intent Entity Transformer)
classifier8 component.
A Lexical Syntactic Featurizer was included by default to create lexical and syntactic fea-
tures for a user’s message to support entity extraction. It moves with a sliding window over
every token in the message and creates features according to a default configuration9 . It was
also possible to configure what kind of lexical and syntactic features it should extract.
Finally, a Count Vectors Featurizer from Scikit-learn10 was used to convert a collection of
text documents to a matrix of token counts. It creates bag-of-words representation of user
message, intent, and response. This featurizer can be configured to use word or character n-
grams. It is set to “word” by default, it counts whole words and use word token counts as
features. This featurizer was used for a second time where it was set to “char_wb”, which
would look at sub-word sequences of characters.
of-the-art-performance-on-a-lightweight-architecture
9 https://rasa.com/docs/rasa/nlu/components/#lexicalsyntacticfeaturizer
10 https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text
.CountVectorizer.html
11 https://rasa.com/docs/rasa/nlu/components/#spacyentityextractor
12 https://spacy.io/api/annotation#section-named-entities
13 https://www.youtube.com/playlist?list=PL75e0qA87dlG-za8eLI6t0_Pbxafk-cxb
69
5.4. VADER Sentiment Analysis
An Entity Synonym Mapper component was used to map synonymous entity values to
the same value. With this component, the model has the ability to detect and recognize similar
entity values and return just one consistent value. This is useful as users may not refer to a
certain object using the exact same name.
Finally, a response retrieval model was built by the Response Selector component. It was
used to predict the chatbot’s response from a set of candidate responses. It follows the exact
same neural network architecture and optimization as the DIET classifier.
On the other hand, a disadvantage was that the information may not always be usable or
as desired, because it is based on the information in the Open Mind Common Sense (OMCS)14
database, which is built on a large commonsense knowledge base from the contributions of
many thousands of people across the Web. This results in potentially inconsistent or missing
information and relations for a given search word. Therefore, this tool was not considered.
70
5.6. Rasa Core
5.6.1 Story
Several stories were supplied in order to cover a wide range of general opening and closing
conversations, such as responding to the user’s greeting, asking how the user is doing, and
saying goodbye. Additionally, the chatbot was also taught to handle unexpected scenarios,
for example when the user’s input does not make sense, or is written in another language, or
anything that the chatbot had not learned and therefore was not able to give a proper reply.
The solution was to respond that the chatbot did not understand the user’s input, or to ask
the user to rephrase the sentence. These stories are usually simple and short, and do not
contain more than two responses or actions. Below are a few examples of such stories.
## story - sad
* user.sad
- utter_reply.to_sad
## story - good
* user.good
- utter_reply.to_good
## story - bye
* user.bye
- utter_greet.bye
## story - reject
* user.reject
- utter_reply.to.reject
- utter_greet.bye
After the chatbot was capable of handling the common opening and closing conversa-
tions, a few more complex stories were added to take care of conversations that involve more
details and topics related to sports and animals. When the chatbot is deployed on a virtual
machine later on and open for user interaction, more stories can be generated from the con-
versation history and used to retrain the chatbot.
5.6.2 Domain
The domain was essentially the universe that the chatbot lives in. It started with defining all
the intents that the chatbot was able to classify, such as user.thank, user.sad, user.good, user.bye
and user.reject, which were seen previously in 5.6.1. Multiple intents could be created for var-
ious topics, such as discuss_sport for sport related discussions, and discuss_animal for animal
related discussions. However, the more intents created, the more training stories would be
required. Below are two example stories which train the chatbot to ask follow-up questions.
71
5.6. Rasa Core
* discuss_animal
- utter_ask_why_animal
Instead of creating a unique intent for each topic, a better approach was to generalize
and group them as one intent, in this case called enter_data. It consisted of all NLU training
examples for topics related to sports and animals. With this approach, fewer intents would
be needed, and the design of training stories was simplified and more generalized. Below
are a few examples of NLU training data about sport and animal topics, as well as a more
generalized story that trains the chatbot to ask the same follow-up question.
In the NLU training data, the entity types were defined using brackets. Since the focused
topics were sports and animals, the two main entities are then (sport) and (animal). The en-
tity’s value was defined using square brackets, such as [football] and [dog]. This helps the
chatbot to map these values to the desired entities. For example, if the user inputs “I like
monkey.”, then the chatbot is able to extract entity animal with value monkey. Similarly, if
the user inputs “I really like basketball.”, then the chatbot is able to extract entity sport with
value basketball. The extracted entities were stored into slots. Slots are essentially the chatbot’s
memory, and is covered more in detail in 5.6.3.
5.6.3 Slot
The chatbot was equipped with slots, which are its short-term memories. It was considered
as short-term memories because the data would be lost after each session. When the chatbot
restarts, its slots are emptied. To implement long-term memories, certain slots are saved to
an external database. Each memory slots stores a piece of information. With the help of slots,
the chatbot was able to memorize certain information throughout the conversation, such as
the user’s name.
The numbers of slots depended on how many pieces of information the chatbot needs
to remember. Slots such as email and name were commonly used and were essential to have
in order to have a general knowledge base of the user. Other types of information to be
stored should be usable in a follow-up and/or long-term question. For example, if the user
inputs a favorite sport to be basketball, then the value “basketball” is stored in a unfeatur-
ized slot named “type_of_sport”. The chatbot can then utilize this information and ask a
follow-up question such as “Did you play basketball recently?”. The answer is expected to
be a yes or no, this is converted to a True or False value, and stored in a boolean slot named
“recent_active_sport”. Now the chatbot knows that whether or not the user has done the
activity recently. During the next session, the chatbot can then utilize this information to ask
related question, depending on the boolean value. For example, if the boolean returns True,
the chatbot can ask how the activity went; if the boolean returns False, the chatbot can ask
if the user will perform the activity in the near future. Below are a few examples of slots
72
5.6. Rasa Core
that store such information that can be utilized to ask follow-up questions and/or long-term
questions.
A few examples of slots that store useful information that can be utilized to ask
follow-up questions and/or long-term questions
type_of_animal:
type: unfeaturized
own_animal:
type: bool
animal_breed:
type: unfeaturized
More slots would be needed as the numbers of topics increase, and the deeper each con-
versation the chatbot tries to cover. Therefore, limiting the topics down to as few as two was
beneficial, since less work would be needed for the conversation flow design. The numbers
of slots depend on how much information is desired and to be extracted. Certain informa-
tion from the slots will be stored externally in a dataframe, based on the design of long-term
questions. For example, if the user owns an animal, and the animal is a dog, then a potential
long-term question can be “So what breed is your dog?” or “What color is your dog?”. In
order to ask such question, certain information such as that the user owns an animal and the
type of animal is a dog, are essential to store. If more follow-up questions are required, addi-
tional slots such as “animal_color” and “animal_breed” can be added. Essentially, the more
information is extracted, the more slots will be required. Below are the slots used for topics
related to sports and animals, they were 14 slots in total.
There were 6 slots for the sport topic, where 2 slots “type_of_sport” and “re-
cent_active_sport” were stored externally; and 8 slots for the animal topic, where 2 slots
“type_of_animal” and “own_animal” were stored externally. These were stored externally
in order for the chatbot to memorize some personal information about the user, therefore it
was able to ask personalized question in the next conversation. For example, if the user’s
favorite animal is cat, and s/he happens to own one, then based on these information, the
chatbot is able to ask the cat’s breed, color and size; on the other hand, if the user’s favorite
animal is cat but does not own one, then the chatbot can ask if s/he plans to own one in the
future. Many more such follow-up and long-term question could be designed to make the
chatbot more friendly and engaging.
73
5.6. Rasa Core
5.6.4 Response
Responses are the messages/utterances that the chatbot is able to use to reply to a user.
These responses were defined in the domain file. Below are a few examples for utterances
utter_reply.to_thank and utter_greet.bye, which are seen previously in 5.6.1.
Each utterance can contain one or more candidate responses (templates), the chatbot will
select one randomly. With this approach, the specific texts do not need to be provided when
designing the training stories. It was sufficient to supply the type of utterance that was suit-
able given an intent. When the chatbot has classified a particular intent, it would know which
utterance to use and picks one of the responses to send.
The chatbot’s memory can be utilized and make the responses more dynamic. For exam-
ple, when the user inputs “I really like basketball.”, a follow-up question can be “What do
you like about basketball?”. However, the amount of responses will be enormous as there
are many types of sports, and here is where slots are useful. The chatbot is able to extract the
entity sport with value basketball from the above user input, it then stores this value into a slot
named type_of_sport. The response can utilize the information from this slot, and a general
follow-up question is adjusted to “What do you like about {type_of_sport}?”. This response
changes according to the extracted entity value from the user input. Now the user is able
to talk about all kinds of sports, and the chatbot is able to follow up with a related question
using just one or a few templates.
5.6.5 Action
Actions are similar to responses, except they have more functionalities. The chatbot can use
responses to reply to the user, but it is a simple text replies, it can be customized with the help
of slots to make the replies more dynamic, and that is about all to it. On the other hand, the
chatbot can use actions to perform more complicated tasks, such as calling external API and
storing and retrieving information from external dataframe.
To implement the chatbot’s memory, an external dataframe was used. Table 5.1 shows the
information that are stored externally in an excel file.
There are several slots that were filled with information, six of them were useful for gen-
erating long-term questions. Since there would be two conversation sessions, where the first
session the chatbot would retrieve information, and use some of the information to gener-
ate personalized questions in the second session, it needed a way to identify and distinguish
between new and existing users. The email slot was stored to the user column for a simple
identification purpose and checking if user exists in a dataframe to distinguish between the
first and second interaction. The name slot was stored to the name column, so that the chatbot
74
5.7. Rasa X
had the ability to refer the users by their names. The other four were related to sports and an-
imals topics. At the end of first interaction, all slots were cleared, and information from these
six slots were written to an excel file. During the start of the second interaction, when the
user enters his/her email address, it will be used to check whether or not the same email is
found in this excel file. If the email address is found, that means the user has interacted with
the chatbot previously; otherwise, it means that it is the user’s first interaction. A different
story path will be used depending on if it is the user’s first or second interaction. Below is an
example when the user enters a second session.
Custom action such as calling API was also possible. For example, it is possible to get
data from ConceptNet in JSON format, and various information can be retrieved depending
on the relations that are looked for. In general, this helps to widen the chatbot knowledge
and information regarding a specific topic, depending on how knowledgeable the chatbot
needs to be. An extra functionality is tested, which enables the chatbot to send a URL where
the given search word can be found, to emphasize that the chatbot remembers the previously
mentioned topic. It can reply for example, “By the way, I saw this post the other day, check
it out!” and insert the URL that is related to the topic. This is similar to a recommendation
system. However, this was not implemented in the final version, as not every search word
will have available data, although it might make the chatbot more interesting to interact with.
Other API calls can be useful depending on what functionalities the chatbot was designed to
have.
5.7 Rasa X
After a first basic simple model that was sufficient to cover the most common and general
conversation scenarios, it was moved onto conversation-driven development using Rasa X.
Conversation-driven development allows the chatbot learn from the users, and use those new
messages as training data to retrain and improve the chatbot. Rasa X was essentially a layer
on top of Rasa Open Source, and it was deployed on a virtual machine on Google Cloud15 .
The virtual machine was a standard type with 2 CPU, 7.5 GB memory and no GPU. This
met the requirements16 to host and run Rasa X. The operating system was Ubuntu 18.0417 .
Once the virtual machine was set up, Rasa X was installed using Docker Compose18 . This
was a good choice if the user traffic was not expected to be a large amount19 , i.e. less than
hundreds of concurrent users, which was the case for this study.
Rasa X has an Integrated Version Control feature, which automates data synchronization
with a Git repository, annotates new data and pushes those changes with Git. The project files
were uploaded onto GitHub. To connect Rasa X to the remote repository, a public SSH key
was generated and added as a deploy key with write access on GitHub. After the connection
15 https://cloud.google.com/compute
16 https://rasa.com/docs/rasa-x/installation-and-setup/install/docker-compose/#doc
ker-compose-requirements
17 https://releases.ubuntu.com/18.04
18 https://docs.docker.com/compose
19 https://rasa.com/docs/rasa-x/installation-and-setup/installation-guide
75
5.8. Architecture
had been established, Rasa X could pull from the repository and train the model. For any
further adjustments, Rasa X could push the changes to the repository on the master branch
directly, or to a new branch and perform a merge later.
5.8 Architecture
Figure 5.2 shows the architecture of the chatbot built using the Rasa framework. The chatbot
was built using the open source Rasa framework, which consisted of two main components:
Rasa Core and Rasa NLU. Rasa Core was the dialogue engine and the domain file defined
the settings for the chatbot, such as what it should understand, what it could use to reply and
so on. Rasa NLU took in NLU training data and conversations examples in order to train the
chatbot, it could then take care of intent classification and entity extraction. The conversation
flow design, and other functionalities such as storing and fetching information externally
from a dataframe, were defined in the action file. This chatbot was deployed on a virtual
machine with the help of Rasa X tool, which helped improving the chatbot as it interacted
with users.
The conversation flow started with asking for a mean of identification, in this case the
user’s email address. The chatbot used the given email and check against a dataframe to see
if the user exists to decide if it was the first or second interaction. The chatbot started the
conversation with five chit chat questions, such as “How should I call you?”. Thereafter, the
chatbot proceeded to the main topics, which were related to sports and animals. In the first
session, there were three questions related to sports, one opening question and two follow-up
questions; and there was a total of four questions related to animals, one opening question
and three follow-up questions. In the second session, there were three long-term questions
related to sports; and four long-term questions related to animals. Not every single question
would be asked, it depended on certain replies from the user, ones that return boolean values.
Finally, the chatbot would thank the user for the participation and attach a link to the survey.
5.9.1 Survey
The interaction time for both sessions was estimated to be about five to seven minutes. After
each interaction, the user was asked to fill in a survey that consists of five questions, two
of which are the same from the question answering chatbot survey, in order to have some
common metric for comparison. The users were asked to rate on a 5-point Likert scale from
1-“Completely disagree” to 5-“Completely agree”. The five questions are shown below:
76
5.9. User test - Question Asking Chatbot
Figure 5.2: Architecture of the chatbot built using the Rasa framework, both the Rasa Core
and Rasa NLU are used for dialog management and natural language understanding. The
chatbot is hosted on a virtual machine on Google compute engine for user tests.
5.9.2 Results
The user test had five participants in total, one of which was working as a PhD student, one
was a recent PhD graduate, one was a recent master graduate, and the other two were in their
master’s degree study. Further investigation was conducted if one or more questions have
low scores. Table 5.2 shows the mean results for each question.
As seen in the table, the results of the chatbot’s grammar, appropriateness and engaging-
ness from both sessions were consistent. However, there was a decrease in its relevancy and
77
5.9. User test - Question Asking Chatbot
realness. Therefore, further investigations and discussion were made with the users who had
entered a lower score for these two metrics in the second session.
The first immediate shortcoming was on the survey itself. There lacked a clear instruc-
tion to inform the users on what the chatbot does and its limitation, as it mainly generated
question, and the focus was on the usage of memory. This was not explicitly stated, therefore
users might have different expectations and might take unexpected actions such as trying to
ask the chatbot questions.
The second shortcoming was the opening chit chat questions. The intention of these small
talks was to make the chatbot friendly and let the users feel more relaxed before jumping into
the main topics. While some of the common chit chat questions worked as intended, other
might raise the user’s expectation on the chatbot’s capability. Therefore, users might incline
to test the chatbot’s limitation and see what it could do, or reply with an answer that was not
logical or suitable compare to a normal conversation between two people. Below is a sample
of interactions from a user during both sessions, where questions were asked by the user,
which was unanticipated.
78
5.9. User test - Question Asking Chatbot
Below is another sample of interactions from a user during both sessions, which is a better
representation of a normal conversations between two people. In this case, the user still rated
a lower score on the realness metrics. The main reason was that the user referred to the dog
as a “she”, but the chatbot still used “it”, which showed that the chatbot was not able to
comprehend fully. This could be improved by introducing an additional slot, however since
it did not contribute significantly to answering the research question, this part was left out.
79
5.9. User test - Question Asking Chatbot
It was unavoidable for users to also ask question, which the bot was not designed to
handle the more complicated ones. However, this problem would be eased in the next stage
when this chatbot is combined with the question answering chatbot.
Some of the chit chat questions had also risen the users’ expectation on the chatbot’s ca-
pability. Chitchat questions such as “How are you today?” were considered normal opening
questions, on the other hand, questions such as “what do you do for a living?”, while still be-
long to chitchat, the scope had narrowed down to topics related to jobs. This could mislead
users that the chatbot was able to handle conversations related to jobs. Therefore, several
chitchat questions were removed.
Out of context messages were treated normally. For example, when the first user an-
swered to have a shark as a pet, the chatbot treated the animal as a normal house pet, which
was unrealistic. This happens when users try to enter unrealistic or out-of-context answers,
which do not happen in a normal conversation. The chatbot was not designed to handle
such situations. However, this would not be a severe problem when the users answer the
questions normally.
80
6 Integration
The final step of development consisted of combining the question answering agent with the
question asking agent into one system. The question asking agent was designed to have more
control of the conversation flow and therefore it was decided to use this agent as the base for
the system. The question answering agent was turned into a callable API where new rules
had to be constructed as to when to make an API call from the base agent.
The second new component was developed to introduce a new topic by letting the agent
mention its favorite subject in the topic and asking the user about theirs, see figure 6.1.
A-chatbot: “My favorite sport is football. Do you have a favorite sport?
• The question answering function, the function takes user ID and user utterance as input
and returns an answer.
• Self-disclosure function, the function takes user ID, user utterance and topic as input.
This function is called after a user answers an initial question in some topic (once per
topic). The agent then acknowledges and shares its own preference on the topic.
• Disclosure and reflect function, the function takes user ID, user utterance and topic as
input. The user utterance is only given to be saved in the agent’s conversation history.
The function returns an output solely based on the topic. The function is called only
once for each topic that has not yet been discussed.
1 https://github.com/lcebear/memoryDialogueBot
81
6.2. Integrated System Rules and Conversation Flow
The inputs and outputs to the API are in JSON format. The output from the API consists of an
answer field, an error field, and a component field. The answer field is the answer returned
by the question answering system. The error field is to inform whether an error occurred on
the side of the question answering system. The error field is ‘None’ by default and therefore,
an answer should only be considered if the error field is ‘None’. The component field is
returned purely to record which component of the system was used to get the response.
In the suggested flow, the user will have two options for their next reply, either ask a
question, "Why do you prefer dogs?”, or acknowledge the agent’s answer, “Ok.”. If the user
asks a question, the API will be called to answer it, otherwise the agent will continue with
asking questions. Once the user starts asking questions, the user can be said to have control
of the conversation and therefore rules have to be determined as to when the system should
retake control and start asking questions again.
82
6.3. Final User Test
One possibility is to keep calling the question answering agent for as long as the user is
asking questions. Once the user replies with a message that is classified as “Acknowledge-
ment”, the question asking agent will retake control.
Another possibility is to allow the user to ask one question by calling the API, getting the
answer and adding a question at the end of the answer such that the next agent response
consists of an answer and a new question. This option was chosen in this case due to the
scripted nature of the question asking agent. Only allowing one question by the user ensures
that the conversation remain on topic. With the other option, the user may ask questions
which lead the conversation into another topic, and once the agent gets control again, it asks
a question about the original topic again, which may seem out of place.
Another addition is, for every user message, a check is performed to see whether or not
the message intent is classified as a question. If it is, then an API call will be made to the
question answering chatbot. This will handle cases where users add a question at the end.
Below is an example where a user asks a question during the opening question phrase.
Then, if the participant had interacted with a chatbot in this study before, they could fill in
which chatbot they had interacted with (mark all that apply).
Then the participants were provided a link to where they could interact with the chatbot,
as well as brief instructions. The chatbot was trained on a set of conversation flows (rasa
related) and required that the user first sent a greeting message.
After interacting with the chatbot, the participants were told to rate a number of state-
ments on a scale of 1-5 from "Not at all" to "Very much". Three statements measuring the
overall agent, two statements each for the question answering agent and question asking
agent.
• I would like to interact with the chatbot again in the future. (retention)
83
6.3. Final User Test
• It feels as if the chatbot has its own personal background and personality. (persona)
• The quality of the chatbot’s answers (non-question replies) from unsatisfactory to satis-
factory.
The survey for the second session was the same except that there were optional text fields
to provide feedback if the participant gave a low rating on one measure and wanted to pro-
vide context as to why, or if the participant had any other feedback. The questions regarding
age, English proficiency and previous experience with chatbots were not asked in the second
survey, as they were already provided in the first survey, and were not expected to change.
Results
Issues with the test, as observed during it, was that users experienced difficulties with start-
ing a session, mostly the second session. This was due to a number of reasons, either the user
did not start with a greet message to initiate the agent’s trained conversation flow or the agent
was unable to identify the user, or the agent’s session was not reset between first and second
sessions. The agent was unable to identify the user if either the email was not provided to
the agent, or if the user accidentally used different e-mails, or when there is a typo/accidental
capitalization present. The agent’s session not being fully reset automatically after 60 minutes
was a technical error due to an incorrect setting in Rasa. Remaining information that are still
left in slots are carried over to the next session, therefore the chatbot may skip asking certain
questions if the corresponding slots are filled. It is uncertain as to whether the setting was
incorrect from the start of the user test, or if the setting was lost after restarting the system
on Monday the 20th of July. The session not restarting led to users being unable to start the
second session unless they hit a reset button, which they were not provided instructions for.
The number of participants that interacted with the agent and filled in the first survey
related to the first session was 17, out of which 5 had interacted with a chatbot from this study
previously, and 5 of which had not interacted with any chatbot previously. The participants
had a mean English proficiency value of 4.42. The age of the participants ranged from 18
to early 40 with a mean of 27.5. Out of the 17 participants, 3 were discarded. One of the
discarded participants did not interact with the agent but still filled in the survey, another
participant bypassed the agent’s trained conversation flow, and the last participant was an
outlier. The participant was an outlier due to interacting with the agent after the session was
officially over, with a total of 25 messages during the entire interaction. The average number
of messages were 8.375 (between 6-10 messages), where two of the messages include greeting
and providing e-mail for identification. Out of the original 17 participants, 10 participants
interacted with the agent a second time, 2 of which were unable to interact with the second
session of the agent and were therefore discarded. One more participant was discarded due
to filling in the survey over one week after interacting with the agent, which is not considered
reliable results. A reliable result is considered to be one that is taken as soon as possible such
that the subject still has the interaction in memory, e.g. directly after the interaction. An
additional two participants were discarded due to faults in the chatbot where it had failed to
extract the user’s favorite sport in the first session, resulting in a template response asking
84
6.3. Final User Test
about "fail". The template response should not have been retrieved in this case. Therefore, 5
participants were considered for the second survey.
The results are shown in table 6.1 and 6.2, where the first group "Session 1 (All)" represent
the result from the 14 remaining participants, and the remaining columns represent the first
and second session of the 5 participants that interacted with the agent’s two sessions. The
main results in the tables show a decrease in engagingness and conversation flow from ses-
sion 1 and session 2. As well as an increase in realness, persona, question quality and user
memory (due to long-term memory question asking). T-tests indicate that the only significant
difference between the two sessions was the measure of user memory (p=0.008). The results
are put in context in chapter 7 with tables comparing all three chatbots.
Three users got a different conversation flow where they were not given an opportunity to
ask the agent questions. This happened when the chatbot was unable to extract the favorite
sport of the user. The following quote and interaction was one of the three.
"It just isn’t how a person would talk. I feel that it is very limited, all the things i
can input are one single word. It feels shallow."
85
6.3. Final User Test
"The way the chatbot asked questions about things we talked about in our previ-
ous conversation felt a bit...wooden? I think the questions should be phrased in a
more natural "hey we are texting with each other" kind of way."
"The chatbots answers ("Mh") were pretty unengaging, causing the flow of the
conversation to come to a halt."
"Grammar was sometimes wrong, eg. "last time we talked about cat". Responses
to my questions were also a bit off."
86
6.3. Final User Test
Chatbot: Last time we talked about Skiing, did you do it recently (<30 days)?
User: Hahaha no I didnt go skiing recently
Chatbot: Mh okay.
User: Last time I went was 2 years ago
Chatbot: No, this time I only went once.
Chatbot: We also talked about cat. Do you happen to own one?
User: We have 2 cats
Chatbot: Mh.
User: And you?
Chatbot: Nothing really.
...
Two full interactions (cherry-picked), from session 1 to session 2, are provided in the ap-
pendix A.9.
87
7 Results - Chatbot Comparisons
Three different chatbots were developed where a user test was held for each chatbot. The
user test on the question answering chatbot had 32 participants in total, where 16 of the
participants were deemed as non-ideal subjects. The user test on the question asking chatbot
had 5 participants. The final user test on the integrated chatbot had 17 participants in the first
session out of which 3 were discarded. The second session had 10 participants out of which
5 were discarded. The results of the user tests performed on the three chatbots are shown in
table 7.1. The table compares the chatbots on 1 session to measure the user’s perception after
interacting with the agent once. The results of the user tests from the integrated chatbot and
question asking chatbot during both sessions are compared and shown in table 7.3. Table 7.2
shows P-values found from performing ANOVA tests on the measures of engagingness and
realness. With a significance level of α = 0.05, it is found that there is a significant difference
in realness between the groups when including the Question Answering ’All’ group in the
test, but not when excluding it. Furthermore, there is no significant difference between the
groups in table 7.3 as indicated by the p-values.
Additionally, box plots were created to display the distribution of the test results pre-
sented in table 7.1, based on a five number summary: minimum, first quartile (Q1), median,
third quartile (Q3) and maximum. Figure 7.1 depicts the engagingness metrics from table 7.1.
The integrated chatbot had an evenly spread result. The question asking chatbot did not have
a large amount of participant, therefore resulting an uncompleted box figure. The question
88
Integrated Agent Integrated Agent Question Asking Question Asking P-value
(Session 1) (Session 2) (Session 1) (Session 2)
Participants 5 5 5 5 -
Engagingness 3.600 (0.548) 3.000 (0.707) 3.200 (0.447) 3.200 (0.837) 0.541
Realness 2.400 (0.548) 2.800 (0.837) 3.600 (0.548) 2.600 (1.140) 0.137
Q-Relevancy 4.400 (0.894) 4.200 (0.837) 4.400 (0.548) 3.800 (1.304) 0.716
Grammar 3.933 (1.023) 4.022 (0.659) 4.400 (0.548) 4.800 (0.447) 0.232
Table 7.3: Comparing results of shared measures between the integrated chatbot and the
question asking chatbot, illustrating for each question the mean value supplemented with
the standard deviation. Additionally, the resulting p-values of ANOVA tests are presented.
answering chatbot with ideal results had a wide spread of score, with a majority that found it
to be engaging; whereas the question answering chatbot with all results had a wider spread
and overall slightly lower score on the engagingness.
Figure 7.2 depicts the realness metrics from table 7.1. Most participants perceived that the
integrated chatbot was not human-like except for two outliers. The question asking chatbot
had a score that was above average on the 5-point Likert scale. The question answering
chatbot, both with ideal results and all results, had the same spread of score (range). The
median line illustrates that the ’ideal’ group for the question answer chatbot had overall
higher score compared to the ’all’ group.
Figure 7.1: Box plot of the three chatbots’ engagingness in the first session. The whiskers are
at a distance of 1.5 interquartile range length (IQR).
To make any conclusions regarding the results of the user tests on how engaging the
agents are, a comparison is made between other social chatbots. The comparison is made
based on one session (first) due to the other agents being tested for one session/interaction.
The researchers who created the Persona-Chat dataset tested different models trained on
the dataset, where they measured engagingness, fluency, consistency and persona detection.
Fluency was measured from 1 - "not fluent at all" to 5 - "extremely fluent". Engagingness
was measured from 1 - "not engaging at all" to 5 - "extremely engaging". Similarly, consis-
89
Figure 7.2: Box plot of the three chatbots’ realness in the first session. The whiskers are at a
distance of 1.5 interquartile range length (IQR).
tency was measured from 1 to 5, provided the example “I have a dog” followed by “I have
no pets” is not consistent. A comparison between their trained models and the agents in
this thesis is illustrated in table 7.4. In the table, "profile memory" represents a model that
takes into consideration a provided persona. Interestingly enough, the engagingness scores
are lower in this case for the models that take persona into consideration. Persona detection
was measured by presenting two alternatives and letting the human user say which persona
they think they interacted with [56]. In this thesis, a persona is not considered in the same
way, e.g. the generative model does not take into consideration a pre-defined persona when
generating the next response. The persona measure is not compared due to it, in this case, be-
ing based on whether the user perceive that the agent has its own personal background and
personality, rated on a 5-point Likert scale. In this thesis, inconsistency is measured instead
of consistency. The results are (naively) reversed to represent consistency instead, e.g. incon-
sistency measure 2.75 reversed to consistency measure 3.25. The results from the surveys in
this thesis measuring engagingness, "The chatbot is engaging/interesting to interact with.",
is used for comparison on engagingness. The measures of grammar are used for comparison
in the fluency column in this case. The final test (Integrated Agent) measured grammar on
a 1-10 scale and the result has therefore been scaled down to 1-5 for comparison. It is im-
portant to consider the difference in sample size when reading the table, as the Persona-Chat
models used 100 dialogues for evaluation, compared to the varying sample size of 5-32 for
the chatbots in this thesis.
After the release of the Persona-Chat dataset, a competition was held called ConvAI2,
where models were trained on the task of acting out a persona. In table 7.5, a compari-
son of engagingness is made between the agents in this thesis and agents designed on the
persona-chat task, as presented in the ConvAI2 challenge [12]. The agents in the challenge
were evaluated through human evaluation where the dialogues were of length 4-6 turns each.
The human evaluators were asked to answer the question "How much did you enjoy talking
to this user?" on a scale of 1-4. In this thesis, scales were from were mainly from 1-5, and
as such, normalized scaling was performed for the sake of comparison. It should be noted
90
Agent Fluency Engagingness Consistency
Seq2Seq 3.17 (1.10) 3.18 (1.41) 2.98 (1.45)
Seq2Seq - Profile Memory 3.08 (1.40) 3.13 (1.39) 3.14 (1.26)
KV Memory 3.81 (1.14) 3.88 (0.98) 3.36 (1.37)
KV Profile Memory 3.97 (0.94) 3.50 (1.17) 3.44 (1.30)
Question Answering Agent (All) - 3.28 (1.30) 2.81 (1.28)
Question Answering Agent (Ideal) - 3.69 (1.14) 3.25 (1.18)
Question Asking Agent (Session 1) 4.40, (0.56) 3.20, (0.45) -
Integrated Agent (Session 1 - All) 4.14, (0.93) 3.57 (0.85) -
Human 4.31 (1.07) 4.25 (1.06) 4.36 (0.92)
Table 7.4: Comparing the agents in this thesis with models from the Persona-Chat paper [56]
on fluency, engagingness and consistency.
that the agents in this thesis differ from the ones developed for the Persona-Chat task, as
those agents are able to act out a persona given a few lines of descriptive text. Again, it is
important to note the sample size difference, as the Persona-Chat agents were evaluated on
100 dialogues each.
The tables show that the engagingness and consistency of the developed agents are similar
to that of chatbots in the Persona-Chat task. However, compared to the human baseline (of
humans who acted out a persona), the results of the chatbots are still not near human level
engagingness.
91
8 Discussion
This chapter consists of discussions revolving what can be observed from the results of the
user tests, the result compared with previous works, potential improvements, and ideas for
future work.
8.1 Results
This section discusses the results from the user tests of the two individual chatbots in this
thesis, and how the results were affected as the chatbots were merged.
Realness
The users did not feel as if they chatted with a real person. For this question, and the user
survey in general, an optional text field would have been beneficial to explore potentials for
improvements. One reason for the low perceived realness may have been due to the agent’s
inability to answer follow-up questions, which led to irrelevant responses that may have
ruined the immersion. The score may also have been affected by the fact that the agent was
only capable of answering questions, which does not simulate a real conversation which
consists of both question asking and question answering. The second hypothesis was that
the group with artificial response delay (DELAY group) would perceive the agent as more
real or human-like than the group with instantaneous responses. While the results show a
92
8.1. Results
significant difference (p=0.037) between the ideal DELAY and ideal NON-DELAY group, the
opposite effect was observed; the ideal NO-DELAY group perceived the agent to be more
human-like than the ideal DELAY group.
Responsiveness / Unresponsiveness
Overall, the users did not feel as if the response time (average 5.983 seconds) negatively
affected the chatting experience. This knowledge can be utilized in future design of chatbots
as it may allow for features, models or components with higher time complexity that may
have otherwise been disregarded to achieve near instantaneous responses. The measure
of unresponsiveness was the most significant difference (p=0.009) between the ideal and
non-ideal groups. The ideal group scored very positively on this measure, meaning that
they were fine with the response time. The non-ideal group scored more toward the neutral.
Such outcome is expected as the non-ideal group had different expectations and interacted
with the agent differently, and occasionally out of its intended application. As the agent
responded with something irrelevant to their input, the response time may have been found
irritating as the user had to wait all that time for an unintelligent reply, in relation to their
input.
The NO-DELAY group scored better on responsiveness while the DELAY group scored
better on unresponsiveness. The reason why the DELAY group scored better on unrespon-
siveness is probably due to the consistency of the response times. The artificial delay masked
the calculation time of the generative model for this group, whereas the group with no delay
had the possibility of experiencing instantaneous answers from the template and retrieval
components. When this group then received an answer from the generative model, the
answer was observably much slower than non-generated answers. This in turn may have
led the user to feel as if the response time of the system was less acceptable.
Repetitiveness
While the NO-DELAY group scored higher on every measure except for unresponsiveness
and repetitiveness, it is uncertain as to why the DELAY group perceived the agent to be less
repetitive. Perhaps the delay made it seem as if the agent took the time to process the user’s
input and its next reply. When considering the results of the t-tests, the more likely answer is
that the group scored higher by chance (p=0.518).
93
8.1. Results
ability of the user test when looking at the statistical aspect as to how artificial delay affected
engagement and perceived realness in this study. Furthermore, the average response time of
the messages in the NO-DELAY group was not significantly different from that of the DELAY
group (5.103 compared to 6.646) due to response time of the generative model. Therefore, it
may be important in future tests that the NO-DELAY group experience consistent response
times (each response near instantaneous) to determine the effects of artificial/dynamic re-
sponse delay. Additionally, the artificial delay may have been too large to observe the positive
effects found in the study by [14].
A lesson learned from the study was to limit the scope of the users’ interaction with the
agent to prevent large variance in the outcome. For example, by telling the user to ask ques-
tions in a limited number of supported topics like hobbies, sport and food.
94
8.1. Results
flow. Additionally, the first session made use of the question answering API to get self-
disclosure while the second session only called the API if the user asked questions, which
may have affected the engagingness and conversation flow. The measure on question quality
increased in the second session, which may be correlated to users finding a greater interest in
the long-term memory generated questions.
A couple of unforeseen scenarios happened during the test such as the system not rec-
ognizing the user for a second session. As such, there should have been better instructions
emphasizing the importance of using the same e-mail for the study. Additionally, some issues
occurred due to time constraints where the development required more time for completion.
This can be observed in faults in the conversation flow, where some non-question user utter-
ances were incorrectly provided to the question answering API to respond to. Additional
faults were related to the question answering component being unable to extract entities
which resulted in a different conversation flow for users in the first session, and the agent
asking about "fail" in the second session.
One aim with the test was to compare engagingness and realness of the question asking
chatbot before and after the final test to find how the user’s perception of the chatbot was
affected by the chatbot’s personal memory and self-disclosure ability. This question remains
partially unanswered due to the method more so measuring the effect of adding more fea-
tures to the chatbot, such as question answering. As such, the survey should have consisted
of additional questions that ask more specific questions about the user’s perception of the
chatbot’s self-disclosure. An example of more specific questions would be:
Where the questions or statements are rated from "not at all" to "very much". More specific
questions would in turn reduce the amount of speculation around the meaning of the results.
95
8.2. Method
the user to ask questions. Being able to interact with an agent more than simply answering
its questions exposes the limitations in the chatbot’s capabilities and intelligence. It can also
be observed that a high realness score does not necessarily imply a more engaging chatbot.
However, it is also important to note the low sample size of the user tests, especially for the
question asking agent, which makes it difficult to make generalized conclusions.
8.2 Method
The method of the three agents are discussed individually in the following sub-sections. More
research into the methodology could improve design and development of the question asking
and answering chatbot.
2. New questions that are different but with high cosine similarity to a past question are
answered by the saved past answer.
96
8.2. Method
With the suggested question-answer pair database, it is possible to keep expanding the
database from user interactions. The database can then be used to train the generative model.
The trained model can then generate questions and answers which can then be added to the
retrieval database. This cycle of expanding the database and training the model can continue
as long as the bad question-answer pairs are filtered out. The downside is that manual filter-
ing bad pairs by going through thousands of question and answer pairs is time consuming.
With the final version of the generative component, the component is fast enough to be
used at all times. As such, whenever a question is not considered a template question, then
instead, always retrieve a number of candidate responses from the retrieval database and
generate some responses. Thereafter apply ranking to the retrieved and generated responses.
Then use a threshold to filter out bad responses, and randomly select one of the remaining
responses as the answer to output to the user. This method of response selection was used in
XiaoIce, where it allows for variability. In comparison, the existing ID based retrieval struc-
ture in this thesis also returns a random response out of the answers with the ID matching
the question.
As of now, the generative model may generate responses that contradict the sentiment
memory. The responses may even contradict themselves, "I don’t like to read but I do enjoy
reading". This may be avoided by implementing Semantic Role Labeling to split the sentence
into clauses and then by using VADER sentiment analysis on each individual clause. Seman-
tic Role Labeling finds the semantic meaning of words or phrases in a sentence. By using
sentiment analysis, we can find whether a response contradicts the sentiment memory, and
if so either discard the answer or replace the keyword to fit with the sentiment memory. For
example, if the generated answer says that the agent likes basketball whereas the sentiment
memory has a negative sentiment for basketball but a positive sentiment for football, then
basketball can be replaced by football before the answer is output to the user. Then, the ex-
isting templates could be completely replaced by the generative model, which increases vari-
ability and in turn should positively affect engagingness. Some sentences may include mixed
sentiment about different subjects, such as "I don’t like basketball, but I enjoy football". Then
it may be possible to split the sentence into clauses or n-grams and run sentiment analysis on
the individual clauses to find the corresponding sentiment for each subject, which in turn can
find contradiction if there are mixed sentiments about the same subject in a sentence.
A possible mistake that was done during fine-tuning was that model #8, #9 #10 was only
fine-tuned on a question answering dataset while model 7 was fine-tuned on a movie dia-
logue dataset first and then question answering. The mistake was stopping the behavior of
first fine-tuning on the movie dialogues, as they may have contributed to more engaging gen-
erated text. A suggestion is therefore to first train on movie dialogue data or Persona-Chat
data to first attain interesting conversations, and then fine-tune for question answering.
For the ideal length penalty in the ranking system, it may be beneficial to make it more
dynamic, for example by trying to match the sentence length of the partner (user). This sug-
gestion is based on findings in [36] where human users roughly matched the sentence length
of their conversational partner. Additionally, some questions entail different answer lengths
(e.g. yes/no questions), although this may be more difficult to consider and implement.
ConceptNet
ConceptNet’s API was initially used to add unseen words to the sentiment memory by call-
ing the API to find the IsA relationship (e.g. an apple is a fruit/food) and add the unseen
word under the appropriate category. This feature was later disabled due to the increase in
response time of the agent and replaced by a word embedding and similarity based method,
which is less accurate. A solution would be to call the API in another thread such that the
results can be used by the agent, not in the current response, but in future responses. As
the sentiment memory only stores a noun if the IsA relationship corresponds with existing,
pre-defined topics, an additional memory should be created to store every new noun. Then
the agent does not need to call the API again and again for the same noun, e.g. consider the
noun "name" when given the common question "What is your name?".
97
8.2. Method
User test
The user test got mixed results as the users had different expectations of the agent and its
capabilities. Even though the initial message by the agent stated "Feel free to ask anything
about me, whether you want to know what food I like or what my hobbies are.", some users
still asked task-oriented questions. To consider for future user tests is to give more specific
instructions and to emphasize on what the agent can and cannot do. For example, by em-
phasizing on the fact that the agent is a question answering component and that it is only
designed to answer personal questions. As well as telling the user to pretend like it is a per-
son you are trying to get to know. Additionally, it might have been helpful to give examples
of questions that the user can ask, e.g. "Suggestion: Ask a question about sport.". A concern
when designing the user test was to not influence the user too much to allow for a better
understanding of how a user may interact with the agent.
After the first user test, some time was spent on creating context memory for the gener-
ative model. Another solution, or a complement to the solution would be to transform user
utterances into contextual utterances, as is done in XiaoIce [59].
Automatic evaluation
We have learned that automatic evaluation of response generation does not correlate well
with human evaluation [25]. However, similar to [25], we believe that embeddings, such as
the Universal Sentence Encoder, is a key component for automatic evaluation. The metric
should aim to find topic relevance rather than word-overlap to some ground truth, as many
different kinds of answers are still as relevant to a question, even if it is not similar to the
ground truth.
A late realization for automatic evaluation when used in a one-versus-one setting such as
comparing two methods or models, was that of using a ∆similarity threshold (the difference
in ranking scores). By using a ∆similarity threshold it is possible to let the automatic eval-
uation decide if two answers are equally relevant (tie), as was done in manual evaluation.
This might lead to slightly better correlation to the manual evaluation as answers with low
difference in their similarity/ranking score would then be considered tied.
98
8.3. The Work in a Wider Context
from the user in order for the chatbot to reply, some sort of activation intent is needed in order
to trigger a certain action after countdown. Below is an example of the timer function, which
calls an intent in order to trigger the chatbot’s reply.
A feature in Rasa which allows scheduling of an intent to be triggered in a set
amount of seconds
...
Chatbot: What’s one sport that you like?
User: Nothing
(Timer elapses for 5 seconds)
User: \scheduled_event_trigger
Chatbot: Here’s another question, what’s your one favorite pet?
...
However, this approach was discarded due to the automated intent trigger that must
be printed out as a user input. It was deemed that this trigger message would break the
immersion, and thus decrease the user experience. It may be possible to hide this intent
trigger message if the chatbot was deployed on a website or an app, where the developer has
more control on the back-end and is able to decide what will be shown in the user interface.
8.2.3 Integration
The final agent is only able to consider one intent per message and may therefore not work
ideally for a user input consisting of a statement and a question. Future work should aim
toward using tools such as Semantic Role Labeling, to create clauses to identify multiple
intents, and to potentially handle those intents as individual inputs.
99
8.4. Future Work
100
8.4. Future Work
and not so engaging responses. It would have been interesting to fine-tune a DialoGPT model
with the datasets created in this thesis and compare the results. In theory, the results should
be better as the model is trained on millions of conversations instead of many different types
of texts.
Researchers at Facebook AI created a "state-of-the-art open-source chatbot" called
BlenderBot which was released end of April where it claimed to outperform other chatbots
in terms of engagingness3 . When compared one-to-one to Google’s chatbot Meena [1], hu-
man evaluation found that users more often preferred BlenderBot. Once again, a transformer
architecture was chosen where different sized models were trained, smallest one having 90
million parameters and the largest having 9.4 billion parameters. The models are trained to
be able to display knowledge, personality (Persona-Chat dataset) and empathy, and to blend
these skills seemingly. Had the model been released earlier, it would have been tested and
considered for the project as the main chatbot, with more focus spent on introducing the
long-term memory aspect and reducing contradictions. Although the chatbot is observed to
be engaging in few turn dialogue settings, the research paper describes the flaws of the model
[36]. The author(s) mentions that the bot would be repetitive and dull if users were to interact
with the chatbot over the course of several days or weeks. This is due to the chatbot being
incapable of remembering earlier conversations as it has a hard limit for the history that it
considers. This supports the need for further work and research in chatbots with memory,
especially how they can be added to generative models, more than just context-based under-
standing. It also illustrates contradictions or inconsistencies, which is common-place for gen-
erative model chatbots. Similarly to what was used in the ranking in this thesis, researchers
at Facebook AI [36] found that encouraging longer generations helps with reducing, but does
not solve the problem of bland/"safe" response generation, e.g. "I don’t know", which is an-
other problem with generative models. They also found that the model tends to generate
common words too frequently and rare words too infrequently, compared to human distri-
bution (which leads to a less engaging chatbot). This further supports the decision to reward
rare words during ranking (4.4.4). It should be noted however, the underlying problem is the
nature of the models themselves, as they try to find the most likely next word when generat-
ing output. Therefore, any ranking is limited by the underlying architecture. But it is possible
to adjust parameters that affect the text generation such that the model chooses a "less likely"
next word to achieve more engaging responses (which in turn may lead to less sensible sen-
tences). As such, over-generating responses and applying length based and rare word based
ranking should serve meaningful until the architecture is improved.
3 https://ai.facebook.com/blog/state-of-the-art-open-source-chatbot
101
9 Conclusion
1. How can the user and agent specific memories be developed to extract and store
information from user utterances, as well as utilizing the stored information in agent
utterances?
Information can be extracted using Natural Language Processing techniques,
such as Part-of-Speech tagging to extract nouns and Named Entity Recogni-
tion to extract names. Neural models can also be trained to extract desired
information by providing examples of user utterances (input) and the desired
information (output). Sentiment analysis (e.g. with VADER) can then be used to
find whether the user like/dislike or feel neutral towards a subject. The question
answering agent’s memory was developed to have a sentiment memory with
subject-sentiment pairs for a number of topics. Templates were then used to uti-
lize information, either from the agent memory, user memory or extracted from
the user’s most recent input.
2. How can short-term memory be developed such that the agent can understand and
provide a relevant answer to user utterances, such as follow-up questions, from the
context of the conversation history?
This question is answered from the research/related work presented in chapter 2.
Three methods were observed:
1. Have memory slots for he/she/it, etc., as used in Mitsuku (2.4.2), where
information is extracted from user utterances and stored in memory.
2. Transform user utterances into contextual utterances, as seen in XiaoIce [59].
3. When using neural/generative models, either design the architecture or fine-
tune to explicitly consider history when generating new responses.
This thesis made use of the third method where a GPT-2 model was fine-tuned
for question answering on conversational data where history was considered.
The resulting model illustrates the ability to answer context dependent follow-up
questions. It should be noted that the model also tends to generate nonsensical
answers occasionally. Furthermore, recent advances in end-to-end trained neural
conversational models [1] [36] present chatbots that converse well with respect to
context understanding with the help of their short-term memory, but the chatbots
102
do not consider long-term memory.
3. How is a user’s experience with an agent affected when the agent generates questions
which are personalized based on the user’s utterances from a previous interaction?
The users were able to realize when the chatbot asked questions that were related
to previously discussed topics. However, for the question asking chatbot, the
experience does not improve in terms of the perception of the chatbot’s realness.
Other metrics such as engagingness and appropriateness have similar scores
during both interaction sessions. For the integrated chatbot, the engagingness
decreased while measures of realness and persona increased between sessions.
The hypothesis was that the users would find a chatbot more engaging and fun
to interact with if it could remember information about the users. The hypothesis
cannot be accepted due to the results of t-tests where no significant difference
was observed between sessions. Furthermore, the hypothesis cannot be rejected
because the users had personal expectations about the memory module which
resulted in inconclusive outcome. Moreover, the results indicated that engaging-
ness was independent of chatbot memory.
4. How is the user’s perception of the agent affected by the agent having a personal
background, which allows for self-disclosure responses?
The aim was to answer the question by comparing the engagingness and realness
of the question asking chatbot before and after the integration with the question
answering chatbot. The integrated chatbot resulted in higher engagingness com-
pared to the question asking chatbot. Although the engagingness increased, the
cause may also be due to the addition of more features, such as question answer-
ing, rather than it being due to the agent’s personal background. The perceived
realness was lower for the first session of the integrated chatbot compared to the
question asking chatbot, but higher in the second session. The hypothesis was
that the realness score would increase from the first to second session, as the
chatbot would ask long-term memory generated questions. The hypothesis was
not supported for the question asking chatbot as the realness decreased between
sessions, while the integrated chatbot supports the hypothesis. Furthermore, no
significant difference was observed on the measure of realness between the two
chatbots and their respective sessions, indicating that the null hypothesis cannot
be rejected. The results show that the question asking chatbot may initially give
off a higher sense of realness compared to the question answering or integrated
chatbot but leads to loss of realness after additional interactions.
5. To what extent would the user’s perception of the agent be affected if the user could
ask the agent self-disclosure questions?
By comparing the engagingness measure of the question answering chatbot and
the question asking chatbot, the engagingness is higher for the question answer-
ing chatbot. Once the two chatbots were integrated, the engagingness was mea-
sured to be higher than the question asking chatbot by itself, while being slightly
lower than the question answering chatbot. Therefore, giving the users the option
to ask self-disclosure questions had a positive effect in terms of engagingness. The
integrated chatbot had a lower perceived realness compared to the question ask-
ing chatbot, but higher compared to the question answering chatbot. Allowing
users to ask the agent questions may lead to lower perceived realness due to the
possibility of nonsensical and inconsistent responses.
103
Progress has been made in neural approaches for training end-to-end chatbots with con-
text understanding but the methods do not consider long-term memory to remain engaging
over longer periods of time. This thesis presents how generative-retrieval hybrid chatbots
can make use of short and long-term memory to enable rapport building toward maintaining
long-term engagingness by personalizing question and answer generation. User tests show
that chatbots perceived as more human-like are not necessarily more engaging. The results of
the user tests show that the developed chatbots achieved similar score on engagingness and
consistency to that of chatbots in the Persona-Chat task (ConvAI2). However, the chatbots
are still not near human level performance on these measures. Furthermore, it is observed
that generative models are essential for open-domain conversation, but still suffer from in-
consistency and unengaging or nonsensical responses.
104
Bibliography
[1] Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal
Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. “Towards
a Human-like Open-Domain Chatbot”. In: arXiv preprint arXiv:2001.09977 (2020).
[2] Christoph Bartneck, Dana Kulić, Elizabeth Croft, and Susana Zoghbi. “Measurement
instruments for the anthropomorphism, animacy, likeability, perceived intelligence,
and perceived safety of robots”. In: International journal of social robotics 1.1 (2009),
pp. 71–81.
[3] Timothy Bickmore, Daniel Schulman, and Langxuan Yin. “Maintaining engagement
in long-term interventions with relational agents”. In: Applied Artificial Intelligence 24.6
(2010), pp. 648–666.
[4] David M Blei, Andrew Y Ng, and Michael I Jordan. “Latent dirichlet allocation”. In:
Journal of machine Learning research 3.Jan (2003), pp. 993–1022.
[5] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.
“Language Models are Few-Shot Learners”. In: arXiv preprint arXiv:2005.14165 (2020).
[6] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John,
Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. “Universal Sen-
tence Encoder for English”. In: Proceedings of the 2018 Conference on Empirical Methods in
Natural Language Processing: System Demonstrations. 2018, pp. 169–174.
[7] Yllias Chali and Sadid A Hasan. “Towards topic-to-question generation”. In: Computa-
tional Linguistics 41.1 (2015), pp. 1–20.
[8] Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. “On
the Properties of Neural Machine Translation: Encoder–Decoder Approaches”. In: Pro-
ceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Trans-
lation. 2014, pp. 103–111.
[9] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. “Empirical
evaluation of gated recurrent neural networks on sequence modeling”. In: NIPS 2014
Workshop on Deep Learning, December 2014. 2014.
[10] Cristian Danescu-Niculescu-Mizil and Lillian Lee. “Chameleons in imagined conver-
sations: A new approach to understanding coordination of linguistic style in dialogs.”
In: Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, ACL
2011. 2011.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-
training of Deep Bidirectional Transformers for Language Understanding”. In: Proceed-
ings of the 2019 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019,
pp. 4171–4186.
105
Bibliography
[12] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster,
Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. “The Sec-
ond Conversational Intelligence Challenge (ConvAI2)”. In: The NeurIPS’18 Competition.
Springer, 2020, pp. 187–208.
[13] Denis Fedorenko, Nikita Smetanin, and Artem Rodichev. “Avoiding echo-responses in
a retrieval-based conversation system”. In: Conference on Artificial Intelligence and Natu-
ral Language. Springer. 2018, pp. 91–97.
[14] Ulrich Gnewuch, Stefan Morana, Marc Adam, and Alexander Maedche. “Faster is not
always better: understanding the effect of dynamic response delays in human-chatbot
interaction”. In: (2018).
[15] Rachel Gockley, Allison Bruce, Jodi Forlizzi, Marek Michalowski, Anne Mundell,
Stephanie Rosenthal, Brennan Sellner, Reid Simmons, Kevin Snipes, Alan C Schultz, et
al. “Designing robots for long-term social interaction”. In: 2005 IEEE/RSJ International
Conference on Intelligent Robots and Systems. IEEE. 2005, pp. 1338–1343.
[16] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory”. In: Neural com-
putation 9.8 (1997), pp. 1735–1780.
[17] Kate S Hone and Robert Graham. “Towards a tool for the subjective assessment
of speech system interfaces (SASSI)”. In: Natural Language Engineering 6.3-4 (2000),
pp. 287–303.
[18] Karen Huang, Michael Yeomans, Alison Wood Brooks, Julia Minson, and Francesca
Gino. “It doesn’t hurt to ask: Question-asking increases liking.” In: Journal of personality
and social psychology 113.3 (2017), p. 430.
[19] Clayton J Hutto and Eric Gilbert. “Vader: A parsimonious rule-based model for senti-
ment analysis of social media text”. In: Eighth international AAAI conference on weblogs
and social media. 2014.
[20] Dan Jurafsky and James H. Martin. Speech and Language Processing. 2019.
[21] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. “Reformer: The Efficient Trans-
former”. In: arXiv preprint arXiv:2001.04451 (2020).
[22] Iryna Kulatska. “ArgueBot: Enabling debates through a hybrid retrieval-generation-
based chatbot”. MA thesis. University of Twente, 2019.
[23] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. “A Diversity-
Promoting Objective Function for Neural Conversation Models”. In: Proceedings of the
2016 Conference of the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies. 2016, pp. 110–119.
[24] Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and Bill
Dolan. “A Persona-Based Neural Conversation Model”. In: Proceedings of the 54th An-
nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016,
pp. 994–1003.
[25] Chia-Wei Liu, Ryan Lowe, Iulian Vlad Serban, Mike Noseworthy, Laurent Charlin, and
Joelle Pineau. “How NOT To Evaluate Your Dialogue System: An Empirical Study of
Unsupervised Evaluation Metrics for Dialogue Response Generation”. In: Proceedings of
the 2016 Conference on Empirical Methods in Natural Language Processing. 2016, pp. 2122–
2132.
[26] Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou, Zixuan Chen, Bin Zhou, and Dong-
mei Zhang. “You Impress Me: Dialogue Generation via Mutual Persona Perception”.
In: arXiv preprint arXiv:2004.05388 (2020).
[27] Yani Mandasari. “Follow-up Question Generation”. MA thesis. University of Twente,
2019.
106
Bibliography
[28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. “Distributed
representations of words and phrases and their compositionality”. In: Advances in neu-
ral information processing systems. 2013, pp. 3111–3119.
[29] Sinno Jialin Pan and Qiang Yang. “A survey on transfer learning”. In: IEEE Transactions
on knowledge and data engineering 22.10 (2009), pp. 1345–1359.
[30] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. “BLEU: a method for
automatic evaluation of machine translation”. In: Proceedings of the 40th annual meeting
of the Association for Computational Linguistics. 2002, pp. 311–318.
[31] Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global vec-
tors for word representation”. In: Proceedings of the 2014 conference on empirical methods
in natural language processing (EMNLP). 2014, pp. 1532–1543.
[32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. “Language models are unsupervised multitask learners”. In: OpenAI Blog
1.8 (2019), p. 9.
[33] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, and Peter J Liu. “Exploring the Limits of Transfer Learning
with a Unified Text-to-Text Transformer”. In: arXiv preprint arXiv:1910.10683 (2019).
[34] Filipe N Ribeiro, Matheus Araújo, Pollyanna Gonçalves, Marcos André Gonçalves, and
Fabrício Benevenuto. “Sentibench-a benchmark comparison of state-of-the-practice
sentiment analysis methods”. In: EPJ Data Science 5.1 (2016), pp. 1–29.
[35] Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and
beyond. Now Publishers Inc, 2009.
[36] Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing
Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. “Recipes for building an open-domain
chatbot”. In: arXiv preprint arXiv:2004.13637 (2020).
[37] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. “DistilBERT, a
distilled version of BERT: smaller, faster, cheaper and lighter”. In: arXiv preprint
arXiv:1910.01108 (2019).
[38] David R So, Chen Liang, and Quoc V Le. “The Evolved Transformer”. In: arXiv preprint
arXiv:1901.11117 (2019).
[39] Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff
Wu, Alec Radford, and Jasmine Wang. “Release Strategies and the Social Impacts of
Language Models”. In: arXiv preprint arXiv:1908.09203 (2019).
[40] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Si-
monsen, and Jian-Yun Nie. “A hierarchical recurrent encoder-decoder for generative
context-aware query suggestion”. In: Proceedings of the 24th ACM International on Con-
ference on Information and Knowledge Management. 2015, pp. 553–562.
[41] Robyn Speer, Joshua Chin, and Catherine Havasi. ConceptNet 5.5: An Open Multilingual
Graph of General Knowledge. 2017. URL: http://aaai.org/ocs/index.php/AAAI/
AAAI17/paper/view/14972.
[42] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou.
“MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices”. In: arXiv
preprint arXiv:2004.02984 (2020).
[43] Ilya Sutskever, James Martens, and Geoffrey E Hinton. “Generating text with recurrent
neural networks”. In: ICML. 2011.
[44] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with
neural networks”. In: Advances in neural information processing systems. 2014, pp. 3104–
3112.
107
Bibliography
[45] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances
in neural information processing systems. 2017, pp. 5998–6008.
[46] Oriol Vinyals and Quoc Le. “A neural conversational model”. In: Proceedings of ICML
Deep Learning Workshop. 2015.
[47] Marilyn A Walker, Diane J Litman, Candace A Kamm, and Alicia Abella. “Evaluating
spoken dialogue agents with PARADISE: Two case studies”. In: Computer speech and
language 12.4 (1998), pp. 317–348.
[48] Richard S Wallace. “The anatomy of ALICE”. In: Parsing the Turing Test. Springer, 2009,
pp. 181–210.
[49] Joseph Weizenbaum. “ELIZA—a computer program for the study of natural language
communication between man and machine”. In: Communications of the ACM 9.1 (1966),
pp. 36–45.
[50] Sean Welleck, Jason Weston, Arthur Szlam, and Kyunghyun Cho. “Dialogue Natural
Language Inference”. In: Proceedings of the 57th Annual Meeting of the Association for Com-
putational Linguistics. 2019, pp. 3731–3741.
[51] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. “Transfer-
Transfo: A Transfer Learning Approach for Neural Network Based Conversational
Agents”. In: arXiv preprint arXiv:1901.08149 (2019).
[52] Dongling Xiao, Han Zhang, Yukun Li, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang.
“ERNIE-GEN: An Enhanced Multi-Flow Pre-training and Fine-tuning Framework for
Natural Language Generation”. In: arXiv preprint arXiv:2001.11314 (2020).
[53] Liu Yang, Qingyao Ai, Jiafeng Guo, and W Bruce Croft. “aNMM: Ranking short answer
texts with attention-based neural matching model”. In: Proceedings of the 25th ACM in-
ternational on conference on information and knowledge management. 2016, pp. 287–296.
[54] Liu Yang, Junjie Hu, Minghui Qiu, Chen Qu, Jianfeng Gao, W Bruce Croft, Xiaodong
Liu, Yelong Shen, and Jingjing Liu. “A hybrid retrieval-generation neural conversa-
tion model”. In: Proceedings of the 28th ACM International Conference on Information and
Knowledge Management. 2019, pp. 1341–1350.
[55] Jianhua Yin and Jianyong Wang. “A dirichlet multinomial mixture model-based ap-
proach for short text clustering”. In: Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining. 2014, pp. 233–242.
[56] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason
Weston. “Personalizing Dialogue Agents: I have a dog, do you have pets too?” In: Pro-
ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers). 2018, pp. 2204–2213.
[57] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jian-
feng Gao, Jingjing Liu, and Bill Dolan. “DialoGPT: Large-Scale Generative Pre-training
for Conversational Response Generation”. In: arXiv preprint arXiv:1911.00536 (2019).
[58] Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, and Bing Liu. “Emotional
chatting machine: Emotional conversation generation with internal and external mem-
ory”. In: Thirty-Second AAAI Conference on Artificial Intelligence. 2018.
[59] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. “The Design and Implementa-
tion of XiaoIce, an Empathetic Social Chatbot”. In: Computational Linguistics 46.1 (2020),
pp. 53–93.
108
A Appendix
Model 4
I like to drive.
Beer.
Well, the money was good for me to go to college.
I’m an electrician.
Sorta nothing.
There’s always new things, silly.
I’m a dancer.
Swim and kayak.
Model 7
I spend time with my family, though I do enjoy traveling.
109
A.2. Fine-Tuned Generative Models and Their Answers
Model 10
I like to watch movies, travel, go surfing, play video games.
Well, I enjoy golf.
I like to watch movies, travel, go surfing, play video games.
I like to watch movies, travel, go surfing, play video games.
I like to gamble, exercise, and go out to bars occasionally.
I like to watch movies, travel, go surfing, play video games.
I like to watch movies, travel, go surfing, play video games.
I like to watch movies and play video games.
Model 18
Well, I like to dance, read, go for hikes, and listen to music.
Haha nothing too free-minded.
I enjoy hiking and horseback riding.
I like to travel.
I spend most of my time exercising, especially track and tennis.
Well, like anything I will miss the porch in the morning, which is a shame.
I like to play video games, listen to music, and chat.
I watch, cook, take my dogs along, hike, go out on the weekends, watch a few
movies, do some gaming.
Model 4
Yes.
Yes, Mr. Milch-born and KAPOne.
No.
Gross.
No, thank you.
No, sir.
No.
No, I don’t have any pets.
Model 7
I have 2 cats.
No pets ., never really considered that.
I’m not sure.
No pets of my own but my husband has 2 cats., I have 2 dogs who are both like
my children, large and small.
I have 3 dogs.. they are 8 and 7.. no I don’t have pets.
yeah, I have 3 dogs.
I have 3 dogs.. they are very house trained and love everything. They are also
very healthy., Every little bit helps., They are 3½ years old and eating nothing but
110
A.2. Fine-Tuned Generative Models and Their Answers
Model 10
Yeah i have 2 dogs
No pets .
No pets .
Yeah, I have 3 dogs, a rat and a shekel.
Yeah i have 3 dogs.
Yeah, 2 guinea pigs.
Yeah i have 2 cats
Yeah i have 2 dogs.
Model 18
I have a dog.
No, but I do have a dog.
I have 3 dogs.
I have 2 dogs that I have with me whenever I have time.
I have 2 cats.
No, not currently.
Nope, no.
i’ve a dog as well.
Model 4
To be a minister or a crusader or a magician, or just to have a talk show.
I’m into historical sports, and maybe I’m gonna write something about Dolly
Parton.
Tennis, surfing, wrestling, surf... golf.
I’m a comedian.
I’m a computer programmer, a novelist, I’m a poet, and now I’m a doctor.
Sleep.
Spirits.
Fernie, I don’t care about anything but her.
Model 7
that’s awesome.
that would be traveling all day.
That sounds pretty awesome.
I like to play video games.
I like to ride my bicycle, and I like to play video games.
that sounds pretty good.
I like to play soccer a lot.
that would be awesome.
Model 10
My hobbies include biking, reading, playing video games, cooking, getting out
and playing basketball.
My hobbies are golf, grand kids, the beach, cards and goofing off.
My hobbies are golf, grand kids, the beach, cards and hugs.
111
A.3. GPT-2 Models Experiment
I like to cook, if you can call that a hobby and I like all types of craftwork
My hobbies would be Netflix and yoga.
I like to read, if you can call that a hobby but I also enjoy playing video games as
well.
My hobbies are golf, grand kids, the beach, cards and goofing off.
Well I like to ride my bicycle around the area, play video games, play Magic: The
Gathering, listen to music, cook, and probably get pretty cold there too.
Model 18
Well I work a lot, I play music mostly, I watch movies.
Well, hobbies include hiking and boxing (which I’m really good at).
My hobbies are golf and visiting friends in the area.
Well I play tennis and I take walks along the river.
Well, besides golf, I like basketball, crocheting, and crocheting on the beach.
My hobbies are golf, restaurants and craft beers.
So far so good.
I enjoy chess, collecting cards, and spending time with friends.
To test the time it takes for the models to generate text, the pre-trained versions of the
models (as released by OpenAI) were loaded without any fine-tuning and used to generate
text with gpt-2-simple’s function generate(). Each model called the function three times, and
an average was thereafter calculated of the runtime. Additionally it was observed that setting
a lower value on the length parameter in the generate() function, the time to generate was
112
A.4. GTKY - Most Common Nouns
reduced. In the case for the 355M model, a length of 40 took around 16 seconds compared
to 23-26 seconds when length was 100. The 124M model then took 8-9 seconds compared to
10-11 seconds. A conclusion of the experiments was that, henceforth, only 124M parameter
models are fine-tuned for the sake of having the lowest response time.
Table A.3: Top 87 (84+3) most frequently occurring nouns in extracted questions from the
GTKY dataset.
113
A.5. Templates
A.5 Templates
Table A.4: Template questions developed after analyzing the GTKY dataset, with the addition
of two questions after a user test
Table A.5: Template answers that are used together with a sentiment memory, to answer what
things the agent likes or dislikes.
114
A.6. Front-End - Question Answering Agent
Figure A.1: Front page with instructions and terms for the question answering user test.
115
A.7. Question Answering User Test Interactions
Figure A.3: Survey page for the question answering user test.
116
A.7. Question Answering User Test Interactions
Borderline example
Q: Is it going to rain tomorrow?
A: Probably not exactly this morning.
Q: Are you a human=
A: I’m a robot.
Q: why?
A: because its summer so i have to get my legs together pretty quick.
Q: Why are your legs apart?
A: For different things.
Q: What kinds of things?
A: I like to play sports.
Q: Watersports?
A: Yeah, I’ve seen storms, hail and snow.
117
A.8. Questions for Evaluation
Questions
what do you do for fun?
Do you have any plans for the weekend?
What do you do for a living?
do you have a favorite kind of music you like to listen to?
What do you like to do in your spare time?
Have you always lived in Texas?
what else do you like doing?
How do you like playing sports?
What are you studying?
do you like soccer?
what do you do for fun in Missouri?
Have you seen ANY of the Hunger games movies?
Do you love American food?
Are you old enough to drink?
Are you ready for the snow to melt and spring to start?
Do you ski?
Are you a student at Harvard?
How are you today?
What kind of education do you have?
do you meditate?
Where do you live?
What’s your favorite beer?
What type of job do you do?
Are you on Facebook?
Have you seen any good movies yet?
Are you also in college?
Do you cook?
have you ever been to italy?
What do you do with your summer?
Do you speak any other languages?
What kind of music do you like?
What’s your favorite kind of food?
What’s your favorite cake you have made?
What is your favorite place in the entire country?
do you have any recommendations for things to do in the area?
what is your name?
do you like The Count from Sesame Street?
What do you do outside of school?
how often do you do these types of studies?
What kind of ice cream flavor would you suggest for me?
do you have any pets?
what sport do you play?
Are you into yoga?
are you religious?
What kind of field do you study in?
you like anything to do with computers?
What do you do?
What’s a typical day like for you?
Do you like reading?
Have you enjoyed the sunshine?
So are you a student or working?
is there anything fun planned for you this weekend?
Any favourite sports?
what do you do that is important to you in your life?
Have you ever seen snow?
Can I ask your age?
How is the weather where you are?
Do you enjoy reading?
Do you like to watch any sports?
What are you planning to do on your vacation?
do you like to travel?
Have you ever been in Asia?
do you have any hobbies?
do you enjoy swimming?
do you have plans for the weekend?
What do you do at home on a Saturday?
are you from here?
What are your favorite genres then?
Did you study abroad at all or plan to?
What do you do when stressed?
What’s your major?
Are you an artist?
What’s your opinion on him?
What do you do when you’re on the computer?
Do you go to school in Boston?
What’s your favorite thing to cook?
You really like to read, don’t you?
where in the world do you live?
When did you start singing?
Are you a monk too?
Do you have children?
What do you do with your winter vacation?
How old are your kids?
Did you ever leave the country before?
How’s your morning going?
Do you want to stay in the area or move to another state?
Table A.6: Questions generated by a model accidentally trained on question asking, that were
then used for evaluating answer relevancy of fine-tuned generative models.
118
A.9. Final User Test Interactions
Session 2
User: Hi
Chatbot: SYSTEM: PLEASE enter your email address to continue.
User: REDACTED_EMAIL
Chatbot: Good day REDACTED_NAME! How are you?
User: I’m great! How are you?
Chatbot: It has been long.
Chatbot: Last time we talked about swim, did you do it recently (<30 days)?
User: No I didn’t
Chatbot: You think you will do it in the near future?
User: I hope so!
Chatbot: I see.
User: Do you have any hobbies?
Chatbot: I like to ski, I like playing video games, reading, and traveling.
Chatbot: We also talked about cat. Do you happen to own one?
User: That’s nice!
Chatbot: And what color?
User: Yes I do have a cat
Chatbot: Okay!
User: My cat has different colors
Chatbot: lol My cat has a light gray fur.
Chatbot: Great. It was nice talking to you!
Chatbot: Thank you! Please proceed to the form for a small survey to complete
the test.
119
A.9. Final User Test Interactions
Chatbot: Thank you REDACTED_NAME! You have now finished the first ses-
sion, please go back to the form for a small survey, and start a second session in a
few days. I will send you a reminder via email.
Session 2
User: Hi
Chatbot: SYSTEM: PLEASE enter your email address to continue.
User: REDACTED_EMAIL
Chatbot: Good day REDACTED_NAME! How are you?
User: I’m fine. How’re you?
Chatbot: I’m fine.
Chatbot: Last time we talked about Yoga, did you do it recently (<30 days)?
User: Not since we last talked, but before (<30 days) - yes.
Chatbot: Mh okay.
User: Did you play soccer?
Chatbot: No I played a little bit at the junior college I grew up in, it was very
tough.
Chatbot: We also talked about Cat. Do you happen to own one?
User: I do, yes.
Chatbot: And what color?
User: She’s tricolor - white, black, yellow.
Chatbot: Mh.
User: So, what’re your plans today?
Chatbot: I’d like to go to a movie today.
Chatbot: Great. It was nice talking to you!
Chatbot: Thank you! Please proceed to the form for a small survey to complete
the test.
120