How to break conversation data into pairs of (Context , Response)
Asked Answered
C

1

13

I'm using Gensim Doc2Vec model, trying to cluster portions of a customer support conversations. My goal is to give the support team an auto response suggestions.

Figure 1: shows a sample conversations where the user question is answered in the next conversation line, making it easy to extract the data:

Figure 1

during the conversation "hello" and "Our offices are located in NYC" should be suggested


Figure 2: describes a conversation where the questions and answers are not in sync

Figure 2

during the conversation "hello" and "Our offices are located in NYC" should be suggested


Figure 3: describes a conversation where the context for the answer is built over time, and for classification purpose (I'm assuming) some of the lines are redundant.

Figure 3

during the conversation "here is a link for the free trial account" should be suggested


I have the following data per conversation line (simplified):
who wrote the line (user or agent), text, time stamp

I'm using the following code to train my model:

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedLineDocument
import datetime

print('Creating documents',datetime.datetime.now().time())
context = TaggedLineDocument('./test_data/context.csv')

print('Building model',datetime.datetime.now().time())

model = Doc2Vec(context,size = 200, window = 10, min_count = 10, workers=4)
print('Training...',datetime.datetime.now().time())

for epoch in range(10):
    print('Run number :',epoch)
    model.train(context)

model.save('./test_data/model')

Q: How should I structure my training data and what heuristics could be applied in order to extract it from the raw data?

Cora answered 14/9, 2016 at 12:0 Comment(4)
Train on those where you are sure only? Then predict which of the out-of-sync choices is besty and add that to the training set?Lubeck
Thanks for the reply, unfortunately I can't really be sure what part of the context triggered the agent response. I'll appreciate any approach that will move me forwardCora
Nicely constructed question but it's a bit general. What techniques are you familiar with and what areas would you feel comfortable using? Maybe that can help narrow it down.Psychoanalysis
Thanks for the reply, to tackle this problem I've tried RNN method described here:www.wildml.com/2016/07/deep-learning-for-chatbots-2-retrieval-based-model-tensorflow/, tried Facebook's FastText and Doc2Vec as described above. I've come to a conclusion that all of that approaches suffer from the same symptom, garbage in garbage outCora
R
6

To train a model I would start by concatenating consecutive sequences of messages. What I would do is, using the timestamps, concatenate the messages without any message in between from the other entity.

For instance:

Hello
I have a problem
I cannot install software X
                                       Hi
                                       What error do you get?

would be:

Hello I have a problem I cannot install software X
                                       Hi What error do you get?

Then I would train a model with sentences in that format. I would do that because I am assuming that the conversations have a "single topic" all the time between interactions from the entities. And in that scenario suggesting a single message Hi What error do you get? would be totally fine.

Also, take a look at the data. If the questions from the users are usually single-sentenced (as in the examples) sentence detection could help a lot. In that case I would apply sentence detection on the concatenated strings (nltk could be an option) and use only single-sentenced questions for training. That way you can avoid the out-of-sync problem when training the model at the price of reducing the size of the dataset.

On the other hand, I would really consider to start with a very simple method. For example you could score questions by tf-idf and, to get a suggestion, you can take the most similar question in your dataset wrt some metric (e.g. cosine similarity) and suggest the answer for that question. That will perform very bad in sentences with context information (e.g. how do you do it?) but can perform well in sentences like where are you based?.

My last suggestion is because traditional methods perform even better than complex NN methods when the dataset is small. How big is your dataset?

How you train a NN method is also crucial, there are a lot of hyper-parameters, and tuning them properly can be difficult, that's why having a baseline with a simple method can help you a lot to check how well you are doing. In this other paper they compare the different hyper-parameters for doc2vec, maybe you find it useful.

Edit: a completely different option would be to train a model to "link" questions with answers. But for that you should manually tag each question with the corresponding answer and then train a supervised learning model on that data. That could potentially generalize better but with the added effort of manually labelling the sentences and still it doesn't look like an easy problem to me.

Retrieval answered 20/9, 2016 at 16:30 Comment(4)
Thank you for the detailed reply, a lot to digest. My DB is huge, for testing I'm taking only a small portion (about 500.000 lines of conversations).Most of the conversations cover more then one topic, and the topic distribution is not balanced (60% topic A,20 topic B , and the rest distributed to another 8 topics , more or less). Manual labeling is an option, but I prefer to look into some sort of automation. Check out google smart reply, it could give a new direction.Cora
The fact that the conversations have more than one topic wouldn't be a problem in the methodology I described as far as the topics are separated with messages from the other entity (e.g. I have a problem installing sw X and btw, where are you based? would be a problem). I also came up with the google smart reply paper, but unfortunately I don't have the time to study it now, it could be a good place to start. Finally I encourage you again to start with a simple method and improve on that. Come up with a metric to compare the models and see how well you do.Materiel
Very interesting problem, I hope I helped :)Materiel
Please consider the answer for the bounty if it helped :)Materiel

© 2022 - 2024 — McMap. All rights reserved.