How to break conversation data into pairs of (Context , Response)

from gensim.models import Doc2Vec from gensim.models.doc2vec import TaggedLineDocument import datetime print('Creating documents',datetime.datetime.now().time()) context = TaggedLineDocument('./test_data/context.csv') print('Building model',datetime.datetime.now().time()) model = Doc2Vec(context,size = 200, window = 10, min_count = 10, workers=4) print('Training...',datetime.datetime.now().time()) for epoch in range(10): print('Run number :',epoch) model.train(context) model.save('./test_data/model')

To train a model I would start by concatenating consecutive sequences of messages. What I would do is, using the timestamps, concatenate the messages without any message in between from the other entity.

For instance:

Hello
I have a problem
I cannot install software X
                                       Hi
                                       What error do you get?

would be:

Hello I have a problem I cannot install software X
                                       Hi What error do you get?

Then I would train a model with sentences in that format. I would do that because I am assuming that the conversations have a "single topic" all the time between interactions from the entities. And in that scenario suggesting a single message Hi What error do you get? would be totally fine.

Also, take a look at the data. If the questions from the users are usually single-sentenced (as in the examples) sentence detection could help a lot. In that case I would apply sentence detection on the concatenated strings (nltk could be an option) and use only single-sentenced questions for training. That way you can avoid the out-of-sync problem when training the model at the price of reducing the size of the dataset.

On the other hand, I would really consider to start with a very simple method. For example you could score questions by tf-idf and, to get a suggestion, you can take the most similar question in your dataset wrt some metric (e.g. cosine similarity) and suggest the answer for that question. That will perform very bad in sentences with context information (e.g. how do you do it?) but can perform well in sentences like where are you based?.

My last suggestion is because traditional methods perform even better than complex NN methods when the dataset is small. How big is your dataset?

How you train a NN method is also crucial, there are a lot of hyper-parameters, and tuning them properly can be difficult, that's why having a baseline with a simple method can help you a lot to check how well you are doing. In this other paper they compare the different hyper-parameters for doc2vec, maybe you find it useful.

Edit: a completely different option would be to train a model to "link" questions with answers. But for that you should manually tag each question with the corresponding answer and then train a supervised learning model on that data. That could potentially generalize better but with the added effort of manually labelling the sentences and still it doesn't look like an easy problem to me.

Recommended topics

Hot tags