Doc2Vec Get most similar documents
Asked Answered
L

1

44

I am trying to build a document retrieval model that returns most documents ordered by their relevancy with respect to a query or a search string. For this I trained a doc2vec model using the Doc2Vec model in gensim. My dataset is in the form of a pandas dataset which has each document stored as a string on each line. This is the code I have so far

import gensim, re
import pandas as pd

# TOKENIZER
def tokenizer(input_string):
    return re.findall(r"[\w']+", input_string)

# IMPORT DATA
data = pd.read_csv('mp_1002_prepd.txt')
data.columns = ['merged']
data.loc[:, 'tokens'] = data.merged.apply(tokenizer)
sentences= []
for item_no, line in enumerate(data['tokens'].values.tolist()):
    sentences.append(LabeledSentence(line,[item_no]))

# MODEL PARAMETERS
dm = 1 # 1 for distributed memory(default); 0 for dbow 
cores = multiprocessing.cpu_count()
size = 300
context_window = 50
seed = 42
min_count = 1
alpha = 0.5
max_iter = 200

# BUILD MODEL
model = gensim.models.doc2vec.Doc2Vec(documents = sentences,
dm = dm,
alpha = alpha, # initial learning rate
seed = seed,
min_count = min_count, # ignore words with freq less than min_count
max_vocab_size = None, # 
window = context_window, # the number of words before and after to be used as context
size = size, # is the dimensionality of the feature vector
sample = 1e-4, # ?
negative = 5, # ?
workers = cores, # number of cores
iter = max_iter # number of iterations (epochs) over the corpus)

# QUERY BASED DOC RANKING ??

The part where I am struggling is in finding documents that are most similar/relevant to the query. I used the infer_vector but then realised that it considers the query as a document, updates the model and returns the results. I tried using the most_similar and most_similar_cosmul methods but I get words along with a similarity score(I guess) in return. What I want to do is when I enter a search string(a query), I should get the documents (ids) that are most relevant along with a similarity score(cosine etc). How do I get this part done?

Lapidary answered 14/3, 2017 at 8:43 Comment(6)
Does your query exists in the dataset? If so you can use the sentence_tag to find similar sentences. If not you could create a infer vector (after gensim 0.12.4) and query with it. Both using model.docvecs.most_similar()Arcturus
@Arcturus my query is a string for example- customer segmentation. Customer and segmentation both exist in the vocabulary. By sentence_tag you mean the tag we pass in LabeledSentence, right? If so, then I have used document id(basically a number 1,2,3...num_docs) as the tag. I used infer_vector but that wasn't helpful because it considers the query as the document, updates the model weights and returns similar documents. I don't want to update the model every time I pass a query.Lastly, model.docvecs.most_similar() can be used, but it needs a vector to find the most similar docsLapidary
@Arcturus So basically the question comes down to how do I get a vector representation of the query without altering the model.Lapidary
The infer method will ignore any words it does not have on vocsb and should not update weights afaik. passing the inffered vector to the most_similar function shd indeed give you back tags of similar doc. Have you tried that? What happens? Have you saved and loaded the model again?Anderegg
@ClockSlave currently I don't think there is any other way to get the vector representations. If you have a query that exists in your vocabulary than you can use their tag (document id in your case) to calculate similarity or to get their vectors. But I don't think infer vector would update the weights. You may see some difference results from same query due to non-deterministic nature of some algorithms used (negative sampling, dbow=1 etc...). But that does not mean the model is altered.Arcturus
@Arcturus the infer_vector method takes parameters like alpha, min_alpha so i figured they update the model as well. However I am not sure if they are learning rates or some other parametersLapidary
A
55

You need to use infer_vector to get a document vector of the new text - which does not alter the underlying model.

Here is how you do it:

tokens = "a new sentence to match".split()

new_vector = model.infer_vector(tokens)
sims = model.docvecs.most_similar([new_vector]) #gives you top 10 document tags and their cosine similarity

Edit:

Here is an example of how the underlying model does not change after infer_vec is called.

import numpy as np

words = "king queen man".split()

len_before =  len(model.docvecs) #number of docs

#word vectors for king, queen, man
w_vec0 = model[words[0]]
w_vec1 = model[words[1]]
w_vec2 = model[words[2]]

new_vec = model.infer_vector(words)

len_after =  len(model.docvecs)

print np.array_equal(model[words[0]], w_vec0) # True
print np.array_equal(model[words[1]], w_vec1) # True
print np.array_equal(model[words[2]], w_vec2) # True

print len_before == len_after #True
Abed answered 15/3, 2017 at 18:3 Comment(4)
are you sure that it doesn't update the model. The infer_vector method takes parameters like alpha and min_alpha. I'm assuming they are learning rates. There's not much given in the documentation so I am not really sure if they are learning rates or some other parameters. Also, I came to think that it was updating the model because every time I passed the same sentence to infer_vector and then to most_similar, I got different results each timeLapidary
infer_vector like the training is has non-deterministic elements. You will get different vectors on each call. There are a number of discussions out there on Gensim's mailing list and their issue log on github. Here is a good one good example: github.com/RaRe-Technologies/gensim/issues/447. Also, you can test if the model changes. See my edit.Abed
it's clearly stated in doc2vec paper that at inference time, all the parameters of the model are fixed. So the model definitely doesn't get updated.Insuperable
@ClockSlave Yes, infer_vector is changing the model. I am reloading the model, after infer_vector & the output is deterministic. Very useful post!Brioche

© 2022 - 2024 — McMap. All rights reserved.