Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.
The code below finds the cosine similarity of the first vector and not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea. How can I infer the vector of a new document, and find the related docs, same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])