Python - tf-idf predict a new document similarity

Asked 25/9, 2016 at 16:2 Answered 26/9, 2016 at 10:49

python machine-learning scikit-learn tf-idf document-classification

Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.

The code below finds the cosine similarity of the first vector and not a new query

>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1.        ,  0.04405952,  0.11016969, ...,  0.04433602,
    0.04457106,  0.03293218])

Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea. How can I infer the vector of a new document, and find the related docs, same as the code below?

>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([    0,   958, 10576,  3277])
>>> cosine_similarities[related_docs_indices]
array([ 1.        ,  0.54967926,  0.32902194,  0.2825788 ])

Philoctetes answered 25/9, 2016 at 16:2 Comment(1)

while there might be better solutions, linear search is not necessarily a bad idea and can be fast if implemented correctly, how huge is your dataset? what query times level would be aceptable? – Inhalation 25/9, 2016 at 22:55

You should take a look at gensim. Example starting code looks like this:

from gensim import corpora, models, similarities

dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]

tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)

At prediction time you first get the vector for the new doc:

doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]

Then get the similarities (sorted by most similar):

sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples

This does a linear scan like you wanted to do but they have a more optimized implementation. If the speed is not enough then you can look into approximate similarity search (Annoy, Falconn, NMSLIB).

Inhalation answered 26/9, 2016 at 8:48 Comment(2)

Thanks for your reply, I'll have a look and post back – Philoctetes 26/9, 2016 at 8:56

My notebook crashed. I don't know the reason – Engineman 17/1, 2020 at 17:33

This problem can be partially addressed by combining the vector space model (which is the tf-idf & cosine similarity) together with the boolean model. These are concepts of information theory and they are used (and nicely explained) in ElasticSearch- a pretty good search engine.

The idea is simple: you store your documents as inverted indices. Which is comparable to the words present at the end of a book, which hold a reference to the pages (documents) they were mentioned in.

Instead of calculating the tf-idf vector for all document it will only calculate it for documents that have at least one (or specify a threshold) of the words in common. This can be simply done by looping over the words in the queried document, finding the documents that also have this word using the inverted index and calculate the similarity for those.

Finback answered 26/9, 2016 at 10:49 Comment(0)

For huge data sets, there is a solution called Text Clustering By Concept. search engines use this Technic,

At first step, you cluster your documents to some groups(e.g 50 cluster), then each cluster has a representative document(that contain some words that have some useful information about it's cluster)
At second step, for calculating cosine similarity between New Document and you data set, you loop through all representative(50 numbers) and find top near representatives(e.g 2 representative)
At final step, you can loop through all documents in selected representative and find nearest cosine similarity

With this Technic, you can reduce the number of loops and improve performace, You can read more tecnincs in some chapter of this book: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Partridgeberry answered 25/9, 2016 at 16:29 Comment(1)

it is worth stating that this is just a heuristic, which does not guarantee correct result (can deviate arbitrarly from results given by "true" one-by-one search) – Nevlin 25/9, 2016 at 19:21

Recommended topics

Hot tags