Using sklearn how do I calculate the tf-idf cosine similarity between documents and a query?
Asked Answered
I

4

11

My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.

So far I have calculated the tf-idf of the documents doing the following:

from sklearn.feature_extraction.text import TfidfVectorizer

def get_term_frequency_inverse_data_frequency(documents):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    vectorizer = TfidfVectorizer()
    matrix = vectorizer.fit_transform(allDocs)
    return matrix

def get_tf_idf_query_similarity(documents, query):
    tfidf = get_term_frequency_inverse_data_frequency(documents)

The problem I am having is now that I have tf-idf of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?

Isodynamic answered 14/4, 2019 at 16:6 Comment(0)
B
12

Here is my suggestion:

  • We don't have to fit the model twice. we could reuse the same vectorizer
  • text cleaning function can be plugged into TfidfVectorizer directly using preprocessing attribute.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vectorizer = TfidfVectorizer(preprocessor=nlp.clean_tf_idf_text)
docs_tfidf = vectorizer.fit_transform(allDocs)

def get_tf_idf_query_similarity(vectorizer, docs_tfidf, query):
    """
    vectorizer: TfIdfVectorizer model
    docs_tfidf: tfidf vectors for all docs
    query: query doc

    return: cosine similarity between query and all docs
    """
    query_tfidf = vectorizer.transform([query])
    cosineSimilarities = cosine_similarity(query_tfidf, docs_tfidf).flatten()
    return cosineSimilarities
Brandibrandice answered 15/4, 2019 at 3:53 Comment(1)
Thank you for your answer. This saved me a good night's rest!Teazel
W
3

You can do as Nihal has written in his response or you can use the nearest neighbors algo from sklearn. You have to select the proper metric (cosine)

from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=5, metric='cosine')
Wortham answered 14/4, 2019 at 16:45 Comment(0)
I
3

The other answers were very helpful but not entirely what I was looking for as they didn't help me transform my query so I could compare it with the documents.

To transform the query I first fit it to the document matrix:

queryTFIDF = TfidfVectorizer().fit(allDocs)

I then transform it into the matrix shape:

queryTFIDF = queryTFIDF.transform([query])

And then just calculate the cosine similarity between all the documents and my query using the sklearn.metrics.pairwise.cosine_similarity function

cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()

Although I realise using Nihal's solution I could input my query as one of the documents and then calculated the similarity between it and the other documents but this is what worked best for me.

The full code ends up looking like:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def get_tf_idf_query_similarity(documents, query):
    allDocs = []
    for document in documents:
        allDocs.append(nlp.clean_tf_idf_text(document))
    docTFIDF = TfidfVectorizer().fit_transform(allDocs)
    queryTFIDF = TfidfVectorizer().fit(allDocs)
    queryTFIDF = queryTFIDF.transform([query])

    cosineSimilarities = cosine_similarity(queryTFIDF, docTFIDF).flatten()
    return cosineSimilarities
Isodynamic answered 14/4, 2019 at 17:33 Comment(4)
I had to do cosine similarity on a list and find which of the elements had maximum similarity to the query. I had the solution worked in plain python, but @OultimoCoder's sklearn based solution worked perfectlySophia
If I compare this to the answer by @Venkatachalam, the difference the queryTFIDF = TfidfVectorizer().fit(allDocs) step. What is the purpose this step?Flaw
what is nlp here, i got error as -- name 'nlp' is not definedPortmanteau
@Flaw Please see my revision.Headlock
P
1

Cosine similarity is cosine of the angle between the vectors that represent documents.

K(X, Y) = <X, Y> / (||X||*||Y||)

Your tf-idf matrix will be a sparse matrix with dimensions = no. of documents * no. of distinct words.

To print the whole matrix you can use todense()

print(tfidf.todense())

Each row represents the vector representation corresponding to one document. Like wise each column corresponds to tf-idf score of unique word in the corpus.

Between a vector and any other vector the pairwise-similarity can be calculated from your tf-idf matrix as:

from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(reference_vector, tfidf_matrix) 

The output will be a array of length = no. of documents indicating the similarity score between your reference vector and vector corresponding to each document. Of course the similarity between the reference vector and itself will be 1. Overall it will be a value between 0 and 1.

To find the similarity between first and second documents,

print(cosine_similarity(tfidf_matrix[0], tfidf_matrix[1]))

array([[0.36651513]])
Pyroligneous answered 14/4, 2019 at 16:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.