My goal is to input 3 queries and find out which query is most similar to a set of 5 documents.
So far I have calculated the tf-idf
of the documents doing the following:
from sklearn.feature_extraction.text import TfidfVectorizer
def get_term_frequency_inverse_data_frequency(documents):
allDocs = []
for document in documents:
allDocs.append(nlp.clean_tf_idf_text(document))
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(allDocs)
return matrix
def get_tf_idf_query_similarity(documents, query):
tfidf = get_term_frequency_inverse_data_frequency(documents)
The problem I am having is now that I have tf-idf
of the documents what operations do I perform on the query so I can find the cosine similarity to the documents?