Re-calculate similarity matrix given new documents
Asked Answered
L

1

0

I'm running an experiment that include text documents that I need to calculate the (cosine) similarity matrix between all of them (to use for another calculation). For that I use sklearn's TfidfVectorizer:

corpus = [doc1, doc2, doc3, doc4]
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
tfidf = vect.fit_transform(corpus)
similarities = tfidf * tfidf.T
pairwise_similarity_matrix = similarities.A

The problem is that with each iteration of my experiment I discover new documents that I need to add to my similarity matrix, and given the number of documents I'm working with (tens of thousands and more) - it is very time consuming.

I wish to find a way to calculate only the similarities between the new batch of documents and the existing ones, without computing it all again one the entire data set.

Note that I'm using a term-frequency (tf) representation, without using inverse-document-frequency (idf), so in theory I don't need to re-calculate the whole matrix each time.

Leia answered 20/10, 2020 at 9:55 Comment(8)
I think one problem you would run into, if you don't re-calculating everytime you discover new documents, is that your TfidfVectorizer, and with it the vocabulary, will not be fitted for new words in these documents. That could mean, that even though you added a new document to it, most of it might not be usable, if the TfidfVectorizer did not adjust its vocabulary with the words in it.Globulin
@KimTang It's part of the problem - I wish to add the new terms to the same termsXdocs matrix before multiplying it with itself to get the new docsXdocs similarity matrix.Leia
@KimTang For that part I saw the answer on this question and using the relevant part of "partial_fit" I can manage to update the vocabulary with the new seen terms, but it does not help me with the similarity matrix part: #39110243Leia
I just had a look at it now after already posting my answer. I actually don't understand how it would make sense to do a partial fit, since all values could change after discovering a new document as explained in my answer. But perhaps someone else can help you more with it.Globulin
Hi @KimTang, Thank you for your help! Think of cosine similarity between 2 vectors: it is based only on the vectors' content regardless of other vectors and regardless of new entries that are all zeros.Leia
@KimTang I've just posted an answer that solve this part - you can check it outLeia
thanks for the update! I just deleted my answer now as well.Globulin
Hope that answer was useful to youLeia
L
0

OK, I got it. The idea is, as I said, to calculate the similarity only between the new batch of files and the existing ones, which their similarity is unchanged. The problem is to keep the TfidfVectorizer's vocabulary updated with the newly seen terms.

The solution has 2 steps:

  1. Update the vocabulary and the tf matrices.
  2. Matrix multiplications and stacking.

Here's the whole script - we first got the original corpus and the trained and calculated objects and matrices:

corpus = [doc1, doc2, doc3]
# Build for the first time:
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
tf_matrix = vect.fit_transform(corpus)
similarities = tf_matrix * tf_matrix.T
similarities_matrix = similarities.A # just for printing

Now, given new documents:

new_docs_corpus = [docx, docy, docz] # New documents
# Building new vectorizer to create the parsed vocabulary of the new documents:
new_vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False) 
new_vect.fit(new_docs_corpus)

# Merging old and new vocabs:
new_terms_count = 0
for k, v in new_vect.vocabulary_.items():
    if k in vect.vocabulary_.keys():
        continue
    vect.vocabulary_[k] = np.int64(len(vect.vocabulary_)) # important not to assign a simple int
    new_terms_count = new_terms_count + 1
new_vect.vocabulary_ = vect.vocabulary_

# Build new docs represantation using the merged vocabulary:
new_tf_matrix = new_vect.transform(new_docs_corpus)
new_similarities = new_tf_matrix * new_tf_matrix.T

# Get the old tf-matrix with the same dimentions:
if new_terms_count:
    zero_matrix = csr_matrix((tfidf.shape[0],new_terms_count))
    tf_matrix = hstack([tf_matrix, zero_matrix])
# tf_matrix = vect.transform(corpus) # Instead, we just append 0's for the new terms and stack the tf_matrix over the new one, to save time
cross_similarities = new_tf_matrix * tf_matrix.T # Calculate cross-similarities
tf_matrix = vstack([tf_matrix, new_tfidf])
# Stack it all together:
similarities = vstack([hstack([similarities, cross_similarities.T]), hstack([cross_similarities, new_similarities])])
similarities_matrix = similarities.A

# Updating the corpus with the new documents:
corpus = corpus + new_docs_corpus

We can check this by comparing the calculated similarities_matrix we got, with the one we get when we train a TfidfVectorizer on the joint corpus: corpus + new_docs_corpus.

As discussed in the the comments, we can do all that only because we are not using the idf (inverse-document-frequency) element, that will change the representation of existing documents given new ones.

Leia answered 27/10, 2020 at 20:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.