Python tf-idf: fast way to update the tf-idf matrix
Asked Answered
C

2

8

I have a dataset of several thousand rows of text, my target is to calculate the tfidf score and then cosine similarity between documents, this is what I did using gensim in Python followed the tutorial:

dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(text) for text in dat]

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
index = similarities.MatrixSimilarity(corpus_tfidf)

Let's say we have the tfidf matrix and similarity built, when we have a new document come in, I want to query for its most similar document in our existing dataset.

Question: is there any way we can update the tf-idf matrix so that I don't have to append the new text doc to the original dataset and recalculate the whole thing again?

Cephalad answered 13/2, 2017 at 19:54 Comment(0)
C
2

I'll post my solution since there are no other answers. Let's say we are in the following scenario:

import gensim
from gensim import models
from gensim import corpora
from gensim import similarities
from nltk.tokenize import word_tokenize
import pandas as pd

# routines:
text = "I work on natural language processing and I want to figure out how does gensim work"
text2 = "I love computer science and I code in Python"
dat = pd.Series([text,text2])
dat = dat.apply(lambda x: str(x).lower()) 
dat = dat.apply(lambda x: word_tokenize(x))


dictionary = corpora.Dictionary(dat)
corpus = [dictionary.doc2bow(doc) for doc in dat]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]


#Query:
query_text = "I love icecream and gensim"
query_text = query_text.lower()
query_text = word_tokenize(query_text)
vec_bow = dictionary.doc2bow(query_text)
vec_tfidf = tfidf[vec_bow]

if we look at:

print(vec_bow)
[(0, 1), (7, 1), (12, 1), (15, 1)]

and:

print(tfidf[vec_bow])
[(12, 0.7071067811865475), (15, 0.7071067811865475)]

FYI id and doc:

print(dictionary.items())

[(0, u'and'),
 (1, u'on'),
 (8, u'processing'),
 (3, u'natural'),
 (4, u'figure'),
 (5, u'language'),
 (9, u'how'),
 (7, u'i'),
 (14, u'code'),
 (19, u'in'),
 (2, u'work'),
 (16, u'python'),
 (6, u'to'),
 (10, u'does'),
 (11, u'want'),
 (17, u'science'),
 (15, u'love'),
 (18, u'computer'),
 (12, u'gensim'),
 (13, u'out')]

Looks like the query only picked up existing terms and using pre-calculated weights to give you the tfidf score. So my workaround is to rebuild the model weekly or daily since it is fast to do so.

Cephalad answered 9/5, 2017 at 21:40 Comment(6)
Does this actually work? I would have thought that due to the nature of tfidf, fundamentally you cant incrementally update the model (update the tfidf matrix) because each time a new document comes in you would have to update the IDF values of all the relevant word features contained in the new document across the entire corpus. Also, what happens when a document comes in with a new word - wont you have a feature length mismatch? Please let me know, as I am also actively researching this problemAriana
It's working, but I believe what is does is only query your new document using your existing model. I will edit my answer to show the work.Cephalad
Wow! That's really cool - thanks so much for sharing this. So if I understand correctly, when a new query document comes in, gensim calculates the tfidf score from the pre calculated tfidf matrix and the new query document? Or does it only calculate it from the pre calculated tfidf matrix? Updating the model periodically makes more sense if there are constantly new queries coming in, especially if it's expensive to update the modelAriana
Haven't looked into the source code yet, but since the actual query only happened in this line of code tfidf[vec_bow], I think it only queries the pre calculated tfidf matrix without updating anything. So yeah you are right, periodically update can make up the updating part.Cephalad
I have the similar problem recently. Thanks. I am quite confused about how to incrementally update the matrix.Sarene
It's unfortunate you weren't able to find a solution.Seamy
T
2

Let me share my thoughts.

One thing is Corpus, another thing is Model and another thing is Query. I would say that sometimes is easy to mix them.

1) Corpus and Models

A Corpus is a set of documents, your library, where each document is represented in a certain format. For example, a Corpus_BOW represents your documents as a Bag of Words. A Corpus_TFIDF represents your documents by their TFIDF.

A Model is something that transforms a Corpus representation into another. For example, Model_TFIDF transform Corpus_BOW --> Corpus_TFIDF. You can have other models, for example a model for Corpus_TFIDF --> Corpus_LSI or Corpus_BOW --> Corpus_LSI.

I would say this is the main nature of the wonderful Gensim, to be a Corpus transformator. And the objective is to find that corpus representation that better represents similarities between documents for your application.

A couple of important ideas:

  • First, the Model is always built from the entry Corpus, for example: Model_TFIDF = models.TfidfModel(Corpus_BOW, id2word = yourDictionary)
  • Second, if you want your corpus in a format (Corpus_TFIDF), you need first to build the model (Model_TFIDF) and then transform your entry corpus: Corpus_TFIDF = Model_TFIDF[Corpus_BOW].

So, we first build the model with the entry corpus, and then apply the model to the same entry corpus, to obtain the output corpus. Perhaps some steps could be joined, but these are the conceptual steps.

2) Queries and Updates

A given model can be applied to new documents, to obtain the new documents TFIDF. For example, New_Corpus_TFIDF = Model_TFIDF[New_Corpus_BOW]. But this is just Querying. The Model is not updated with the new corpus/documents. That is, the model is modeled with the original corpus, and used, as it was, with the new documents.

This is useful when the new document is just a short user query and we want to find the most similar documents in our original corpus. Or when we have just a single new document and we want to find the most similar ones in our corpus. In these cases, if your corpus is large enough, you don't need to update the model.

But let say your library, your corpus, is something alive. And you want to update your models with new documents, as if they were since the beginning. Some models can be updated just giving the new documents. For example models.LsiModel has "add_documents" method for including new Corpus in your LSI model (so if you built it with Corpus_BOW, you can just update giving New_Corpus_BOW).

But TFIDF model hasn't this "add_documents" method. I don't know if there is a complex and smart mathematical way to overcome this, but the thing is that the "IDF" part of TFIDF depends on the full Corpus (previous and new). So, if you add a new document, then the IDF of every previous document changes. The only way to update TFIDF model is to recalculated it again.

In any case, consider that even if you can update a model, then you need to apply it again to your entry corpus to have the output corpus, and rebuilt similarities.

As someone says before, if your library is large enough, you can use the original TFIDF model and apply to new documents, as it is, without updating the model. Probably results are good enough. Then, time to time, when the number of new documents is large, you re-build again the TFIDF model.

Transcendental answered 4/9, 2020 at 9:42 Comment(1)
Conceptually, the IDF part can be updated without the full corpus, so long as you know the total number of documents. For example, if a term has a document frequency of 0.5 for 10 documents (leaving out log scaling and the inverse for simplicity), adding a document without that term lowers the document frequency to 0.45 (5 out of 11 documents), no need to have the actual documents themselves.Abebi

© 2022 - 2024 — McMap. All rights reserved.