Let me share my thoughts.
One thing is Corpus, another thing is Model and another thing is Query. I would say that sometimes is easy to mix them.
1) Corpus and Models
A Corpus is a set of documents, your library, where each document is represented in a certain format. For example, a Corpus_BOW represents your documents as a Bag of Words. A Corpus_TFIDF represents your documents by their TFIDF.
A Model is something that transforms a Corpus representation into another. For example, Model_TFIDF transform Corpus_BOW --> Corpus_TFIDF. You can have other models, for example a model for Corpus_TFIDF --> Corpus_LSI or Corpus_BOW --> Corpus_LSI.
I would say this is the main nature of the wonderful Gensim, to be a Corpus transformator. And the objective is to find that corpus representation that better represents similarities between documents for your application.
A couple of important ideas:
- First, the Model is always built from the entry Corpus, for example: Model_TFIDF = models.TfidfModel(Corpus_BOW, id2word = yourDictionary)
- Second, if you want your corpus in a format (Corpus_TFIDF), you
need first to build the model (Model_TFIDF) and then transform your entry corpus:
Corpus_TFIDF = Model_TFIDF[Corpus_BOW].
So, we first build the model with the entry corpus, and then apply the model to the same entry corpus, to obtain the output corpus. Perhaps some steps could be joined, but these are the conceptual steps.
2) Queries and Updates
A given model can be applied to new documents, to obtain the new documents TFIDF. For example, New_Corpus_TFIDF = Model_TFIDF[New_Corpus_BOW]. But this is just Querying. The Model is not updated with the new corpus/documents. That is, the model is modeled with the original corpus, and used, as it was, with the new documents.
This is useful when the new document is just a short user query and we want to find the most similar documents in our original corpus. Or when we have just a single new document and we want to find the most similar ones in our corpus. In these cases, if your corpus is large enough, you don't need to update the model.
But let say your library, your corpus, is something alive. And you want to update your models with new documents, as if they were since the beginning. Some models can be updated just giving the new documents. For example models.LsiModel has "add_documents" method for including new Corpus in your LSI model (so if you built it with Corpus_BOW, you can just update giving New_Corpus_BOW).
But TFIDF model hasn't this "add_documents" method. I don't know if there is a complex and smart mathematical way to overcome this, but the thing is that the "IDF" part of TFIDF depends on the full Corpus (previous and new). So, if you add a new document, then the IDF of every previous document changes. The only way to update TFIDF model is to recalculated it again.
In any case, consider that even if you can update a model, then you need to apply it again to your entry corpus to have the output corpus, and rebuilt similarities.
As someone says before, if your library is large enough, you can use the original TFIDF model and apply to new documents, as it is, without updating the model. Probably results are good enough. Then, time to time, when the number of new documents is large, you re-build again the TFIDF model.