How to classify new documents with tf-idf?
Asked Answered
L

2

9

If I use the TfidfVectorizer from sklearn to generate feature vectors as:

features = TfidfVectorizer(min_df=0.2, ngram_range=(1,3)).fit_transform(myDocuments)

How would I then generate feature vectors to classify a new document? Since you cant calculate the tf-idf for a single document.

Would it be a correct approach, to extract the feature names with:

feature_names = TfidfVectorizer.get_feature_names()

and then count the term frequency for the new document according to the feature_names?

But then I won't get the weights that have the information of a words importance.

Lionel answered 18/10, 2016 at 15:32 Comment(0)
L
11

You need to save the instance of the TfidfVectorizer, it will remember the term frequencies and vocabulary that was used to fit it. It may make things clearer sense if rather than using fit_transform, you use fit and transform separately:

vec = TfidfVectorizer(min_df=0.2, ngram_range=(1,3))
vec.fit(myDocuments)
features = vec.transform(myDocuments)
new_features = fec.transform(myNewDocuments)
Limewater answered 19/10, 2016 at 4:21 Comment(1)
a typo in the last line, it should be: new_features = vec.transform(myNewDocuments)Iny
N
2

I would rather use gensim with a Latent Semantic Indexing as a wrapper over the original corpus: bow->tfidf->lsi

tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=300)
corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi

Then if you need to continue the training:

new_tfidf = models.TfidfModel(corpus)
new_corpus_tfidf = new_tfidf[corpus]
lsi.add_documents(another_tfidf_corpus) # now LSI has been trained on corpus_tfidf + another_tfidf_corpus
lsi_vec = model[tfidf_vec] # convert some new document into the LSI space

Where corpus is bag-of-words

As you can read in their tutorials:
LSI training is unique in that we can continue “training” at any point, simply by providing more training documents. This is done by incremental updates to the underlying model, in a process called online training. Because of this feature, the input document stream may even be infinite – just keep feeding LSI new documents as they arrive, while using the computed transformation model as read-only in the meanwhile!

If you like sci-kit, gensim is also compatible with numpy

Nominal answered 1/2, 2018 at 11:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.