I have read many blogs but was not satisfied with the answers, Suppose I train tf-idf model on few documents example:
" John like horror movie."
" Ryan watches dramatic movies"
------------so on ----------
I use this function:
from sklearn.feature_extraction.text import TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print((X_train_counts.todense()))
# Gives count of words in each document
But it doesn't tell which word? How to get words as headers in X_train_counts
outputs. Similarly in X_train_tfidf ?
So X_train_tfidf output will be matrix with tf-idf score:
Horror watch movie drama
doc1 score1 -- -----------
doc2 ------------------------
Is this correct?
What does fit
does and what does transformation
does?
In sklearn it is mentioned that:
fit(..) method to fit our estimator to the data and secondly the transform(..) method to transform our count-matrix to a tf-idf representation.
What does estimator to the data
means?
Now suppose new test document comes:
" Ron likes thriller movies"
How to convert this document to tf-idf? We can't convert it to tf-idf right?
How to handle word thriller
which is not there in train document.