I was recently practicing bag of words introduction : kaggle , I want to clear few things :
using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )
Now when we were preparing the bag of words array on train reviews we used fit_predict
on the list of train reviews , now I know that fit_predict
does two things , first it fits on the data and knows the vocabulary and then it makes vectors on each review .
thus when we used vectorizer.transform( "*list of cleaned train reviews* " )
this just transformed the list of test reviews into the vector for each review.
my question is, why not use fit_transform
on the test list too? I mean in the documents it says it leads to overfitting, but it does make sense to me to use it anyways; let me give you my prospective:
when we don't use fit_transform
we are essentially saying to make feature vector of test reviews using the most frequent words of train reviews. Why not make test features array using the most frequent words in the test itself?
I mean does random forest care? if we give random forest the train feature array and train feature sentiment to work and train itself with and then give it the test feature array won't it just give its prediction on sentiment?
fit
(orfit_transform
) on the test data is not a "not much wise decision" due to duplication of efforts - it is plain wrong and can lead to multiple issues downstream, including plain programming errors – Mandamandaean