What is the difference between fit_transform and transform in sklearn countvectorizer?

Asked 1/8, 2016 at 6:46 Answered 2/10, 2020 at 13:25

python scikit-learn tokenize text-processing

I was recently practicing bag of words introduction : kaggle , I want to clear few things :

using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )

Now when we were preparing the bag of words array on train reviews we used fit_predict on the list of train reviews , now I know that fit_predict does two things , first it fits on the data and knows the vocabulary and then it makes vectors on each review .

thus when we used vectorizer.transform( "*list of cleaned train reviews* " ) this just transformed the list of test reviews into the vector for each review.

my question is, why not use fit_transform on the test list too? I mean in the documents it says it leads to overfitting, but it does make sense to me to use it anyways; let me give you my prospective:

when we don't use fit_transform we are essentially saying to make feature vector of test reviews using the most frequent words of train reviews. Why not make test features array using the most frequent words in the test itself?

I mean does random forest care? if we give random forest the train feature array and train feature sentiment to work and train itself with and then give it the test feature array won't it just give its prediction on sentiment?

Emogeneemollient answered 1/8, 2016 at 6:46 Comment(0)

You do not do a fit_transform on the test data because, when you fit a Random Forest, the Random Forest learns the classification rules based on the values of the features that you provide it. If these rules are to be applied to classify the test set then you need to make sure that the test features are calculated in the same way using the same vocabulary. If the vocabulary of the training and the test features is different, then features will not really make sense as they will reflect a vocabulary that is separate from the one the document was trained on.

Now if we specifically talk about CountVectorizer, then consider the following example, let your training data have the following 3 sentences:

Dog is black.
Sky is blue.
Dog is dancing.

Now the vocabulary set for this will be {Dog, is, black, sky, blue, dancing}. Now the Random Forest that you will train will try to learn rules based on the count of these 6 vocabulary terms. So your features will be vector of length 6. Now if the test set is as follows:

Dog is white.
Sky is black.

Now if you use the test data for fit_transform your vocabulary will look like {Dog, white, is, Sky, black}. So here your each document will be represented by a vector of length 5 denoting the counts of each of these terms. Now, this will be like comparing apples with oranges. You learn rules for counts of the previous vocabulary and those rules can not be applied to this vocabulary. This is the reason why you only fit on the training data.

Liquid answered 2/8, 2016 at 0:20 Comment(0)

Basically you split the whole data into train and test to expose only the train data to the model and other statistical variable calculation like mean and standard deviations, if you expose the test data your model might not be generalized any more and chances of overfit. So expose only train data with fit_transform and use the statistical variables to the test data with transform.

Raynell answered 15/5, 2020 at 12:24 Comment(0)

-1

In short, fit is used to train the model, once it's trained you can use that model. To use of course you use transform. (Remember fit generally does calculations or normalization of data).

So you can use fit and transform on test data but it's not much wise decision as you duplicate the efforts (Your model is already trained using fit on train data) as well in long term it may lower the performance too.

Pfeifer answered 2/10, 2020 at 13:25 Comment(1)

Using fit (or fit_transform) on the test data is not a "not much wise decision" due to duplication of efforts - it is plain wrong and can lead to multiple issues downstream, including plain programming errors – Mandamandaean 29/5, 2023 at 14:56

Recommended topics

Hot tags