How to scale train, validation and test sets properly using StandardScaler?

Asked 12/11, 2019 at 16:54 Answered 29/5 at 5:2

Solved python machine-learning scikit-learn

Some articles says that in case of having only train and test sets, first, we need to use fit_transform() to scale training set and then only transform() for test set, in order to prevent data leakage.

In my case, I have also validation set.

I think one of these codes below would be okay to use but I cannot rely on them completely. Any kind of help will be appreciated, thanks!

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
X_test = scaler.transform(X_test)

Calfskin answered 12/11, 2019 at 16:54 Comment(3)

First code is generally considered best practice. Fitting the scaler on only the training data prevents data leakage between your model training and model validation – Oram 12/11, 2019 at 17:0

Number 1) is correct. In terms of scaling you should treat test and val data in the same way. – Linetta 12/11, 2019 at 17:2

Thank you for your help! I got the point without any question mark. – Calfskin 12/11, 2019 at 19:20

Generally you would want to use Option 1 code. The reason for using fit and then transform with train data is a) Fit would calculate mean,var etc of train set and then try to fit the model to data b) post which transform is going to convert data as per the fitted model.

If you use fit again with test set this is going to add bias to your model.

Toxoplasmosis answered 12/11, 2019 at 17:10 Comment(0)

Scale dataset : logistic regression, support vector machine and random forest . How are the results compared to the lupin ski descriptors

Inopportune answered 29/5 at 5:2 Comment(0)

Recommended topics

Hot tags