Some articles says that in case of having only train and test sets, first, we need to use fit_transform() to scale training set and then only transform() for test set, in order to prevent data leakage.
In my case, I have also validation set.
I think one of these codes below would be okay to use but I cannot rely on them completely. Any kind of help will be appreciated, thanks!
1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)
2)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 2/7)
X_test = scaler.transform(X_test)
test
andval
data in the same way. – Linetta