When performing classification (for example, logistic regression) with an imbalanced dataset (e.g., fraud detection), is it best to scale/zscore/standardize the features before over-sampling the minority class, or to balance the classes before scaling features?
Secondly, does the order of these steps affect how features will eventually be interpreted (when using all data, scaled+balanced, to train a final model)?
Here's an example:
Scale first:
- Split data into train/test folds
- Calculate mean/std using all training (imbalanced) data; scale the training data using these calculations
- Oversample minority class in the training data (e.g, using SMOTE)
- Fit logistic regression model to training data
- Use mean/std calculations to scale the test data
- Predict class with imbalanced test data; assess acc/recall/precision/auc
Oversample first
- Split data into train/test folds
- Oversample minority class in the training data (e.g, using SMOTE)
- Calculate mean/std using balanced training data; scale the training data using these calculations
- Fit logistic regression model to training data
- Use mean/std calculations to scale the test data
- Predict class with imbalanced test data; assess acc/recall/precision/auc