Imbalanced classification: order of oversampling vs. scaling features?

Asked 21/1, 2018 at 17:10 Answered 23/11, 2022 at 13:57

machine-learning classification logistic-regression

When performing classification (for example, logistic regression) with an imbalanced dataset (e.g., fraud detection), is it best to scale/zscore/standardize the features before over-sampling the minority class, or to balance the classes before scaling features?

Secondly, does the order of these steps affect how features will eventually be interpreted (when using all data, scaled+balanced, to train a final model)?

Here's an example:

Scale first:

Split data into train/test folds
Calculate mean/std using all training (imbalanced) data; scale the training data using these calculations
Oversample minority class in the training data (e.g, using SMOTE)
Fit logistic regression model to training data
Use mean/std calculations to scale the test data
Predict class with imbalanced test data; assess acc/recall/precision/auc

Oversample first

Split data into train/test folds
Oversample minority class in the training data (e.g, using SMOTE)
Calculate mean/std using balanced training data; scale the training data using these calculations
Fit logistic regression model to training data
Use mean/std calculations to scale the test data
Predict class with imbalanced test data; assess acc/recall/precision/auc

Sparry answered 21/1, 2018 at 17:10 Comment(0)

You may have meant it implicitly, but you need to apply the mean/std to scale the training data as well, and that needs to happen before you fit the model.

Barring that point, there isn't a definitive answer on this. The best thing would be to simply try both and see which works best for your data.

For you own understanding of the model on the resulting data, you may want to instead play with computing the mean and standard deviation of the minority and majority classes. If they have similar statistics, then we wouldn't expect much of a difference between scale first or over-sample first.

If the means and standard deviations are very different, the results may differ significantly. But that may also mean the problem has greater separation, and you may expect a higher classification accuracy.

Onstage answered 21/1, 2018 at 17:21 Comment(0)

Since under- and over-sampling techniques often depend on k-NN or k-means-like algorithms which use the Euclidean distance between data points, it is safer to scale before resampling. However, in real-life, the order hardly matters.

Manipulator answered 23/11, 2022 at 13:57 Comment(0)

Recommended topics

Hot tags