Dimensionality reduction, noralization, resampling, k-fold CV... In what order?
Asked Answered
A

0

8

In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:

  • Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
  • Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
  • Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).

Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.

After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.

So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince which comes with FAMD(Factor Analysis for Mixture Data).

Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?

Question 2: The package prince does not have methods such as fit or transform like in Sklearn, so I cannot do the 3rd step described below. Any other good packages to do fitand transform for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?

My approach:

  1. Make k folds and isolate one of them for validation, use the rest for training
  2. Normalize training data and transform validation data
  3. Fit FAMD on training data, and transform training and test data
  4. Resample only training data using SMOTE-NC
  5. Train whatever model it is, evaluate on validation data
  6. Repeat 2-5 k times and take the average of precision, recall F2 score

*I would also appreciate for any kinds of advices on my overall approach to this problem

Thanks!

Animal answered 6/6, 2019 at 13:35 Comment(2)
With Sklearn you don't need to care about test/train/validation of kfold if you use cross_val_score method. See scikit-learn.org/stable/modules/generated/…Natividad
You did not specified which classifier you are using. Which one did you used so far ?Solferino

© 2022 - 2024 — McMap. All rights reserved.