In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:
- Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
- Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
- Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).
Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.
After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.
So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince
which comes with FAMD(Factor Analysis for Mixture Data).
Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?
Question 2: The package prince
does not have methods such as fit
or transform
like in Sklearn
, so I cannot do the 3rd step described below. Any other good packages to do fit
and transform
for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?
My approach:
- Make k folds and isolate one of them for validation, use the rest for training
- Normalize training data and transform validation data
- Fit FAMD on training data, and transform training and test data
- Resample only training data using SMOTE-NC
- Train whatever model it is, evaluate on validation data
- Repeat 2-5 k times and take the average of precision, recall F2 score
*I would also appreciate for any kinds of advices on my overall approach to this problem
Thanks!