Building ML classifier with imbalanced data
Asked Answered
E

5

5

I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).

Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean). I am thinking that the steps followed are wrong.

import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score

y = df['Label']
X = df.drop('Label',axis=1)

def create_cv(X,y):
    if type(X)!=np.ndarray:
        X=X.values
        y=y.values
 
    test_size=1/5
    proportion_of_true=y[y==1].shape[0]/y.shape[0]
    num_test_samples=math.ceil(y.shape[0]*test_size)
    num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
    num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
    
    y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
    y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])

    X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
    X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
    return X_train,X_test,y_train,y_test

X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
    
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)       

y_predict_test = tree.predict(X_test)

print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)

Output:

     precision    recall  f1-score   support

           0       1.00      1.00      1.00        24
           1       1.00      1.00      1.00        70

    accuracy                           1.00        94
   macro avg       1.00      1.00      1.00        94
weighted avg       1.00      1.00      1.00        94

Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output. What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.

I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is

  • Split the data on train/test taking into account imbalance
  • Perform CV on trains set
  • Apply undersampling only on a test fold
  • After the model has been chosen with the help of CV, undersample the train set and train the classifier
  • Estimate the performance on the untouched test set (f1-score)

as also outlined in this question: CV and under sampling on a test fold .

I think the steps above should make sense, but happy to receive any feedback that you might have on this.

Exodus answered 7/7, 2021 at 19:28 Comment(5)
Just a pointer. I've used SMOTE+ENN as a combination of oversampling and undersampling. This has produced good results for my data.Weed
Thank you so much Kabilan Mohanraj. I will have a look at this approach as well. I think it would be nice to compare different approach :)Exodus
I wouldn't worry about imbalance for a decision tree at a 70:30 ratio, I'd take that out entirely. Just do proper cross validation. Your report is saying that the tree is perfectly classifying the test set, this is weird, I would check the shape of all of the X_/y_ variables you have there to make sure you're getting the splits you expect. If that all looks good, is it possible you have duplicate data in your observations? Or perhaps the label is indeed perfectly predictable from the observations.Abbotson
Thank you, sturgemeister, for your suggestions. I am going to use several classifiers, including the decision tree as in the example above, for comparison. My concern is about cross-validation. I think something is going wrong with my prediction. I would exclude duplicate data (if the steps above - which includes all what I have - do not create duplicates), but I would say that the predict is taking the wrong fieldExodus
Take a look at imbalanced-learn.org/stableLanoralanose
S
3

When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.

Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:

This code may help you split your dataset that way:

def split_dataset(dataset: pd.DataFrame, train_share=0.8):
    """Splits the dataset into training and test sets"""
    all_idx = range(len(dataset))
    train_count = int(len(all_idx) * train_share)

    train_idx = random.sample(all_idx, train_count)
    test_idx = list(set(all_idx).difference(set(train_idx)))

    train = dataset.iloc[train_idx]
    test = dataset.iloc[test_idx]

    return train, test

def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
    """Splits the dataset as in `split_dataset` but with stratification"""

    data_pos = dataset[dataset[target_attr] == positive_class]
    data_neg = dataset[dataset[target_attr] != positive_class]

    if len(data_pos) < len(data_neg):
        train_pos, test_pos = split_dataset(data_pos, train_share)
        train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
        # set.difference makes the test set larger
        test_neg = test_neg.iloc[0:len(test_pos)]
    else:
        train_neg, test_neg = split_dataset(data_neg, train_share)
        train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
        # set.difference makes the test set larger
        test_pos = test_pos.iloc[0:len(test_neg)]

    return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
           test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)

Usage:

train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)

You can now perform cross validation on train_ds and evaluate your model in test_ds.

Strawberry answered 4/10, 2021 at 12:36 Comment(0)
C
3
  • your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.

  • as @sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.

  • as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.

  • another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.

  • to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.

  • as for general strategy for 5-fold CV I suggest following:

    1. split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
    2. take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
    3. apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
    4. repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
    5. now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .
Cycloplegia answered 5/10, 2021 at 18:8 Comment(0)
C
3

There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).

Example code:

from catboost import CatBoostClassifier

y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)

And so forth. This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight). This is better than downsampling, and is technically equal to oversampling (but requires less memory).

Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.

And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):

from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)
Cartogram answered 6/1, 2022 at 13:11 Comment(0)
M
1

Cross-validation or held-out set

First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.

It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.

As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!

Purpose

Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.

Conclusion

Simply use GridSearchCV. You will have cross-validation and label balance done for you.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5)  # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())

You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).

I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.

Musquash answered 4/10, 2021 at 12:16 Comment(0)
C
1

Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.

Couturier answered 6/10, 2021 at 7:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.