How to do groupKfold validation and have balanced data?

I'm spliting some data in train and test set according to group values. How can I do this in order to have balanced data?

In order to solve a binary classification task I have 100 samples, each one with a unique ID a subject and a label(1 or 0).

In order to avoid to degenerate in a person recognition task, I need that the same subject cant be in both training and test set.

The number of subjects is less then the number of samples (57), some subject appears in only one sample other in many with same or different label.

I can simply do that using gropKfold from sklearn but I would like my data to be balanced (or at least close to be)

I tried with the following code:

n_shuffles = 2
group_k_fold = GroupKFold(n_splits=5)

        for i in range(n_shuffles):
            X_shuffled, y_shuffled, groups_shuffled = shuffle(idx, labels, subjects, random_state=i)
            splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)

            for train_idx, val_idx in splits:     
                X = perezDataFrame.loc[perezDataFrame['ID'].isin(X_shuffled[train_idx]),AU_names].values
                X = preprocessing.normalize(X, norm='l2')
                y = perezDataFrame.loc[perezDataFrame['ID'].isin(X_shuffled[train_idx]),'label'].values

                XTest = perezDataFrame.loc[perezDataFrame['ID'].isin(X_shuffled[val_idx]),AU_names].values
                XTest = preprocessing.normalize(XTest, norm='l2')
                yTest = perezDataFrame.loc[perezDataFrame['ID'].isin(X_shuffled[val_idx]),'label'].values

Where idx, subjects and labels are respectivlly a list of ID, subjects and labels.

But data were very unbalanced.

I also tried this:

for i in range(5):
    GSP = GroupShuffleSplit(n_splits =10, test_size =0.20, train_size=0.80 ,random_state=i)
    splits = GSP.split(idx, labels, subjects)
    for train_idx, test_idx in splits:
        .....

But this is not Kfold so I have no guarantee that same sample stay in just one fold.

So I don't think there is a default scikit-learn crossvalidator that will achieve what you want, but it should be possible to create one.

My approach to this would be to loop through all the subjects and greedily assign them to be in the test set for a fold depending on how much that assigning improves the size of the fold as well as the target class rate in the fold.

I've generated some sample data that resembles your problem:

import pandas as pd
import numpy as np


n_subjects = 50
n_observations = 100
n_positives = 15

positive_subjects = np.random.randint(0, n_subjects, n_positives)
data = pd.DataFrame({
    'subject': np.random.randint(0, n_subjects, n_observations)
}).assign(
    target=lambda d: d['subject'].isin(positive_subjects)
)


subject target
0   14  False
1   12  True
2   10  False
3   36  False
4   21  False

We can then do the assigning using the following snippet

def target_rate_improvements(data, subjects, extra):
    """Compute the improvement in squared difference between the positive rate in each fold vs the overall positive rate in the dataset"""
    target_rate = data['target'].mean()
    rate_without_extra = data.loc[lambda d: d['subject'].isin(subjects), 'target'].mean()
    rate_with_extra =  data.loc[lambda d: d['subject'].isin(subjects + [extra]), 'target'].mean()
    rate_without_extra = 0 if np.isnan(rate_without_extra) else rate_without_extra

    return (rate_without_extra - target_rate)**2 - (rate_with_extra - target_rate)**2


def size_improvement(data, subjects, n_folds):
    """compute the improvement in squared difference between the number of observations in each fold vs the expected number of observations"""
    target_obs_per_fold = len(data) / n_folds

    return [(target_obs_per_fold - len(data.loc[lambda d: d['subject'].isin(subject)])) ** 2 for subject in subjects.values()]

n_folds = 5
test_subjects_per_fold = {fold: [] for fold in range(n_folds)}
subjects_to_assign = list(range(100))

for subject in data['subject'].unique():

    target_rate_improvement = np.array([target_rate_improvements(data, test_subjects_per_fold[fold], subject) for fold in range(n_folds)])  
    size_improvements = np.array(size_improvement(data, test_subjects_per_fold, n_folds)) * 0.001
    best_fold = np.argmax(target_rate_improvement +size_improvements)
    test_subjects_per_fold[best_fold] += [subject]

and verify that it works as we expect:


for fold, subjects in test_subjects_per_fold.items():
    print('-'*80)
    print(f'for fold {fold}')
    test_data = data.loc[lambda d: d['subject'].isin(subjects)]
    train_data = data.loc[lambda d: ~d['subject'].isin(subjects)]

    print('train - pos rate:', train_data['target'].mean(), 'size:', len(train_data))
    print('test - pos rate:', test_data['target'].mean(), 'size:', len(test_data))

--------------------------------------------------------------------------------
for fold 0
train - pos rate: 0.3 size: 80
test - pos rate: 0.3 size: 20
--------------------------------------------------------------------------------
for fold 1
train - pos rate: 0.3037974683544304 size: 79
test - pos rate: 0.2857142857142857 size: 21
--------------------------------------------------------------------------------
for fold 2
train - pos rate: 0.2962962962962963 size: 81
test - pos rate: 0.3157894736842105 size: 19
--------------------------------------------------------------------------------
for fold 3
train - pos rate: 0.3 size: 80
test - pos rate: 0.3 size: 20
--------------------------------------------------------------------------------
for fold 4
train - pos rate: 0.3 size: 80
test - pos rate: 0.3 size: 20

Variable naming can be improved here and there but overall I would say this approach could work for your problem.

Implementing this in a scikit-learn compatible crossvalidator would look something like this, although it requires a bit more re-engineering.

class StratifiedGroupKFold(_BaseKFold):

    ...


    def _iter_test_indices(self, X, y, groups):
        test_subjects_per_fold = {fold: [] for fold in range(n_folds)}

        for subject in data['subject'].unique():

            target_rate_improvement = np.array([self.target_rate_improvements(X, y, test_subjects_per_fold[fold], subject) for fold in range(self.n_folds)])  
            size_improvements = np.array(self.size_improvement(X, y, test_subjects_per_fold, self.n_folds)) * 0.001
            best_fold = np.argmax(target_rate_improvement +size_improvements)
            test_subjects_per_fold[best_fold] += [subject]

        for subjects in test_subjects_per_fold.values():
            yield data['subject'].isin(subjects)], ~data['subject'].isin(subjects)]

Recommended topics

Hot tags