How to split data based on a column value in sklearn

sklearn.model_selection has several other options other than train_test_split. One of them, aims at solving what you're asking for. In this case you could use GroupShuffleSplit, which as mentioned inthe docs it provides randomized train/test indices to split data according to a third-party provided group. This is useful when you're doing cross-validation, and you want to split in validation-train multiple times, ensuring that the sets are split by the group field. You also have GroupKFold for these cases which is very useful.

So, adapting your example, here's what you could do.

Say you have for instance:

from sklearn.model_selection import GroupShuffleSplit

cols = ['customer', 'calibrat', 'churn', 'churndep', 'revenue', 'mou',]
X = pd.DataFrame(np.random.rand(10, 6), columns=cols)
X['calibrat'] = np.random.choice([0,1], size=10)

print(X)

   customer  calibrat     churn  churndep   revenue       mou
0  0.523571         1  0.394896  0.933637  0.232630  0.103486
1  0.456720         1  0.850961  0.183556  0.885724  0.993898
2  0.411568         1  0.003360  0.774391  0.822560  0.840763
3  0.148390         0  0.115748  0.089891  0.842580  0.565432
4  0.505548         0  0.370198  0.566005  0.498009  0.601986
5  0.527433         0  0.550194  0.991227  0.516154  0.283175
6  0.983699         0  0.514049  0.958328  0.005034  0.050860
7  0.923172         0  0.531747  0.026763  0.450077  0.961465
8  0.344771         1  0.332537  0.046829  0.047598  0.324098
9  0.195655         0  0.903370  0.399686  0.170009  0.578925

y = X.pop('churn')

You can now instanciate GroupShuffleSplit, and do as you would with train_test_split, with the only difference of specifying a group column, which will be used to split X and y so the groups are split according the the groups values:

gs = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)

As mentioned, this is more handy when you want to split into multiple groups, generally for cross validation purposes. Here's just an example of how you'd get two splits, as mentioned in the question:

train_ix, test_ix = next(gs.split(X, y, groups=X.calibrat))

X_train = X.loc[train_ix]
y_train = y.loc[train_ix]

X_test = X.loc[test_ix]
y_test = y.loc[test_ix]

Giving:

print(X_train)

   customer  calibrat  churndep   revenue       mou
3  0.148390         0  0.089891  0.842580  0.565432
4  0.505548         0  0.566005  0.498009  0.601986
5  0.527433         0  0.991227  0.516154  0.283175
6  0.983699         0  0.958328  0.005034  0.050860
7  0.923172         0  0.026763  0.450077  0.961465
9  0.195655         0  0.399686  0.170009  0.578925

print(X_test)

   customer  calibrat  churndep   revenue       mou
0  0.523571         1  0.933637  0.232630  0.103486
1  0.456720         1  0.183556  0.885724  0.993898
2  0.411568         1  0.774391  0.822560  0.840763
8  0.344771         1  0.046829  0.047598  0.324098

Recommended topics

Hot tags