How to generate a train-test-split based on a group id?

Asked 21/2, 2019 at 0:45 Answered 6/3 at 23:50

Solved python-3.x pandas machine-learning scikit-learn train-test-split

I have the following data:

   Group_ID Item_id Target
0         1       1      0
1         1       2      0
2         1       3      1
3         2       4      0
4         2       5      1
5         2       6      1
6         3       7      0
7         4       8      0
8         5       9      0
9         5      10      1

I need to split the dataset into a training and testing set based on the "Group_ID" so that 80% of the data goes into a training set and 20% into a test set.

That is, I need my training set to look something like:

    Group_ID Item_id Target
0          1       1      0
1          1       2      0
2          1       3      1
3          2       4      0
4          2       5      1
5          2       6      1
6          3       7      0
7          4       8      0

And test set:

   Group_ID Item_id Target
8         5       9      0
9         5      10      1

What would be the simplest way to do this? As far as I know, the standard test_train_split function in sklearn does not support splitting by groups in a way where I can also indicate the size of the split (e.g. 80/20).

Winna answered 21/2, 2019 at 0:45 Comment(5)

what have you tried? using random selection can work. – Jeffiejeffrey 21/2, 2019 at 0:47

@Jeffiejeffrey Could you provide an example? I've relied so much on sklearn in the past that I'm completely lost with how to split any other way. – Winna 21/2, 2019 at 1:6

I can think of two ways but it depends on your complete dataset. 1)Lets say, you have 10 records in dataset then sort the dataset based on groupid and then just use train = df.iloc[:8,:], test = df.iloc[8:,:] 2) Use a conditional subset. Like make a list of groups . for exam- a = [5,6] and use df['groupid].isin(a) – Morten 21/2, 2019 at 2:7

@AdityaKansal The data is about 4 gb in size. Could I use something like sklearn's GroupShuffleSplit? – Winna 21/2, 2019 at 2:28

Also you should use K-folding for training and testing. This is where you split you data into k (usually k=10) random sets, you then loop k times and each time you use (k-1) sets to train and 1 to test. (a different one each loop) This will make sure that all data is used to train, test. – Jeffiejeffrey 21/2, 2019 at 14:57

I figured out the answer. This seems to work:

from sklearn.model_selection import GroupShuffleSplit 

splitter = GroupShuffleSplit(test_size=.20, n_splits=2, random_state = 7)
split = splitter.split(df, groups=df['Group_Id'])
train_inds, test_inds = next(split)

train = df.iloc[train_inds]
test = df.iloc[test_inds]

Winna answered 21/2, 2019 at 19:8 Comment(3)

Shouldn't it be n_splits=1? It will still work with n_splits=2, but generate an extra split which is never used. – Fidelis 15/2, 2022 at 11:39

The number of splits determines the relative sizes of train and test - if you want 50:50 then you need to use n_splits=2, 80:20 use n_splits=5 etc. – Kerrin 1/3, 2023 at 4:19

What if we want to split keeping whole groups but at the same time you want to stratify (to keep the same proportion of classes). How do you mix groupwise and stratification splitting? – Deflate 24/7, 2023 at 17:20

You can also use pandas here if you want to split only once.

train_ids = (
    df['Group_Id'].value_counts(normalize=True)      # get Group_Id distribution
    .sample(frac=1, replace=False, random_state=0)   # randomize
    .cumsum()                                        # make the shares into cudf
    .pipe(lambda x: x.index[x > test_size])          # get train Ids
)
train_mask = df['Group_Id'].isin(train_ids)
train = df[train_mask]
test = df[~train_mask]

With that being said, the main advantage of GroupShuffleSplit() (as used in the top answer) is its n_splits= argument, which returns multiple splits of the data by simply iterating over the generator. So you can do cross validation on each split. In the following example, split is a generator that returns 5 different splits which can be used to evaluate the model.

from sklearn.model_selection import GroupShuffleSplit 

splitter = GroupShuffleSplit(test_size=.2, random_state = 0, n_splits=5)
split = splitter.split(df, groups=df['Group_Id'])

for train_inds, test_inds in split:
    train = df.iloc[train_inds]
    test = df.iloc[test_inds]
    # fit a model and get its score

It should probably be noted that it works best if the groups are evenly distributed. As the following example shows, in the case of an unbalanced dataset, the train set may end up being much smaller (even though test_size-0.2):

df = pd.DataFrame({'Group_Id': [2, 2, 2, 2, 2, 2, 2, 0, 0, 1], 'value': range(10)})
splitter = GroupShuffleSplit(test_size=.2, random_state=0)
train_inds, test_inds = next(splitter.split(df, groups=df['Group_Id']))
train = df.iloc[train_inds]
train.shape        # (3, 2)

In that case, pandas could become very useful. However, unlike the above cases, it will give the specific Group_Ids in a specific way (i.e. deterministic), so only works in special cases.

pct_cumsum = df['Group_Id'].value_counts(sort=True, ascending=True, normalize=True).cumsum()
train_group = pct_cumsum.index[pct_cumsum > test_size]
train_mask = df['Group_Id'].isin(train_group)
train = df[train_mask]
test = df[~train_mask]

Peewee answered 6/3 at 23:50 Comment(0)

Recommended topics

Hot tags