Separate pandas dataframe using sklearn's KFold
Asked Answered
T

3

26

I had obtained the index of training set and testing set with code below.

df = pandas.read_pickle(filepath + filename)
kf = KFold(n_splits = n_splits, shuffle = shuffle, random_state = 
randomState)

result = next(kf.split(df), None)

#train can be accessed with result[0]
#test can be accessed with result[1]

I wonder if there is any faster way to separate them into 2 dataframe respectively with the row indexes I retrieved.

Tatting answered 15/7, 2017 at 8:1 Comment(0)
D
58

You need DataFrame.iloc for select rows by positions:

Sample:

np.random.seed(100)
df = pd.DataFrame(np.random.random((10,5)), columns=list('ABCDE'))
df.index = df.index * 10
print (df)
           A         B         C         D         E
0   0.543405  0.278369  0.424518  0.844776  0.004719
10  0.121569  0.670749  0.825853  0.136707  0.575093
20  0.891322  0.209202  0.185328  0.108377  0.219697
30  0.978624  0.811683  0.171941  0.816225  0.274074
40  0.431704  0.940030  0.817649  0.336112  0.175410
50  0.372832  0.005689  0.252426  0.795663  0.015255
60  0.598843  0.603805  0.105148  0.381943  0.036476
70  0.890412  0.980921  0.059942  0.890546  0.576901
80  0.742480  0.630184  0.581842  0.020439  0.210027
90  0.544685  0.769115  0.250695  0.285896  0.852395

from sklearn.model_selection import KFold

#added some parameters
kf = KFold(n_splits = 5, shuffle = True, random_state = 2)
result = next(kf.split(df), None)
print (result)
(array([0, 2, 3, 5, 6, 7, 8, 9]), array([1, 4]))

train = df.iloc[result[0]]
test =  df.iloc[result[1]]

print (train)
           A         B         C         D         E
0   0.543405  0.278369  0.424518  0.844776  0.004719
20  0.891322  0.209202  0.185328  0.108377  0.219697
30  0.978624  0.811683  0.171941  0.816225  0.274074
50  0.372832  0.005689  0.252426  0.795663  0.015255
60  0.598843  0.603805  0.105148  0.381943  0.036476
70  0.890412  0.980921  0.059942  0.890546  0.576901
80  0.742480  0.630184  0.581842  0.020439  0.210027
90  0.544685  0.769115  0.250695  0.285896  0.852395

print (test)
           A         B         C         D         E
10  0.121569  0.670749  0.825853  0.136707  0.575093
40  0.431704  0.940030  0.817649  0.336112  0.175410
Decanter answered 15/7, 2017 at 8:11 Comment(2)
Isn't this just one split. How do we get multiple splits?Agar
@hannahmontanna no. This method produces the desired amount of splits, however, kf.split(df) is a generator. If you want all the splits in a list you can simple cast it to a list via: list(kf.split(df)), or you can iterate through the generator.Crescen
B
0

My answer is irrelevant to the question title, but if you want to obtain training and testing sets, train_test_split from sklearn.model_selection can be used.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(77)
df = pd.DataFrame(np.random.random((10,3)), columns=('one', 'two', 'three'))
print(df)
         one         two       three
0   0.919109    0.642196    0.753712
1   0.139315    0.087320    0.788002
2   0.326151    0.541068    0.240235
3   0.545423    0.400555    0.715192
4   0.836680    0.588481    0.296155
5   0.281018    0.705597    0.422596
6   0.057316    0.747027    0.452313
7   0.175775    0.049377    0.292475
8   0.066799    0.751156    0.063772
9   0.431908    0.364172    0.151972

df_train, df_test = train_test_split(df, test_size=0.3, random_state=77)
print(df_train)
print(df_test)
        one       two     three
6  0.057316  0.747027  0.452313
0  0.919109  0.642196  0.753712
5  0.281018  0.705597  0.422596
3  0.545423  0.400555  0.715192
8  0.066799  0.751156  0.063772
4  0.836680  0.588481  0.296155
7  0.175775  0.049377  0.292475
        one       two     three
2  0.326151  0.541068  0.240235
1  0.139315  0.087320  0.788002
9  0.431908  0.364172  0.151972
Bukhara answered 25/4, 2023 at 7:8 Comment(0)
S
0

If you want a simple one-liner, a list comprehension could be used.

train, test = [df.iloc[ind] for ind in next(kf.split(df))]

However, if you want to split a dataframe into two, train_test_split is probably a simpler option because it is really a wrapper for next(ShuffleSplit().split(df)).

If you want to recover all splits of a KFold (perhaps to pass of to another model), then a loop could be useful. Here, in each iteration, one fold will be the validation set.

kf = KFold(n_splits=5, shuffle=True)

for i, (t_ind, v_ind) in enumerate(kf.split(df)):
    
    train = df.iloc[t_ind]     # train set
    valid = df.iloc[v_ind]     # validation set
    
    result = my_model(train, valid)

Another use case for a loop over the splits generator is to create a new column for folds.

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold

np.random.seed(100)
df = pd.DataFrame(np.random.randint(4,10, size=(7,3)), columns=list('ABC'))
kf = KFold(n_splits=4, shuffle=True, random_state=0)

for i, (_, v_ind) in enumerate(kf.split(df)):
    df.loc[df.index[v_ind], 'kfold'] = f"fold{i+1}"

result

Schadenfreude answered 15/5, 2023 at 15:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.