How to split dataset to train, test and valid in Python? [duplicate]

Asked 22/9, 2020 at 6:27 Answered 22/9, 2020 at 6:33

Solved python scikit-learn train-test-split

I have a dataset like this

my_data= [['Manchester', '23', '80', 'CM',
  'Manchester', '22', '79', 'RM',
  'Manchester', '19', '76', 'LB'],
 ['Benfica', '26', '77', 'CF',
  'Benfica', '22', '74', 'CDM',
  'Benfica', '17', '70', 'RB'],
 ['Dortmund', '24', '75', 'CM',
  'Dortmund', '18', '74', 'AM',
  'Dortmund', '16', '69', 'LM']
]

I know that using train_test_split from sklearn.cross_validation, and I've tried with this

from sklearn.model_selection import train_test_split
train, test = train_test_split(my_data, test_size = 0.2)

The result just split into test and train. I wish to divide it to 3 separate sets with randomized data.

Expected: Test, Train, Valid

Astatine answered 22/9, 2020 at 6:27 Comment(1)

train_test_split divides your data into train and validation set. Don't get confused by the names. Test data should be where you don't know your output variable. – Ramadan 22/9, 2020 at 6:32

You can simply use train_test split twice

X_train, X_test, y_train, y_test 
    = train_test_split(X, y, test_size=0.2, random_state=1)

 X_train, X_val, y_train, y_val 
    = train_test_split(X_train, y_train, test_size=0.25, random_state=1)

also, the answer can be found here

Sweatband answered 22/9, 2020 at 6:33 Comment(0)

It can be achieved using numpy+pandas, see script below splitting 0.6 + 0.2 + 0.2:

train_size = 0.6
validate_size = 0.2
train, validate, test = np.split(my_data.sample(frac=1), [int(train_size * len(my_data)), int((validate_size + train_size) * len(my_data))])

Mckoy answered 22/9, 2020 at 6:31 Comment(2)

i got error 'list' object has no attribute 'sample' – Astatine 22/9, 2020 at 6:56

my_data should be a pandas DataFrame. – Mckoy 22/9, 2020 at 7:50

Recommended topics

Hot tags