How can I split a Dataset from a .csv file for Training and Testing?

Asked 29/4, 2017 at 15:13 Answered 15/4 at 13:55

I'm using Python and I need to split my .csv imported data in two parts, a training and test set, E.G 70% training and 30% test.

I keep getting various errors, such as 'list' object is not callable and so on.

Is there any easy way of doing this?

Thanks

EDIT:

The code is basic, I'm just looking to split the dataset.

from csv import reader
with open('C:/Dataset.csv', 'r') as f:
    data = list(reader(f)) #Imports the CSV
    data[0:1] ( data )

TypeError: 'list' object is not callable

Kirghiz answered 29/4, 2017 at 15:13 Comment(3)

There are many ways to achieve this, but without seeing your code it's hard to help in particular. – Glacier 29/4, 2017 at 15:15

please post the code and the complete error. – Demetria 29/4, 2017 at 15:15

Added the code to the post. – Kirghiz 29/4, 2017 at 15:27

You can use pandas:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.7

train = df[msk]
test = df[~msk]

Kissable answered 29/4, 2017 at 15:48 Comment(3)

worked perfect but can you tell how to write this train and test data in .csv file...?? – Boniface 12/12, 2017 at 16:18

@DeepakChawla train.to_csv('train.csv', index=False) and the same with test. – Kissable 12/12, 2017 at 16:44

Output file is not having the exact percentage? If my dataset is of size 100, then its not returning 70 rows. Every time it is returning different number of rows but not 70. – Catalectic 31/5, 2019 at 6:20

Better practice and maybe more random is to use df.sample:

from numpy.random import RandomState
import pandas as pd

df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()

train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]

Observable answered 10/8, 2017 at 19:15 Comment(0)

You should use the read_csv () function from the pandas module. It reads all your data straight into the dataframe which you can use further to break your data into train and test. Equally, you can use the train_test_split() function from the scikit-learn module.

Aeolipile answered 29/4, 2017 at 15:43 Comment(0)

You should use sklearn.model_selection.train_test_split as its the best for purpose of splitting a dataset below i'm giving code to use it

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('C:/Dataset.csv')
x_train,x_test,y_train,y_test = train_test_split(data["qus"], 
data["ans"],test_size = 0.3)

train_data = pd.concat([x_train , y_train], axis = 1)
test_data = pd.concat([x_train , y_train], axis = 1)
train_data.head()

Assuming that your csv contains 2 columns one for question and other for answer

Petrie answered 21/10, 2023 at 20:6 Comment(0)

-1

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("in.csv")
indices = np.arange(len(df))
indices_train, indices_test = train_test_split(indices, test_size = 0.3)
df_train = df.iloc[indices_train]
df_test = df.iloc[indices_test]
df_train.to_csv("train.csv")
df_test.to_csv("test.csv")

Jesseniajessey answered 15/4 at 13:55 Comment(0)

Recommended topics

Hot tags