How can I split a Dataset from a .csv file for Training and Testing?
Asked Answered
K

5

13

I'm using Python and I need to split my .csv imported data in two parts, a training and test set, E.G 70% training and 30% test.

I keep getting various errors, such as 'list' object is not callable and so on.

Is there any easy way of doing this?

Thanks

EDIT:

The code is basic, I'm just looking to split the dataset.

from csv import reader
with open('C:/Dataset.csv', 'r') as f:
    data = list(reader(f)) #Imports the CSV
    data[0:1] ( data )

TypeError: 'list' object is not callable

Kirghiz answered 29/4, 2017 at 15:13 Comment(3)
There are many ways to achieve this, but without seeing your code it's hard to help in particular.Glacier
please post the code and the complete error.Demetria
Added the code to the post.Kirghiz
K
32

You can use pandas:

import pandas as pd
import numpy as np

df = pd.read_csv('C:/Dataset.csv')
df['split'] = np.random.randn(df.shape[0], 1)

msk = np.random.rand(len(df)) <= 0.7

train = df[msk]
test = df[~msk]
Kissable answered 29/4, 2017 at 15:48 Comment(3)
worked perfect but can you tell how to write this train and test data in .csv file...??Boniface
@DeepakChawla train.to_csv('train.csv', index=False) and the same with test.Kissable
Output file is not having the exact percentage? If my dataset is of size 100, then its not returning 70 rows. Every time it is returning different number of rows but not 70.Catalectic
O
10

Better practice and maybe more random is to use df.sample:

from numpy.random import RandomState
import pandas as pd

df = pd.read_csv('C:/Dataset.csv')
rng = RandomState()

train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]
Observable answered 10/8, 2017 at 19:15 Comment(0)
A
5

You should use the read_csv () function from the pandas module. It reads all your data straight into the dataframe which you can use further to break your data into train and test. Equally, you can use the train_test_split() function from the scikit-learn module.

Aeolipile answered 29/4, 2017 at 15:43 Comment(0)
P
0

You should use sklearn.model_selection.train_test_split as its the best for purpose of splitting a dataset below i'm giving code to use it

`

import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('C:/Dataset.csv')
x_train,x_test,y_train,y_test = train_test_split(data["qus"], 
data["ans"],test_size = 0.3)

train_data = pd.concat([x_train , y_train], axis = 1)
test_data = pd.concat([x_train , y_train], axis = 1)
train_data.head()

`

Assuming that your csv contains 2 columns one for question and other for answer

Petrie answered 21/10, 2023 at 20:6 Comment(0)
J
-1
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("in.csv")
indices = np.arange(len(df))
indices_train, indices_test = train_test_split(indices, test_size = 0.3)
df_train = df.iloc[indices_train]
df_test = df.iloc[indices_test]
df_train.to_csv("train.csv")
df_test.to_csv("test.csv")
Jesseniajessey answered 15/4 at 13:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.