I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).
read_xxx
functions in the documentation which allow you to load different file formats into a dataframe. – Fortyfour