I'm working with MNIST dataset from Kaggle challange and have troubles preprocessing with data. Furthermore, I don't know what are the best practices and was wondering if you could advise me on that.
Disclaimer: I can't just use torchvision.datasets.mnist because I need to use Kaggle's data for training and submission.
In this tutorial, it was advised to create a Dataset object loading .pt tensors from files, to fully utilize GPU. In order to achieve that, I needed to load the csv data provided by Kaggle and save it as .pt files:
import pandas as pd
import torch
import numpy as np
# import data
digits_train = pd.read_csv('data/train.csv')
train_tensor = torch.tensor(digits_train.drop(label, axis=1).to_numpy(), dtype=torch.int)
labels_tensor = torch.tensor(digits_train[label].to_numpy())
for i in range(train_tensor.shape[0]):
torch.save(train_tensor[i], "data/train-" + str(i) + ".pt")
Each train_tensor[i].shape
is torch.Size([1, 784])
However, each such .pt file has size of about 130MB. A tensor of the same size, with randomly generated integers, has size of 6.6kB. Why are these tensors so huge, and how can I reduce their size?
Dataset is 42 000 samples. Should I even bother with batching this data? Should I bother with saving tensors to separate files, rather than loading them all into RAM and then slicing into batches? What is the most optimal approach here?