Pytorch tensor.save() produces huge files for small tensors from MNIST

I'm working with MNIST dataset from Kaggle challange and have troubles preprocessing with data. Furthermore, I don't know what are the best practices and was wondering if you could advise me on that.

Disclaimer: I can't just use torchvision.datasets.mnist because I need to use Kaggle's data for training and submission.

In this tutorial, it was advised to create a Dataset object loading .pt tensors from files, to fully utilize GPU. In order to achieve that, I needed to load the csv data provided by Kaggle and save it as .pt files:

import pandas as pd
import torch
import numpy as np

# import data
digits_train = pd.read_csv('data/train.csv')

train_tensor = torch.tensor(digits_train.drop(label, axis=1).to_numpy(), dtype=torch.int)
labels_tensor = torch.tensor(digits_train[label].to_numpy()) 

for i in range(train_tensor.shape[0]):
    torch.save(train_tensor[i], "data/train-" + str(i) + ".pt")

Each train_tensor[i].shape is torch.Size([1, 784])

However, each such .pt file has size of about 130MB. A tensor of the same size, with randomly generated integers, has size of 6.6kB. Why are these tensors so huge, and how can I reduce their size?

Dataset is 42 000 samples. Should I even bother with batching this data? Should I bother with saving tensors to separate files, rather than loading them all into RAM and then slicing into batches? What is the most optimal approach here?

Recommended topics

Hot tags