In the case where right of the bat, before epoch 1 starts, we get the out of memory error,
torch.cuda.empty_cache()
gc.collect()
couple with lower the batch_size
may work in some case, as noted by previous answers. In my case it was not enough. I did 2 more things:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"
Here you can adjust 1024 to a desired size.
I adjusted the size of the images I was introducing to the network, in the dataset class, particularly in the __getitem__()
method:
def __getitem__(self, i_dex, resize_=(320,480)):
transforms_ = transforms.Compose([
transforms.PILToTensor(),
transforms.ConvertImageDtype(torch.float32),
])
im_ = Image_.open(self.data_paths[i_dex])
if im_.mode !='RGB':
im_ = im_.convert('RGB')
im_ = im_.resize(resize_)
return transforms_(im_), labels[i_dex]
and reduced the batch_size from 40 to 20. Before resizing the maximum batch_size I was able to run was 4. This is very important for contrastive learning models like the SimCLR where the batch size must be larger (256 or more) such that the model learns from multiple contrastive augmentation image pairs.
Edits:
Repeating the process above several times, I was able to train the model on a batch size of 400 eventually.
To monitor GPU resources you can use something like glances
. This makes things easier while adjusting parameters.
torch.cuda.empty_cache()
did not work. Instead first disable the GPU, then restart the kernel, and reactivate the GPU. This worked for me. – Terina