Given the number of parameters, how to estimate the VRAM needed by a pytorch model?
Asked Answered
K

0

8

I am trying to estimate the VRAM needed for a fully connected model without having to build/train the model in pytorch.

I got pretty close with this formula:

# params = number of parameters
# 1 MiB = 1048576 bytes
estimate = params * 24 / 1048576

This example model has 384048000 parameters, but I have tested this on different models with different parameter sizes.

The results are pretty accurate. However, the estimation only takes into account the pytorch session VRAM, not the driver/cuda buffer VRAM amount. Here are the estimated (with the formula) versus empirical results (using nvidia-smi after building/training the model)

ESTIMATE BEFORE EMPIRICAL TEST: 
VRAM estimate = 8790.1611328125MiB

EMPIRICAL RESULT AFTER BUILDING MODEL: 
GPU RAM for pytorch session only (cutorch.max_memory_reserved(0)/1048576): 8466.0MiB
GPU RAM including extra driver buffer from nvidia-smi: 9719MiB

Any ideas on how to estimate that extra VRAM shown in nvidia-smi output?

Kerns answered 22/10, 2021 at 18:29 Comment(5)
An irrelevant question, do you take into account the memory required for storing the graph for gradients (sometimes may not apply) and gradient history specially when you use some momentum based optimizer. I mean how did you arrive at 384048000Syncom
model.summary() shows the number of parameters, which comprises all elements of the model across all layers (weights, biases, inputs, outputs, etc).Kerns
I figured you might do something like that... what you are estimating are the variables of your forward pass and you dont take into account the variables that the optimizers would introduce. your calculations might only apply to inference, not the training.Syncom
Makes sense. I just noticed that the optimizer uses about 3 times more VRAM than the VRAM I calculated needed model parameters alone. It starts by occupying one time on epoch 1, two times, on epoch 2, and 3 times more on epoch 3. Then it doesn't increment the VRAM used after epoch 3. I was trying to understand why that happen...Kerns
if you are using Adam optimizer it would store one set of parameters for the adagrad and one set for rmsprop part of the Adam for each parameter/tensor/variable that requires grad. Other optimizers might use less or more depending on how they are configured.Syncom

© 2022 - 2024 — McMap. All rights reserved.