How can I decrease Dedicated GPU memory usage and use Shared GPU memory for CUDA and Pytorch
Asked Answered
P

4

14

I'm getting following error when I try to use one of the huggingface models for sentimental analysis:

RuntimeError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 3.00 GiB total capacity; 1.84 GiB already allocated; 5.45 MiB free; 2.04 GiB reserved in total by PyTorch)

Although I'm not using the CUDA memory it is still staying on the same level. enter image description here

I tried to use torch.cuda.empty_cache() however it didn't affect the problem. When I closed the jupyter notebook it decreases to 0. So I am well sure that it is something with pytorch and python.

Here is my code:

import joblib
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification,pipeline
import torch.nn.functional as F
from torch.utils.data import DataLoader
import pandas as pd
import numpy as np
from tqdm import tqdm

tokenizer = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")

sa= pipeline("sentiment-analysis", tokenizer=tokenizer, model=model,device=0)
batcher = DataLoader(dataset=comments,
                      batch_size=100,
                      shuffle=True,
                      pin_memory=True)
predictions= []
for batch in tqdm(batcher):
     p = sa(batch)
     predictions.append(p)

I have a GTX 1060, python 3.8 and torch==1.7.1 and my os is Windows 10. And the count of comments are 187K. I would like to know if there is any work around for this memory issue. Maybe holding tensors somehow on CPU and only use batch on GPU. After using and getting this error the memory usage still continues. When I close my jupyter notebook it goes away. Is there any way that I can clear this memory ? Is there any way I can utilize Shared GPU memory ?

Progenitive answered 4/3, 2021 at 13:20 Comment(2)
Your batch size is way too high for your GPU and a transformer like BERT. Try using batch_size of 8 and go up from there, if you can. Also, the token sequence length plays a big factor in the memory usage.Piecedyed
You could also try using pytorch automatic mixed precision library. Your gpu does not have tensors cores so it will not be faster but it will use less memory. linkBarnette
P
1

How can I decrease Dedicated GPU memory usage and use Shared GPU memory for CUDA and Pytorch

Short answer: you can not.

Details: I believe this answer covers all the information that you need.

You can reduce the amount of usage memory by lower the batch size as @John Stud commented, or using automatic mixed precision as @Dwight Foster suggested.

While training, you can implement the gradient accumulation to reduce the batch size without affecting the performance.

Peregrinate answered 26/7, 2023 at 8:36 Comment(0)
R
0

Short answer: No.

I did run into this problem, but by choice. I wanted to check max capacity of my GPU for training. First things first.

  • Check if you have restarted kernel before launching a new training. Unless explicitly removed , copied items stay in GPU memory. They are only cleaned up when the kernel is restarted or explicitly purged

Next checklist

  • Learning Rate. smaller learning rate will use more memory. 0.0001 > 0.01>
  • Batch size. Big Batch size and low Learning rate = Lot more memory.

Optimizing

You have very little memory i.e. 3GB. Shared Memory doesnt apply here thats automatically managed. To train on GPU your tensor has to be in GPU memory, shared memory is system memory.

  • The numbers you can play around with are typically batch size, in your case its 100, you have to reduce it if you are running out of memory. Start with 8 then try 16 then 32 and so forth. Even numbers. You can do a short train of 1 epoch to test.
  • Use larger value for LR or use Step LR to automatically optimize LR. https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html

What I found is that using more memory does not mean your training will complete faster. It comes down to your hyperparameters and speed of CUDA cores anyway.

Inference

Since you are only using for prediction you can also try CPU inferencing and use the system memory Here you use the dataloader like this.

torch.utils.data.DataLoader(dataset=comments, \
                                           batch_size=batch_size, \
                                           shuffle=True \
                                           num_workers=num_workers)
# where num_workers is defined like this
import multiprocessing
num_workers = multiprocessing.cpu_count()

References

https://superuser.com/questions/1416540/what-is-shared-gpu-memory-and-how-is-total-gpu-memory-calculated-windows-10

Shared memory from pytorch forum, see specific response. https://discuss.pytorch.org/t/what-is-the-shared-memory/112212/8

Rolland answered 31/7, 2023 at 11:37 Comment(1)
Why do you think a smaller learning rate will use more memory?Peregrinate
C
0

I found the Shared GPU memory is used automatically after the Dedicated GPU is full. I'm running project https://github.com/thachln/Hawkeye on my Windows 10, Lenovo Thinkpad P52.

enter image description here

Craggy answered 14/1 at 1:24 Comment(1)
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Jetty
C
-1

You cannot. Shared GPU (which generally is not NVIDIA's) is different than Dedicated GPU. CUDA will only work for GPUs that are from NVIDIA.

However you can try DirectML (for intel shared GPU) which can help you to work on shared GPU than dedicated GPU.

Cavernous answered 28/7, 2023 at 7:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.