Multi GPU training for Transformers with different GPUs
Asked Answered
L

1

6

I want to fine tune a GPT-2 model using Huggingface’s Transformers. Preferably the medium model but large if possible. Currently, I have a RTX 2080 Ti with 11GB of memory and I can train the small model just fine.

My question is: will I run into any issues if I added an old Tesla K80 (24GB) to my machine and distributed the training? I cannot find information about using different capacity GPUs during training and issues I could run into.

Will my model size limit essentially be sum of all available GPU memory? (35GB?)

I’m not interested in doing this in AWS.

Lepp answered 28/3, 2020 at 17:16 Comment(4)
Went ahead and ordered a K80. I’ll update this on any gotchas when it arrives and I can try some local heterogenous multi-GPU training!Lepp
K80 setup and running. The system sees two GPUs with 11GB each. When I start training, I get a warning. “ There is an imbalance between your GPUs. You may want to exclude GPU 1 which has less than 75% of the memory or cores of GPU 0.” This does not cause any problems from what I can tell; however, I cannot load the medium GPT2 without getting OOM errors. Is there a way to split it across my 3 GPUs (each with 11GB)?Lepp
Looks like Model Parallelization will only really be supported via something like github.com/NVIDIA/Megatron-LMLepp
I tried to run 345M on 8xV100 (16 GB each) with batch_size 1 on AWS and it is getting OOM error. The model is trying to allocate more than 16 GB in each GPU. How did you solve this problem? Is there any way to treat 2 GPUs as a single GPU?Mcpeak
R
2

You already solved your problem. That's great. I would like to point out a different approach and address a few questions.

Will my model size limit essentially be sum of all available GPU memory? (35GB?)

This depends on the training technique you use. The standard data parallelism replicates the model, gradients and optimiser states to each of the GPUs. So each GPU must have enough memory to hold all these. The data is splitted across the GPUs. However, the bottleneck is usually the optimiser states and the model not the data.

The state-of-the-art approach in training is ZeRO. Not only the dataset, but also the model parameters, the gradients and the optimizer states are splitted across the GPUs. This allows you to train huge models without hitting OOM. See the nice illustration below from the paper. The baseline is the standard case that I mentioned. They gradually split optimizer states, gradients and model parameter accross the GPU's and compare the memory usage per GPU.

enter image description here

The authors of the paper created a library called DeepSpeed and it is very easy to integrate it with huggingface. With that I was able to increase my model size from 260 Million to 11 Billion :)

If you want to understand in detail how it works, here is the paper: https://arxiv.org/pdf/1910.02054.pdf

More information on integrating DeepSpeed with Huggingface can be found here: https://huggingface.co/docs/transformers/main_classes/deepspeed

PS: There is a the model parallelism technique in which each GPU trains different layers of the model but it lost its popularity and is not being actively used.

Reactor answered 20/8, 2022 at 10:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.