Loading a HuggingFace model on multiple GPUs using model parallelism for inference
Asked Answered
D

1

18

I have access to six 24GB GPUs. When I try to load some HuggingFace models, for example the following

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("google/ul2")
model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2")

I get an out of memory error, as the model only seems to be able to load on a single GPU. However, while the whole model cannot fit into a single 24GB GPU card, I have 6 of these and would like to know if there is a way to distribute the model loading across multiple cards, to perform inference.

HuggingFace seems to have a webpage where they explain how to do this but it has no useful content as of today.

Diazo answered 15/2, 2023 at 11:33 Comment(0)
U
28

When you load the model using from_pretrained(), you need to specify which device you want to load the model to. Thus, add the following argument, and the transformers library will take care of the rest:

model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2", device_map = 'auto')

Passing "auto" here will automatically split the model across your hardware in the following priority order: GPU(s) > CPU (RAM) > Disk.

Of course, this answer assumes you have cuda installed and your environment can see the available GPUs. Running nvidia-smi from a command-line will confirm this. Please report back if you run into further issues.

Unfinished answered 8/3, 2023 at 19:55 Comment(8)
Hello, can you confirm that your technique actually distributes the model across multiple GPUs (i.e. does model parallel loading), instead of just loading the model on one GPU if it is available. My question was not about loading the model on a GPU rather than a CPU, but about loading the same model across multiple GPUs using model parallelism. Second, even when I try that, I get TypeError: <MyTransformerModel>.__init__() got an unexpected keyword argument 'device', for information I'm on transformers==4.26.0Diazo
@andrea, I just updated my answer, it was previously incorrect. The parameter is device_map, not device...my apologies. I just tested on my end and I can confirm that ul2 loads in parallel across my available GPUs.Unfinished
Great, it works now! Thanks so much for this, I have approved your answer.Diazo
Hi @Unfinished another question. Do you know how to load the input in a way that the model can accept? The input_ids should be on the same device as the model if I understand correctly, but not sure how it works when model is distributed. At the moment if I do input_ids = tokenizer.encode(text, return_tensors='pt') and then outputs = model.generate(input_ids) I get the error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:2! (when checking argument for argument mat2 in method wrapper_mm)Diazo
Hi @andrea! Happy to help. Check out this thread on the huggingface forum where I had a similar issue and eventually a solution was found :) discuss.huggingface.co/t/…Unfinished
For future reference, I have also found this very good resource explaining your options for distributing larger models huggingface.co/blog/accelerate-large-modelsDiazo
I think he is asking about how to run the model on multiple GPUs not on specific GPU.Martelli
This is not work, still can not fit the model on multiple GPU but still OOMTical

© 2022 - 2024 — McMap. All rights reserved.