Issues when using HuggingFace `accelerate` with `fp16`
Asked Answered
B

1

12

I'm trying to use accelerate module to parallelize my model training. But I have troubles to use it when training models with fp16. If I load the model with torch_dtype=torch.float16, I got ValueError: Attempting to unscale FP16 gradients.. But if I don't load the model with half precision I will get a CUDA out of memory error. Below are the details of this problem:

I'm fine tuning a 2.7B CLM on one A100 - 40GB GPU (I will be working on a much larger model, but I want to use this model to test my training process to make sure everything works as expected). I initially started with a training script without accelerate and without Trainer. I can successfully train the model when I load the model in half precision with:

# here device = 'cuda'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)

I have to load the model in half precision otherwise I will get a CUDA out of memory error. I simplified my script and uploaded here as a demonstration. When loading the model with half precision, it takes about 27GB GPU memory out of 40GB in the training process. It has plenty of rooms left on the GPU memory.

Now I want to utilize the accelerate module (potentially with deepspeed for larger models in the future) in my training script. I made the following changes:

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
accelerator = Accelerator(cpu=False, mixed_precision='fp16')
...
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)
...
# in training loop, I updated `lose.backward()` to:
accelerator.backward(loss)

Here is the updated script. I also configured accelerate with accelerate config. The default_config.yaml can be found from the same Gist.

Now when I tried to launch the script on the same machine with accelerate launch --fp16 <script_path>. I got an error ValueError: Attempting to unscale FP16 gradients.. So I removed torch_dtype=torch.float16 from model loading and rely on accelerate to downcast the model weight to half precision. But now I got CUDA out of memory error.

To summarize:

  1. I can train the model successfully when loading it with torch_dtype=torch.float16 and not using accelerate.
  2. With accelerate, I cannot load the model with torch_dtype=torch.float16. It gives ValueError: Attempting to unscale FP16 gradients..
  3. If I don't load the model with torch_dtype=torch.float16 and use fp16 with accelerate, I got CUDA out of memory error.

So my question is: how can I train the model on a single A100 - 40GB GPU with accelerate?

I included one script without accelerate and one with accelerate. I would like them to have the same behavior in terms of GPU memory consumption.

Blastosphere answered 21/3, 2023 at 15:2 Comment(2)
If I see correctly then accelerator.prepare will only autocast the model's forward function to fp16 but not the model weights itself? This might explain why you are not able to load the model using accelerate mixed precision.Birthright
did you solve? I have exact same issueLehmann
L
2

i fixed it by taking cast_training_params from HF SDXL train script

they load the models in fp32, then they move them to cuda and convert them, like this:

unet.to(accelerator.device, dtype=weight_dtype)

but the trainable params are set to fp32 before starting training, using this function

def cast_training_params(model: Union[torch.nn.Module, List[torch.nn.Module]], dtype=torch.float32):
    if not isinstance(model, list):
        model = [model]
    for m in model:
        for param in m.parameters():
            # only upcast trainable parameters into fp32
            if param.requires_grad:
                param.data = param.to(dtype)
Luminiferous answered 11/4 at 6:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.