CUDA out of memory using trainer in huggingface during validation (training is fine)
Asked Answered
C

1

7

When doing fine-tuning with Hg trainer, training is fine but it failed during validation. Even reducing the eval_accumation_steps = 1 did not work.

I followed the procedure in the link: Why is evaluation set draining the memory in pytorch hugging face? It did not work for me.

When I removed the evaluation dataset in the TrainingArguments, it works fine! But if I added it back like the following, it ran out of memory after finishing 10th step of training (because it was going to do evaluation).


import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'bigscience/bloom-1b1',
    load_in_8bit=True,
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-1b1')



from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r= 8, #attention heads
    lora_alpha=32, #alpha scaling
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM" # set this for CLM or Seq2Seq
)

model = get_peft_model(model, config)


import transformers
trainer = transformers.Trainer(
    model=model,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets["validation"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1, 
        eval_steps = 10,
        seed =  42,
        report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True)

    # data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

error:

OutOfMemoryError Traceback (most recent call last) in <cell line: 26>() 24 ) 25 model.config.use_cache = False # silence the warnings. Please re-enable for inference! ---> 26 trainer.train() 27 28 wandb.finish()

17 frames /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing) 3027 if size_average is not None or reduce is not None: 3028 reduction = _Reduction.legacy_get_string(size_average, reduce) -> 3029 return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) 3030 3031

OutOfMemoryError: CUDA out of memory. Tried to allocate 5.56 GiB (GPU 0; 14.75 GiB total capacity; 12.58 GiB already allocated; 840.81 MiB free; 12.86 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Calvary answered 17/8, 2023 at 11:45 Comment(5)
Looks like a memory issue. Cuda cannot find enough space to allocate for the process as it needs gigabytes of free space while your computer has less than thatIconium
@QingGuo I think so but it is very weird because I have set eval_accumation_steps = 1, but it did not work.Calvary
Is the evaluation dataset mandatory for the training process?Iconium
This is because you mention that it works fine if you omit the evaluation dataset from the training argumentsIconium
@Qing Guo It is optional, but I would like to doing validation during trainingCalvary
G
8

First, ensure that you have the latest accelerate>=0.21.0 installed.

pip install -U accelerate

Then, try using auto_find_batch_size

args=transformers.TrainingArguments(
        auto_find_batch_size=True,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1, 
        eval_steps = 10,
        seed =  42,
        report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    )

Then, try manually setting the evaluation batch size:

args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1, 
        eval_steps = 10,
        seed =  42,
        report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    )

Then if all else is still failing try algorithmically reduce the memory footprint, e.g. https://huggingface.co/docs/transformers/perf_train_gpu_one

args=transformers.TrainingArguments(
        auto_find_batch_size=True,
        optim="adafactor", gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1, 
        eval_steps = 10,
        seed =  42,
        report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    )

Note: If it's still not enough, try 8-bit optimizers, https://huggingface.co/docs/transformers/perf_train_gpu_one#8bit-adam

Finally, if all else fails, consider using multiple GPUs, see https://huggingface.co/docs/transformers/perf_train_gpu_many

Q: Why can't the library handle the batch size automatically during the evaluation?

A: Because it's not a yet a feature coded, try contributing to the library =)

Q: What do I lose if I algorithmically scale down the memory footprint?

A: Most probably nothing when your model is hyperaparameter tuned well and the model training converges eventually. But technically, there's some loss during the quantization when you use tricks like adafactor.

Q: Why is it that the code works as a script but doesn't work in-cell inside Jupyter?

A: Most probably accelerate has more controls to the kernels when you call it as a script, once you're in the Jupyer cell, the code is restricted by the Jupyter Python kernel.

Q: How did everyone else train the model with the example code I see in blogpost but when I run it on my 16GB GPU, it just fails?

A: Most probably, they are using A100 GPUs with 40GB RAM when they use Google colab to demonstrate the code. For more hardware comparison (a little outdated but still relevant), see https://lambdalabs.com/blog/best-gpu-2022-sofar


P/S: The evaluation step breaking with CUDA OOM happens quite often and until hardware catches up, we'll have to either fix it in the code to handle but RAM needed for a batch during training and an additional space needed for evaluation.

Gesner answered 17/8, 2023 at 14:55 Comment(2)
bloom-1b1 is a quite large model for consumer grade GPUs but I've encountered OOM error in the past week with llama-2-7b on my machine due to "old" peft version. Not all OOMs are the same, I guess. :^)Fredel
if all else fails, use_cpu=True can go inside the TrainingArguments. I know you don't want it, but it does seem to work after every other attempt fails.Lifework

© 2022 - 2024 — McMap. All rights reserved.