CUDA out of memory using trainer in huggingface during validation (training is fine)

import os os.environ["CUDA_VISIBLE_DEVICES"]="0" import torch import torch.nn as nn import bitsandbytes as bnb from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'bigscience/bloom-1b1', load_in_8bit=True, device_map='auto', ) tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom-1b1') from peft import LoraConfig, get_peft_model config = LoraConfig( r= 8, #attention heads lora_alpha=32, #alpha scaling target_modules=["query_key_value"], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" # set this for CLM or Seq2Seq ) model = get_peft_model(model, config) import transformers trainer = transformers.Trainer( model=model, train_dataset=tokenized_datasets['train'], eval_dataset=tokenized_datasets["validation"], args=transformers.TrainingArguments( per_device_train_batch_size=1, gradient_accumulation_steps=4, warmup_steps=2, max_steps=60, learning_rate=2e-4, evaluation_strategy = 'steps', eval_accumulation_steps = 1, eval_steps = 10, seed = 42, report_to="wandb", fp16=True, logging_steps=1, output_dir='outputs' ), data_collator=transformers.DataCollatorForSeq2Seq(tokenizer, pad_to_multiple_of=8, return_tensors="pt", padding=True) # data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False) ) model.config.use_cache = False # silence the warnings. Please re-enable for inference! trainer.train()

First, ensure that you have the latest accelerate>=0.21.0 installed.

pip install -U accelerate

Then, try using auto_find_batch_size

args=transformers.TrainingArguments(
        auto_find_batch_size=True,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1, 
        eval_steps = 10,
        seed =  42,
        report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    )

Then, try manually setting the evaluation batch size:

args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1, 
        eval_steps = 10,
        seed =  42,
        report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    )

Then if all else is still failing try algorithmically reduce the memory footprint, e.g. https://huggingface.co/docs/transformers/perf_train_gpu_one

args=transformers.TrainingArguments(
        auto_find_batch_size=True,
        optim="adafactor", gradient_checkpointing=True,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=60,
        learning_rate=2e-4,
        evaluation_strategy = 'steps',
        eval_accumulation_steps = 1, 
        eval_steps = 10,
        seed =  42,
        report_to="wandb",
        fp16=True,
        logging_steps=1,
        output_dir='outputs'
    )

Note: If it's still not enough, try 8-bit optimizers, https://huggingface.co/docs/transformers/perf_train_gpu_one#8bit-adam

Finally, if all else fails, consider using multiple GPUs, see https://huggingface.co/docs/transformers/perf_train_gpu_many

First save your training script into a .py file,
Run the python file on the CLI with something like https://huggingface.co/docs/transformers/run_scripts#distributed-training-and-mixed-precision

Q: Why can't the library handle the batch size automatically during the evaluation?

A: Because it's not a yet a feature coded, try contributing to the library =)

Q: What do I lose if I algorithmically scale down the memory footprint?

A: Most probably nothing when your model is hyperaparameter tuned well and the model training converges eventually. But technically, there's some loss during the quantization when you use tricks like adafactor.

Q: Why is it that the code works as a script but doesn't work in-cell inside Jupyter?

A: Most probably accelerate has more controls to the kernels when you call it as a script, once you're in the Jupyer cell, the code is restricted by the Jupyter Python kernel.

Q: How did everyone else train the model with the example code I see in blogpost but when I run it on my 16GB GPU, it just fails?

A: Most probably, they are using A100 GPUs with 40GB RAM when they use Google colab to demonstrate the code. For more hardware comparison (a little outdated but still relevant), see https://lambdalabs.com/blog/best-gpu-2022-sofar

P/S: The evaluation step breaking with CUDA OOM happens quite often and until hardware catches up, we'll have to either fix it in the code to handle but RAM needed for a batch during training and an additional space needed for evaluation.

Q: Why can't the library handle the batch size automatically during the evaluation?

Q: What do I lose if I algorithmically scale down the memory footprint?

Q: Why is it that the code works as a script but doesn't work in-cell inside Jupyter?

Q: How did everyone else train the model with the example code I see in blogpost but when I run it on my 16GB GPU, it just fails?

Recommended topics

Hot tags