How to use Huggingface Trainer with multiple GPUs?

Asked 22/3, 2023 at 15:10 Answered 5/12, 2023 at 11:38

machine-learning pytorch huggingface-transformers huggingface

Say I have the following model (from this script):

from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig

config = AutoConfig.from_pretrained(
    "gpt2",
    vocab_size=len(tokenizer),
    n_ctx=context_length,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)

I'm currently using this training arguments for the Trainer:

from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="codeparrot-ds",
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="steps",
    eval_steps=5_000,
    logging_steps=5_000,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=5_000,
    fp16=True,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["valid"],
)
trainer.train()

How can I adapt this so the Trainer will use multiple GPUs (e.g., 8)?

I found this SO question, but they didn't use the Trainer and just used PyTorch's DataParallel

model = torch.nn.DataParallel(model, device_ids=[0,1])

The Huggingface docs on training with multiple GPUs are not really clear to me and don't have an example of using the Trainer. Instead, I found here that they add arguments to their python file with nproc_per_node, but that seems too specific to their script and not clear how to use in general. This is in contrary to this discussion on their forum that says "The Trainer class automatically handles multi-GPU training, you don’t have to do anything special.". So this is confusing as on one hand they're mentioning that there are things needed to be done to train on multiple GPUs, and also saying that the Trainer handles it automatically. So I'm not sure what to do.

Comnenus answered 22/3, 2023 at 15:10 Comment(3)

Unfortunately, no magic one argument/liner (yet). But with a little more line modification to your code you can use huggingface.co/docs/transformers/accelerate and huggingface.co/docs/transformers/… – Malamut 22/3, 2023 at 17:32

I'm trying to do this right now, it seems like there still is no way of using the plain trainer class to do this, but if someone has figured it out please answer! – Sculpsit 17/8, 2023 at 20:42

the script should work normally without any change, i tested a (somewhat) similar script on kaggle t4 x2, and it auto use 2 gpu – Chantey 15/2 at 1:15

I used one of following python scripts (e.g. run_clm.py) where trainer.train() is in there: https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling

Make a finetune.sh bash file, execute a python script inside

#!/bin/bash
export LD_LIBRARY_PATH=/home/miniconda3/envs/HF/lib/python3.7/.../nvidia/cublas/lib/:$LD_LIB
export CUDA_VISIBLE_DEVICES=0,1  # will use two GPUs
###############################
python run_clm.py --options...

Then run it via bash, it'll run over the two GPUs as defined.

$ nohup ./finetune.sh &

If you want to run over all available 8 GPUs,
simply comment the following line

#export CUDA_VISIBLE_DEVICES=0,1 # will use all GPUs

Tortoise answered 10/5, 2023 at 18:48 Comment(0)

you can create batch file eg script.sh for running on 8 gpus.

#! /bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=256
#SBATCH --gres=gpu:A100-SXM4:8
#SBATCH --time=35:00:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Job id is $SLURM_JOBID"
echo "Job submission directory is : $SLURM_SUBMIT_DIR"
cd $SLURM_SUBMIT_DIR


#activating environment
source Conda/bin/activate
conda activate trocr1

#python script for running. 

python -m torch.distributed.launch \
    --nproc_per_node 8 demo.py \     #num_gpu
    --tokenizer processor.feature_extractor\
    --args training_args\
    --compute_metrics compute_metrics\
    --train_dataset train_dataset\
    --eval_dataset eval_dataset\
    --data_collator default_data_collator\
    --predict_with_generate True\
    --evaluation_strategy "epoch"\
    --save_strategy "epoch"\
    --load_best_model_at_end True\
    --greater_is_better False\
    --metric_for_best_model "eval_cer"\
    --per_device_train_batch_size 8\
    --per_device_eval_batch_size 8\
    --fp16 False\
    --bf16 True\
    --output_dir "./model/"\
    --num_train_epochs 5\
    --save_total_limit 3\
    --warmup_steps 500\
    --weight_decay 0.01\
    --logging_steps 10000\
    --save_steps 5000\
    --eval_steps 5000\
    --report_to 'tensorboard'\
    --seed 42

then submit this file using "sbatch script.sh" command on terminal. I used this for fine tuning trocr mode using HF Trainer. And it worked. I was using dgx server.

This helped me a lot

https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling

https://github.com/huggingface/transformers/tree/main/examples/pytorch#distributed-training-and-mixed-precision

or if you want to go with the interactive session, request for resources using command, after allocation just run the python script. Change specifications in script.sh as per your server.

Viviparous answered 13/9, 2023 at 6:9 Comment(0)

The Trainer class can auto detect if there are multiple GPUs. You just need to copy your code to Kaggle, and enable the accelerator(multiple GPUs or single GPU) from the Notebook options. And check if the training process can work well normally. Here is an example of mine, I have been tested Trainer with Multiple GPUs or Single GPU. The training process works well as normal. But the training time is not less than single GPU. Here is the notebook https://www.kaggle.com/code/aisuko/text-classification-with-transformers/notebook?scriptVersionId=153710473

If you are using native PyTorch with your customise training loop. Here is the document for distributed training https://huggingface.co/docs/transformers/accelerate#distributed-training-with--accelerate

Kwashiorkor answered 5/12, 2023 at 11:38 Comment(0)

Recommended topics

Hot tags