Say I have the following model (from this script):
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig
config = AutoConfig.from_pretrained(
"gpt2",
vocab_size=len(tokenizer),
n_ctx=context_length,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
)
model = GPT2LMHeadModel(config)
I'm currently using this training arguments for the Trainer:
from transformers import Trainer, TrainingArguments
args = TrainingArguments(
output_dir="codeparrot-ds",
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
evaluation_strategy="steps",
eval_steps=5_000,
logging_steps=5_000,
gradient_accumulation_steps=8,
num_train_epochs=1,
weight_decay=0.1,
warmup_steps=1_000,
lr_scheduler_type="cosine",
learning_rate=5e-4,
save_steps=5_000,
fp16=True,
push_to_hub=True,
)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
args=args,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["valid"],
)
trainer.train()
How can I adapt this so the Trainer will use multiple GPUs (e.g., 8)?
I found this SO question, but they didn't use the Trainer and just used PyTorch's DataParallel
model = torch.nn.DataParallel(model, device_ids=[0,1])
The Huggingface docs on training with multiple GPUs are not really clear to me and don't have an example of using the Trainer. Instead, I found here that they add arguments to their python file with nproc_per_node
, but that seems too specific to their script and not clear how to use in general. This is in contrary to this discussion on their forum that says "The Trainer class automatically handles multi-GPU training, you don’t have to do anything special."
. So this is confusing as on one hand they're mentioning that there are things needed to be done to train on multiple GPUs, and also saying that the Trainer handles it automatically. So I'm not sure what to do.