So going the AMP: Automatic Mixed Precision Training tutorial for Normal networks, I found out that there are two versions, Automatic
and GradScaler
. I just want to know if it's advisable / necessary to use the GradScaler
with the training becayse it is written in the document that:
Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(1):
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
opt.zero_grad()
Also, looking at NVIDIA Apex Documentation for PyTorch, they have used it as,
from apex import amp
model, optimizer = amp.initialize(model, optimizer)
loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
optimizer.step()
I think this is what GradScaler
does too so I think it is a must. Can someone help me with the query here.