Is GradScaler necessary with Mixed precision training with pytorch?

Asked 7/6, 2022 at 16:46 Answered 14/6, 2024 at 2:32

Solved deep-learning pytorch nvidia nvidia-apex

So going the AMP: Automatic Mixed Precision Training tutorial for Normal networks, I found out that there are two versions, Automatic and GradScaler. I just want to know if it's advisable / necessary to use the GradScaler with the training becayse it is written in the document that:

Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.

scaler = torch.cuda.amp.GradScaler()
for epoch in range(1):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)

        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad()

Also, looking at NVIDIA Apex Documentation for PyTorch, they have used it as,

from apex import amp

model, optimizer = amp.initialize(model, optimizer)

loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
optimizer.step()

I think this is what GradScaler does too so I think it is a must. Can someone help me with the query here.

Flurried answered 7/6, 2022 at 16:46 Comment(0)

Short answer: yes, your model may fail to converge without GradScaler().

There are three basic problems with using FP16:

Weight updates: with half precision, 1 + 0.0001 rounds to 1. autocast() takes care of this one.
Vanishing gradients: with half precision, anything less than (roughly) 2e-14 rounds to 0, as opposed to single precision 2e-126. GradScaler() takes care of this one.
Explosive loss: similar to the above, overflow is also much more likely with half precision. This is also managed by autocast() context.

Vincents answered 8/6, 2022 at 14:18 Comment(1)

I had trouble finding concrete resources on the matter, so thank you for a clear answer! – Dylane 16/12, 2022 at 10:9

Yes, gradient scaling is crucial.

There are usually two problems with using low precision FP16 compared to FP32.

Arithmetic underflow/overflow: In fp16, when update/param < 2^-11 i.e. 0.00049, param update will have no effect. This means weight update (weight += lr*gradient) as shown in below example won't be effective. In reduction operations, fp16 will lead to arithmetic overflow. The solution is to use FP32 wherever underflow/overflow might happen, which is taken care by AMP in PyTorch.

PyTorch Autocasting Behaviour	Ops
Ops autocast to fp16	matmul, linear, conv2d, LSTMCell, etc.
Ops autocast to fp32	pow, sum, normalize, softmax, etc.

# Imprecise weight update
p = torch.FloatTensor([1.0]), device='cuda:0')
print(p.dtype, p + 0.0001)  # weight += lr*gradient
p = torch.HalfTensor([1.0]), device='cuda:0')
print(p.dtype, p + 0.0001, '-> undeflow')

# output
torch.float32 tensor([1.0001])
torch.float16 tensor([1.], dtype=torch.float16) -> undeflow

# reduction operation
a = torch.FloatTensor(4096).fill_(16.0) # a 4096x1 tensor having each value 16.0
print(a.dtype, a.sum())
a = torch.HalfTensor(4096).fill_(16.0)
print(a.dtype, a.sum(), '-> overflow')

#output
torch.float32 tensor(65536.)
torch.float16 tensor(inf, dtype=torch.float16) -> overflow

Loss scaling: Another problem is the gradient values become zero when converted from FP32 to FP16 as they simply lie outside the FP16 range as shown in the figure from nvidia docs. Notice that FP16 range is sufficient but much of it is left unused. The simple solution is to scale the gradients to the right to keep them from becoming 0s in FP16. That's why gradient scaling is necessary in Mixed Precision.

Domiciliate answered 14/6, 2024 at 2:32 Comment(0)

Recommended topics

Hot tags