How do I add L1/L2 regularization in PyTorch without manually computing it?
See the documentation. Add a weight_decay
parameter to the optimizer for L2 regularization.
Use weight_decay > 0
for L2 regularization:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, weight_decay=1e-5)
weight_decay
. But weight_decay
and L2 regularization is different for Adam optimizer. More can be read here: openreview.net/pdf?id=rk6qdGgCZ –
Sudhir weight_decay
and L2 regularization is different but in the case of PyTorch's implementation of Adam, they actually implement L2 regularization instead of true weight decay. Note that the weight decay term is applied to the gradient before the optimizer step here –
Tempi L = f(θ) + ½λ∑θ²
. But then, 𝜕L/𝜕θ = 𝜕f/𝜕θ + λ∑θ
. If you take a look at the Adam algorithm, it effectively says g = 𝜕L/𝜕θ = 𝜕f/𝜕θ + λ∑θ
. –
Osmo See the documentation. Add a weight_decay
parameter to the optimizer for L2 regularization.
Previous answers, while technically correct, are inefficient performance wise and are not too modular (hard to apply on a per-layer basis, as provided by, say, keras
layers).
PyTorch L2 implementation
Why PyTorch implemented L2
inside torch.optim.Optimizer
instances?
Let's take a look at torch.optim.SGD
source code (currently as functional optimization procedure), especially this part:
for i, param in enumerate(params):
d_p = d_p_list[i]
# L2 weight decay specified HERE!
if weight_decay != 0:
d_p = d_p.add(param, alpha=weight_decay)
- One can see, that
d_p
(derivative of parameter, gradient) is modified and re-assigned for faster computation (not saving the temporary variables) - It has
O(N)
complexity without any complicated math likepow
- It does not involve
autograd
extending the graph without any need
Compare that to O(n)
**2
operations, addition and also taking part in backpropagation.
Math
Let's see L2
equation with alpha
regularization factor (same could be done for L1 ofc):
If we take derivative of any loss with L2
regularization w.r.t. parameters w
(it is independent of loss), we get:
So it is simply an addition of alpha * weight
for gradient of every weight! And this is exactly what PyTorch does above!
L1 Regularization layer
Using this (and some PyTorch magic), we can come up with quite generic L1 regularization layer, but let's look at first derivative of L1
first (sgn
is signum function, returning 1
for positive input and -1
for negative, 0
for 0
):
Full code with WeightDecay
interface located in torchlayers third party library providing stuff like regularizing only weights/biases/specifically named paramters (disclaimer: I'm the author), but the essence of the idea outlined below (see comments):
class L1(torch.nn.Module):
def __init__(self, module, weight_decay):
super().__init__()
self.module = module
self.weight_decay = weight_decay
# Backward hook is registered on the specified module
self.hook = self.module.register_full_backward_hook(self._weight_decay_hook)
# Not dependent on backprop incoming values, placeholder
def _weight_decay_hook(self, *_):
for param in self.module.parameters():
# If there is no gradient or it was zeroed out
# Zeroed out using optimizer.zero_grad() usually
# Turn on if needed with grad accumulation/more safer way
# if param.grad is None or torch.all(param.grad == 0.0):
# Apply regularization on it
param.grad = self.regularize(param)
def regularize(self, parameter):
# L1 regularization formula
return self.weight_decay * torch.sign(parameter.data)
def forward(self, *args, **kwargs):
# Simply forward and args and kwargs to module
return self.module(*args, **kwargs)
Read more about hooks in this answer or respective PyTorch docs if needed.
And usage is also pretty simple (should work with gradient accumulation and and PyTorch layers):
layer = L1(torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=3))
torchlayers
and release it with L1
and L2
because they are still missing in version 0.1.1 released more that 1 year ago? –
Ngocnguyen L1
and L2
, thank you. There is a warning though ~/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py:785: UserWarning: Using a non-full backward hook when outputs are generated by different autograd Nodes is deprecated and will be removed in future versions. This hook will be missing some grad_output. Please use register_full_backward_hook to get the documented behavior.
–
Ngocnguyen -1
or 1
until the optimizer uses this feedback. Otherwise it could grow to, say -16 for 16 steps of gradient accumulation. –
Gotcher For L2 regularization,
l2_lambda = 0.01
l2_reg = torch.tensor(0.)
for param in model.parameters():
l2_reg += torch.norm(param)
loss += l2_lambda * l2_reg
References:
torch.norm
is taking 2-norm here, not the square of the 2-norm. So I think the norm should be squared to get a correct regularization. –
Geo torch.norm
is deprecated. –
Crossjack L2 regularization out-of-the-box
Yes, pytorch optimizers have a parameter called weight_decay
which corresponds to the L2 regularization factor:
sgd = torch.optim.SGD(model.parameters(), weight_decay=weight_decay)
L1 regularization implementation
There is no analogous argument for L1, however this is straightforward to implement manually:
loss = loss_fn(outputs, labels)
l1_lambda = 0.001
l1_norm = sum(torch.linalg.norm(p, 1) for p in model.parameters())
loss = loss + l1_lambda * l1_norm
The equivalent manual implementation of L2 would be:
l2_reg = sum(p.pow(2).sum() for p in model.parameters())
Source: Deep Learning with PyTorch (8.5.2)
for L1 regularization and include weight
only:
l1_reg = torch.tensor(0., requires_grad=True)
for name, param in model.named_parameters():
if 'weight' in name:
l1_reg = l1_reg + torch.linalg.norm(param, 1)
total_loss = total_loss + 10e-4 * l1_reg
torch.norm
is deprecated. –
Crossjack Interesting torch.norm
is slower on CPU and faster on GPU vs. direct approach.
import torch
x = torch.randn(1024,100)
y = torch.randn(1024,100)
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
Out:
1000 loops, best of 3: 910 µs per loop
1000 loops, best of 3: 1.76 ms per loop
On the other hand:
import torch
x = torch.randn(1024,100).cuda()
y = torch.randn(1024,100).cuda()
%timeit torch.sqrt((x - y).pow(2).sum(1))
%timeit torch.norm(x - y, 2, 1)
Out:
10000 loops, best of 3: 50 µs per loop
10000 loops, best of 3: 26 µs per loop
To extend on the good answers: As it was said, L2 norm added to the loss is equivalent to weight decay iff you use plain SGD without momentum. Otherwise, e.g. with Adam, it is not exactly the same. The AdamW paper [1] pointed out that weight decay is actually more stable. That is why you should use weight decay, which is an option to the optimizer. And consider using AdamW
instead of Adam
.
Also note, you probably don't want weight decay on all parameters (model.parameters()
), but only on a subset. See here for examples:
© 2022 - 2024 — McMap. All rights reserved.