Does pytorch apply softmax automatically in nn.Linear
Asked Answered
S

1

17

In pytorch a classification network model is defined as this,

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.out = torch.nn.Linear(n_hidden, n_output)   # output layer

    def forward(self, x):
        x = F.relu(self.hidden(x))      # activation function for hidden layer
        x = self.out(x)
        return x

Is softmax applied here? In my understanding, things should be like,

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.relu =  torch.nn.ReLu(inplace=True)
        self.out = torch.nn.Linear(n_hidden, n_output)   # output layer
        self.softmax = torch.nn.Softmax(dim=n_output)
    def forward(self, x):
        x = self.hidden(x)      # activation function for hidden layer
        x = self.relu(x)
        x = self.out(x)
        x = self.softmax(x)
        return x

I understand that F.relu(self.relu(x)) is also applying relu, but the first block of code doesn't apply softmax, right?

Sorrento answered 15/8, 2019 at 20:43 Comment(1)
On a related note, if you're using nn.CrossEntropyLoss then that applies log-softmax followed by nll-loss. You probably want to make sure you're not applying softmax twice since softmax is not idempotent.Vasomotor
S
11

Latching on to what @jodag was already saying in his comment, and extending it a bit to form a full answer:

No, PyTorch does not automatically apply softmax, and you can at any point apply torch.nn.Softmax() as you want. But, softmax has some issues with numerical stability, which we want to avoid as much as we can. One solution is to use log-softmax, but this tends to be slower than a direct computation.

Especially when we are using Negative Log Likelihood as a loss function (in PyTorch, this is torch.nn.NLLLoss, we can utilize the fact that the derivative of (log-)softmax+NLLL is actually mathematically quite nice and simple, which is why it makes sense to combine the both into a single function/element. The result is then torch.nn.CrossEntropyLoss. Again, note that this only applies directly to the last layer of your network, any other computation is not affected by any of this.

Scandura answered 16/8, 2019 at 8:45 Comment(3)
If I understand you correctly, it would be better to apply nn.CrossEntropyLoss as the loss function to the output of last layer nn.Linear(), instead of using nn.Softmax() directly. Is that correctly?Sorrento
And another question ensues, the output of nn.Softmax() can be considered as the probability of a certain class, while the sum of all outputs of nn.Linear() is not guaranteed to be equal to 1. Would that lose the meaning of the final output?Sorrento
To answer your first comment: You're not really replacing any layer with a loss function, but rather replace your current loss function (which should be nn.NLLLoss) with a different loss, while removing the last nn.Softmax(). I think the idea you had is already correct, though. The second question: Since your loss function still "applies" log softmax (or at least your derivatives are based on that), the interpretation still holds. If you are using the output in any other ways, e.g., during inference, you of course have to re-apply a softmax in that case.Scandura

© 2022 - 2024 — McMap. All rights reserved.