Does pytorch apply softmax automatically in nn.Linear

In pytorch a classification network model is defined as this,

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.out = torch.nn.Linear(n_hidden, n_output)   # output layer

    def forward(self, x):
        x = F.relu(self.hidden(x))      # activation function for hidden layer
        x = self.out(x)
        return x

Is softmax applied here? In my understanding, things should be like,

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.relu =  torch.nn.ReLu(inplace=True)
        self.out = torch.nn.Linear(n_hidden, n_output)   # output layer
        self.softmax = torch.nn.Softmax(dim=n_output)
    def forward(self, x):
        x = self.hidden(x)      # activation function for hidden layer
        x = self.relu(x)
        x = self.out(x)
        x = self.softmax(x)
        return x

I understand that F.relu(self.relu(x)) is also applying relu, but the first block of code doesn't apply softmax, right?

Latching on to what @jodag was already saying in his comment, and extending it a bit to form a full answer:

No, PyTorch does not automatically apply softmax, and you can at any point apply torch.nn.Softmax() as you want. But, softmax has some issues with numerical stability, which we want to avoid as much as we can. One solution is to use log-softmax, but this tends to be slower than a direct computation.

Especially when we are using Negative Log Likelihood as a loss function (in PyTorch, this is torch.nn.NLLLoss, we can utilize the fact that the derivative of (log-)softmax+NLLL is actually mathematically quite nice and simple, which is why it makes sense to combine the both into a single function/element. The result is then torch.nn.CrossEntropyLoss. Again, note that this only applies directly to the last layer of your network, any other computation is not affected by any of this.

Recommended topics

Hot tags