shall I apply softmax before cross entropy? [closed]

class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) return x

CrossEntropyLoss in PyTorch is already implemented with Softmax:

https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss

This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.

The answer to the second part of your question is a little more complicated. There can be multiple causes for reduction in accuracy. Theoretically speaking, since the softmax layer you added can predict the correct answer in a reasonable accuracy, the following layer should be able to do the same by preserving the maximum value with identity between the last two layers. Although the softmax normalizes those bounded outputs (between 0 and 1) again, it may change the way those are distributed, but still can preserve the maximum and therefore the class that is predicted.

However, in practice, things are a little bit different. When you have a double softmax in the output layer, you basically change the output function in such way that it changes the gradients that are propagated to your network. The softmax with cross entropy is a preferred loss function due to the gradients it produces. You can prove it to yourself by computing the gradients of the cost function, and account for the fact that each "activation" (softmax) is bounded between 0 and 1. The additional softmax "behind" the original one just multiplies the gradients with values between 0 and 1 and thus reducing the value. This affects the updates to the weights. Maybe it can be fixed by changing the learning rate but this is strongly not suggested. Just have one softmax and you're done.
See Michael Nielsen's book, chapter 3 for more profound explanation on that.

Recommended topics

Hot tags