Artificial Neural Network RELU Activation Function and Gradients
D

1

1

I have a question. I watched a really detailed tutorial on implementing an artificial neural network in C++. And now I have more than a basic understanding of how a neural network works and how to actually program and train one.

So in the tutorial a hyperbolic tangent was used for calculating outputs, and obviously its derivative for calculating gradients. However I wanted to move on to a different function. Specifically Leaky RELU (to avoid dying neurons).

My question is, it specifies that this activation function should be used for the hidden layers only. For the output layers a different function should be used (either a softmax or a linear regression function). In the tutorial the guy taught the neural network to be an XOR processor. So is this a classification problem or a regression problem?

I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem? So I implemented the Leaky RELU function and its derivative but I don't know whether I should use a softmax or a regression function for the output layer.

Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?

Thanks in advance.

Deejay answered 7/10, 2017 at 12:46 Comment(1)
It should be a classification problem because a XOR processor has binary output (2 classes, ie. yes/no).Chiffonier
F
1

I tried to google the difference between the two, but I can't quite grasp the category for the XOR processor. Is it a classification or a regression problem?

In short, classification is for discrete target, regression is for continuous target. If it were a floating point operation, you had a regression problem. But here the result of XOR is 0 or 1, so it's a binary classification (already suggested by Sid). You should use a softmax layer (or a sigmoid function, which works particularly for 2 classes). Note that the output will be a vector of probabilities, i.e. real valued, which is used to choose the discrete target class.

Also for recalculating the output gradients I use the Leaky RELU's derivative(for now) but in this case should I use the softmax's/regression derivative as well?

Correct. For the output layer you'll need a cross-entropy loss function, which corresponds to the softmax layer, and it's derivative for the backward pass. If there will be hidden layers that still use Leaky ReLu, you'll also need Leaky ReLu's derivative accordingly, for these particular layers.

Highly recommend this post on backpropagation details.

Factitive answered 7/10, 2017 at 20:38 Comment(5)
So do you think you could give me a piece of advice for the architecture of the new network? In the tutorial the guy uses an input layer with 2 neurons, one hidden layer with 4 neurons, and the output layer, with a single neuron. How should I adapt this? Should the output layer have a neuron corresponding to each class? (In the XOR case, 2 neurons?) Or...? Thanks in advance.Deejay
@Deejay There's no link to the tutorial, so can't comment on that. But a single output sigmoid neuron is totally possible. In this case it's output is interpreted as a probability of 1. You still can use cross-entropy loss, because you know both p and 1-pFactitive
Well the video is an hour and five minutes long, but here's the actual code he writes: inkdrop.net/dave/docs/neural-net-tutorial.cpp Mine has a few variations, as in I replaced the hyperbolic tangent and its derivative with Leaky RELU (and its derivative), and also made a new sigmoid function now, and I use it to recalculate the output for the final neuron. I also use the sigmoid's function derivative for recalculating the gradient for the neuron in the output layer. However after a training session my neural network comes up with 0.5 for every entry of the XOR processor.Deejay
@Deejay As far as I can see, his architecture is to output two values in [0, 1] (due to sigmoid) and have L2 loss function. This is possible as an example (though not very illustrative, besides it makes you confused), but the next step is to use softmax + cross-entropy loss. You can try both ways for practice, but the second one is certainly more established in classification problems. Feel free to create new questions if it doesn't work.Factitive
So I finally got it to work :), thanks for all the help but I still have a question. It seems now that the way I implemented it (with ReLU + Sigmoid) it takes more iterations for my network to learn, but here's a little weird behaviour. For 500 iterations over a set, Tanh already learns to mimic XOR, whereas ReLU is far off. For 2.5k iterations, very little improvement can be seen for Tanh, whereas ReLU becomes deadly accurate (more accurate then Tanh). Is this somehow related to the learning rate and momentum? If so, do you have any final advice for how to play around with that?Deejay

© 2022 - 2024 — McMap. All rights reserved.