RL Activation Functions with Negative Rewards
Asked Answered
A

2

7

I have a question regarding appropriate activation functions with environments that have both positive and negative rewards.

In reinforcement learning, our output, I believe, should be the expected reward for all possible actions. Since some options have a negative reward, we would want an output range that includes negative numbers.

This would lead me to believe that the only appropriate activation functions would either be linear or tanh. However, I see any many RL papers the use of Relu.

So two questions:

  1. If you do want to have both negative and positive outputs, are you limited to just tanh and linear?

  2. Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?

Argentinaargentine answered 26/12, 2017 at 14:35 Comment(2)
Could you point me to some RL paper using Relu which output is the expected reward? (just curiosity!) Thanks.Sly
Human level control reinforcement learning from Mnih and Hindsight Experience Replay from openAIArgentinaargentine
H
6

Many RL papers indeed use Relu's for most layers, but typically not for the final output layer. You mentioned the Human Level Control through Deep Reinforcement Learning paper and the Hindsight Experience Replay paper in one of the comments, but neither of those papers describe architectures that use Relu's for the output layer.

In the Human Level Control through Deep RL paper, page 6 (after references), Section "Methods", last paragraph for the part on "Model architecture" mentions that the output layer is a fully-connected linear layer (not a Relu). So, indeed, all hidden layers can only have nonnegative activation levels (since they all use Relus), but the output layer can have negative activation levels if there are negative weights between the output layer and last hidden layer. This is indeed necessary because the outputs it should create can be interpreted as Q-values (which may be negative).

In the Hindsight Experience Replay paper, they do not use DQN (like the paper above), but DDPG. This is an "Actor-Critic" algorithm. The "critic" part of this architecture is also intended to output values which can be negative, similar to the DQN architecture, so this also cannot use a Relu for the output layer (but it can still use Relus everywhere else in the network). In Appendix A of this paper, under "Network architecture", it is also described that the actor output layer uses tanh as activation function.

To answer your specific questions:

  1. If you do want to have both negative and positive outputs, are you limited to just tanh and linear?
  2. Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?
  1. Well, there are also other activations (leaky relu, sigmoid, lots of others probably). But a Relu indeed cannot result in negative outputs.
  2. Not 100% sure, possibly. It would often be difficult though, if you have no domain knowledge about how big or small rewards (and/or returns) can possibly get. I have a feeling it would typically be easier to simply end with one fully connected linear layer.
Housefather answered 29/12, 2017 at 16:46 Comment(1)
This is exactly the part I was missing - that you can have nonnegative layers in between and still achieve a negative with the final layer. Re your specific answers - I though sigmoid is bound between 0,1. Leaky relu can have negative values, but this is more just to make sure it doesn't die at zero. I don't believe it is meant to output a negative value. I think I agree with your second point. Scaling rewards can lead to distortions that may not reflect reality. It's probably best to try to keep that as simple as possible and modify the model. Thanks!Argentinaargentine
B
1

If you do want to have both negative and positive outputs, are you limited to just tanh and linear?

No, this is only the case for the activation function of the output layer. For all other layers, it does not matter because you can have negative weights which means neurons with only positive values can still contribute with negative values to the next layer.

Borszcz answered 7/6, 2021 at 10:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.