Asked 5/11, 2016 at 12:56 Answered 4/7, 2019 at 19:31

python deep-learning theano keras q-learning

I've been trying to build a model using 'Deep Q-Learning' where I have a large number of actions (2908). After some limited success with using standard DQN: (https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), I decided to do some more research because I figured the action space was too large to do effective exploration.

I then discovered this paper: https://arxiv.org/pdf/1512.07679.pdf where they use an actor-critic model and policy gradients, which then led me to: https://arxiv.org/pdf/1602.01783.pdf where they use policy gradients to get much better results then DQN overall.

I've found a few sites where they have implemented policy gradients in Keras, https://yanpanlau.github.io/2016/10/11/Torcs-Keras.html and https://oshearesearch.com/index.php/2016/06/14/kerlym-a-deep-reinforcement-learning-toolbox-in-keras/ however I'm confused how they are implemented. In the former (and when I read the papers) it seems like instead of providing an input and output pair for the actor network, you provide the gradients for the all the weights and then use the network to update it, whereas, in the latter they just calculate an input-output pair.

Have I just confused myself? Am I just supposed to be training the network by providing an input-output pair and use the standard 'fit', or do I have to do something special? If it's the latter, how do I do it with the Theano backend? (the examples above use TensorFlow).

Humanist answered 5/11, 2016 at 12:56 Comment(2)

Have you seen github.com/matthiasplappert/keras-rl ? – Pyrology 31/1, 2017 at 8:59

One reason for not putting in state action pairs is that it will take a long time if you have a large number of actions. Instead, its often useful to have the network predict the values of all actions at once and then do your action selection following that – Inappetence 12/2, 2017 at 16:13

TL;DR

Learn how to implement custom loss functions and gradients using Keras.backend. You will need it for more advanced algorithms and it's actually much easier once you get the hang of it
One CartPole example of using keras.backend could be https://gist.github.com/kkweon/c8d1caabaf7b43317bc8825c226045d2 (though its backend used Tensorflow but it should be very similar if not the same)

Problem

When playing,

the agent needs a policy that is basically a function that maps a state into a policy that is a probability for each action. So, the agent will choose an action according to its policy.

i.e, policy = f(state)

When training,

Policy Gradient does not have a loss function. Instead, it tries to maximize the expected return of rewards. And, we need to compute the gradients of log(action_prob) * advantage

advantage is a function of rewards.
- advantage = f(rewards)
action_prob is a function of states and action_taken. For example, we need to know which action we took so that we can update parameters to increase/decrease a probability for the action we took.
- action_prob = sum(policy * action_onehot) = f(states, action_taken)

I'm assuming something like this

policy = [0.1, 0.9]
action_onehot = action_taken = [0, 1]
then action_prob = sum(policy * action_onehot) = 0.9

Summary

We need two functions

update function: f(state, action_taken, reward)
choose action function: f(state)

You already know it's not easy to implement like typical classification problems where you can just model.compile(...) -> model.fit(X, y)

However,

In order to fully utilize Keras, you should be comfortable with defining custom loss functions and gradients. This is basically the same approach the author of the former one took.
You should read more documentations of Keras functional API and keras.backend

Plus, there are many many kinds of policy gradients.

The former one is called DDPG which is actually quite different from regular policy gradients
The latter one I see is a traditional REINFORCE policy gradient (pg.py) which is based on Kapathy's policy gradient example. But it's very simple for example it only assumes only one action. That's why it could have been implemented somehow using model.fit(...) instead.

References

Schulman, "Policy Gradient Methods", http://rll.berkeley.edu/deeprlcourse/docs/lec2.pdf

Hoshi answered 18/5, 2017 at 8:48 Comment(0)

The seemingly conflicting implementations you encountered are both valid implementations. They are two equivalent ways two implement the policy gradients.

In the vanilla implementation, you calculate the gradients of the policy network w.r.t. rewards and directly update the weights in the direction of the gradients. This would require you to do the steps described by Mo K.
The second option is arguably a more convenient implementation for autodiff frameworks like keras/tensorflow. The idea is to implement an input-output (state-action) function like supervised learning, but with a loss function who's gradient is identical to the policy gradient. For a softmax policy, this simply means predicting the 'true action' and multiplying the (cross-entropy) loss with the observed returns/advantage. Aleksis Pirinen has some useful notes about this [1].

The modified loss function for option 2 in Keras looks like this:

import keras.backend as K

def policy_gradient_loss(Returns):
    def modified_crossentropy(action,action_probs):
        cost = K.categorical_crossentropy(action,action_probs,from_logits=False,axis=1 * Returns)
        return K.mean(cost)
    return modified_crossentropy

where 'action' is the true action of the episode (y), action_probs is the predicted probability (y*). This is based on another stackoverflow question [2].

References

Eboat answered 4/7, 2019 at 19:31 Comment(3)

This is very helpful. One question, should there be a K.mean() on the cost? The cost needs to be scalar in the end, and I'm assuming action and action-probs represent a full trajectory (game run) through time? More to the point, what dimensions are you assuming for the inputs, including Returns? – Abampere 5/8, 2019 at 21:15

@Abampere the dimensions for action and action_probs are (batch_size,n_categories). Using K.categorical_crossentropy on these variables results in a vector of length (batch size,) and this is multiplied element-wise with the Returns vector with the same dimensions (batch size,). So the function returns a vector of losses where one element reflects one sample, I don't think a K.mean is needed. – Eboat 6/8, 2019 at 7:18

@Abampere I looked at it and indeed normally you would take the mean. So I tested it on my own RL algorithm and using K.mean() gives the same result. I updated my answer. Thanks for the tip. – Eboat 6/8, 2019 at 7:56