What Loss Or Reward Is Backpropagated In Policy Gradients For Reinforcement Learning?
Asked Answered



I have made a small script in Python to solve various Gym environments with policy gradients.

import gym, os
import numpy as np
#create environment
env = gym.make('Cartpole-v0')
s_size = len(env.reset())
a_size = 2

#import my neural network code
os.chdir(r'C:\---\---\---\Python Code')
import RLPolicy
policy = RLPolicy.NeuralNetwork([s_size,a_size],learning_rate=0.000001,['softmax']) #a 3layer network might be ([s_size, 5, a_size],learning_rate=1,['tanh','softmax'])
#it supports the sigmoid activation function also

DISCOUNT = 0.95 #parameter for discounting future rewards

#first step
action = policy.feedforward(env.reset)
state,reward,done,info = env.step(action)

for t in range(3000):
    done = False
    states = [] #lists for recording episode
    probs2 = []
    rewards = []
    while not done:
        #env.render() #to visualize learning

        probs = policy.feedforward(state)[-1] #calculate probabilities of actions
        action = np.random.choice(a_size,p=probs) #choose action from probs

        #record and update state
        state,reward,done,info = env.step(action)
        rewards.append(reward) #should reward be before updating state?

    #calculate gradients
    gradients_w = []
    gradients_b = []
    for i in range(len((rewards))):
        totalReward = sum([rewards[t]*DISCOUNT**t for t in range(len(rewards[i:]))]) #discounted reward
        ## !! this is the line that I need help with
        gradient = policy.backpropagation(states[i],totalReward*(probs2[i])) #what should be backpropagated through the network
        ## !!

        ##record gradients
    #combine gradients and update the weights and biases
    gradients_w = np.array(gradients_w,object)
    gradients_b = np.array(gradients_b,object)
    policy.weights += policy.learning_rate * np.flip(np.sum(gradients_w,0),0) #np.flip because the gradients are calculated backwards
    policy.biases += policy.learning_rate * np.flip(np.sum(gradients_b,0),0)
    #reset and record
    if t%100==0:

What should be passed backwards to calculate the gradients? I am using gradient ascent but I could switch it to descent. Some people have defined the reward function as totalReward*log(probabilities). Would that make the score derivative totalReward*(1/probs) or log(probs) or something else? Do you use a cost function like cross entropy? I have tried

probs = np.zeros(a_size)  
probs[action] = 1  

and a couple others. The last one is the only one that was able to solve any of them and it only worked on Cartpole. I have tested the various loss or score functions for thousands of episodes with gradient ascent and descent on Cartpole, Pendulum, and MountainCar. Sometimes it will improve a small amount but it will never solve it. What am I doing wrong?

And here is the RLPolicy code. It is not well written or pseudo coded but I don't think it is the problem because I checked it with gradient checking several times. But it would be helpful even if I could narrow it down to a problem with the neural network or somewhere else in my code.

#Neural Network
import numpy as np
import random, math, time, os
from matplotlib import pyplot as plt

def activation(x,function):
    if function=='sigmoid':
        return(1/(1+math.e**(-x))) #Sigmoid
    if function=='relu':
    if function=='tanh':
        return(np.tanh(x.astype(float))) #tanh
    if function=='softmax':
        z = np.exp(np.array((x-max(x)),float))
        y = np.sum(z)
def activationDerivative(x,function):
    if function=='sigmoid':
    if function=='relu':
    if function=='tanh':
    if function=='softmax':
        s = x.reshape(-1,1)
        return(np.diagflat(s) - np.dot(s, s.T))

class NeuralNetwork():
    def __init__ (self,layers,learning_rate,momentum,regularization,activations):
        self.learning_rate = learning_rate   
        if (isinstance(layers[1],list)):
            h = layers[1][:]
            del layers[1]
            for i in h:
        self.layers = layers
        self.weights = [2*np.random.rand(self.layers[i]*self.layers[i+1])-1 for i in range(len(self.layers)-1)]
        self.biases = [2*np.random.rand(self.layers[i+1])-1 for i in range(len(self.layers)-1)]    
        self.weights = np.array(self.weights,object)
        self.biases = np.array(self.biases,object)
        self.activations = activations
    def feedforward(self, input_array):
        layer = input_array
        neuron_outputs = [layer]
        for i in range(len(self.layers)-1):
            layer = np.tile(layer,self.layers[i+1])
            layer = np.reshape(layer,[self.layers[i+1],self.layers[i]])
            weights = np.reshape(self.weights[i],[self.layers[i+1],self.layers[i]])
            layer = weights*layer
            layer = np.sum(layer,1)#,self.layers[i+1]-1)
            layer = layer+self.biases[i]
            layer = activation(layer,self.activations[i])
    def neuronErrors(self,l,neurons,layerError,n_os):
        if (l==len(self.layers)-2):
        totalErr = [] #total error
        for e in range(len(layerError)): #-layers
            e = e*self.layers[l+2]
            a_ws = self.weights[l+1][e:e+self.layers[l+1]]
            e = int(e/self.layers[l+2])
            err = layerError[e]*a_ws #error
    def backpropagation(self,state,loss):
        weights_gradient = [np.zeros(self.layers[i]*self.layers[i+1]) for i in range(len(self.layers)-1)]
        biases_gradient = [np.zeros(self.layers[i+1]) for i in range(len(self.layers)-1)]  
        neuron_outputs = self.feedforward(state)
        grad = self.individualBackpropagation(loss, neuron_outputs)

    def individualBackpropagation(self, difference, neuron_outputs): #number of output
        lr = self.learning_rate
        n_os = neuron_outputs[:]
        w_o = self.weights[:]
        b_o = self.biases[:]
        w_n = self.weights[:]
        b_n = self.biases[:]
        gradient_w = []
        gradient_b = []
        error = difference[:] #error for neurons
        for l in range(len(self.layers)-2,-1,-1):
            p_n = np.tile(n_os[l],self.layers[l+1]) #previous neuron
            neurons = np.arange(self.layers[l+1])
            error = (self.neuronErrors(l,neurons,error,n_os))
            if not self.activations[l]=='softmax':
                error = error*activationDerivative(neuron_outputs[l+1],self.activations[l])
                error = error @ activationDerivative(neuron_outputs[l+1],self.activations[l]) #because softmax derivative returns different dimensions
            w_grad = np.repeat(error,self.layers[l]) #weights gradient
            b_grad = np.ravel(error) #biases gradient
            w_grad = w_grad*p_n
            b_grad = b_grad

Thanks for any answers, this is my first question here.

Longwood answered 26/8, 2020 at 16:50 Comment(2)
The derivative of tanh should be 1-tanh2 instead of 1-x2Syllabify
@Syllabify When I use the derivative function, I pass the outputs of the neuron to it. That means x is already tanh(x) and the same with the sigmoid function.Longwood

mprouveur's answer was half correct but I felt that I needed to explain the right thing to backpropagate. The answer to my question on ai.stackexchange.com was how I came to understand this. The correct error to backpropagate is the log probability of taking the action multiplied by the goal reward. This can also be calculated as the cross entropy loss between the outputted probabilities and an array of zeros with the action that was taken being one 1. Because of the derivative of cross entropy loss, this will have the effect of pushing only the probability of the action that was taken closer to one. Then, the multiplication of the total reward makes better actions get pushed more to a higher probability. So, with the label being a one-hot encoded vector, the correct equation is label/probs * totalReward because that is the derivative of cross entropy loss and the derivative of the log of probs. I got this working in other code, but even with this equation I think something else in my code is wrong. It probably has something to do with how I made the softmax derivative too complicated instead of calculating the usual way, by combing the cross entropy derivative and softmax derivative. I will update this answer soon with correct code and more information.

Longwood answered 24/10, 2020 at 15:24 Comment(0)

Using as reference this post for the computation of the gradient ( https://medium.com/@jonathan_hui/rl-policy-gradients-explained-9b13b688b146) :

It seems to me that totalRewardOfEpisode*np.log(probability of sampled action) is the right computation. However in order to have a good estimate of the gradient I'd suggest using many episodes to compute it. (30 for example, you'd just need to average your end gradient by dividing by 30)

The main difference with your test with totalReward*np.log(probs) is that for each step I think you should only backpropagate on the probability of the action you sampled, not the whole output. Initialy in the cited article they use the total reward but then they suggest in the end using the discounted reward of the present and future rewards as you do, so that part doesn't seem theoretically problematic.

OLD answer :

To my knowledge deepRL methods usely use some estimate of the value of the state in the game or the value of each action. From what I see in your code you have a neural network that only outputs probabilities for each action.

Although what you want is definitely to maximize the total reward, you can't compute a gradient on the end reward because of the environment. I'd suggest you'd look into methods such as deepQLearning or Actor/Critic based methods such as PPO.

Given the method you chose you'll get different answers on how to compute your gradient.

Poundal answered 9/9, 2020 at 9:40 Comment(10)
I have looked at Actor/Critic methods, but I thought I should just get the actor down first. I thought policy gradient methods were supposed to be able to update without another network determining the value. You are right that my network just outputs probabilities for actions. Why does the environment stop us from computing the gradient on the end reward? Are policy gradients mixed with value-based methods the only way to solve these problems?Longwood
The environment stops us from computing the gradient on the end reward because the environment ( the function that links state and action to next state and reward) is not differentiable (or you simply don't have access to this function). I am not sure whether Actor/Critic based methods are the only way to solve these problems, at least they are the only ones that I know about and they seem efficient/popular enough to give it a go at least.Poundal
okay you are right policy gradient methods do not inherently require a function to estimate the values. This article from medium clears that up : medium.com/@jonathan_hui/… the last equation of the optimization part is what you said in your answer. I'll think about it a bit more and delete my answer if nothing comes outPoundal
Thanks. I have already read the article but I am going through it again. I will get back to you in a bit.Longwood
I tried backpropagating the single action probability. If the probabilities were [0.7,0.3] and it chose 0.7, then I have tried backpropagating [0.7,0], [0.7,1] and [0.7,0.7] but none of them solved the cartpole environment. What am I doing wrong there? I am using totalReward*np.log(probs) with those as the new probs. I also used gradient ascent and descent for 3,000 iterations.Longwood
I also tried all of them with 1/probs with is supposed to be the derivative of log.Longwood
I gave the bounty to you because it was about to expire. I have still not gotten an answer that solved my problem yet but you have helpful information.Longwood
Thanks, and sorry for the late reply. The sum of probabilities should be 1, and changing thoses values seems strange to me. In theory your network has 2 outputs (0.7 and 0.3) and you can compute the gradient on a single output (in your example 0.7). In case you have not checked this one already, I found this implementation of the REINFORCE algorithm : towardsdatascience.com/… the loss uses the log probability of the chosen output but no backpropagation happens on the other output (though these outputs are tightly linked)Poundal
Sorry I’m taking so long. I am still experimenting with your suggestion.Longwood
I figured it out and made my own answer. You were mostly correct but I just wanted to make things clear. Thank you for the help.Longwood

mprouveur's answer was half correct but I felt that I needed to explain the right thing to backpropagate. The answer to my question on ai.stackexchange.com was how I came to understand this. The correct error to backpropagate is the log probability of taking the action multiplied by the goal reward. This can also be calculated as the cross entropy loss between the outputted probabilities and an array of zeros with the action that was taken being one 1. Because of the derivative of cross entropy loss, this will have the effect of pushing only the probability of the action that was taken closer to one. Then, the multiplication of the total reward makes better actions get pushed more to a higher probability. So, with the label being a one-hot encoded vector, the correct equation is label/probs * totalReward because that is the derivative of cross entropy loss and the derivative of the log of probs. I got this working in other code, but even with this equation I think something else in my code is wrong. It probably has something to do with how I made the softmax derivative too complicated instead of calculating the usual way, by combing the cross entropy derivative and softmax derivative. I will update this answer soon with correct code and more information.

Longwood answered 24/10, 2020 at 15:24 Comment(0)

The loss here depends on what output on each problem. Generaly, loss for backpropagate should be a number that represents for everything you have processed. For policy gradient, it will be the reward that it think it will get compare with the original reward, the log is just a way to bring it back to a probabily random variable. Single dimension. If you want to inspect the behavior behind codes, you should always check the shape/dimension between each process to fully understand

Rhineland answered 9/9, 2020 at 6:58 Comment(5)
What number represents everything the network has processed? Nowhere in my code does the neural network predict the reward it will get, do I need to have another network to predict the value? Then it would become an Actor/Critic method, is there no way to solve a problem with just a policy gradient and no value-based method?Longwood
when you use the policy gradient, that policy will give you a reward based on that policy. That belongs to the bellman updateRhineland
the bellman update will give you a value and reward, if that reward is dynamically low. It is a penalty to the loss and that number will give the backpropagation a big update for the neural networkRhineland
the actor-critic doesn't work like this. The value will be designed like you said and give you the evaluate instead of a bellman.Rhineland
It is weird that you know Policy Gradient but dont know this basic principle. You can check with sthing like Q learning at first. That is a typical Policy Gradient methodRhineland

© 2022 - 2024 — McMap. All rights reserved.