Free Energy Reinforcement Learning Implementation
Asked Answered
C

1

8

I've been trying to implement the algorithm described here, and then test it on the "large action task" described in the same paper.

Overview of the algorithm:

enter image description here

In brief, the algorithm uses an RBM of the form shown below to solve reinforcement learning problems by changing its weights such that the free energy of a network configuration equates to the reward signal given for that state action pair.

To select an action, the algorithm performs gibbs sampling while holding the state variables fixed. With enough time, this produces the action with the lowest free energy, and thus the highest reward for the given state.

Overview of the large action task:

enter image description here

Overview of the author's guidelines for implementation:

A restricted Boltzmann machine with 13 hidden variables was trained on an instantiation of the large action task with an 12-bit state space and a 40-bit action space. Thirteen key states were randomly selected. The network was run for 12 000 actions with a learning rate going from 0.1 to 0.01 and temperature going from 1.0 to 0.1 exponentially over the course of training. Each iteration was initialized with a random state. Each action selection consisted of 100 iterations of Gibbs sampling.

Important omitted details:

  • Were bias units needed?
  • Was weight decay needed? And if so, L1 or L2?
  • Was a sparsity constraint needed for the weights and/or activations?
  • Was there modification of the gradient descent? (e.g. momentum)
  • What meta-parameters were needed for these additional mechanisms?

My implementation:

I initially assumed the authors' used no mechanisms other than those described in the guidelines, so I tried training the network without bias units. This led to near chance performance, and was my first clue to the fact that some mechanisms used must have been deemed 'obvious' by the authors and thus omitted.

I played around with the various omitted mechanisms mentioned above, and got my best results by using:

  • softmax hidden units
  • momentum of .9 (.5 until 5th iteration)
  • bias units for the hidden and visible layers
  • a learning rate 1/100th of that listed by the authors.
  • l2 weight decay of .0002

But even with all of these modifications, my performance on the task was generally around an average reward of 28 after 12000 iterations.

Code for each iteration:

    %%%%%%%%% START POSITIVE PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    data = [batchdata(:,:,(batch)) rand(1,numactiondims)>.5];
    poshidprobs = softmax(data*vishid + hidbiases);

    %%%%%%%%% END OF POSITIVE PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    hidstates = softmax_sample(poshidprobs);

    %%%%%%%%% START ACTION SELECTION PHASE  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

    if test
        [negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,0);
    else
        [negaction poshidprobs] = choose_factored_action(data(1:numdims),hidstates,vishid,hidbiases,visbiases,cdsteps,temp);
    end


    data(numdims+1:end) = negaction > rand(numcases,numactiondims);


    if mod(batch,100) == 1
        disp(poshidprobs);
        disp(min(~xor(repmat(correct_action(:,(batch)),1,size(key_actions,2)), key_actions(:,:))));
    end

    posprods    = data' * poshidprobs;
    poshidact   = poshidprobs;
    posvisact = data;

    %%%%%%%%% END OF ACTION SELECTION PHASE %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


    if batch>5,
        momentum=.9;
    else
        momentum=.5;
    end;

    %%%%%%%%% UPDATE WEIGHTS AND BIASES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
    F = calcF_softmax2(data,vishid,hidbiases,visbiases,temp);

    Q = -F;
    action = data(numdims+1:end);
    reward = maxreward - sum(abs(correct_action(:,(batch))' - action));
    if correct_action(:,(batch)) == correct_action(:,1)
        reward_dataA = [reward_dataA reward];
        Q_A = [Q_A Q];
    else
        reward_dataB = [reward_dataB reward];
        Q_B = [Q_B Q];
    end
    reward_error = sum(reward - Q);
    rewardsum = rewardsum + reward;
    errsum = errsum + abs(reward_error);
    error_data(ind) = reward_error;
    reward_data(ind) = reward;
    Q_data(ind) = Q;

    vishidinc = momentum*vishidinc + ...
        epsilonw*( (posprods*reward_error)/numcases - weightcost*vishid);
    visbiasinc = momentum*visbiasinc + (epsilonvb/numcases)*((posvisact)*reward_error - weightcost*visbiases);
    hidbiasinc = momentum*hidbiasinc + (epsilonhb/numcases)*((poshidact)*reward_error - weightcost*hidbiases);

    vishid = vishid + vishidinc;
    hidbiases = hidbiases + hidbiasinc;
    visbiases = visbiases + visbiasinc;

    %%%%%%%%%%%%%%%% END OF UPDATES %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

What I'm asking for:

So, if any of you can get this algorithm to work properly (the authors claim to average ~40 reward after 12000 iterations), I'd be extremely grateful.

If my code appears to be doing something obviously wrong, then calling attention to that would also constitute a great answer.

I'm hoping that what the authors left out is indeed obvious to someone with more experience with energy-based learning than myself, in which case, simply point out what needs to be included in a working implementation.

Cheffetz answered 31/5, 2012 at 3:31 Comment(10)
quick note: I realize this question is relatively huge compared to the average stack-overflow bug fix. As such, I'll donate all of the reputation I possibly can (650?), to anyone who answers it satisfactorily (even if you answer within the first 2 days of my posting).Cheffetz
Also, the code provided is written in matlab should be treated as pseudo-code since it won't run without my helper functions and setup. If you think a working version of my network and large action task, I can provide it upon request.Cheffetz
Try posting on metaoptimize.com/qaCalamus
try posting #2 - on codereview.stackexchange.comCalces
meta.stackexchange.com/a/129787Jemy
@Cheffetz forgive me for not elaborating on that in my previous comment, but it just seems that your question is way too descriptive.Jemy
If you don't mind the question, what is your use case? Also, is this the only thing you have tried? Have you tried this from the angle of modeling the undirected graph?Incontrovertible
@Incontrovertible I'm developing an agent that learns to react to visual stimuli with cursor movements (e.g. blob appears on screen, move cursor to blob). This algorithms seems to be the best I've come across for dealing with the large state-space entailed by considering all possible cursor movements.Cheffetz
@Incontrovertible I don't quite understand what you mean with regards to your second question. The algorithm does indeed operate on an rbm, which is a undirected graphical model of a particular variety, but I'm still not sure what "modeling the undirected graph" means in this context. I'm probably missing something obvious, but if you could elaborate, it'd be appreciated.Cheffetz
I'm voting to close this question as off-topic because it's a work requestLon
Z
1
  1. The algorithm in the paper looks wierd. They use a kind of hebbian learning that increases conectonstrength, but no mechanism to decay them. In contrast the regular CD pushes the energy of incorrect fantasies up, balancing overall activiity. I would speculate that yuo will need strong sparcity regulaton and/or weight decay.
  2. bias never would hurt :)
  3. Momentum and other fancy stuff may speed up, but usually not neccesary.
  4. Why softmax on hiddens? Should it be just sigmoid?
Zincate answered 20/7, 2013 at 12:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.