When to use a certain Reinforcement Learning algorithm?
Asked Answered
S

1

23

I'm studying Reinforcement Learning and reading Sutton's book for a university course. Beside the classic PD, MC, TD and Q-Learning algorithms, I'm reading about policy gradient methods and genetic algorithms for the resolution of decision problems. I have never had experience before in this topic and I'm having problems understanding when a technique should be preferred over another. I have a few ideas, but I'm not sure about them. Can someone briefly explain or tell me a source where I can find something about typical situation where a certain methods should be used? As far as I understand:

  • Dynamic Programming and Linear Programming should be used only when the MDP has few actions and states and the model is known, since it's very expensive. But when DP is better than LP?
  • Monte Carlo methods are used when I don't have the model of the problem but I can generate samples. It does not have bias but has high variance.
  • Temporal Difference methods should be used when MC methods need too many samples to have low variance. But when should I use TD and when Q-Learning?
  • Policy Gradient and Genetic algorithms are good for continuous MDPs. But when one is better than the other?

More precisely, I think that to choose a learning methods a programmer should ask himlself the following questions:

  • does the agent learn online or offline?
  • can we separate exploring and exploiting phases?
  • can we perform enough exploration?
  • is the horizon of the MDP finite or infinite?
  • are states and actions continuous?

But I don't know how these details of the problem affect the choice of a learning method. I hope that some programmer has already had some experience about RL methods and can help me to better understand their applications.

Shelli answered 28/3, 2014 at 21:53 Comment(0)
T
8

Briefly:

does the agent learn online or offline? helps you to decide either using on-line or off-line algorithms. (e.g. on-line: SARSA, off-line: Q-learning). On-line methods have more limitations and need more attention to pay.

can we separate exploring and exploiting phases? These two phase are normally in a balance. For example in epsilon-greedy action selection, you use an (epsilon) probability for exploiting and (1-epsilon) probability for exploring. You can separate these two and ask the algorithm just explore first (e.g. choosing random actions) and then exploit. But this situation is possible when you are learning off-line and probably using a model for the dynamics of the system. And it normally means collecting a lot of sample data in advance.

can we perform enough exploration? The level of exploration can be decided depending on the definition of the problem. For example, if you have a simulation model of the problem in memory, then you can explore as you want. But real exploring is limited to amount of resources you have. (e.g. energy, time, ...)

are states and actions continuous? Considering this assumption helps to choose the right approach (algorithm). There are both discrete and continuous algorithms developed for RL. Some of "continuous" algorithms internally discretize the state or action spaces.

Toor answered 9/4, 2014 at 12:27 Comment(6)
I have to disagree regarding Q-Learning for an Off-line case. Q-Learning can easily be done online, and from what I've seen, it blends really nicely to an on-line case.Detribalize
I disagree as well. As you may now Q-learning is an off-policy method. it means that it updates another policy while executing the current policy. On the other hand, SARSA is an on-policy method which updates the same policy that it is updating. Thus algorithms like SARSA are being used for control purposes while they are learning a task. You can read more about it in: Reinforcement learning: an introductionToor
I wasn't explicitly stating that SARSA was not on-line,as I did not mention SARSA anywhere in my comment. I was just stating that Q-learning can be used in both cases.Detribalize
SARSA is an example for an on-policy algorithm. This kind of algorithms are the best choice when you would like to control the system while learning about the policy. About Q-learning I have to mention that, when we say Q-learning we mean the original version of the algorithm proposed by Watkins (1997 if I am not wrong), although there are many modified versions of Q-learning that can be used.Toor
Online vs offline you mean on-policy vs off-policy.Elizaelizabet
@Elizaelizabet Online and offline are different from on- and off-policy (I just wanted to correct that error in case sombody stumbles upon this post): online updates are performed while running an episode and typically performed after every step while offline updates are performed once the episode terminates and the full reward is known (no bootstrapping). On-policy update on the other hand refer to updates that are made to the policy that was used for sampling. Off-policy updates use certain techniques (iimportance sampling, v-traces) to update the policy using roll-outs from other pollicies.Agretha

© 2022 - 2024 — McMap. All rights reserved.