What does "deterministic=True" in stable baselines3 library means?

About

Asked 3/3, 2021 at 10:57 Answered 22/5, 2021 at 23:39

python-3.x reinforcement-learning stable-baselines

I am trying to apply the PPO algorithm from the stable baselines3 library https://stable-baselines3.readthedocs.io/en/master/ to a custom environment I made.

One thing I don't understand is the following line:

mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)

Should I always let deterministic equal True? When I keep deterministic="True", my custom environment "somehow" is always solved (i.e., always returning reward of 1 +/- 0 std).

And when I change it to "False", it starts behaving in a reasonable way (i.e., sometimes it succeeds (reward=1) and sometimes it fails (reward=0).

Machos answered 3/3, 2021 at 10:57 Comment(11)

If I am not mistaken, stable baselines takes a random sample based on some distribution when using deterministic is False. This means that if the model prediction is not sure of what to pick, you get a higher level of randomness, which increases the exploration. During evaluation you generally don't want to explore, but exploit the model. Therefore deterministic should be True, which always returns the best action. When using deterministic is False, you won't always get the best action, but sometimes less optimal action picked at random (based on your model confidence). – Transmittal 3/3, 2021 at 11:5

I actually tested it (deterministic= "True") before and after training the model. And even before training the model, the reward was always 1, which was very unusual. Can you explain why an untrained model would be successful! – Machos 3/3, 2021 at 15:30

I don't know the environment you are training on, can you provide some action distributions of the model (with deterministic = True and False) and a small explanation of your environment? That would help thinking about what goes wrong. My thoughts so far is that the default model initialization already solves the environment, and you never get a different action (that makes you fail). – Transmittal 3/3, 2021 at 16:1

I am sorry @Transmittal for the late reply, yes here is my gym environment I just uploaded it to my GitHub, here it is: github.com/amine179/mygym_environment/tree/main my environment is trying to teach an agent to keep two scores above a certain threshold, one is proportional to its actions and the other is inversely proportional to its actions. Therefore it should always adapt its actions to prevent the two scores from going extreme. Thank you for your help, I will be waiting for your trial results. – Machos 5/3, 2021 at 7:35

Could you add your PPO code with experiments, so i can directly run it and verify it myself? Because looking at your environment my initial thoughts don't make much sense, because you have a continuous action space. – Transmittal 5/3, 2021 at 9:16

There you go! I uploaded the code used to train and save my PPO algorithm and updated the test file so you could try it. This been said! I noticed that I had a mistake in the environment that I corrected! now the PPO algorithm isn't able to solve the environment which I think is just a matter of training. Nevertheless, I would be glad if you could check on your own I might be mistaking since I am new to Deep RL in general. Also, the rendering function seems too much slow, is there any way I can make it a bit faster. Thanks a lot – Machos 5/3, 2021 at 9:56

Running your code for 100_000 steps and Determinstic=True, leads to a start of 0. and end of 49. With Determinstic=False, start 0. and end 31. Which seem reasonable. For the rendering, the reason that it is slow is because you are re rendering the whole plot every time with more data. The best way to handle that is either making it a separate process and using a queue to transfer the data. Or make a render interval, every 20 steps for example. – Transmittal 5/3, 2021 at 10:32

Yes that's what I got for the training too, I was using a small number of steps (5000) that's why it never learned previously. Thanks. For the plotting, I am aware of the second suggestion, but for the first one (using a queue...) I don't know. Can you provide me with an example if it's possible. – Machos 5/3, 2021 at 11:16

An example of plotting using a separate process can be found here. – Transmittal 5/3, 2021 at 13:13

Thanks a lot! You've been so helpful. – Machos 5/3, 2021 at 13:39

Is it not just Deterministic Policy when deterministic=True and Stochastic Policy otherwise? – Insanitary 14/3, 2022 at 3:53

This parameter corresponds to "Whether to use deterministic or stochastic actions". So the thing is when you are selecting an action according to given state, the actor_network gives you a probability distribution. For example for two possible actions a1 and a2: [0.25, 0.75]. If you use deterministic=True, the result will be action a2 since it has more probability. In the case of deterministic=False, the result action will be selected with given probabilities [0.25, 0.75].

Repast answered 22/5, 2021 at 23:39 Comment(1)

So, basically Deterministic Policy when deterministic=True and Stochastic Policy otherwise? – Insanitary 14/3, 2022 at 6:54

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags