I am trying to apply the PPO algorithm from the stable baselines3 library https://stable-baselines3.readthedocs.io/en/master/ to a custom environment I made.
One thing I don't understand is the following line:
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10, deterministic=True)
Should I always let deterministic equal True? When I keep deterministic="True", my custom environment "somehow" is always solved (i.e., always returning reward of 1 +/- 0 std).
And when I change it to "False", it starts behaving in a reasonable way (i.e., sometimes it succeeds (reward=1) and sometimes it fails (reward=0).
True
andFalse
) and a small explanation of your environment? That would help thinking about what goes wrong. My thoughts so far is that the default model initialization already solves the environment, and you never get a different action (that makes you fail). – Transmittal0.
and end of49.
With Determinstic=False, start0.
and end31
. Which seem reasonable. For the rendering, the reason that it is slow is because you are re rendering the whole plot every time with more data. The best way to handle that is either making it a separate process and using a queue to transfer the data. Or make a render interval, every 20 steps for example. – Transmittaldeterministic=True
and Stochastic Policy otherwise? – Insanitary