How to use the replay buffer in tf_agents for contextual bandit, that predicts and trains on a daily basis
Asked Answered
D

1

0

I am using the tf_Agents library for contextual bandits usecase.

In this usecase predictions (daily range between 20k and 30k predictions, 1 for each user) are made daily (multiple times a day) and training only happens on all the predicted data from 4 days ago (Since the labels for predictions takes 3 days to observe).

The driver seems to replay only the batch_size number of experience (Since max_step length is 1 for contextual bandits). Also the replay buffer has the same constraint only handling batch size number of experiences.

I wanted to use checkpointer and save all the predictions (experience from driver which are saved in replay buffer) from the past 4 days and train only on the first of the 4 days saved on each given day.

I am unsure how to do the following and any help is greatly appreciate.

  1. How to (run the driver) save replay buffer using checkpoints for the entire day (a day contains, say, 3 predictions runs and each prediction will be made on 30,000 observations [say batch size of 16]). So in this case I need multiple saves for each day
  2. How to save the replay buffers for past 4 days (12 prediction runs ) and only retrieve the first 3 prediction runs (replay buffer and the driver run) to train for each day.
  3. Unsure how to handle the driver, replay buffer and checkpointer configurations given the above #1, #2 above
Dubbin answered 27/4, 2022 at 17:5 Comment(0)
A
1

On the Replay Buffer I don't think there is any way to get that working without implementing your own RB class (which I wouldn't necessarily recommend). Seems to me like the most straight forward solution for this is to take the memory inefficiency hit and have two RB with a different size of max_length. One of the two is the one given to the driver to store episodes and then rb.as_dataset(single_determinsitic_pass=true) is used to get the appropriate items to place in the memory of the second one used for training. The only thing you need to checkpoint of course is the first one.

Note: I'm not sure off-the-top-of-my head how exactly single_determinsitic_pass works, you may want to check that in order to determine which portion of the returned dataset corresponds to the day you want to train from. I also have the suspicion that probably the portion corresponding to the last day shifts, because if I don't remember wrong the RB table that stores the experiences works with a cursor that once reached the maximum length starts overwriting from the beginning.

Neither RB needs to know about the logic of how many prediction runs there are, in the end your code should manage that logic and you might want to keep track (maybe in a pickle if you want to save this) how many predictions correspond to each day so that you know which ones to pick.

Aramen answered 28/4, 2022 at 20:4 Comment(4)
@FedericoMalerbaThanks. I am also thinking of a different solution. Handling this at the environment level. Do you think there is a way for the custom Banditpyenvironment to generate different observations for predictions (in the _observe function) and rewards (in the _Apply_action function) during training vs. different for prediction. May be I can train using 1 environment and predict using a different environment but wanted to see if I can do with same. I am planning to keep all daily observations in the big query table so that the _observe and _apply_Action can pull from tablesDubbin
You should not want to call the environment during training. In general environment calls in RL tend to be the bottleneck on performance (which is why TF-agent offers a framework to batch multiple envs together). Since training calls tend to be much more than prediction (you iterate over data multiple times), you'd do best to stick to RB for trainingAramen
Thank you, in that case, is there a way to save past 4 days of RB ? Also, I see the RB saves only the last batch that is run by the driver. In my case I have 20k observations per day, the batch size could be 64. Looks like RB is saving only the last batch of 64 observations. Is there a way to save all the 20k observations in a single replay buffer and also a different replay buffer for each day using the save code. How can I retrieve the replay buffer during training. Because the replay buffer at prediction doesn't have the corresponding reward.Reward is available only 3 days after prediction.Dubbin
Also is there a way to override the action that the policy makes (as obtained in the _apply_action) and save the overridden action to the trajectory instead of the original action ? Wondering if it possible to train the agent on historical data (say data from a table). The observations and rewards can be pulled form the table, using the environment. But in all the examples I saw, the action is coming from the agent directly (fed into _Apply_action). wondering if there is a way to take the action from a table, instead of from the agent. Thank youDubbin

© 2022 - 2024 — McMap. All rights reserved.