Why does Monte Carlo Tree Search reset Tree
Asked Answered
C

2

12

I had a small but potentially stupid question about Monte Carlo Tree Search. I understand most of it but have been looking at some implementations and noticed that after the MCTS is run for a given state and a best move returned, the tree is thrown away. So for the next move, we have to run MCTS from scratch on this new state to get the next best position.

I was just wondering why we don't retain some of the information from the old tree. It seems like there is valuable information about the states in the old tree, especially given that the best move is one where the MCTS has explored most. Is there any particular reason we can't use this old information in some useful way?

Cuddle answered 20/11, 2017 at 10:18 Comment(2)
Probably because of stochastical dependence. The root-problem changed and therefore different paths might be traversed. In minmax i would think, given a 50-move decision, we could reuse 1/50 of our already pre-computed data (simplified; loss is huge), but in MCTS it's maybe not as trivial in terms of math-proofs, if we are to re-use these or not. I think this paper is analyzing this (chapter 5). This is an interesting question, but i'm convinced it's not well-suited for StackOverflow as the topic is far away from coding and more mathHolub
Just for future reference (comment above too long): the paper i linked is called Powley, Edward J., Peter I. Cowling, and Daniel Whitehouse. "Information capture and reuse strategies in Monte Carlo Tree Search, with applications to games of hidden information." Artificial Intelligence 217 (2014): 92-116.Holub
P
14

Some implementations do indeed retain the information.

For example, the AlphaGo Zero paper says:

The search tree is reused at subsequent time-steps: the child node corresponding to the played action becomes the new root node; the subtree below this child is retained along with all its statistics, while the remainder of the tree is discarded

Pentadactyl answered 20/11, 2017 at 11:28 Comment(2)
Why is the remainder of the tree thrown away? Considering the policy is fixed, the information gathered during the MCTS runs don’t get stale at all. Is data discarded just to free up RAM?Dictum
I agree it may well help to keep positions reachable via transpositions, especially in a game like Go. This sounds like a potential improvement.Pentadactyl
A
2

Well the reason may be the following.

Rollouts are truncated value estimations, contribution after maximum length are discarded.

Assume that maximum rollout depth is N.

If you consider an environment where average reward is !=0 (let's say >0).

After an action is taken and observation is obtained a child node of the tree could be selected.

Now the maximum length of the branches and the maximum length of the rollout that partecipated to the evaluation of a node value is N-1, as the root node has been discarded.

However, the new simulations will obviously still have length N but they will have to be combined with simulations of length N-1.

Longer simulations will have a biased value as the average reward is !=0

This means that the nodes are evaluated with mixed length evaluation will have a bias depending on the ratio of simulations with different lengths..

Another reason why recycling old simulations with shorter length is avoided is because of the bias induced on the sampling. Just imagine a T maze where at depth d on the left there is a maximum reward =R/2 while at depth=d+1 there is a maximum reward = R on the right. All the paths to the left that during the first step were able to reach the R/2 reward at depth d will be favoured during the second step with a recycled tree while paths to the right will be less common and there will higher chance to not reach the reward R. Starting from an empty tree will give the same probability to both sides of the maze.

Alpha Go Zero (see Peter de Rivaz's answer) actually does not use rollouts but uses a value approaximation (generated by a deep network). values are not truncated estimations. Thus Alpha Go Zero is not affected by this branch length bias.

Alpha Go, the predecessor of Alpha Go Zero, combined rollouts and the value approximation and also reused the tree.. but no the new version does not use the rollouts.. maybe for this reason. Also both Alpha Go Zero and Alpha Go do not use the value of the action but the number of times it was selected during search. This value may be less affected by the length bias, at least in the case where the average reward is negative

Hope this is clear..

Anthracene answered 21/8, 2018 at 23:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.