Monte Carlo Tree Search agent in a game of Isolation - Debug Suggestions
Asked Answered
L

1

24

TLDR

MCTS agent implementation runs without errors locally, achieving win-rates of >40% against heuristic driven minimax but fails the autograder - which is a requirement before the project can be submitted. Autograder throws IndexError: Cannot choose from an empty sequence. I'm looking for suggestions on the part of the code that is most likely to throw this exception.

Hi, I am currently stuck at this project, which I need to clear before I get to complete the program that I'm enrolled in, in 2 weeks' time. My task, which I have already completed, is to implement an agent to play against the heuristic-driven minimax agent in a game of Isolation between two chess knights. Full implementation details of the game can be found here. For my project, the game will be played on a board measuring 9 x 11, using bitboard encoding. My implementation of MCTS is straightforward, following closely the pseudocode provided in this paper (pg 6).

In essence, the general MCTS approach comprises these 4 parts and they are each implemented by the following nested functions in the CustomPlayer class:

  1. Selection - tree_policy
  2. Expansion - best_child, expand
  3. Simulation - default_policy
  4. Backpropagation - backup_negamax, update_scores

    import math
    import random
    import time
    import logging
    
    from copy import deepcopy
    from collections import namedtuple
    
    from sample_players import DataPlayer
    
    
    class CustomPlayer(DataPlayer):
        """ Implement your own agent to play knight's Isolation
        The get_action() method is the only required method for this project.
        You can modify the interface for get_action by adding named parameters
        with default values, but the function MUST remain compatible with the
        default interface.
        **********************************************************************
        NOTES:
        - The test cases will NOT be run on a machine with GPU access, nor be
          suitable for using any other machine learning techniques.
        - You can pass state forward to your agent on the next turn by assigning
          any pickleable object to the self.context attribute.
        **********************************************************************
        """
        def get_action(self, state):
            """ Employ an adversarial search technique to choose an action
            available in the current state calls self.queue.put(ACTION) at least
            This method must call self.queue.put(ACTION) at least once, and may
            call it as many times as you want; the caller will be responsible
            for cutting off the function after the search time limit has expired.
            See RandomPlayer and GreedyPlayer in sample_players for more examples.
            **********************************************************************
            NOTE: 
            - The caller is responsible for cutting off search, so calling
              get_action() from your own code will create an infinite loop!
              Refer to (and use!) the Isolation.play() function to run games.
            **********************************************************************
            """
            logging.info("Move %s" % state.ply_count)
            self.queue.put(random.choice(state.actions()))
            i = 1
            statlist = []
    
        while (self.queue._TimedQueue__stop_time - 0.05) > time.perf_counter():
            next_action = self.uct_search(state, statlist, i)
            self.queue.put(next_action)
            i += 1
    
    
        def uct_search(self, state, statlist, i):
            plyturn = state.ply_count % 2
            Stat = namedtuple('Stat', 'state action utility visit nround')
    
            def tree_policy(state):
                statecopy = deepcopy(state)
    
                while not statecopy.terminal_test():
                    # All taken actions at this depth
                    tried = [s.action for s in statlist if s.state == statecopy]
                    # See if there's any untried actions left
                    untried = [a for a in statecopy.actions() if a not in tried]
    
                    topop = []
                    toappend = []
    
                    if len(untried) > 0:
                        next_action = random.choice(untried)
                        statecopy = expand(statecopy, next_action)
                        break
                    else:
                        next_action = best_child(statecopy, 1)
    
                        for k, s in enumerate(statlist):
                            if s.state == statecopy and s.action == next_action:
                                visit1 = statlist[k].visit + 1
                                news = statlist[k]._replace(visit=visit1)
                                news = news._replace(nround=i)
    
                                topop.append(k)
                                toappend.append(news)
                                break
    
                        update_scores(topop, toappend)
                        statecopy = statecopy.result(next_action)
                return statecopy
    
    
            def expand(state, action):
                """
                Returns a state resulting from taking an action from the list of untried nodes
                """
                statlist.append(Stat(state, action, 0, 1, i))
                return state.result(action)
    
    
            def best_child(state, c):
                """
                Returns the state resulting from taking the best action. c value between 0 (max score) and 1 (prioritize exploration)
                """
                # All taken actions at this depth
                tried = [s for s in statlist if s.state == state]
    
                maxscore = -999
                maxaction = []
                # Compute the score
                for t in tried:
                    score = (t.utility/t.visit) + c * math.sqrt(2 * math.log(i)/t.visit)
                    if score > maxscore:
                        maxscore = score
                        del maxaction[:]
                        maxaction.append(t.action)
                    elif score == maxscore:
                        maxaction.append(t.action)
    
                if len(maxaction) < 1:
                    logging.error("IndexError: maxaction is empty!")
    
                return random.choice(maxaction)
    
    
            def default_policy(state):
                """
                The simulation to run when visiting unexplored nodes. Defaults to uniform random moves
                """
                while not state.terminal_test():
                    state = state.result(random.choice(state.actions()))
    
                delta = state.utility(self.player_id)
                if abs(delta) == float('inf') and delta < 0:
                    delta = -1
                elif abs(delta) == float('inf') and delta > 0:
                    delta = 1
                return delta
    
    
            def backup_negamax(delta):
                """
                Propagates the terminal utility up the search tree
                """
                topop = []
                toappend = []
                for k, s in enumerate(statlist):
                    if s.nround == i:
                        if s.state.ply_count % 2 == plyturn:
                            utility1 = s.utility + delta
                            news = statlist[k]._replace(utility=utility1)
                        elif s.state.ply_count % 2 != plyturn:
                            utility1 = s.utility - delta
                            news = statlist[k]._replace(utility=utility1)
    
                        topop.append(k)
                        toappend.append(news)
    
                update_scores(topop, toappend)
                return
    
    
            def update_scores(topop, toappend):
                # Remove outdated tuples. Order needs to be in reverse or pop will fail!
                for p in sorted(topop, reverse=True):
                    statlist.pop(p)
                # Add the updated ones
                for a in toappend:
                    statlist.append(a)
                return
    
    
            next_state = tree_policy(state)
            if not next_state.terminal_test():
                delta = default_policy(next_state)
                backup_negamax(delta)
    
            return best_child(state, 0)
    

The lack of color formatting does make the code really hard to read. So, please feel free to check it out at my github. I have no issues running the game locally, with my MCTS agent achieving win-rates of >40% (under a 150ms/move limit) against the minimax player. However, when I try submitting my code to the autograder, it gets rejected with the IndexError: Cannot choose from an empty sequence exception.

From my discussion with the course representation, we believe that the error is likely caused by the usage of random.choice(). There are 4 instances of its usage in my implementation:

  1. Line 39, before the MCTS algorithm, to feed a random move to the queue
  2. Line 66, to randomly select one move that has not been tried
  3. Line 114, to randomly select an action should there be a tie in the score of the best moves
  4. Line 122, to simulate the game randomly until terminal state for a chosen move

I assume that the game implementation is correct and calling state.actions() will always return a list of possible moves as long as the state is terminal. Therefore, the only instance that can trigger this exception is Item 3. Items 1 and 4 are simply randomly selecting from available actions, while an explicit check is in place to make sure that random.choice() is not fed with an empty list. Hence, I applied logging to item 3 (even though no exception has been thrown while running locally) and sure enough, did not catch any exception after 50 games.

I apologize for the lengthy post but I do hope that someone out there may be able to catch something that I have missed out in my implementation.

Lindahl answered 12/3, 2019 at 9:23 Comment(10)
have you tried posting at codereview.stackexchange.com as well?Grandpapa
I'll do that now. Wasn't aware about the channel's existence. Thanks!Lindahl
Also try to fix the random seed on the submission to see if that gives you any insights... Then do something like before returning a sequence, if sequence is empty, return foobar(some output that is not necessary good, but would be valid) This will help you temporarily bypass the problem to see where you stand... In fact, depending on if its allowed and if its possible, you might want to return a string with all your state as a form of debugging... If the error message contains that string, it will give you insight on what went wrongResidence
@Residence As a precursor to all the elaborate tests described in the question, the get_action was set up with just self.queue.put(random.choice(state.actions())) without going into MCTS and the autograder happily accepted it without complaints. Therefore, I think that I am on the right track with the 4 instances of random.choice that I listed above - in particular Item 3. My only issue is that I haven't been able to catch any empty sequences despite running so many rounds of gamesLindahl
I have a question about lines 167 and 168. What happens if your opponent has a legal move and you don't? In this case, i think next_state.terminal_test() is False and next_state.actions() is empty. Can you change the test and re-submit?Naranjo
Generally when you get a problem like this, it is due to the score = (t.utility/t.visit) + c * math.sqrt(2 * math.log(i)/t.visit) line evaluating to NaN for all possible actions, which causes all of the if score > maxscore: tests to fail. That would usually be due to i and/or t.visit evaluating to 0... but I can't realistically see any cases in this code where that would happen. I do see that the consistent use of i throughout the entire tree here is incorrect, it should be the visit count of the parent node you're currently looking at... but that shouldn't cause the crash.Buttercup
@SylvainBiehler I'm sorry for the late reply - was just too caught up with other stuffs over the past few days. I'm glad for your question; it helped me to look at this with a fresh perspective! It seems that the autograder expects the call to queue.put() to supply an empty move at terminal state, which I have implemented with an if else statement to avoid going into the uct_search function when the terminal state is trueLindahl
@DennisSoemers Thank you for taking the time to look through the implementation! I don't think I am applying i throughout the tree. The idea is that, MCTS selects a move via the tree_policy during its turn i.e. 1. picking untried nodes if any, else 2. pick the node with the highest score. I apply a consistent i for the nodes selected under the second rule because the terminal value is a result of that combination of selection (score-based selection). Under rule 1, MCTS goes into random simulation after the node selection and therefore, only the parent node is updated with the score.Lindahl
@Lindahl If you mean to implement the standard UCB1 policy there, it should look like: score = (t.utility/t.visit) + c * math.sqrt(2 * math.log(N)/t.visit), with N instead of i, where N should be the visit count of the "parent" (the node that all those actions in tried are branching out of). These two are identical right up at the root node (because the number of MCTS iterations i is equal to the visit count of the root node), but they're very different further down in the tree.Buttercup
@DennisSoemers Yes, this is supposed to be the standard UCB1 implementation. I get what you're trying to explain also. I'm thinking that the impact isn't too far off from the original intention, the difference being the magnitude of the exploration term (which gets a heavier weight due to this I believe). However, in the final selection, the move ultimately selected is still determined by the exploitation term.Lindahl
C
2

I would recommend to write unit tests for each of the functions you have. Verify your assumptions about the behavior of your code. Try to be comprehensive about testing it, including corner cases. Just the need to input abstract test data usually reveals a lot about the architecture and details of a solution.

Furthermore, you could try to let your agent play any other suitable game. These steps will give you a good chance to discover whatever the bug is in your code, and make it more suitable for reuse in future.

Cletacleti answered 20/4, 2019 at 18:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.