MCTS, AlphaGo, AlphaZero: 게임 AI의 혁명

MCTS (Monte Carlo Tree Search) 는 체스, 바둑 같은 게임에서 최적의 수를 찾는 탐색 알고리즘입니다. DeepMind는 MCTS와 딥러닝을 결합하여 AlphaGo, AlphaZero를 만들어 인간 최고수를 꺾었습니다. 이 글에서는 MCTS의 원리와 AlphaGo/AlphaZero의 작동 방식을 살펴봅니다.

1. 게임 트리 탐색의 기초

Minimax 알고리즘

게임 AI의 고전적 방법:

나의 턴: 가장 높은 점수 선택 (max)
상대 턴: 가장 낮은 점수 선택 (min)

def minimax(node, depth, is_maximizing):
    if depth == 0 or node.is_terminal():
        return evaluate(node)

    if is_maximizing:
        max_eval = float('-inf')
        for child in node.children():
            eval = minimax(child, depth - 1, False)
            max_eval = max(max_eval, eval)
        return max_eval
    else:
        min_eval = float('inf')
        for child in node.children():
            eval = minimax(child, depth - 1, True)
            min_eval = min(min_eval, eval)
        return min_eval

Minimax의 한계

복잡도: 체스는 매 턴 ~35개 선택지, 게임 평균 ~80턴 → 35^80 경우의 수
바둑: 19×19 보드, 매 턴 ~250개 선택지 → 탐색 불가능

해결 방향

Alpha-Beta 가지치기: 불필요한 탐색 제거 (체스에서 사용)
MCTS: 유망한 경로만 샘플링 (바둑에서 사용)

2. MCTS (Monte Carlo Tree Search)

핵심 아이디어

모든 경우의 수를 탐색하지 않고, 무작위 시뮬레이션을 통해 유망한 수를 찾습니다.

4단계 반복

     Selection → Expansion → Simulation → Backpropagation
         ↓          ↓           ↓              ↓
    유망한 노드   새 노드     무작위로      결과를 거슬러
     선택하기    확장하기   게임 끝까지    올라가며 업데이트

1) Selection (선택)

루트에서 시작하여 UCB1 공식으로 자식 노드 선택:

$UCB1 = \frac{W_i}{N_i} + c \sqrt{\frac{\ln N_{parent}}{N_i}}$

W_i: 해당 노드의 승리 횟수
N_i: 해당 노드의 방문 횟수
c: 탐험 상수 (보통 √2)

첫 번째 항: 활용 (승률 높은 노드) 두 번째 항: 탐험 (방문 횟수 적은 노드)

2) Expansion (확장)

선택된 리프 노드에서 가능한 수 중 하나를 새 자식으로 추가

3) Simulation (시뮬레이션/Rollout)

새 노드에서 게임이 끝날 때까지 무작위로 진행

4) Backpropagation (역전파)

시뮬레이션 결과(승/패)를 루트까지 거슬러 올라가며 업데이트

3. MCTS 구현

import numpy as np
import random
from collections import defaultdict

class MCTSNode:
    def __init__(self, state, parent=None, action=None):
        self.state = state
        self.parent = parent
        self.action = action  # 이 노드에 도달한 행동
        self.children = []
        self.visits = 0
        self.wins = 0
        self.untried_actions = state.get_legal_actions()

    def is_fully_expanded(self):
        return len(self.untried_actions) == 0

    def is_terminal(self):
        return self.state.is_game_over()

    def ucb1(self, c=1.41):
        if self.visits == 0:
            return float('inf')
        exploitation = self.wins / self.visits
        exploration = c * np.sqrt(np.log(self.parent.visits) / self.visits)
        return exploitation + exploration

    def best_child(self, c=1.41):
        return max(self.children, key=lambda child: child.ucb1(c))

    def expand(self):
        action = self.untried_actions.pop()
        next_state = self.state.apply_action(action)
        child = MCTSNode(next_state, parent=self, action=action)
        self.children.append(child)
        return child

    def backpropagate(self, result):
        self.visits += 1
        self.wins += result
        if self.parent:
            # 상대 관점에서의 결과
            self.parent.backpropagate(1 - result)


class MCTS:
    def __init__(self, iterations=1000, c=1.41):
        self.iterations = iterations
        self.c = c

    def search(self, root_state):
        root = MCTSNode(root_state)

        for _ in range(self.iterations):
            node = root

            # 1. Selection
            while not node.is_terminal() and node.is_fully_expanded():
                node = node.best_child(self.c)

            # 2. Expansion
            if not node.is_terminal() and not node.is_fully_expanded():
                node = node.expand()

            # 3. Simulation
            state = node.state.clone()
            while not state.is_game_over():
                action = random.choice(state.get_legal_actions())
                state = state.apply_action(action)
            result = state.get_result(root_state.current_player)

            # 4. Backpropagation
            node.backpropagate(result)

        # 가장 많이 방문된 자식의 행동 반환
        return max(root.children, key=lambda c: c.visits).action


# 예시 게임 상태 인터페이스
class GameState:
    def get_legal_actions(self):
        """가능한 행동 리스트 반환"""
        pass

    def apply_action(self, action):
        """행동을 적용한 새 상태 반환"""
        pass

    def is_game_over(self):
        """게임 종료 여부"""
        pass

    def get_result(self, player):
        """player 관점에서 결과 (1: 승, 0: 패, 0.5: 무)"""
        pass

    def clone(self):
        """상태 복사"""
        pass

4. 틱택토에서 MCTS 테스트

class TicTacToe:
    def __init__(self, board=None, current_player=1):
        self.board = board if board else [0] * 9
        self.current_player = current_player

    def get_legal_actions(self):
        return [i for i in range(9) if self.board[i] == 0]

    def apply_action(self, action):
        new_board = self.board.copy()
        new_board[action] = self.current_player
        return TicTacToe(new_board, -self.current_player)

    def is_game_over(self):
        return self.get_winner() != 0 or len(self.get_legal_actions()) == 0

    def get_winner(self):
        lines = [
            [0, 1, 2], [3, 4, 5], [6, 7, 8],  # 가로
            [0, 3, 6], [1, 4, 7], [2, 5, 8],  # 세로
            [0, 4, 8], [2, 4, 6]              # 대각선
        ]
        for line in lines:
            if self.board[line[0]] == self.board[line[1]] == self.board[line[2]] != 0:
                return self.board[line[0]]
        return 0

    def get_result(self, player):
        winner = self.get_winner()
        if winner == player:
            return 1
        elif winner == -player:
            return 0
        return 0.5

    def clone(self):
        return TicTacToe(self.board.copy(), self.current_player)

    def __str__(self):
        symbols = {0: '.', 1: 'X', -1: 'O'}
        rows = [' '.join(symbols[self.board[i*3 + j]] for j in range(3)) for i in range(3)]
        return '\n'.join(rows)


# 테스트
if __name__ == "__main__":
    game = TicTacToe()
    mcts = MCTS(iterations=1000)

    while not game.is_game_over():
        print(game)
        print()

        if game.current_player == 1:
            action = mcts.search(game)
            print(f"MCTS chooses: {action}")
        else:
            action = int(input("Your move (0-8): "))

        game = game.apply_action(action)

    print(game)
    winner = game.get_winner()
    print(f"Winner: {'X' if winner == 1 else 'O' if winner == -1 else 'Draw'}")

5. AlphaGo (2016)

AlphaGo는 MCTS에 딥러닝을 결합했습니다.

구성 요소

Policy Network (정책 네트워크): 다음 수 예측
- 입력: 바둑판 상태
- 출력: 각 위치에 둘 확률
Value Network (가치 네트워크): 승률 예측
- 입력: 바둑판 상태
- 출력: 현재 플레이어의 승률 (0~1)

학습 과정

1. Supervised Learning: 인간 기보로 Policy Network 초기 학습
2. Self-Play RL: 자기 자신과 대국하며 Policy Network 강화
3. Value Network 학습: Self-play 결과로 승률 예측 학습

MCTS + 신경망

기존 MCTS:

Selection: UCB1 사용
Simulation: 무작위 rollout

AlphaGo MCTS:

Selection: Policy Network가 가이드
Simulation: Value Network로 대체 (빠름)

# AlphaGo 스타일 MCTS Selection
def select_child_alphago(node, policy_net, c_puct=1.0):
    """PUCT 알고리즘 사용"""
    best_score = float('-inf')
    best_child = None

    for child in node.children:
        # 정책 네트워크의 사전 확률
        prior = policy_net.get_prior(node.state, child.action)

        # PUCT 점수
        q_value = child.wins / child.visits if child.visits > 0 else 0
        u_value = c_puct * prior * np.sqrt(node.visits) / (1 + child.visits)
        score = q_value + u_value

        if score > best_score:
            best_score = score
            best_child = child

    return best_child

6. AlphaZero (2017)

AlphaZero는 인간 기보 없이 처음부터 자기 대국만으로 학습합니다.

AlphaGo vs AlphaZero

특성	AlphaGo	AlphaZero
초기 학습	인간 기보 필요	불필요 (self-play only)
네트워크	Policy + Value 분리	하나의 네트워크
Rollout	있음	없음 (Value로 대체)
범용성	바둑 전용	체스, 장기, 바둑 모두 가능

단일 네트워크

class AlphaZeroNetwork(nn.Module):
    def __init__(self, board_size=19, num_actions=362, num_residual_blocks=19):
        super().__init__()
        self.board_size = board_size

        # 입력 처리
        self.conv_input = nn.Sequential(
            nn.Conv2d(17, 256, kernel_size=3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU()
        )

        # Residual Blocks
        self.residual_blocks = nn.ModuleList([
            ResidualBlock(256) for _ in range(num_residual_blocks)
        ])

        # Policy Head
        self.policy_head = nn.Sequential(
            nn.Conv2d(256, 2, kernel_size=1),
            nn.BatchNorm2d(2),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(2 * board_size * board_size, num_actions)
        )

        # Value Head
        self.value_head = nn.Sequential(
            nn.Conv2d(256, 1, kernel_size=1),
            nn.BatchNorm2d(1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(board_size * board_size, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
            nn.Tanh()  # -1 ~ 1
        )

    def forward(self, x):
        x = self.conv_input(x)
        for block in self.residual_blocks:
            x = block(x)

        policy = self.policy_head(x)
        value = self.value_head(x)

        return F.softmax(policy, dim=1), value


class ResidualBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(channels)
        self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(channels)

    def forward(self, x):
        residual = x
        x = F.relu(self.bn1(self.conv1(x)))
        x = self.bn2(self.conv2(x))
        x = F.relu(x + residual)
        return x

AlphaZero MCTS

class AlphaZeroMCTS:
    def __init__(self, network, num_simulations=800, c_puct=1.0):
        self.network = network
        self.num_simulations = num_simulations
        self.c_puct = c_puct

    def search(self, root_state):
        root = Node(root_state)

        # 루트 노드 초기화
        policy, value = self.network(state_to_tensor(root_state))
        root.expand(policy)

        for _ in range(self.num_simulations):
            node = root

            # Selection: PUCT로 리프까지
            while node.is_expanded():
                node = self.select_child(node)

            # Expansion + Evaluation
            if not node.state.is_terminal():
                policy, value = self.network(state_to_tensor(node.state))
                node.expand(policy)
            else:
                value = node.state.get_result()

            # Backpropagation
            while node is not None:
                node.visits += 1
                node.value_sum += value
                value = -value  # 상대 관점
                node = node.parent

        # 방문 횟수에 비례하는 확률로 행동 선택
        visits = [child.visits for child in root.children]
        return root.children[np.argmax(visits)].action

    def select_child(self, node):
        best_score = float('-inf')
        best_child = None

        for child in node.children:
            # PUCT score
            q = child.value_sum / child.visits if child.visits > 0 else 0
            u = self.c_puct * child.prior * np.sqrt(node.visits) / (1 + child.visits)
            score = q + u

            if score > best_score:
                best_score = score
                best_child = child

        return best_child

Self-Play 학습

def self_play_game(network, mcts):
    """한 게임의 self-play 데이터 생성"""
    game_states = []
    mcts_policies = []

    state = initial_state()

    while not state.is_terminal():
        # MCTS로 정책 생성
        mcts_policy = mcts.get_policy(state)

        game_states.append(state)
        mcts_policies.append(mcts_policy)

        # 정책에 따라 행동 선택 (학습 시에는 약간의 노이즈 추가)
        action = np.random.choice(len(mcts_policy), p=mcts_policy)
        state = state.apply_action(action)

    # 게임 결과
    winner = state.get_winner()

    # 학습 데이터 생성
    training_data = []
    for i, (s, pi) in enumerate(zip(game_states, mcts_policies)):
        # 각 상태에서의 결과 (해당 플레이어 관점)
        value = 1 if winner == s.current_player else -1 if winner == -s.current_player else 0
        training_data.append((s, pi, value))

    return training_data


def train_alphazero(network, num_iterations=100):
    optimizer = optim.Adam(network.parameters(), lr=0.01)
    replay_buffer = []

    for iteration in range(num_iterations):
        # Self-play로 데이터 생성
        mcts = AlphaZeroMCTS(network)
        for _ in range(100):  # 100 게임
            game_data = self_play_game(network, mcts)
            replay_buffer.extend(game_data)

        # 네트워크 학습
        for _ in range(1000):  # 1000 배치
            batch = random.sample(replay_buffer, 256)
            states, target_policies, target_values = zip(*batch)

            # Forward
            pred_policies, pred_values = network(states_to_tensor(states))

            # Loss
            policy_loss = cross_entropy(pred_policies, target_policies)
            value_loss = mse(pred_values, target_values)
            loss = policy_loss + value_loss

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

7. AlphaZero의 성과

게임	학습 시간	결과
체스	4시간	Stockfish (당시 최강) 압도
장기	2시간	Elmo (당시 최강) 압도
바둑	8시간	기존 AlphaGo 압도

핵심: 인간 지식 없이 게임 규칙만으로 초인적 수준 달성

8. 간단한 AlphaZero 구현 (Connect4)

class Connect4:
    """4목 게임"""
    def __init__(self, board=None, player=1):
        self.rows = 6
        self.cols = 7
        self.board = board if board else np.zeros((self.rows, self.cols))
        self.player = player

    def get_legal_actions(self):
        return [c for c in range(self.cols) if self.board[0][c] == 0]

    def apply_action(self, col):
        new_board = self.board.copy()
        for row in range(self.rows - 1, -1, -1):
            if new_board[row][col] == 0:
                new_board[row][col] = self.player
                break
        return Connect4(new_board, -self.player)

    def is_terminal(self):
        return self.check_winner() != 0 or len(self.get_legal_actions()) == 0

    def check_winner(self):
        # 가로, 세로, 대각선 검사
        for row in range(self.rows):
            for col in range(self.cols):
                if self.board[row][col] != 0:
                    if self._check_line(row, col, 0, 1) or \
                       self._check_line(row, col, 1, 0) or \
                       self._check_line(row, col, 1, 1) or \
                       self._check_line(row, col, 1, -1):
                        return self.board[row][col]
        return 0

    def _check_line(self, row, col, dr, dc):
        player = self.board[row][col]
        for i in range(1, 4):
            r, c = row + dr * i, col + dc * i
            if not (0 <= r < self.rows and 0 <= c < self.cols):
                return False
            if self.board[r][c] != player:
                return False
        return True


class SimpleAlphaZeroNet(nn.Module):
    """간단한 Connect4용 네트워크"""
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 64, 3, padding=1)
        self.conv2 = nn.Conv2d(64, 64, 3, padding=1)

        self.policy_fc = nn.Linear(64 * 6 * 7, 7)
        self.value_fc = nn.Sequential(
            nn.Linear(64 * 6 * 7, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Tanh()
        )

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))
        x = x.view(x.size(0), -1)

        policy = F.softmax(self.policy_fc(x), dim=1)
        value = self.value_fc(x)
        return policy, value

9. MCTS 변형들

Progressive Widening

행동 공간이 연속적이거나 매우 클 때 점진적으로 자식 추가

RAVE (Rapid Action Value Estimation)

같은 행동의 결과를 공유하여 학습 가속

Gumbel MuZero (2021)

Gumbel-Top-k 샘플링으로 MCTS 효율 개선

Quiz

MCTS의 UCB1 공식에서 탐험 항(exploration term)의 역할은?