MuZero: 규칙 없이 배우는 AI

MuZero는 DeepMind가 2019년에 발표한 알고리즘으로, AlphaZero를 넘어 게임 규칙조차 알려주지 않아도 스스로 환경 모델을 학습하고 계획합니다. 바둑, 체스뿐 아니라 아타리 게임 57종에서도 최고 성능을 달성했습니다.

1. AlphaZero의 한계

AlphaZero는 게임 규칙을 알아야 합니다:

어떤 수가 합법적인지
수를 두면 보드가 어떻게 변하는지
게임이 언제 끝나는지

이 정보가 없으면 MCTS를 수행할 수 없습니다.

실제 문제에서의 한계

로봇 제어: 물리 법칙을 정확히 시뮬레이션하기 어려움
비디오 게임: 게임 엔진 접근 불가능한 경우
실세계: 환경 역학을 완벽히 모델링 불가능

2. MuZero의 핵심 아이디어

MuZero는 환경 모델을 학습합니다. 실제 환경의 모든 세부사항이 아닌, 계획에 필요한 정보만 학습합니다.

AlphaZero: 실제 환경 규칙 사용 → MCTS
MuZero:    학습된 환경 모델 사용 → MCTS

학습된 모델 (Learned Model)

MuZero의 모델은 세 가지 함수로 구성됩니다:

Representation Function h(o): 관측 → 잠재 상태
Dynamics Function g(s, a): 잠재 상태 + 행동 → 다음 잠재 상태 + 보상
Prediction Function f(s): 잠재 상태 → 정책 + 가치

관측 o_t → [h] → 잠재 상태 s^0
                    ↓
             [f] → 정책 p^0, 가치 v^0
                    ↓
            a^1 + [g] → s^1, r^1
                    ↓
             [f] → p^1, v^1
                    ↓
                   ...

3. 세 가지 네트워크

1) Representation Network h(o)

실제 관측을 잠재 공간으로 인코딩합니다.

class RepresentationNetwork(nn.Module):
    """관측을 잠재 상태로 변환"""

    def __init__(self, observation_shape, hidden_dim=256):
        super().__init__()

        # 이미지 입력용 CNN
        self.conv = nn.Sequential(
            nn.Conv2d(observation_shape[0], 32, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
        )

        # 잠재 상태로 변환
        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 10 * 10, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

    def forward(self, observation):
        x = self.conv(observation)
        hidden_state = self.fc(x)
        return hidden_state

2) Dynamics Network g(s, a)

잠재 공간에서의 상태 전이와 보상을 예측합니다.

class DynamicsNetwork(nn.Module):
    """잠재 상태 전이 예측"""

    def __init__(self, hidden_dim=256, action_dim=18):
        super().__init__()

        # 행동을 원-핫 인코딩으로 결합
        self.fc = nn.Sequential(
            nn.Linear(hidden_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )

        # 다음 잠재 상태
        self.next_state = nn.Linear(hidden_dim, hidden_dim)

        # 즉각 보상 예측
        self.reward = nn.Sequential(
            nn.Linear(hidden_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 1)
        )

    def forward(self, hidden_state, action):
        # 행동을 원-핫 인코딩
        action_one_hot = F.one_hot(action, num_classes=self.action_dim).float()

        x = torch.cat([hidden_state, action_one_hot], dim=-1)
        x = self.fc(x)

        next_hidden_state = self.next_state(x)
        reward = self.reward(x)

        return next_hidden_state, reward

3) Prediction Network f(s)

잠재 상태에서 정책과 가치를 예측합니다.

class PredictionNetwork(nn.Module):
    """정책과 가치 예측"""

    def __init__(self, hidden_dim=256, action_dim=18):
        super().__init__()

        # 공유 레이어
        self.shared = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )

        # 정책 헤드
        self.policy = nn.Linear(hidden_dim, action_dim)

        # 가치 헤드
        self.value = nn.Linear(hidden_dim, 1)

    def forward(self, hidden_state):
        shared = self.shared(hidden_state)
        policy = F.softmax(self.policy(shared), dim=-1)
        value = self.value(shared)
        return policy, value

4. MuZero 전체 네트워크

class MuZeroNetwork(nn.Module):
    def __init__(self, observation_shape, action_dim, hidden_dim=256):
        super().__init__()
        self.action_dim = action_dim

        self.representation = RepresentationNetwork(observation_shape, hidden_dim)
        self.dynamics = DynamicsNetwork(hidden_dim, action_dim)
        self.prediction = PredictionNetwork(hidden_dim, action_dim)

    def initial_inference(self, observation):
        """첫 추론: 관측 → 잠재 상태 → 정책, 가치"""
        hidden_state = self.representation(observation)
        policy, value = self.prediction(hidden_state)
        return hidden_state, policy, value

    def recurrent_inference(self, hidden_state, action):
        """재귀 추론: 잠재 상태 + 행동 → 다음 상태, 보상, 정책, 가치"""
        next_hidden_state, reward = self.dynamics(hidden_state, action)
        policy, value = self.prediction(next_hidden_state)
        return next_hidden_state, reward, policy, value

5. MuZero MCTS

AlphaZero와 달리 학습된 모델을 사용합니다.

class MuZeroMCTS:
    def __init__(self, network, num_simulations=50, c_puct=1.25):
        self.network = network
        self.num_simulations = num_simulations
        self.c_puct = c_puct

    def search(self, observation):
        # 초기 추론
        hidden_state, policy, value = self.network.initial_inference(observation)
        root = Node(hidden_state, policy)

        for _ in range(self.num_simulations):
            node = root
            search_path = [node]

            # Selection: 리프까지
            while node.is_expanded:
                action, node = self.select_child(node)
                search_path.append(node)

            # Expansion: 학습된 모델로 상태 전이
            parent = search_path[-2]
            action = self.get_action_to_child(parent, node)

            # 학습된 dynamics 사용 (실제 환경 X)
            next_hidden, reward, policy, value = self.network.recurrent_inference(
                parent.hidden_state, action
            )

            node.expand(next_hidden, policy, reward)

            # Backpropagation
            self.backpropagate(search_path, value, reward)

        # 방문 횟수 기반 행동 선택
        visits = [child.visit_count for child in root.children]
        return np.argmax(visits)

    def select_child(self, node):
        best_score = float('-inf')
        best_action = None
        best_child = None

        for action, child in enumerate(node.children):
            if child is None:
                continue

            # PUCT score
            prior = node.policy[action]
            q_value = child.value() if child.visit_count > 0 else 0

            u = self.c_puct * prior * np.sqrt(node.visit_count) / (1 + child.visit_count)
            score = q_value + u

            if score > best_score:
                best_score = score
                best_action = action
                best_child = child

        return best_action, best_child

    def backpropagate(self, path, value, reward):
        for node in reversed(path):
            node.visit_count += 1
            node.value_sum += value
            value = reward + self.gamma * value  # 역전파 시 보상 누적

6. 학습 알고리즘

데이터 수집

실제 환경에서 MCTS로 플레이하며 궤적 저장:

def collect_experience(env, network, mcts, num_episodes=100):
    buffer = []

    for _ in range(num_episodes):
        observation = env.reset()
        trajectory = []

        while True:
            # MCTS로 행동 선택
            root_value, action, policy = mcts.run(observation, network)

            next_observation, reward, done, _ = env.step(action)

            trajectory.append({
                'observation': observation,
                'action': action,
                'reward': reward,
                'policy': policy,  # MCTS 방문 비율
                'value': root_value
            })

            observation = next_observation
            if done:
                break

        buffer.append(trajectory)

    return buffer

손실 함수

MuZero는 K 스텝 앞을 unroll하며 학습합니다:

def compute_loss(network, batch, K=5):
    """
    K 스텝 unrolling 손실
    """
    total_policy_loss = 0
    total_value_loss = 0
    total_reward_loss = 0

    for trajectory in batch:
        # 시작 위치 샘플링
        t = random.randint(0, len(trajectory) - K - 1)

        observation = trajectory[t]['observation']
        hidden_state, policy_pred, value_pred = network.initial_inference(observation)

        # 첫 스텝 손실
        policy_target = trajectory[t]['policy']
        value_target = compute_target_value(trajectory, t)

        total_policy_loss += cross_entropy(policy_pred, policy_target)
        total_value_loss += mse(value_pred, value_target)

        # K 스텝 unroll
        for k in range(K):
            action = trajectory[t + k]['action']

            hidden_state, reward_pred, policy_pred, value_pred = \
                network.recurrent_inference(hidden_state, action)

            # 타겟
            reward_target = trajectory[t + k]['reward']
            policy_target = trajectory[t + k + 1]['policy']
            value_target = compute_target_value(trajectory, t + k + 1)

            total_reward_loss += mse(reward_pred, reward_target)
            total_policy_loss += cross_entropy(policy_pred, policy_target)
            total_value_loss += mse(value_pred, value_target)

    return total_policy_loss + total_value_loss + total_reward_loss


def compute_target_value(trajectory, t, gamma=0.997, n_steps=10):
    """n-step bootstrapped value target"""
    value = 0
    for i in range(n_steps):
        if t + i >= len(trajectory):
            break
        value += (gamma ** i) * trajectory[t + i]['reward']

    # Bootstrap with MCTS value
    if t + n_steps < len(trajectory):
        value += (gamma ** n_steps) * trajectory[t + n_steps]['value']

    return value

7. MuZero vs 다른 알고리즘

특성	DQN	AlphaZero	MuZero
환경 모델	불필요	규칙 필요	학습됨
계획 (Planning)	없음	MCTS	MCTS
적용 범위	Model-free	완벽한 정보 게임	범용
아타리 성능	높음	불가	최고
바둑 성능	불가	최고	최고

8. MuZero의 발전: EfficientZero (2021)

EfficientZero는 MuZero의 샘플 효율성을 크게 개선했습니다.

핵심 개선점

Self-supervised consistency loss: 모델 예측 일관성 강화
End-to-end prediction: 보상과 가치의 연결 강화
Value prefix prediction: 더 정확한 가치 추정

성과

아타리 100k 벤치마크에서 인간 수준의 194% 달성
단 2시간의 게임 경험으로 학습

9. 최신 발전 (2024-2025)

TransZero (2025)

Transformer를 사용하여 MCTS의 순차적 확장을 병렬화:

기존 MuZero 대비 최대 11배 속도 향상
동일한 샘플 효율성 유지

UniZero (2025)

Transformer 백본으로 전체 시퀀스를 집계:

이산/연속 제어 태스크 모두 지원
더 나은 확장성과 데이터 효율성

Multiagent Gumbel MuZero (2024)

다중 에이전트 환경으로 확장:

기하급수적인 행동 공간 처리
Model-free 대비 10배 적은 환경 상호작용

10. 간단한 MuZero 구현 예제

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from collections import deque

class SimpleMuZero:
    """CartPole용 간소화된 MuZero"""

    def __init__(self, state_dim, action_dim, hidden_dim=64):
        self.action_dim = action_dim

        # 네트워크 정의
        self.representation = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim)
        )

        self.dynamics = nn.Sequential(
            nn.Linear(hidden_dim + action_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim + 1)  # next_state + reward
        )

        self.prediction = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        self.policy_head = nn.Linear(hidden_dim, action_dim)
        self.value_head = nn.Linear(hidden_dim, 1)

        # 모든 파라미터
        self.parameters = list(self.representation.parameters()) + \
                         list(self.dynamics.parameters()) + \
                         list(self.prediction.parameters()) + \
                         list(self.policy_head.parameters()) + \
                         list(self.value_head.parameters())

        self.optimizer = torch.optim.Adam(self.parameters, lr=1e-3)

    def initial_inference(self, observation):
        hidden = self.representation(observation)
        pred = self.prediction(hidden)
        policy = F.softmax(self.policy_head(pred), dim=-1)
        value = self.value_head(pred)
        return hidden, policy, value

    def recurrent_inference(self, hidden, action):
        action_one_hot = F.one_hot(action, self.action_dim).float()
        x = torch.cat([hidden, action_one_hot], dim=-1)
        out = self.dynamics(x)
        next_hidden = out[..., :-1]
        reward = out[..., -1:]

        pred = self.prediction(next_hidden)
        policy = F.softmax(self.policy_head(pred), dim=-1)
        value = self.value_head(pred)

        return next_hidden, reward, policy, value

    def train_step(self, observations, actions, rewards, policies, values):
        # 초기 추론
        hidden, policy_pred, value_pred = self.initial_inference(observations[:, 0])

        loss = F.cross_entropy(policy_pred, policies[:, 0])
        loss += F.mse_loss(value_pred.squeeze(), values[:, 0])

        # K 스텝 unroll
        for k in range(actions.shape[1]):
            hidden, reward_pred, policy_pred, value_pred = \
                self.recurrent_inference(hidden, actions[:, k])

            loss += F.mse_loss(reward_pred.squeeze(), rewards[:, k])
            if k + 1 < policies.shape[1]:
                loss += F.cross_entropy(policy_pred, policies[:, k + 1])
                loss += F.mse_loss(value_pred.squeeze(), values[:, k + 1])

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()

Quiz

MuZero가 AlphaZero와 다른 점은?