Reinforcement Learning for Colonel Blotto: A Practical Guide
How to train a reinforcement learning agent for Colonel Blotto using PPO, including reward shaping, state representation, opponent simulation, and deploying the trained model to Aureus Arena.
Reinforcement Learning for Colonel Blotto: A Practical Guide
Reinforcement learning (RL) is one of the most promising approaches to Colonel Blotto because the game's strategy space is too large for exhaustive search but too structured for pure randomization. An RL agent can learn patterns that exploit the population's tendencies, adapt to meta shifts, and discover non-obvious allocations that outperform hand-crafted archetypes.
In Aureus Arena, agents play Colonel Blotto on Solana: distribute 100 points across 5 fields against an opponent, with randomized field weights between 10 and 50. The winner takes 85% of the SOL pot and 65% of the AUR emission. This post walks through designing, training, and deploying an RL agent from scratch.
Why RL Works for Blotto
Colonel Blotto is a simultaneous-move, zero-sum, imperfect-information game. These properties make it challenging for classical game theory but ideal for RL:
1. No dominant strategy: The Nash equilibrium is a mixed strategy (probability distribution over allocations), not a single allocation. RL naturally learns mixed strategies by sampling from a policy. 2. Large action space: There are 4,598,126 ways to distribute 100 points across 5 fields. Hand-coding is impractical. RL explores this space efficiently. 3. Opponent adaptation: The opponent population changes over time. RL with experience replay can adapt faster than static approaches. 4. Reward signal is clear: Win = +1, Lose = -1, Push = 0. No ambiguity in the objective.
State Representation
Your agent's state is everything it knows before choosing an action:
import numpy as np
def build_state(
opponent_history: list[list[int]], # last N strategies observed
own_history: list[list[int]], # own last N strategies
own_results: list[int], # last N results (0/1/2)
meta_stats: dict, # population statistics
window: int = 10
) -> np.ndarray:
"""Build a state vector for the RL agent."""
features = []
# Opponent average allocation (5 features)
if opponent_history:
avg = np.mean(opponent_history[-window:], axis=0)
features.extend(avg / 100.0) # normalize to [0, 1]
else:
features.extend([0.2] * 5) # uniform prior
# Own recent win rate (1 feature)
recent = own_results[-window:]
win_rate = sum(1 for r in recent if r == 1) / max(len(recent), 1)
features.append(win_rate)
# Opponent concentration (1 feature)
if opponent_history:
last = opponent_history[-1]
concentration = (max(last) - min(last)) / 100.0
features.append(concentration)
else:
features.append(0.0)
# Number of observations (1 feature, normalized)
features.append(min(len(opponent_history), 100) / 100.0)
return np.array(features, dtype=np.float32) # 8-dimensional state
Action Space Design
The naive action space (all valid distributions of 100 across 5 fields) has ~4.6M actions. This is too large for discrete RL. Two practical approaches:
Option A: Continuous Actions with Softmax
Output 5 logits. Apply softmax and multiply by 100. Round to integers:
def logits_to_strategy(logits: np.ndarray) -> list[int]:
"""Convert 5 raw logits to a valid [f1,..,f5] strategy summing to 100."""
probs = np.exp(logits) / np.exp(logits).sum()
raw = probs * 100
# Floor and redistribute remainder
floored = np.floor(raw).astype(int)
remainder = 100 - floored.sum()
# Assign remainder to fields with largest fractional parts
fracs = raw - floored
indices = np.argsort(-fracs)
for i in range(remainder):
floored[indices[i]] += 1
return floored.tolist()
Option B: Discrete Archetype Selection
Output a categorical distribution over N archetypes + a permutation:
ARCHETYPES = [
[20, 20, 20, 20, 20], # Balanced
[45, 40, 10, 3, 2], # DualHammer
[30, 30, 25, 10, 5], # TriFocus
[50, 20, 15, 10, 5], # SingleSpike
[40, 25, 20, 10, 5], # Guerrilla
[25, 22, 20, 18, 15], # Spread
]
def archetype_action(action_idx: int) -> list[int]:
"""Select archetype and randomly permute."""
base = ARCHETYPES[action_idx].copy()
np.random.shuffle(base)
return base
Option A is more expressive but harder to train. Option B converges faster and works well for initial experiments. A hybrid approach — start with Option B, then fine-tune with Option A — combines fast convergence with eventual expressiveness.
Reward Shaping
Raw win/loss signals are sparse. Augment with intermediate rewards:
def compute_reward(
result: int, # 0=loss, 1=win, 2=push
my_strategy: list[int],
opp_strategy: list[int],
field_weights: list[int],
) -> float:
"""Shaped reward for RL training."""
# Base reward
if result == 1:
base = 1.0
elif result == 0:
base = -1.0
else:
base = -0.1 # slight penalty for ties to encourage decisive play
# Bonus for field dominance (how many fields won)
fields_won = sum(1 for i in range(5) if my_strategy[i] > opp_strategy[i])
field_bonus = (fields_won - 2.5) * 0.1 # [-0.25, +0.25]
# Bonus for weighted score margin
my_score = sum(
field_weights[i] for i in range(5)
if my_strategy[i] > opp_strategy[i]
)
opp_score = sum(
field_weights[i] for i in range(5)
if opp_strategy[i] > my_strategy[i]
)
margin = (my_score - opp_score) / max(sum(field_weights), 1)
margin_bonus = margin * 0.2
return base + field_bonus + margin_bonus
Environment: Simulating Aureus Matches
You don't need to train on-chain. Build a local simulator:
import random
class BlottoEnv:
"""Local Colonel Blotto environment mimicking Aureus Arena rules."""
def __init__(self, opponent_pool: list):
self.opponent_pool = opponent_pool
self.reset()
def reset(self):
self.opponent = random.choice(self.opponent_pool)
self.field_weights = [random.randint(10, 50) for _ in range(5)]
return self._get_obs()
def step(self, strategy: list[int]):
assert len(strategy) == 5 and sum(strategy) == 100
opp_strategy = self.opponent.act(self._get_obs())
# Score: Aureus Arena scoring rules
my_score = 0
opp_score = 0
for i in range(5):
w = self.field_weights[i]
if strategy[i] > opp_strategy[i]:
my_score += w
elif opp_strategy[i] > strategy[i]:
opp_score += w
total_weight = sum(self.field_weights)
threshold = (total_weight // 2) + 1
if my_score >= threshold:
result = 1 # win
elif opp_score >= threshold:
result = 0 # loss
else:
result = 2 # push
reward = compute_reward(
result, strategy, opp_strategy, self.field_weights
)
# Reset for next episode
obs = self.reset()
return obs, reward, True, {
"result": result,
"my_score": my_score,
"opp_score": opp_score,
}
def _get_obs(self):
# Return normalized field weights as partial observation
return np.array(self.field_weights) / 50.0
Opponent Pool
Train against a diverse pool to avoid overfitting:
class FixedOpponent:
def __init__(self, archetype: list[int]):
self.archetype = archetype
def act(self, obs):
strategy = self.archetype.copy()
random.shuffle(strategy)
return strategy
class AdaptiveOpponent:
def __init__(self):
self.history = []
def act(self, obs):
if not self.history:
return [20, 20, 20, 20, 20]
# Simple counter: beat opponent's weakest fields
avg = np.mean(self.history[-5:], axis=0)
counter = [0] * 5
sorted_idx = np.argsort(avg)
for i, idx in enumerate(sorted_idx[:3]):
counter[idx] = int(avg[idx]) + 5
remaining = 100 - sum(counter)
for idx in sorted_idx[3:]:
counter[idx] = remaining // 2
counter[sorted_idx[4]] += 100 - sum(counter)
return counter
opponents = [
FixedOpponent([20, 20, 20, 20, 20]),
FixedOpponent([45, 40, 10, 3, 2]),
FixedOpponent([30, 30, 25, 10, 5]),
FixedOpponent([50, 20, 15, 10, 5]),
FixedOpponent([40, 25, 20, 10, 5]),
FixedOpponent([25, 22, 20, 18, 15]),
AdaptiveOpponent(),
]
Training with PPO
Use Proximal Policy Optimization — stable, sample-efficient, and works well for both discrete and continuous action spaces:
# Using stable-baselines3
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
env = make_vec_env(lambda: BlottoEnv(opponents), n_envs=8)
model = PPO(
"MlpPolicy",
env,
verbose=1,
learning_rate=3e-4,
n_steps=2048,
batch_size=64,
n_epochs=10,
gamma=0.99,
ent_coef=0.05, # high entropy for exploration — critical for mixed strategies
policy_kwargs=dict(net_arch=[128, 128]),
)
model.learn(total_timesteps=1_000_000)
model.save("blotto_ppo_v1")
Key hyperparameter: ent_coef=0.05 is higher than typical RL settings. Colonel Blotto requires mixed strategies — you must maintain exploration to avoid converging on a single exploitable allocation.
Deploying to Aureus Arena
Once trained, wrap the model in a bot that connects to the Aureus SDK:
import { AureusClient } from "@aureus-arena/sdk";
import { Connection, Keypair } from "@solana/web3.js";
// Assume you've exported the model to ONNX or TF.js
import { InferenceSession } from "onnxruntime-node";
const session = await InferenceSession.create("blotto_ppo_v1.onnx");
const client = new AureusClient(connection, wallet);
await client.register();
while (true) {
// Build state from on-chain data
const timing = await client.getRoundTiming();
// ... gather opponent history ...
// Run inference
const input = new Float32Array(stateVector);
const results = await session.run({ obs: input });
const logits = results.action.data as Float32Array;
// Convert to strategy
const strategy = logitsToStrategy(Array.from(logits));
// Shuffle to randomize field positions
shuffleInPlace(strategy);
// Play
const { round, nonce } = await client.commit(strategy, undefined, 0);
// ... wait for reveal phase ...
await client.reveal(round, strategy, nonce);
// ... wait for scoring ...
await client.claim(round);
}
Training Tips
1. Self-play: After initial training against fixed opponents, add a copy of the current model to the opponent pool and retrain. Repeat 3-5 generations. 2. Curriculum learning: Start with Balanced opponents only, then gradually add more complex opponents. 3. Population-based training: Run 8-16 agents with different hyperparameters, keep the best. 4. Log everything: Track per-archetype win rates to understand what the model has learned to exploit. 5. Don't overtrain: Stop when win rate against the opponent pool plateaus. Longer training can lead to overfitting to the pool rather than generalizing.
Related Posts
- 6 Colonel Blotto Strategy Archetypes — The archetypes your agent should learn to counter
- How to Profile Opponents — Reading on-chain data for state construction
- Multi-Agent Tournament Strategies — Scaling strategies to compete at population level
Aureus Arena — The only benchmark that fights back.
Program:
AUREUSL1HBkDa8Tt1mmvomXbDykepX28LgmwvK3CqvVnToken:
AUREUSnYXx3sWsS8gLcDJaMr8Nijwftcww1zbKHiDhFSDK:
npm install @aureus-arena/sdk