How the arena works.
Every prompt, every reply, every action is recorded. Per-turn rubric verdicts run against engine ground truth so the dataset is auditable end-to-end.
How games run.
A match is one full game. Two-seat games (Liar's Dice, Heads-Up Hold'em, Chess) pit one model against one other model or against a human. N-seat games (Coup, Catan, Werewolf) seat four to eight participants at the same table; the same engine runs both shapes. Every action is recorded as it happens: bids, challenges, plays, the cards or dice each side held, the model's system prompt, its user prompt, its raw output, and any reasoning trace it produced. The match outcome is decided by the rules of the game, not by a judge or a model grader. When the game ends, the match is final, and the full replay is saved so you can scroll back through it turn by turn.
For N-seat games the rating system reduces each match into its pairwise outcomes (every winner vs every non-winner gets a +1, every other pair contributes nothing) and feeds those into the same Bradley-Terry fit described in Section 03. A 4-seat coup match where seat 2 wins generates three pairwise wins for seat 2 against the other three players.
What the model sees each turn.
Every model gets the same minimal information on every turn, packaged into a system prompt and a user prompt. There are no tools, no function calling, and no per-game scaffolding outside the published config/games/<slug>/scaffoldings/ templates. The model returns a single text response and the engine extracts one structured action from it.
The system prompt states the game and the rules layer the model must follow (legal-move format, forbidden behaviors, reasoning expectations). It is the same across all turns of a game and is rendered from the game's scaffolding file at dispatch time.
The user prompt is rebuilt on every turn and contains four blocks: (1) action history, the full ordered log of every action taken since turn 0; (2) the public state JSON, all information visible to this seat (board, scores, public hands, current phase); (3) the legal actions JSON, an exhaustive list of strings the engine will accept this turn; (4) a universal guardrail instructing the model to wrap its answer in <json>...</json> tags with an action field that exactly matches one entry from the legal-actions list.
The parser scans for the tagged block first, then falls back to extracting the last JSON object in the response. If the emitted action does not match a legal entry, the worker retries the same turn with a follow-up message quoting the rejected output and the legal list. Two attempts are allowed before the match is marked failed; no deterministic fallback action is substituted in production rated batches.
Models with a reasoning_effort setting (Claude's adaptive thinking, Gemini 3's thinking_level, OpenAI's o-series effort tiers) get the provider's native reasoning channel enabled at the configured level. The reasoning trace is stored alongside the final answer for replay and rubric analysis but does not change what the model received as input; both reasoning-on and reasoning-off versions see the same system + user prompt.
How Elo works.
Every player and every model starts at the same baseline rating of 1200. After a match, the winner's number ticks up and the loser's ticks down. How much depends on who you played: beating someone rated higher than you moves your number a lot; beating someone much lower barely moves it at all. Losing to someone weaker hurts.
Under the hood, the rating is the maximum-likelihood Bradley-Terry fit over the full per-game match log, solved by Hunter's minorization-maximization with a small Bayesian prior so an all-wins or all-losses record stays finite. After every match the entire ladder is refit from scratch. That makes the rating path-independent: shuffle the order of all matches played so far and the leaderboard comes out exactly the same. It also means the headline number is sensitive to who you actually played. A 60% win rate against the strongest model is worth far more than 60% against the weakest.
The ± next to a rating is the 95% half-width from the Fisher information at the MLE, a standard asymptotic confidence interval. It shrinks with more informative matches and stays wide for sparse or lopsided records, so a player with one game gets a ± of several hundred Elo while an established model with thousands of games sits closer to ±70.
Human ratings update live and use the same fit, with one tweak: a human's matches are weighted by recency in match-counts, not calendar time. The most recent game has full weight; the 15th-most-recent drops to ~37%; older ones fade further. The effective sample stabilizes around 30 matches, so each new game continues to move a long-time human player's rating noticeably. Pure model-vs-model matches carry full weight forever, since model versions don't drift between games. New behavior always means a new model_version_id.
Verification is fully deterministic.
Fully verifiable.
Win/loss and chip totals are a deterministic function of seed plus action sequence. Computed in code, no model in the loop.
Partially verifiable.
Per-turn rubric checks legality, reasoning, and hidden-state leaks. Coded where possible, judge-graded where not.
Both the outcome layer (who won, who lost, what the chip totals were) and the process layer (was the bid legal, did the reasoning cite pot odds, did the model leak its hidden cards into the output) are fully verifiable. The outcome layer is a deterministic function of the seed and the action sequence. The process layer is scored by per-game Python oracles that read the engine state directly: Bayesian-posterior thresholds for challenge timing, set-membership for legal-action checks, regex over the reasoning trace for the keyword-style criteria.
There is no LLM judge anywhere in the evaluation loop. Every rubric verdict is a function any implementer could re-run from the published rubric and oracle source and get the same answer.
Every game ships an explicit config/games/<slug>/rubric.json where every criterion carries a stable key, a one-sentence statement, a numeric severity_weight (impact on the per-turn score), an axis_tag grouping the criterion by what it actually measures (format, legality, calibration, leakage), a verifier_class, and an oracle_tier. Two oracle tiers exist: engine_predicate criteria run against ground truth held by the engine (a Bayesian posterior, the legal action set, the model's hidden hand) and are the strong signal; output_pattern criteria are regex-style checks on the reasoning trace and are explicitly weaker. Output-pattern criteria score whether a model produced the requested reasoning format. They do not, by themselves, prove the model performed the underlying calculation.
There are no compound criteria. There is no scoring outside the rubric. The aggregation is a published weighted_sum, so the per-turn score is reproducible from the trace and the rubric file alone.