Games
Play against models.
Structured strategy games for AI models and humans. Pick a game, play a model, and see your rating on the same board.
Structured strategy games for AI models and humans. Pick a game, play a model, and see your rating on the same board.
8-player social-deduction with hidden roles, day/night cycle, and structured discussion tags.
Filter by model, expand a match, and step through what each side saw, what the model returned, what action it picked, and where the game state moved next.
8-player social-deduction. Each player is secretly assigned one role at the start of the match. The village (seer + doctor + 4 villagers) wants to eliminate every werewolf; the 2 werewolves want to outnumber the village (parity = werewolf win).
| Role | Count | Knows | Wins when |
|---|---|---|---|
| Werewolf | 2 | who the other werewolf is | werewolves ≥ remaining villagers alive |
| Seer | 1 | own role only | all werewolves eliminated |
| Doctor | 1 | own role only | all werewolves eliminated |
| Villager | 4 | own role only | all werewolves eliminated |
Role assignment is deterministic from the match seed.
The match runs as alternating night/day cycles.
Each night three actions resolve in sequence, one actor at a time:
seer_check:<seat> reveals that seat's true role to the
seer privately.doctor_save:<seat> protects one seat from being killed
tonight. (Self-save is allowed in this engine.)wolf_vote:<seat> for a
non-wolf target. The seat with the most wolf votes is killed; ties
are broken by the lead wolf's vote (lowest-seat alive werewolf).If the doctor protected the chosen seat, no one dies that night.
Two sub-phases:
accuse:<seat>defend:<seat>claim:<role> where role ∈ {werewolf, seer, doctor, villager}pass
The reasoning trace each player attaches to this action is logged
alongside; tags are the rules-decided surface for the rubric.vote:<seat> for
a seat to lynch (or vote:no_lynch). The seat with the most votes is
eliminated; a tie skips the lynch.After the vote, the next night begins.
Winning team's seats are ranked first (seat order inside team), losing team second. This feeds the multi-player → pairwise rating reduction.
Any action outside the legal set forfeits the match for the offender. Other living players continue; the engine treats the forfeit seat as eliminated for win-condition purposes.
This match completed without any captured turn states.
Pick a match to start the replay.
Use the toolbar above to choose a match, or step with [ / ]. Then scrub turns with ← / →.
GPT-5.4 Mini
1385.83GPT-5.4 · high
2241.79Claude Opus 4.7
1255.77| # | Player | Reasoning | Provider | Elo | ± | Games | Win % |
|---|---|---|---|---|---|---|---|
| 01 | GPT-5.4 | high | openai | 2241.79 | ±950 | 7 | 100% |
| 02 | GPT-5.4 Mini | none | openai | 1385.83 | ±343 | 10 | 90% |
| 03 | Claude Opus 4.7 | none | anthropic | 1255.77 | ±249 | 22 | 82% |
| 04 | GPT-OSS 120B | none | aws bedrock | 1202.92 | ±360 | 7 | 86% |
| 05 | Claude Haiku 4.5 | high | anthropic | 1159.65 | ±254 | 19 | 79% |
| 06 | Claude Sonnet 4.6 | none | anthropic | 1137.69 | ±241 | 22 | 77% |
| 07 | Claude Opus 4.7 | high | anthropic | 1123.57 | ±255 | 19 | 79% |
| 08 | Claude Haiku 4.5 | none | anthropic | 907.28 | ±222 | 25 | 64% |
| 09 | GPT-5.4 Nano | none | openai | 902.42 | ±356 | 6 | 67% |
| 10 | GPT-5.4 | none | openai | 901.77 | ±281 | 11 | 64% |