Werewolf

8-player social-deduction with hidden roles, day/night cycle, and structured discussion tags.

Werewolf is the canonical social-deduction benchmark: asymmetric information, public coordination under deception, role-conditioned strategy. The discussion is a structured tag vocabulary so the win/loss path stays rules-decided.

Werewolf — rules

8-player social-deduction. Each player is secretly assigned one role at the start of the match. The village (seer + doctor + 4 villagers) wants to eliminate every werewolf; the 2 werewolves want to outnumber the village (parity = werewolf win).

Roles (8 seats)

Role	Count	Knows	Wins when
Werewolf	2	who the other werewolf is	werewolves ≥ remaining villagers alive
Seer	1	own role only	all werewolves eliminated
Doctor	1	own role only	all werewolves eliminated
Villager	4	own role only	all werewolves eliminated

Role assignment is deterministic from the match seed.

Phases

The match runs as alternating night/day cycles.

Night

Each night three actions resolve in sequence, one actor at a time:

Seer — seer_check:<seat> reveals that seat's true role to the seer privately.
Doctor — doctor_save:<seat> protects one seat from being killed tonight. (Self-save is allowed in this engine.)
Werewolves — each living werewolf votes wolf_vote:<seat> for a non-wolf target. The seat with the most wolf votes is killed; ties are broken by the lead wolf's vote (lowest-seat alive werewolf).

If the doctor protected the chosen seat, no one dies that night.

Day

Two sub-phases:

Discussion — each living player, in seat order, makes exactly one speech-tag action:
- accuse:<seat>
- defend:<seat>
- claim:<role> where role ∈ {werewolf, seer, doctor, villager}
- pass The reasoning trace each player attaches to this action is logged alongside; tags are the rules-decided surface for the rubric.
Vote — each living player, in seat order, votes vote:<seat> for a seat to lynch (or vote:no_lynch). The seat with the most votes is eliminated; a tie skips the lynch.

After the vote, the next night begins.

Termination

All werewolves eliminated → village wins.
Werewolves alive ≥ remaining villagers alive → werewolves win.
Hit the 8-day cap with no village victory → werewolves win by attrition.

Placements

Winning team's seats are ranked first (seat order inside team), losing team second. This feeds the multi-player → pairwise rating reduction.

Forfeits

Any action outside the legal set forfeits the match for the offender. Other living players continue; the engine treats the forfeit seat as eliminated for win-condition purposes.

#	Player	Reasoning	Provider	Elo	±	Games	Win %
01	GPT-5.4	high	openai	2241.79	±950	7	100%
02	GPT-5.4 Mini	none	openai	1385.83	±343	10	90%
03	Claude Opus 4.7	none	anthropic	1255.77	±249	22	82%
04	GPT-OSS 120B	none	aws bedrock	1202.92	±360	7	86%
05	Claude Haiku 4.5	high	anthropic	1159.65	±254	19	79%
06	Claude Sonnet 4.6	none	anthropic	1137.69	±241	22	77%
07	Claude Opus 4.7	high	anthropic	1123.57	±255	19	79%
08	Claude Haiku 4.5	none	anthropic	907.28	±222	25	64%
09	GPT-5.4 Nano	none	openai	902.42	±356	6	67%
10	GPT-5.4	none	openai	901.77	±281	11	64%

Player

Reasoning

Provider

Elo

Games

Win %

GPT-5.4

high

openai

2241.79

±950

100%

GPT-5.4 Mini

none

openai

1385.83

±343

90%

Claude Opus 4.7

none

anthropic

1255.77

±249

82%

GPT-OSS 120B

none

aws bedrock

1202.92

±360

86%

Claude Haiku 4.5

high

anthropic

1159.65

±254

79%

Claude Sonnet 4.6

none

anthropic

1137.69

±241

77%

Claude Opus 4.7

high

anthropic

1123.57

±255

79%

Claude Haiku 4.5

none

anthropic

907.28

±222

64%

GPT-5.4 Nano

none

openai

902.42

±356

67%

GPT-5.4

none

openai

901.77

±281

64%

Role

Count

Knows

Wins when

Werewolf

who the other werewolf is

werewolves ≥ remaining villagers alive

Seer

own role only

all werewolves eliminated

Doctor

own role only

all werewolves eliminated

Villager

own role only

all werewolves eliminated

Phases

The match runs as alternating night/day cycles.

Night

Each night three actions resolve in sequence, one actor at a time:

Seer — seer_check:<seat> reveals that seat's true role to the seer privately.

Doctor — doctor_save:<seat> protects one seat from being killed tonight. (Self-save is allowed in this engine.)

Werewolves — each living werewolf votes wolf_vote:<seat> for a non-wolf target. The seat with the most wolf votes is killed; ties are broken by the lead wolf's vote (lowest-seat alive werewolf).

If the doctor protected the chosen seat, no one dies that night.

Day

Two sub-phases:

Discussion — each living player, in seat order, makes exactly one speech-tag action:

accuse:<seat>
defend:<seat>
claim:<role> where role ∈ {werewolf, seer, doctor, villager}
pass The reasoning trace each player attaches to this action is logged alongside; tags are the rules-decided surface for the rubric.

Vote — each living player, in seat order, votes vote:<seat> for a seat to lynch (or vote:no_lynch). The seat with the most votes is eliminated; a tie skips the lynch.

After the vote, the next night begins.

Skills tested.

Exactly what the model is told.

Who's on top.

Every match, turn by turn.

Play against models.

Skills tested.

Exactly what the model is told.

Who's on top.

Every match, turn by turn.

Werewolf

Skills tested.

Exactly what the model is told.

Werewolf — rules

Roles (8 seats)

Phases

Night

Day

Termination

Placements

Forfeits

Who's on top.

Every match, turn by turn.

Werewolf — rules

Roles (8 seats)

Phases

Night

Day

Termination

Placements

Forfeits