Experimental

First applied to the whitepaper cover system. Graduates to convention after a second non-trivial design decision uses it successfully.

Adversarial Review Workflow

For non-trivial design decisions, run this workflow before merge. It is additional to, not a replacement for, the 3-agent QA workflow defined in CLAUDE.md.

When to use this workflow

Required for:

Not required for:

The dividing line: does this change establish what is true, or check that work matches what is already true? Adversarial review for the former, QA for the latter.

Why this exists separately from QA

QA finds errors against a known spec. Run a content check, a rules-consistency check, an HTML-validation check; each one assumes the spec is correct and verifies that the artifact obeys it.

Adversarial review stress-tests the spec itself. When the spec is new or being substantively changed, QA cannot tell you whether the rules are right; it can only tell you whether the artifact follows the rules. A spec that codifies a wrong universal can pass every QA check and still ship a flawed system.

The whitepaper cover system shipped a v0.1 with universal “Recommendations to " labeling and required COI disclosures. QA passed. Adversarial review surfaced that those are genre-specific conventions and that the v0.1 spec had inadvertently universalized think-tank patterns. The fix changed the spec, not the artifacts.

The four stages

1. Research

Build the institutional reference set before writing rules. For each genre the spec will cover, find 3-5 authoritative exemplars from real institutions. Read or page through them. Capture concrete observations:

Output: a research note (in the PR description, in the category guide, or in docs/) listing exemplars and observations. The note is the input to stage 2.

This stage is non-negotiable. Skipping it produces specs that codify the author’s intuition or the most recently encountered exemplar as universal.

2. Debate

Spawn two reviewer agents (or two adversarial passes by the same agent) holding opposing positions on the most consequential open question.

Examples of debate prompts that produced useful tension on the whitepaper cover system:

The debate is structured: each side cites exemplars from the research stage; the goal is not consensus but exposure of the tradeoffs. The synthesizer (you, or a third agent) reads both arguments and decides how the spec should resolve the tension — usually with a genre-conditional rule (encoded via the YAML modes: field), sometimes by accepting one side, occasionally by reframing the question.

Output: a synthesis note enumerating which position won, which genre it applies to, and what rules drop out.

3. Synthesis

Translate the debate output into spec changes:

Cross-file consistency is critical: if a value appears in rules, foundations, and specimens, all three must match. The validators catch some of this; you catch the rest.

4. QA

Run the standard 3-agent QA workflow from CLAUDE.md:

  1. Content review (no AI artifacts, voice consistency, factual accuracy)
  2. Rules consistency (cross-file value alignment, sidebar coverage, validator pass)
  3. HTML/render validation (specimens render, examples render, no broken links)

QA at this stage operates on the new spec. It tells you whether your artifacts obey it. If they don’t, fix the artifacts; do not weaken the spec to match a sloppy artifact.

Output of the workflow

A PR that includes:

The “Why this guide changed” section is part of the deliverable, not an afterthought. It is the audit trail that lets a future reader understand why the spec is what it is, and it is the warning to future cells thinking about reverting the change.

When 3 loops vs. 5 loops

The four-stage workflow above (research / debate / synthesis / QA) describes one iteration. For most non-trivial cells, three iterations of the agent-team loop on the same artifact converge on correctness; a single loop catches only the most-obvious 30-40% of issues. PR #43 codified this in the LEARNINGS entry “Three-loop adversarial review converges; one loop misses.” The default cadence is:

For three classes of work, three loops are not enough. Add a Loop 4 and Loop 5:

Per-loop agent count

Plan-agent recommendation: 4-5 parallel agents per loop, maximum. Past that, the synthesis cost (reading the agents’ outputs, deduplicating findings, resolving disagreements between agents) eats the gains from additional coverage. Confirmed in PR #43 Wave 1, where 5 agents per loop was the sweet spot and 7+ produced more synthesis overhead than incremental signal.

The exception: when the agents are prosecuting genuinely different surfaces (Loop 5R = render, Loop 5O = threat modeling), each surface gets its own 4-5 agents because their outputs don’t overlap. The “max 4-5” cap is per-surface, not per-loop.

When NOT to add Loop 4 and Loop 5

The cost of 5 loops is real: 5 loops × 4-5 agents × ~15 min synthesis per loop ≈ 5 hours of structured review. The benefit is real too: PII catches, render-bug catches, and convergence on a spec that downstream cells can trust. Apply the cost-benefit honestly.

Anti-patterns

Skipping research and going straight to debate. Two agents arguing without exemplars produces vibes, not principles. Always research first.

Synthesizing into universal rules when the debate exposed a genre split. If position A wins for genre X and position B wins for genre Y, the rule must be genre-conditional (use the modes: YAML field). Forcing one side to win across all genres is the failure mode the workflow exists to catch.

Treating adversarial review as a one-way ratchet toward more rules. Sometimes the right outcome of debate is removing or weakening a rule. The whitepaper cover system removed three “always required” rules and replaced them with genre-conditional rules — net rule count went down, but precision went up.

Conflating adversarial review with code review. This workflow stress-tests the spec. Code review (pr-review-toolkit:code-reviewer) checks that the implementation matches the spec. Both happen, in that order.

See also