gstack/review
Garry Tan 4ab0269729
feat(codex+review): require synthesis Recommendation in cross-model skills
Extends the v1.25.1.0 AskUserQuestion recommendation-quality coverage to the
cross-model synthesis surfaces that were previously emitting prose without a
structured recommendation:

- /codex review (Step 2A) — after presenting Codex output + GATE verdict,
  must emit `Recommendation: <action> because <reason>` line. Reason must
  compare against alternatives (other findings, fix-vs-ship, fix-order).
- /codex challenge (Step 2B) — same requirement after adversarial output.
- /codex consult (Step 2C) — same requirement after consult presentation,
  with examples for plan-review consults that engage with specific Codex
  insights.
- Claude adversarial subagent (scripts/resolvers/review.ts:446, used by
  /ship Step 11 + standalone /review) — subagent prompt now ends with
  "After listing findings, end your output with ONE line in the canonical
  format Recommendation: <action> because <reason>". Codex adversarial
  command (line 461) gets the same final-line requirement.

The same `judgeRecommendation` helper grades both AskUserQuestion and
cross-model synthesis — one rubric, two surfaces. Substance-5 cross-model
recommendations explicitly compare against alternatives (a different
finding, fix-vs-ship, fix-order). Generic synthesis ("because adversarial
review found things") fails at threshold ≥ 4.

Tests:
- test/llm-judge-recommendation.test.ts gains 5 cross-model fixtures (3
  substance ≥ 4, 2 substance < 4). Existing rubric correctly grades them.
- test/skill-cross-model-recommendation-emit.test.ts (new, free-tier) —
  static guard greps codex/SKILL.md.tmpl + scripts/resolvers/review.ts for
  the canonical emit instruction. Trips before any paid eval if the
  templates drift.

Touchfile: extended `llm-judge-recommendation` entry with codex/SKILL.md.tmpl
and scripts/resolvers/review.ts so synthesis-template edits invalidate the
fixture re-run.

Verified: free `bun test` exits 0 (5/5 static emit-guard tests pass), paid
fixture passes 45/45 expect calls in 24s with the cross-model substance-5
fixtures correctly judged at >= 4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 19:38:12 -07:00
..
specialists feat: adaptive gating + cross-review dedup for review army (v0.15.2.0) (#760) 2026-04-04 22:46:21 -07:00
SKILL.md feat(codex+review): require synthesis Recommendation in cross-model skills 2026-05-01 19:38:12 -07:00
SKILL.md.tmpl v1.11.0.0 feat(ship): workspace-aware version allocation (#1168) 2026-04-23 23:03:27 -07:00
TODOS-format.md feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61) 2026-03-14 20:15:11 -07:00
checklist.md feat: Review Army — parallel specialist reviewers for /review (v0.14.3.0) (#692) 2026-03-30 22:07:50 -06:00
design-checklist.md feat: adaptive gating + cross-review dedup for review army (v0.15.2.0) (#760) 2026-04-04 22:46:21 -07:00
greptile-triage.md feat: TODOS-aware skills, 2-tier Greptile replies, gitignore fix (#61) 2026-03-14 20:15:11 -07:00