docs: add E2E eval failure blame protocol

"Not related to our changes" is an extraordinary claim that requires
extraordinary proof. When evals fail during /ship:

1. Run the same eval on main — prove it fails there too
2. If it passes on main, it IS your change — trace the blame
3. If you can't verify, say "unverified" not "pre-existing"

Added to CLAUDE.md and as a comment in skill-e2e.test.ts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Garry Tan 2026-03-16 10:21:46 -05:00
parent 9540b216fb
commit 3492f98e82
No known key found for this signature in database
GPG Key ID: C1F69E85C74EFE1D
2 changed files with 20 additions and 0 deletions

View File

@ -118,6 +118,21 @@ CHANGELOG.md is **for users**, not contributors. Write it like product release n
- No jargon: say "every question now tells you which project and branch you're in" not - No jargon: say "every question now tells you which project and branch you're in" not
"AskUserQuestion format standardized across skill templates via preamble resolver." "AskUserQuestion format standardized across skill templates via preamble resolver."
## E2E eval failure blame protocol
When an E2E eval fails during `/ship` or any other workflow, **never claim "not
related to our changes" without proving it.** These systems have invisible couplings —
a preamble text change affects agent behavior, a new helper changes timing, a
regenerated SKILL.md shifts prompt context.
**Required before attributing a failure to "pre-existing":**
1. Run the same eval on main (or base branch) and show it fails there too
2. If it passes on main but fails on the branch — it IS your change. Trace the blame.
3. If you can't run on main, say "unverified — may or may not be related" and flag it
as a risk in the PR body
"Pre-existing" without receipts is a lazy claim. Prove it or don't say it.
## Deploying to the active skill ## Deploying to the active skill
The active skill lives at `~/.claude/skills/gstack/`. After making changes: The active skill lives at `~/.claude/skills/gstack/`. After making changes:

View File

@ -13,6 +13,11 @@ import * as os from 'os';
const ROOT = path.resolve(import.meta.dir, '..'); const ROOT = path.resolve(import.meta.dir, '..');
// Skip unless EVALS=1. Session runner strips CLAUDE* env vars to avoid nested session issues. // Skip unless EVALS=1. Session runner strips CLAUDE* env vars to avoid nested session issues.
//
// BLAME PROTOCOL: When an eval fails, do NOT claim "pre-existing" or "not related
// to our changes" without proof. Run the same eval on main to verify. These tests
// have invisible couplings — preamble text, SKILL.md content, and timing all affect
// agent behavior. See CLAUDE.md "E2E eval failure blame protocol" for details.
const evalsEnabled = !!process.env.EVALS; const evalsEnabled = !!process.env.EVALS;
const describeE2E = evalsEnabled ? describe : describe.skip; const describeE2E = evalsEnabled ? describe : describe.skip;