Commit Graph

3 Commits

Author SHA1 Message Date
Garry Tan c7ae63201a
v1.58.1.0 feat: hermetic local E2E + Conductor prose AskUserQuestion (#2004)
* feat: add shared call-time isConductor() helper

Single source of truth for Conductor host detection in TS consumers
(CONDUCTOR_WORKSPACE_PATH / CONDUCTOR_PORT). Reads the passed env at
call time, not a module-load snapshot, so unit tests can pin the env
inline without Bun --preload (esm-hoist-breaks-env-pin-bootstrap).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: harden question-preference-hook harness against ambient Conductor env

runHook copied all of process.env into the hook subprocess, so running the
suite inside Conductor (CONDUCTOR_WORKSPACE_PATH/PORT set) would leak those
markers. Strip them so the existing cases deterministically characterize
NON-Conductor behavior before the Conductor branch lands. Baseline: 15 pass.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects to prose

Conductor disables native AskUserQuestion and routes through a flaky MCP
variant that returns '[Tool result missing due to internal error]'. The
hook now denies any AUQ call in a Conductor session and instructs the model
to render a prose decision brief instead (transport avoidance, not preference
enforcement) — firing for one-way doors too, with a typed-confirmation
requirement for destructive paths.

Precedence: never-ask auto-decide still wins (user already settled those);
Conductor prose is the fallback for everything else; non-Conductor behavior
is byte-for-byte unchanged. Restructured the per-question loop to compute
eligibility without early-returning so the Conductor branch can run as the
fallback while preserving memoryContext on every exit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: Conductor renders AskUserQuestion decisions as prose by default

In Conductor, native AskUserQuestion is disabled and the MCP variant is
flaky, so skills now render every decision as a plain-text prose brief the
user answers by typing a letter — proactively, not as a failure reaction.

- Preamble emits CONDUCTOR_SESSION, gated on != headless so eval/CI inside
  Conductor still BLOCKs instead of rendering prose to nobody.
- AskUserQuestion Format gains a Conductor-default-prose rule (auto-decide
  preferences still apply first; prose decisions log via gstack-question-log
  since PostToolUse never fires), a one-way/destructive typed-confirmation
  rule, and a typed-reply continuation protocol for split chains.
- Regenerated all SKILL.md + ship golden fixtures; bumped affected carve
  skeleton caps to absorb the always-loaded additions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migration)

The PreToolUse hook only delivers its Conductor-prose guarantee if it's
installed, but setup skips hook registration in non-interactive (conductor/CI)
setups. Two fixes so layer 3 actually deploys:

- setup: treat a Conductor workspace as an implicit opt-in for the PreToolUse
  hook on the silent fall-through (never overriding an explicit opt-out).
- migration v1.58.0.0: re-register the hook for existing Conductor installs on
  /gstack-upgrade, idempotent and respecting plan_tune_hooks=no.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME bug

- New skill-e2e-conductor-prose (periodic): Conductor env + plan-eng-review
  surfaces a prose decision brief, not a silent skip. Header documents this is
  end-to-end behavior coverage; the deterministic Conductor guard is the
  question-preference-hook unit test (the PTY harness can't register the MCP
  variant — Codex #10).
- Fix the pre-existing bug in auto-decide-preserved: it seeded the never-ask
  preference under GSTACK_HOME=tmpHome but never passed GSTACK_HOME into the
  PTY run, so the spawned claude read the real ~/.gstack and the preference
  was inert (Codex #9). Now passes GSTACK_HOME + CONDUCTOR_WORKSPACE_PATH to
  prove auto-decide still wins over the Conductor prose redirect.
- Register both in touchfiles (periodic tier).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: strip ambient Conductor env in memory-cache-injection hook harness

Same dev-in-Conductor leak fixed for question-preference-hook: this suite's
runHook copies process.env, so running it inside Conductor flipped the
defer-path memoryContext assertions into the [conductor] prose deny. Strip
CONDUCTOR_* so the cases characterize non-Conductor behavior. (CI is headless,
so this only bit local Conductor runs.)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: gstack-detach — run agent eval/bench jobs in their own session

Long agent-run jobs (30-60 min evals, benchmarks) die when the harness sends
SIGTERM to a background task's process group on turn boundaries / monitor
stops / interruptions (observed: 'script test:gate terminated by signal
SIGTERM'). gstack-detach runs the command in a fresh session (python3
os.setsid, or setsid on Linux, nohup fallback) so a group SIGTERM can't reach
it, and wraps it in caffeinate -i on macOS so idle-sleep can't kill it either.
Returns immediately; caller polls the logfile. Secrets stay in env, never argv.

The guard test pins the contract: the command runs in a different process
group than the caller and outlives the launching shell.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: eval:bg* scripts — detached eval runs for agents

Agent-facing convenience scripts that launch the eval suites through
gstack-detach so a harness SIGTERM can't kill a long run. eval:bg (diff-based),
eval:bg:all, eval:bg:gate, eval:bg:periodic — each returns immediately and
streams to /tmp/gstack-evals.log for polling. The plain test:evals / test:e2e
scripts stay foreground for humans.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: CLAUDE.md — agents must run long evals via gstack-detach

Codifies the detached-execution default: agent-launched eval/benchmark runs go
through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or
macOS idle-sleep can't kill a 30-60 min run, then poll the log with a
death-aware watcher. Humans keep foreground scripts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: harden gstack-detach against all four eval-infra killers

The basic bash detach fixed SIGTERM but a real run on a shared dev box hit
three more killers: cross-worktree API saturation (15-way concurrency x a
sibling worktree mass-timed-out the suite), a silent hang (periodic bun died
with no exit marker), and shared-/tmp log contamination (a concurrent
worktree's agent output bled into the log). Rewrite as a portable python3 tool
that bakes in all four fixes:

- fork + setsid: SIGTERM-proof (own session, survives harness polite-quit)
- caffeinate -i on macOS: no idle-sleep death
- --lock NAME (fcntl, machine-wide): concurrent worktrees SERIALIZE instead of
  saturating the shared model API
- run-scoped default log (~/.gstack-dev/eval-runs/<label>-<slug>-<branch>-<ts>-<pid>):
  no cross-worktree collision/contamination
- --timeout watchdog + a guaranteed '### gstack-detach EXIT=<code> ###' sentinel
  on every terminal path: no silent hang, finished-vs-died always detectable

Guard test pins all four: detached pgid differs + outlives launcher, run-scoped
log path, watchdog EXIT=timeout, and lock serialization (second run WAITS).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: eval:bg* use run-scoped logs + machine lock + watchdog

Drop the shared /tmp/gstack-evals.log path (the cross-worktree collision that
contaminated a live run) for gstack-detach's run-scoped default, and add the
machine-wide gstack-evals lock (concurrent worktrees serialize, no API
saturation) plus per-tier watchdog timeouts (60/90/120 min). Each eval:bg*
prints its run-scoped log path to poll.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags

- /ship eval step (sections/tests.md): long eval suites launch via gstack-detach
  (own session, machine lock, EXIT sentinel) so a turn boundary can't kill a
  30+ min run mid-ship — the exact failure observed during this branch's ship.
- CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize
  worktrees, no API saturation), --timeout watchdog, run-scoped log, and the
  guaranteed EXIT sentinel the poller breaks on.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* refactor: extract pure promotedEnv() from conductor-env-shim

Single source of truth for GSTACK_* key promotion semantics. The ambient
promoteConductorEnv() becomes a wrapper; behavior-preserving. Needed by the
hermetic env builder which must not mutate process.env.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: hermetic child-env builder for E2E runners

Allowlist scrub (basics/network/named-auth kept; CONDUCTOR_*, CLAUDE_*,
GSTACK_*, MCP_*, GBRAIN_*, operator credentials dropped), per-runner
extraAllow, overrides merge last, EVALS_HERMETIC=0 byte-identical escape
hatch read at call time (ESM-hoist safe). Sync memoized singleton temp dirs
(<runRoot>/.claude keeps the extractPlanFilePath contract), seeded
.claude.json for non-interactive first run, pid-aware GC of crashed runs.
19 free unit tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: session-runner spawns hermetic children + isolation canaries

claude -p children now get the allowlist-scrubbed env and a gated
--strict-mcp-config (EVALS_HERMETIC=0 restores operator env AND args).
Two gate-tier canaries make the clean room falsifiable: hermetic-canary
asserts env redirect + scrub + zero MCP servers + nonzero API-key cost
from the Bash tool_result (never model prose); hermetic-sentinel plants a
poisoned operator config (user CLAUDE.md + MCP server) and proves the
child cannot see it. Empirically verified on claude 2.1.175: print mode
needs no seed config (the seed serves the PTY path); the child CLI sets
CLAUDECODE for its own tools, so that scrub is pinned in unit tests, not
E2E. hermetic-env.ts joins GLOBAL_TOUCHFILES.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: PTY runner spawns hermetic claude sessions

launchClaudePty children get the allowlist-scrubbed env, a gated
--strict-mcp-config, and the session exposes hermeticConfigDir for
forensics (hermetic plan files live under <dir>/plans/ and still match
extractPlanFilePath via the /.claude dir-name contract). Seeded trust
state covers repo-cwd sessions; the 15s trust-watcher stays as fallback.
Verified foreground via the plan-mode-no-op gate test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: codex/gemini runners spawn hermetic children

Same allowlist scrub as the claude runners, with each provider's auth
surface re-admitted via extraAllow (codex: OPENAI_API_KEY/CODEX_* plus
its tempHome .codex copy; gemini: GEMINI_*/GOOGLE_* with real HOME for
~/.gemini auth). The gemini spawn previously inherited the full operator
env with no env property at all.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat: agent-sdk-runner spawns hermetic children via complete Options.env

The historical 'env: breaks SDK auth' failure was partial-env replacement:
Options.env replaces the child's entire environment, so objects lacking
ANTHROPIC_API_KEY killed auth. Passing the complete hermetic env (key +
PATH + redirected CLAUDE_CONFIG_DIR/GSTACK_HOME) works — validated live
via query() with a Bash tool call (success, real cost, Conductor vars
scrubbed). Per-test opts.env merges last; ambient key mutation still
works because the builder reads process.env at call time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: static tripwire pins hermetic wiring in all five runners

Free-tier invariants: every runner builds child env via hermeticChildEnv,
no raw ...process.env spread at any spawn site, --strict-mcp-config gated
on isHermeticEnabled in both claude runners, and no test callsite passes
the operator env into a runner's override parameter (scoped to runner
calls — unit tests spawning gstack bin scripts directly are exempt).
Mirrors the terminal-agent-pid-identity / server-embedder-terminal-port
tripwire idiom.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: refresh codex/factory ship goldens with detached-eval block

a38089aa added the gstack-detach guidance to the ship template and
updated the claude golden; the codex and factory goldens missed the same
16-line block. Regenerated via bun run gen:skill-docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: hermetic local E2E is the default; retire stale SDK env warning

CLAUDE.md now documents the hermetic clean room (allowlist scrub, fresh
seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config),
EVALS_HERMETIC=0 as the debug escape hatch, and replaces the 'never pass
env: to runAgentSdkTest' rule with the verified mechanism (partial-env
replacement was the failure; complete env is safe).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: operational-learning fixture copies lib/jsonl-store.ts with the bin

gstack-learnings-log imports $SCRIPT_DIR/../lib/jsonl-store.ts (hasInjection,
v1.57.5.0) — copying only the bin scripts into the temp fixture broke the
script with exit 1 since then. Latent because diff-based selection rarely
runs this test; surfaced when hermetic-env.ts joined GLOBAL_TOUCHFILES and
selected everything. Reproduced outside the hermetic env to confirm blame.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: ios-qa daemon scenarios use unique pidfiles under --concurrent

All scenarios shared join(workDir, 'daemon.pid') through a module-scope
workDir binding that beforeEach reassigns mid-flight under bun --concurrent.
First daemon claims; siblings get already_running against the test process's
own always-alive pid and fail in milliseconds — the failure mode seen at
15-way gate concurrency. Per-claim unique pidfiles keep the single-instance
semantics under test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: workflow judge re-appends body-carved sections after the marker slice

runWorkflowJudge appended sections/*.md before slicing startMarker..endMarker.
That handles skills that moved their MARKERS into sections (plan-eng,
plan-design) but not document-release, which keeps its markers in the
skeleton and carved the workflow BODY (Steps 2-9 -> sections/release-body.md)
AFTER the endMarker — so the slice dropped it and the judge scored
completeness 2 ('Steps 2-9 are in an external file'). Now any carved section
the marker window excluded is re-appended, so the judge sees the full
workflow the agent executes. document-release: completeness 2->5, clarity
3->4. ship/plan-ceo/plan-eng/plan-design judges unchanged (their section
content is already inside the slice, so the head-dedup skips re-append).

Pre-existing since the v1.57.0.0 carve (#1907); surfaced now because
hermetic-env.ts is a global touchfile that selects every llm-judge test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* harden: hermetic temp-dir GC grace window + half-seed cleanup

Codex adversarial review (ship) flagged two temp-dir lifecycle edges:
- GC deleted any dead-pid dir; PID reuse could delete a freshly-created dir
  whose original pid exited and was recycled to a live process. Now requires
  BOTH a dead pid AND mtime older than a 1h floor.
- A seed-write failure after mkdir left an unseeded dir named with our live
  pid that this process's GC skips, leaking until exit. Now the partial dir
  is torn down before the (still loud) rethrow.

Two findings left as-is by design: HOME stays allowlisted (CLAUDE_CONFIG_DIR
wins for claude; codex/gemini need ~/.codex|~/.gemini auth; FS sandbox is
TODOS.md:454 scope; the hermetic-sentinel canary proves config isolation),
and PTY extraArgs --mcp-config is a deliberate caller opt-in like env overrides.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: document hermetic-by-default E2E + eval:bg detached runs in CONTRIBUTING

The Testing & evals section now tells contributors that local E2E runners
spawn children through a sealed clean room (allowlist-scrubbed env, seeded
CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config) so local signal
matches CI, with EVALS_HERMETIC=0 as the escape hatch. The eval-tools list
gains the eval:bg* detached-run scripts (gstack-detach: SIGTERM-proof,
caffeinate-wrapped, machine-locked, run-scoped logs, EXIT= sentinel).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: sync package.json to 1.58.1.0

The merge took main's package.json (1.58.0.0); gstack-version-bump repair
fixed the working tree but the change was left uncommitted. Without this the
committed tree disagrees with VERSION and CI's version-match test fails.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: regenerate diagram SKILL.md with Conductor prose preamble

The diagram skill (new from main) was missing the Conductor-session prose
AskUserQuestion blocks that gen-skill-docs propagates to every SKILL.md.
Pure generated output; reproduced by bun run gen:skill-docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-14 11:40:57 -07:00
Garry Tan 9e244c0bed
v1.11.1.0 fix: plan-mode handshake + canUseTool test harness (#1182)
* feat: plan-mode handshake for interactive review skills

Add a preamble-level STOP-Ask handshake that fires when the user invokes any
of the 4 interactive review skills (plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review) while their Claude Code session is
in plan mode. Without this gate, plan mode's "this supercedes any other
instructions" system-reminder outranked the skills' interactive STOP gates
and the skills silently wrote plan files without any per-finding AskUserQuestion.

The handshake offers 2 options (exit-and-rerun, cancel) — the original
third "stay and batch" option was dropped after two independent reviewers
flagged it as a silent bypass of the skills' anti-skip rule.

Architecture decisions (CEO+Eng review):
- Preamble-level resolver, not per-template injection (Codex finding #2)
- Position 1 in preamble composition: after bash block (_SESSION_ID live),
  before onboarding AskUserQuestion gates (so fresh-install users see the
  handshake first, not drowned in telemetry/proactive/routing prompts)
- Generator-only `interactive: true` frontmatter flag, following the
  `preamble-tier` precedent (no host-config frontmatter allowlist edits)
- Host-scoped to Claude via `ctx.host === 'claude'` check inside the
  resolver (simpler than `suppressedResolvers` which only gates `{{}}`
  placeholders)
- One-way-door classification in scripts/question-registry.ts for all 4
  skills so question-tuning `never-ask` preferences can't suppress the gate
- Synchronous telemetry write to ~/.gstack/analytics/skill-usage.jsonl on
  handshake fire (captures A-exit and C-cancel outcomes that terminate the
  skill before end-of-run telemetry runs)

Also adds an explicit STOP block to plan-ceo-review Step 0C-bis so the
approach-selection question can't silently skip to mode selection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: extend agent-sdk-runner with canUseTool for AskUserQuestion interception

Test harness at test/helpers/agent-sdk-runner.ts gains an optional
`canUseTool` callback parameter. When a test supplies it, the harness
flips `permissionMode` from `bypassPermissions` (overlay-harness default)
to `default` so the SDK actually invokes the callback on every tool use,
and auto-adds `AskUserQuestion` to `allowedTools` so Claude can fire it
at all.

Exports a `passThroughNonAskUserQuestion` helper so tests that only want
to intercept AskUserQuestion can auto-allow every other tool with one
line: `return passThroughNonAskUserQuestion(toolName, input)`.

This is the foundation for D14 — every future interactive-skill E2E test
can now assert on AskUserQuestion shape and routing. Previous E2E tests
at `test/skill-e2e.test.ts` explicitly instructed the model to skip
AskUserQuestion ("non-interactive run") which meant no test could actually
verify the question content or routing.

6 new unit tests in test/agent-sdk-runner.test.ts cover:
- permissionMode flips to 'default' when canUseTool supplied
- permissionMode stays 'bypassPermissions' when canUseTool absent
- canUseTool callback reaches the SDK options
- AskUserQuestion auto-added to allowedTools when canUseTool supplied
- AskUserQuestion NOT added when canUseTool absent
- passThroughNonAskUserQuestion helper returns allow+updatedInput

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test: plan-mode handshake E2E coverage and unit assertions

Adds 6 E2E test files and 8 new unit assertions to verify the plan-mode
handshake works end-to-end and stays correct under regeneration.

E2E tests (gate-tier, paid, EVALS=1 EVALS_TIER=gate):
- test/skill-e2e-plan-ceo-plan-mode.test.ts — handshake fires before any
  Write/Edit when plan-mode distinctive phrase is present; 2-option shape
  (Exit/Cancel); option A routes to ExitPlanMode cleanly
- test/skill-e2e-plan-eng-plan-mode.test.ts — same contract for plan-eng
- test/skill-e2e-plan-design-plan-mode.test.ts — same contract for
  plan-design; exercises C-cancel branch instead of A-exit
- test/skill-e2e-plan-devex-plan-mode.test.ts — same contract for plan-devex
- test/skill-e2e-plan-mode-no-op.test.ts — negative regression: handshake
  must NOT fire when distinctive phrase is absent; skill proceeds normally
  through Step 0 (REGRESSION RULE guardrail against breaking existing
  interactive-review sessions)
- test/e2e-harness-audit.test.ts — free unit test asserting every
  `interactive: true` skill has at least one canUseTool-using test file
  (prevents future drift where a skill opts in without coverage)

Shared helper test/helpers/plan-mode-handshake-helpers.ts centralizes the
canUseTool interceptor + distinctive-phrase injection so the 4 sibling
E2E tests are thin wiring (~20 LOC each) and can't drift out of sync.

Unit assertions added to test/gen-skill-docs.test.ts:
- handshake section present in all 4 Claude-generated SKILL.md files
- handshake section absent from non-interactive Claude skills (ship,
  review, qa, office-hours, codex, retro, cso)
- handshake section absent from non-Claude host outputs (.agents, etc.)
- 0C-bis STOP block present in plan-ceo-review/SKILL.md at correct
  position (between the "Present these approach options" line and
  "### 0D-prelude" header)
- handshake resolver wired BEFORE generateUpgradeCheck in preamble
  composition order

6 new gate-tier entries added to test/helpers/touchfiles.ts so any change
to the handshake resolver, preamble composition, skill templates, question
registry, one-way-door classifier, or agent-sdk-runner fires the relevant
E2E tests. test/touchfiles.test.ts updated for the new selection count
(plan-ceo-review/** now triggers 15 tests, up from 8).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(v1.11.1.0): VERSION bump + CHANGELOG entry + TODOS follow-ups

Bumps from main's v1.11.0.0 to v1.11.1.0 (PATCH — bug-fix release, no new
user-facing artifacts). CHANGELOG entry covers the plan-mode handshake,
agent-sdk-runner canUseTool extension, and the 2 follow-up TODOs.

CHANGELOG order: v1.11.1.0 (this) → v1.11.0.0 (workspace-aware ship,
merged from main) → v1.10.1.0 (overlay efficacy harness). No duplicate
headers.

Syncs package.json version to match VERSION per the Step 12 idempotency
invariant (both files must agree or /ship halts).

TODOS.md:
- Preserves the Testing/security-bench-haiku-responses P1 added on main
- Adds P1 "Structural STOP-Ask forcing function" — broader class of the
  bug this release fixes
- Adds P2 "Apply interactive: true to non-review skills (office-hours,
  codex, investigate, qa, retro, cso)"

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-24 00:04:53 -07:00
Garry Tan e3d7f49c74
feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166)
* refactor: export readOverlay from model-overlay resolver

Needed by the overlay-efficacy eval harness to resolve INHERIT directives
without going through generateModelOverlay's full TemplateContext.

* chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep

Pinned exact for SDK event-shape stability. Used by the overlay-efficacy
harness to drive the model through a closer-to-real Claude Code harness
than `claude -p`.

* feat(preflight): sanity check for agent-sdk + overlay resolver

Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage
event shape matches assumptions, readOverlay resolves INHERIT directives
and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`.

PREFLIGHT OK on first run, $0.013 API spend.

* feat(eval): parametric overlay-efficacy harness (runner + fixtures)

`test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk
with explicit `AgentSdkResult` types, process-level API concurrency
semaphore, and 3-shape 429 retry (thrown error, result-message error,
mid-stream SDKRateLimitEvent). Pins the local claude binary via
`pathToClaudeCodeExecutable`.

`test/fixtures/overlay-nudges.ts` holds the typed registry. Two
fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read)
and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator
rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe
id chars, and missing overlay files at module load.

Adding a future overlay nudge eval = one fixture entry.

* test(eval): unit tests for agent-sdk-runner (36 tests, free tier)

Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers:
happy-path shape, all 3 rate-limit shapes + retry, workspace reset on
retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation,
process-level concurrency cap, options propagation, artifact path
uniqueness, cost/turn mapping, and every validator rejection case.

* test(eval): paid periodic overlay-efficacy harness

`test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES,
runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with
bounded concurrency. Arms use SDK preset `claude_code` so both include
the real Claude Code system prompt; overlay-ON appends the resolved
overlay text. Saves per-trial raw event streams to
`~/.gstack/projects/<slug>/transcripts/` for forensic recovery.

Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials).

* test: register overlay harness in touchfiles (both maps)

Entries for `overlay-harness-opus-4-7-fanout-toy` and
`opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/,
fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes
`test/touchfiles.test.ts` completeness check.

* fix(opus-4.7): remove "Fan out explicitly" overlay nudge

Measured counterproductive under the new SDK harness. Baseline Opus 4.7
emits first-turn parallel tool_use blocks 70% of the time on a 3-file
read prompt. With the custom nudge: 10%. With Anthropic's own canonical
`<use_parallel_tool_calls>` block from their parallel-tool-use docs:
0%. Both overlays suppress fanout; neither improves it.

On realistic multi-tool prompts (audit a project: read files + glob +
summarize), Opus 4.7 never fans out in first turn regardless of overlay.
Zero of 20 trials. Not a prompt problem.

Keeping the other three nudges (effort-match, batch questions, literal
interpretation) pending their own measurement. Harness is ready for
follow-up fixtures — add one entry to
`test/fixtures/overlay-nudges.ts` to measure any overlay bullet.

Cost of investigation: ~$7 total across 3 eval runs.

* chore: bump version and changelog (v1.6.5.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction

Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write
(e.g. literal-interpretation 'fix the failing tests' needs write access).
Per-fixture maxTurns lets harder prompts run longer without changing the
default. `direction` is cosmetic metadata for test output labeling.

Also adds reusable predicates and metrics:
- lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline
- bashToolCallCount — count of Bash tool_use across the session
- turnsToCompletion — SDK-reported num_turns at result
- uniqueFilesEdited — Edit/Write/MultiEdit file_path set size

test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools
and fixture.maxTurns through runArm.

* test(eval): 3 more overlay fixtures to measure remaining Claude nudges

Measures three overlay bullets that haven't been tested yet:

- claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/
  Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript
  file under src/ and tell me what each exports' and counts Bash tool_use
  across the session. Overlay-ON should drop it by >=20%.
- opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads
  don't need deep reasoning.' Fixture uses a trivial one-file prompt
  (config.json lookup) and measures turns_used. Overlay-ON should be
  <=80% of baseline turns.
- opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing
  tests, not just the obvious one.' Fixture seeds three failing test
  files with deliberately distinct failure modes and counts unique files
  edited. Overlay-ON should touch >=20% more files.

Adding a fourth fixture for any remaining overlay nudge is a single entry.
The harness is now proven on: fanout (deleted after measurement), dedicated
tools, effort-match, and literal-interpretation.

* fix(eval): handle SDK max-turns throw gracefully

Some @anthropic-ai/claude-agent-sdk versions throw from the query
generator when maxTurns is reached, instead of emitting a result
message with subtype='error_max_turns'. The runner treated that as
a non-retryable error and killed the whole periodic run on the first
fixture that exceeded its turn cap.

Added isMaxTurnsError() detector and a catch branch that synthesizes
an AgentSdkResult from events captured before the throw, with
exitReason='error_max_turns' and costUsd=0 (unknown from the thrown
path). The metric function still runs against whatever assistant
turns were collected, so the trial produces a usable number.

Hoisted events/assistantTurns/toolCalls/assistantTextParts and the
timing counters out of the inner try so the catch branch can read
them. No behavior change on the success path or on rate-limit retry
paths.

* test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash

The prompt 'list every TypeScript file under src/ and tell me what
each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary.
Default maxTurns=5 was not enough; prior run threw from the SDK on
this fixture and tanked the whole periodic eval.

Bumping to 15 gives headroom. The runner now also handles max-turns
gracefully even if a future fixture underestimates, so this is belt
and suspenders.

* test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures

Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`.
Tests whether the overlays behave differently on a weaker Claude model
where baseline behavior is shakier. Sonnet trials cost ~3-4x less than
Opus so these 5 add ~$4.50 to a full run.

Measurement result from the first paired run (100 trials total,
~$14.55):

- **Sonnet + effort-match shows real overlay benefit.** With the overlay
  on, Sonnet takes 2.5 turns on a trivial `What's the version in
  config.json?` prompt. Without, it takes exactly 3.0 turns in all 10
  trials. ~17% reduction, below the 20% pass threshold but the signal
  is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF
  [3,3,3,3,3,3,3,3,3,3].
- All other Sonnet dimensions flat (fanout, dedicated-tools, literal
  interpretation). Same as Opus on those axes.
- Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay).

Implication: model-stratified. The overlay stack helps Sonnet on some
axes where it does nothing on Opus. Wholesale removal would hurt Sonnet.
Per-nudge per-model measurement is the right move going forward.

* chore: bump version to 1.10.1.0

Updates VERSION, package.json, CHANGELOG header, and TODOS completion
marker from 1.6.5.0 to 1.10.1.0.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:42:58 -07:00