gstack

Commit Graph

Author	SHA1	Message	Date
Garry Tan	c7ae63201a	v1.58.1.0 feat: hermetic local E2E + Conductor prose AskUserQuestion (#2004 ) * feat: add shared call-time isConductor() helper Single source of truth for Conductor host detection in TS consumers (CONDUCTOR_WORKSPACE_PATH / CONDUCTOR_PORT). Reads the passed env at call time, not a module-load snapshot, so unit tests can pin the env inline without Bun --preload (esm-hoist-breaks-env-pin-bootstrap). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: harden question-preference-hook harness against ambient Conductor env runHook copied all of process.env into the hook subprocess, so running the suite inside Conductor (CONDUCTOR_WORKSPACE_PATH/PORT set) would leak those markers. Strip them so the existing cases deterministically characterize NON-Conductor behavior before the Conductor branch lands. Baseline: 15 pass. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects to prose Conductor disables native AskUserQuestion and routes through a flaky MCP variant that returns '[Tool result missing due to internal error]'. The hook now denies any AUQ call in a Conductor session and instructs the model to render a prose decision brief instead (transport avoidance, not preference enforcement) — firing for one-way doors too, with a typed-confirmation requirement for destructive paths. Precedence: never-ask auto-decide still wins (user already settled those); Conductor prose is the fallback for everything else; non-Conductor behavior is byte-for-byte unchanged. Restructured the per-question loop to compute eligibility without early-returning so the Conductor branch can run as the fallback while preserving memoryContext on every exit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: Conductor renders AskUserQuestion decisions as prose by default In Conductor, native AskUserQuestion is disabled and the MCP variant is flaky, so skills now render every decision as a plain-text prose brief the user answers by typing a letter — proactively, not as a failure reaction. - Preamble emits CONDUCTOR_SESSION, gated on != headless so eval/CI inside Conductor still BLOCKs instead of rendering prose to nobody. - AskUserQuestion Format gains a Conductor-default-prose rule (auto-decide preferences still apply first; prose decisions log via gstack-question-log since PostToolUse never fires), a one-way/destructive typed-confirmation rule, and a typed-reply continuation protocol for split chains. - Regenerated all SKILL.md + ship golden fixtures; bumped affected carve skeleton caps to absorb the always-loaded additions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migration) The PreToolUse hook only delivers its Conductor-prose guarantee if it's installed, but setup skips hook registration in non-interactive (conductor/CI) setups. Two fixes so layer 3 actually deploys: - setup: treat a Conductor workspace as an implicit opt-in for the PreToolUse hook on the silent fall-through (never overriding an explicit opt-out). - migration v1.58.0.0: re-register the hook for existing Conductor installs on /gstack-upgrade, idempotent and respecting plan_tune_hooks=no. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME bug - New skill-e2e-conductor-prose (periodic): Conductor env + plan-eng-review surfaces a prose decision brief, not a silent skip. Header documents this is end-to-end behavior coverage; the deterministic Conductor guard is the question-preference-hook unit test (the PTY harness can't register the MCP variant — Codex #10). - Fix the pre-existing bug in auto-decide-preserved: it seeded the never-ask preference under GSTACK_HOME=tmpHome but never passed GSTACK_HOME into the PTY run, so the spawned claude read the real ~/.gstack and the preference was inert (Codex #9). Now passes GSTACK_HOME + CONDUCTOR_WORKSPACE_PATH to prove auto-decide still wins over the Conductor prose redirect. - Register both in touchfiles (periodic tier). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: strip ambient Conductor env in memory-cache-injection hook harness Same dev-in-Conductor leak fixed for question-preference-hook: this suite's runHook copies process.env, so running it inside Conductor flipped the defer-path memoryContext assertions into the [conductor] prose deny. Strip CONDUCTOR_* so the cases characterize non-Conductor behavior. (CI is headless, so this only bit local Conductor runs.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: gstack-detach — run agent eval/bench jobs in their own session Long agent-run jobs (30-60 min evals, benchmarks) die when the harness sends SIGTERM to a background task's process group on turn boundaries / monitor stops / interruptions (observed: 'script test:gate terminated by signal SIGTERM'). gstack-detach runs the command in a fresh session (python3 os.setsid, or setsid on Linux, nohup fallback) so a group SIGTERM can't reach it, and wraps it in caffeinate -i on macOS so idle-sleep can't kill it either. Returns immediately; caller polls the logfile. Secrets stay in env, never argv. The guard test pins the contract: the command runs in a different process group than the caller and outlives the launching shell. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: eval:bg* scripts — detached eval runs for agents Agent-facing convenience scripts that launch the eval suites through gstack-detach so a harness SIGTERM can't kill a long run. eval:bg (diff-based), eval:bg:all, eval:bg:gate, eval:bg:periodic — each returns immediately and streams to /tmp/gstack-evals.log for polling. The plain test:evals / test:e2e scripts stay foreground for humans. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: CLAUDE.md — agents must run long evals via gstack-detach Codifies the detached-execution default: agent-launched eval/benchmark runs go through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or macOS idle-sleep can't kill a 30-60 min run, then poll the log with a death-aware watcher. Humans keep foreground scripts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: harden gstack-detach against all four eval-infra killers The basic bash detach fixed SIGTERM but a real run on a shared dev box hit three more killers: cross-worktree API saturation (15-way concurrency x a sibling worktree mass-timed-out the suite), a silent hang (periodic bun died with no exit marker), and shared-/tmp log contamination (a concurrent worktree's agent output bled into the log). Rewrite as a portable python3 tool that bakes in all four fixes: - fork + setsid: SIGTERM-proof (own session, survives harness polite-quit) - caffeinate -i on macOS: no idle-sleep death - --lock NAME (fcntl, machine-wide): concurrent worktrees SERIALIZE instead of saturating the shared model API - run-scoped default log (~/.gstack-dev/eval-runs/<label>-<slug>-<branch>-<ts>-<pid>): no cross-worktree collision/contamination - --timeout watchdog + a guaranteed '### gstack-detach EXIT=<code> ###' sentinel on every terminal path: no silent hang, finished-vs-died always detectable Guard test pins all four: detached pgid differs + outlives launcher, run-scoped log path, watchdog EXIT=timeout, and lock serialization (second run WAITS). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: eval:bg* use run-scoped logs + machine lock + watchdog Drop the shared /tmp/gstack-evals.log path (the cross-worktree collision that contaminated a live run) for gstack-detach's run-scoped default, and add the machine-wide gstack-evals lock (concurrent worktrees serialize, no API saturation) plus per-tier watchdog timeouts (60/90/120 min). Each eval:bg* prints its run-scoped log path to poll. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags - /ship eval step (sections/tests.md): long eval suites launch via gstack-detach (own session, machine lock, EXIT sentinel) so a turn boundary can't kill a 30+ min run mid-ship — the exact failure observed during this branch's ship. - CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize worktrees, no API saturation), --timeout watchdog, run-scoped log, and the guaranteed EXIT sentinel the poller breaks on. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * refactor: extract pure promotedEnv() from conductor-env-shim Single source of truth for GSTACK_* key promotion semantics. The ambient promoteConductorEnv() becomes a wrapper; behavior-preserving. Needed by the hermetic env builder which must not mutate process.env. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: hermetic child-env builder for E2E runners Allowlist scrub (basics/network/named-auth kept; CONDUCTOR_, CLAUDE_, GSTACK_, MCP_, GBRAIN_, operator credentials dropped), per-runner extraAllow, overrides merge last, EVALS_HERMETIC=0 byte-identical escape hatch read at call time (ESM-hoist safe). Sync memoized singleton temp dirs (<runRoot>/.claude keeps the extractPlanFilePath contract), seeded .claude.json for non-interactive first run, pid-aware GC of crashed runs. 19 free unit tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> feat: session-runner spawns hermetic children + isolation canaries claude -p children now get the allowlist-scrubbed env and a gated --strict-mcp-config (EVALS_HERMETIC=0 restores operator env AND args). Two gate-tier canaries make the clean room falsifiable: hermetic-canary asserts env redirect + scrub + zero MCP servers + nonzero API-key cost from the Bash tool_result (never model prose); hermetic-sentinel plants a poisoned operator config (user CLAUDE.md + MCP server) and proves the child cannot see it. Empirically verified on claude 2.1.175: print mode needs no seed config (the seed serves the PTY path); the child CLI sets CLAUDECODE for its own tools, so that scrub is pinned in unit tests, not E2E. hermetic-env.ts joins GLOBAL_TOUCHFILES. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: PTY runner spawns hermetic claude sessions launchClaudePty children get the allowlist-scrubbed env, a gated --strict-mcp-config, and the session exposes hermeticConfigDir for forensics (hermetic plan files live under <dir>/plans/ and still match extractPlanFilePath via the /.claude dir-name contract). Seeded trust state covers repo-cwd sessions; the 15s trust-watcher stays as fallback. Verified foreground via the plan-mode-no-op gate test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: codex/gemini runners spawn hermetic children Same allowlist scrub as the claude runners, with each provider's auth surface re-admitted via extraAllow (codex: OPENAI_API_KEY/CODEX_* plus its tempHome .codex copy; gemini: GEMINI_/GOOGLE_ with real HOME for ~/.gemini auth). The gemini spawn previously inherited the full operator env with no env property at all. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat: agent-sdk-runner spawns hermetic children via complete Options.env The historical 'env: breaks SDK auth' failure was partial-env replacement: Options.env replaces the child's entire environment, so objects lacking ANTHROPIC_API_KEY killed auth. Passing the complete hermetic env (key + PATH + redirected CLAUDE_CONFIG_DIR/GSTACK_HOME) works — validated live via query() with a Bash tool call (success, real cost, Conductor vars scrubbed). Per-test opts.env merges last; ambient key mutation still works because the builder reads process.env at call time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: static tripwire pins hermetic wiring in all five runners Free-tier invariants: every runner builds child env via hermeticChildEnv, no raw ...process.env spread at any spawn site, --strict-mcp-config gated on isHermeticEnabled in both claude runners, and no test callsite passes the operator env into a runner's override parameter (scoped to runner calls — unit tests spawning gstack bin scripts directly are exempt). Mirrors the terminal-agent-pid-identity / server-embedder-terminal-port tripwire idiom. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: refresh codex/factory ship goldens with detached-eval block `a38089aa` added the gstack-detach guidance to the ship template and updated the claude golden; the codex and factory goldens missed the same 16-line block. Regenerated via bun run gen:skill-docs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: hermetic local E2E is the default; retire stale SDK env warning CLAUDE.md now documents the hermetic clean room (allowlist scrub, fresh seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config), EVALS_HERMETIC=0 as the debug escape hatch, and replaces the 'never pass env: to runAgentSdkTest' rule with the verified mechanism (partial-env replacement was the failure; complete env is safe). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: operational-learning fixture copies lib/jsonl-store.ts with the bin gstack-learnings-log imports $SCRIPT_DIR/../lib/jsonl-store.ts (hasInjection, v1.57.5.0) — copying only the bin scripts into the temp fixture broke the script with exit 1 since then. Latent because diff-based selection rarely runs this test; surfaced when hermetic-env.ts joined GLOBAL_TOUCHFILES and selected everything. Reproduced outside the hermetic env to confirm blame. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: ios-qa daemon scenarios use unique pidfiles under --concurrent All scenarios shared join(workDir, 'daemon.pid') through a module-scope workDir binding that beforeEach reassigns mid-flight under bun --concurrent. First daemon claims; siblings get already_running against the test process's own always-alive pid and fail in milliseconds — the failure mode seen at 15-way gate concurrency. Per-claim unique pidfiles keep the single-instance semantics under test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: workflow judge re-appends body-carved sections after the marker slice runWorkflowJudge appended sections/.md before slicing startMarker..endMarker. That handles skills that moved their MARKERS into sections (plan-eng, plan-design) but not document-release, which keeps its markers in the skeleton and carved the workflow BODY (Steps 2-9 -> sections/release-body.md) AFTER the endMarker — so the slice dropped it and the judge scored completeness 2 ('Steps 2-9 are in an external file'). Now any carved section the marker window excluded is re-appended, so the judge sees the full workflow the agent executes. document-release: completeness 2->5, clarity 3->4. ship/plan-ceo/plan-eng/plan-design judges unchanged (their section content is already inside the slice, so the head-dedup skips re-append). Pre-existing since the v1.57.0.0 carve (#1907); surfaced now because hermetic-env.ts is a global touchfile that selects every llm-judge test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> harden: hermetic temp-dir GC grace window + half-seed cleanup Codex adversarial review (ship) flagged two temp-dir lifecycle edges: - GC deleted any dead-pid dir; PID reuse could delete a freshly-created dir whose original pid exited and was recycled to a live process. Now requires BOTH a dead pid AND mtime older than a 1h floor. - A seed-write failure after mkdir left an unseeded dir named with our live pid that this process's GC skips, leaking until exit. Now the partial dir is torn down before the (still loud) rethrow. Two findings left as-is by design: HOME stays allowlisted (CLAUDE_CONFIG_DIR wins for claude; codex/gemini need ~/.codex\|~/.gemini auth; FS sandbox is TODOS.md:454 scope; the hermetic-sentinel canary proves config isolation), and PTY extraArgs --mcp-config is a deliberate caller opt-in like env overrides. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: document hermetic-by-default E2E + eval:bg detached runs in CONTRIBUTING The Testing & evals section now tells contributors that local E2E runners spawn children through a sealed clean room (allowlist-scrubbed env, seeded CLAUDE_CONFIG_DIR, temp GSTACK_HOME, --strict-mcp-config) so local signal matches CI, with EVALS_HERMETIC=0 as the escape hatch. The eval-tools list gains the eval:bg* detached-run scripts (gstack-detach: SIGTERM-proof, caffeinate-wrapped, machine-locked, run-scoped logs, EXIT= sentinel). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: sync package.json to 1.58.1.0 The merge took main's package.json (1.58.0.0); gstack-version-bump repair fixed the working tree but the change was left uncommitted. Without this the committed tree disagrees with VERSION and CI's version-match test fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: regenerate diagram SKILL.md with Conductor prose preamble The diagram skill (new from main) was missing the Conductor-session prose AskUserQuestion blocks that gen-skill-docs propagates to every SKILL.md. Pure generated output; reproduced by bun run gen:skill-docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-14 11:40:57 -07:00
Garry Tan	4dfdb7cdc2	v1.57.2.0 feat: AskUserQuestion prose fallback when the tool fails at runtime (#1908 ) * feat(auq): add gstack-session-kind + echo SESSION_KIND in preamble Classifies the session as spawned \| headless \| interactive from env markers (OPENCLAW_SESSION / GSTACK_HEADLESS / CONDUCTOR_* / CLAUDE_CODE_ENTRYPOINT / CI), defaulting to interactive. Echoed once at skill start alongside BRANCH/REPO_MODE so the AskUserQuestion-failure fallback can branch without a shell-out at failure time. Degrade-safe: empty/error => interactive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(auq): prose fallback when AskUserQuestion fails (interactive sessions) On a genuine AUQ failure (tool absent, or present-but-erroring like Conductor's flaky MCP returning '[Tool result missing due to internal error]'): retry once, then branch on SESSION_KIND — spawned auto-chooses, headless BLOCKs, interactive renders a prose decision brief the user answers by typing a letter. The prose fallback MUST surface the triad: a clear ELI10 of the issue, a per-choice Completeness score, and a recommendation+why (one paragraph per choice). Carves out the [plan-tune auto-decide] denial as NOT a failure, and qualifies the former 'tool_use, not prose' assertions so the rule isn't self-contradicting. Tests pin the triad, the SESSION_KIND branch, the OV2 collision guard, the always-loaded guarantee, and a cross-file invariant on the auto-decide prefix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(auq): default GSTACK_HEADLESS=1 in eval/E2E runners Headless harness runs classify as headless (BLOCK on AUQ failure rather than emit a prose question no one reads). SDK runner uses ambient mutation, not the Options.env object, to avoid breaking the SDK auth pipeline. Interactive-path suites opt out by overriding the env per-run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(auq): defensive PostToolUse error-fallback hook (OV3:B) When an AskUserQuestion call returns an error/missing result, this hook injects additionalContext reminding the model to run the prose fallback for the current SESSION_KIND. It does not render prose itself — it guarantees the reminder fires at the moment of failure instead of relying on the model recalling SESSION_KIND. Inert on success and inert if the platform never invokes PostToolUse on tool errors (unverified — could not force the Conductor MCP error in a harness; see the spike doc). The prompt-level fallback covers the case regardless. Decision logic is unit-tested deterministically; registered in setup beside the existing AUQ hooks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(auq): regenerate SKILL.md for all hosts + refresh ship goldens Regenerated from the resolver changes (gen:skill-docs --host all). Refreshes the byte-exact ship golden fixtures (claude/codex/factory). Spec prose tightened so the cross-cutting preamble addition stays under the 5% per-skill parity ceiling (investigate 4.8%) — guard unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(test): kebab testNames for section-loading E2Es to match TOUCHFILES keys The two section-loading E2E tests used display-form testNames ('/ship section-loading', '/plan-ceo-review section-loading') while every other E2E testName and their E2E_TOUCHFILES keys are kebab. The completeness gate does an exact `name in E2E_TOUCHFILES` check, so it failed (pre-existing on main); diff- based selection also couldn't match them. Align to ship-section-loading / plan-ceo-section-loading. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(test): make external-host freshness checks deterministic The parameterized host smoke + --host all freshness tests assumed an external `gen:skill-docs --host all` had run first (it never does in `bun test`), so which host reported STALE varied by sibling-test timing — flaky. Regenerate the gitignored external host dirs in a beforeAll so the --dry-run check is deterministic. It still catches non-deterministic generation (the real bug class for regenerated outputs); the tracked-claude freshness test runs earlier and is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): headroom for AUQ cross-cutting addition on carved document-release Merging main brought the carve of document-release (smaller skeleton); the AUQ prose-fallback adds ~2KB to every skill's always-loaded preamble, landing document-release at ~5.9% over the pre-carve v1.53.0.0 baseline. Add a per-carve maxSizeRatio override (CARVE_GUARDS single source of truth) and bump only this skill to 1.08. All other skills keep the strict 1.05 ceiling. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(auq): harden error-fallback hook + harness per adversarial review Codex pre-landing review found three real issues: - The PostToolUse fallback hook shared source 'plan-tune-cathedral' with the question-log hook (same event+matcher); gstack-settings-hook replaces the entry, so it would have clobbered plan-tune capture. Give it its own 'auq-error-fallback' source (separate entry, both run); ALREADY_INSTALLED now requires both sources. - isErrorResponse triggered on any string containing 'internal error'/'is_error', so a real answer or a {"is_error": false} payload could fire the fallback after a successful question. Narrow it to the missing-result sentinel + boolean is_error. - The SDK runner mutated process.env.GSTACK_HEADLESS process-wide (leaked headless into later tests). Removed; GSTACK_HEADLESS=1 now lives in the eval package.json scripts, scoped to the invocation and inherited by the SDK child. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.57.2.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-07 21:38:21 -07:00
Garry Tan	0570ef93a5	v1.24.0.0 feat: cross-platform hardening — curated Windows lane + Bun.which resolver + path-portability helper (#1252 ) * feat(paths): bin/gstack-paths helper + migrate 8 skills off inline state-root chains New bin/gstack-paths emits GSTACK_STATE_ROOT, PLAN_ROOT, TMP_ROOT exports for skill bash blocks to source via eval. Honors GSTACK_HOME → CLAUDE_PLUGIN_DATA → $HOME/.gstack → .gstack (and parallel chains for plan/tmp roots) so skills work the same in plugin installs, global installs, and CI containers without HOME. Eight skills migrate off inline ${CLAUDE_PLUGIN_DATA:-...} or ${GSTACK_HOME:-...} chains: careful, freeze, guard, unfreeze, investigate, context-save, context-restore, learn, office-hours, plan-tune, codex. Resolved values are identical, so existing tests cover correctness; the win is consolidating 11 copy-pasted fallback chains behind one helper. codex/SKILL.md.tmpl gets a new Step 0.6 Resolve portable roots that sources gstack-paths once, then replaces hardcoded ~/.claude/plans/.md and /tmp/codex--XXXXXX.txt with "$PLAN_ROOT"/.md and "$TMP_ROOT/codex--XXXXXX.txt". Hardening direction credited to the McGluut/gstack fork; this is upstream's factoring of the per-skill chain the fork inlined. Tests: test/gstack-paths.test.ts covers all three fallback chains with 8 unit tests (HOME unset, CLAUDE_PLUGIN_DATA set, GSTACK_HOME wins, etc). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(claude-bin): Bun.which wrapper for cross-platform claude resolution Replaces 75 LOC of fork-side reimplementation (PATH parsing, Windows PATHEXT, case-insensitive Path/PATH, X_OK) with a thin wrapper around Bun.which() — the runtime built-in that already does all of it. New file is ~70 LOC including the override + arg-prefix logic the runtime doesn't cover. Override branch fixed: GSTACK_CLAUDE_BIN=wsl now resolves through Bun.which() just like a bare claude lookup would. The McGluut fork's claude-bin.ts only handled absolute-path overrides; bare commands silently returned null. Passing the override value through Bun.which fixes the documented use case for free. Five hardcoded claude spawn sites rewired through resolveClaudeCommand: - browse/src/security-classifier.ts:396 — version probe - browse/src/security-classifier.ts:496 — Haiku transcript classifier - scripts/preflight-agent-sdk.ts — preflight binary pinning - test/helpers/providers/claude.ts — LLM judge availability + run - test/helpers/agent-sdk-runner.ts — SDK harness binary resolver All retain their existing degrade-on-missing semantics. Tests: browse/test/claude-bin.test.ts has 9 unit tests including the override-PATH-resolution case the fork's version got wrong. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+test: AGENTS.md/docs/skills.md inventory sync + private-path leak detector Inventory sync (codex-flagged drift): - /debug → /investigate (skill renamed in v1.0.1.0) - AGENTS.md grows from 21 to 40+ skills, organized by category (plan reviews, implementation, release, operational, browser, safety) - docs/skills.md gains 11 missing entries: /plan-devex-review, /devex-review, /plan-tune, /context-save, /context-restore, /health, /landing-report, /benchmark-models, /pair-agent, /setup-gbrain, /make-pdf - Stale "<5s bun test" claim dropped — slim-preamble harness + new tests means no realistic universal claim to make - Adds explicit "Mac + Linux full, curated Windows lane" platform statement + "Git Bash / MSYS today, native PowerShell future" install note New invariants in test/skill-validation.test.ts (~80 LOC): - Private-path leak detector scans every SKILL.md / SKILL.md.tmpl for known maintainer-only filenames (coordination-board.md, SEEKING_LOG.md, RATIONAL_SUBJECT.md, VALUE_SIGNAL_LOOP.md, C:\LLM Playground\go). Adapted from the McGluut fork's skill-contract-audit.ts; we don't take the script wholesale because most of its checks are already covered by test/gen-skill-docs.test.ts:1668-2074 and test/skill-validation.test.ts:1419 — only the private-path scan and doc-inventory cross-check are new. - Doc-inventory cross-check: every skill directory with a SKILL.md.tmpl must appear in both AGENTS.md and docs/skills.md. Catches the inventory drift this commit is fixing — without this test it would just drift again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(windows): curated windows-free-tests CI job + test-free-shards curation Codex's v1.18.0.0 review flagged that a windows-latest matrix entry on the existing Linux-container evals.yml workflow can't work as a drop-in, and that the free test suite has POSIX-bound dependencies a sharded runner doesn't fix on its own. This commit takes McGluut's test-free-shards.ts (190 LOC), adds a Windows-fragility scan, and runs the curated subset on a separate non-container windows-latest job. scripts/test-free-shards.ts: - Enumeration + paid-eval filtering + stable-hash sharding (FNV-1a). Adapted from McGluut/gstack fork. - Upstream-original: --windows-only filter scans each test's content for POSIX-bound patterns: hardcoded /bin/sh, spawn('sh', ...), bash -c, raw /tmp/, chmod, xargs, which claude. Files matching are excluded with the reason logged. Currently filters 25 of 128 free tests; remaining 103 run on windows-latest. .github/workflows/windows-free-tests.yml: - Separate non-container job (NOT a matrix entry on evals.yml). Runs: bun run test:windows # curated subset bun test browse/test/claude-bin.test.ts # PATHEXT+overrides on Windows bun test test/gstack-paths.test.ts # state-root resolution package.json: new test:free + test:windows scripts. Honest about scope (codex-flagged): this does NOT make the full free suite Windows-safe. The 25 excluded tests need POSIX-only surfaces ported off shell primitives (test/ship-version-sync.test.ts:72 hardcodes /bin/bash, etc). Tracked as a P4 follow-up TODO. Full Windows parity is the next wave; this release ships the curated lane. Tests: test/test-free-shards.test.ts has 14 unit tests covering enumeration, paid-eval filtering, Windows-fragility detection (POSIX patterns + safe code), and stable sharding determinism. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): v1.20.0.0 — cross-platform hardening, curated Windows lane Cross-platform hardening. Mac + Linux full, curated Windows lane added. Workspace-aware queue at ship time: - v1.17.0.0 claimed by garrytan/setup-gbrain-run (PR #1234) - v1.19.0.0 claimed by garrytan/browserharness (PR #1233) - This branch claims v1.20.0.0 (next available slot) (Initially bumped to v1.18.0.0 during plan-mode implementation; rebumped to v1.20.0.0 at /ship time when gstack-next-version detected the queue had moved.) Headline numbers (full release-note in CHANGELOG.md): - 2 new shared resolvers: bin/gstack-paths (61 LOC), browse/src/claude-bin.ts (73 LOC) - 8 skills migrated off inline state-root chains - 5 hardcoded claude spawn sites rewired through the shared resolver - 75 LOC of fork-side reimplementation replaced by Bun.which() - 103 of 128 free tests run on windows-latest (curated, ~80%) - +31 new unit tests + 3 new invariants - AGENTS.md inventory grows from 21 to 40+ skills Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): configure git identity + extend Windows-fragility curation First windows-free-tests CI run surfaced 34 failures across two patterns: 1. Tests that init a temp git repo via execSync('git commit ...') — Windows runner has no default git user.email/user.name, so the commit fails. Fix: add a "Configure git identity" step to .github/workflows/windows-free-tests.yml that sets a CI-only identity globally. 2. Tests that use POSIX-only APIs unconditionally: - file-mode bitmask checks (`stat.mode & 0o600`, `mode & 0o111`) — Windows fakes mode bits and these assertions don't compose - hardcoded forward-slash path assertions (`file.endsWith('/tab-42.json')`) — Windows path separators are '\\' Fix: extend WINDOWS_FRAGILE_PATTERNS in scripts/test-free-shards.ts to detect both. 8 additional tests now excluded from the curated Windows subset with logged reasons: - browse/test/security-review-flow.test.ts (file mode) - browse/test/security-sidepanel-dom.test.ts (forward-slash path) - browse/test/url-validation.test.ts (forward-slash path) - test/gbrain-repo-policy.test.ts (file mode) - test/relink.test.ts (file mode) - test/skill-validation.test.ts (file mode — single assertion at :934) - test/team-mode.test.ts (file mode — also kills its 30 git-init beforeEach failures) - test/upgrade-migration-v1.test.ts (file mode) Curated Windows subset: 103 → 95 tests (still ~74% of free suite). All 14 test-free-shards unit tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): enforce LF + build server-node.mjs in CI Second round of windows-free-tests fixes after the first push. Curated subset went from 386/34 to 58/4 fails. Remaining 4 fails + 1 error trace to two root causes: 1. Line-ending sensitivity. Windows checkout with core.autocrlf=true converts .md/.tmpl files to CRLF. Tests that parse YAML frontmatter with `/^---\n([\\s\\S]+?)\n---/` then return zero matches — skill-collision- sentinel.test.ts:120 enumerated 0 skills on Windows, cascading into 3 downstream test failures (sanity, KNOWN_COLLISIONS, /checkpoint resolved). Fix: add .gitattributes that pins LF for .md/.tmpl/.yml/.json/.toml/.sh/ .ts/.tsx/.js/.mjs/.cjs/.bash. Root-cause fix; prevents future similar tests from hitting the same trap. Also keeps bash scripts LF on Linux runners (CRLF in shebangs produces "bad interpreter" errors). 2. Module-level Windows assertion in browse/src/cli.ts:82 throws if browse/dist/server-node.mjs is missing. Any test that transitively loads cli.ts (e.g., browse/test/tab-isolation.test.ts via shard mate imports) then fails to even start. server-node.mjs is generated by bash browse/scripts/build-node-server.sh, which `bun run build` calls but `bun install` does not. Fix: add a "Build server-node.mjs" step to .github/workflows/ windows-free-tests.yml. Calls only the node-server build script, not full `bun run build` — we don't need the compiled binaries for tests and the full build is slow. Expected: skill-collision-sentinel goes 0→3 pass (sanity, KNOWN_COLLISIONS, /checkpoint resolved). tab-isolation's "unhandled error between tests" disappears. Remaining tests should be green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): platform-aware claude-bin test + curate bin/ shebang spawns Round 3 of windows-free-tests fixes. Round 2 (LF gitattributes + server-node.mjs build) cleared shard 1 entirely (skill-collision-sentinel and tab-isolation green). Shard 2 surfaced two more issues: 1. browse/test/claude-bin.test.ts:50 — the "PATH-resolvable override" test creates a fake binary 'fake-claude-cli' (no extension) and expects Bun.which to find it. On Windows, Bun.which probes PATHEXT extensions (.cmd, .exe, .bat) — a bare-name file is not discoverable. Production behavior is correct; the test was Mac/Linux-shaped. Fix: branch on process.platform. On Windows, write 'fake-claude-cli.cmd' with a Windows batch payload instead of a POSIX shebang script. 2. test/gstack-question-log.test.ts (and 18 sibling tests) — spawn a bash shebang script via spawnSync(BIN, args). Git Bash on Windows can run `bash /path/to/script` but spawnSync invokes CreateProcess directly, which doesn't parse #!/usr/bin/env bash. All these tests are Windows-fragile and can't run as-is. Fix: extend WINDOWS_FRAGILE_PATTERNS with `path.join(.., 'bin', ..)` detector. Curates 19 additional tests (benchmark-cli, brain-sync, builder-profile, explain-level-config, gbrain-, gstack-question-, hook-scripts, learnings, plan-tune, review-log, secret-sink-harness, taste-engine, telemetry, timeline, uninstall). Curated Windows subset: 95 → 76 tests (~59% of free suite). Still meaningful Windows coverage. The 52 excluded tests are tracked as a follow-up TODO for full Windows parity (shebang-bin spawns + POSIX file modes + raw /tmp/ etc). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): curate Playwright-launching tests Round 4 of windows-free-tests fixes. Round 3 cleared shard 2 except for browse/test/batch.test.ts:35 which calls `await bm.launch()` and triggers Playwright Chromium launch. The windows-latest runner doesn't have Chromium installed (browser bring-up is a separate concern, tracked by PR #1238 windows-pty-bun-pty-fix). Fix: extend WINDOWS_FRAGILE_PATTERNS with `await \\w+\\.launch\\(` matcher. Catches batch.test.ts plus 7 sibling tests (commands, compare-board, content-security, handoff, security-live-playwright, security-sidepanel-dom, snapshot — most already excluded by other patterns). Curated Windows subset: 76 → 72 tests (~56% of free suite). Net curation across all 4 rounds: 56 of 128 free tests excluded, each with a logged reason. The 56 excluded fall into 6 buckets — POSIX shells, raw /tmp/, chmod/xargs, file mode bitmasks, forward-slash path assertions, bin/ shebang spawns, and Playwright launches — all tracked as a P4 follow-up TODO for full Windows parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): catch destructured join() bin-spawns + browse server tests Round 5 of windows-free-tests fixes. Round 4 caught Playwright launchers but two more failure shapes appeared in shard 5: 1. test/diff-scope.test.ts uses `import { join }` (destructured) and `join(import.meta.dir, '..', 'bin', 'gstack-diff-scope')`. My round-3 pattern only matched `path.join(...)` — the destructured form slipped through. Tightened the pattern to match the literal `, 'bin', '<name>'` path-segment shape regardless of whether it's `path.join` or `join` directly. 2. browse/test/sidebar-integration.test.ts spawns the browse server via `spawn(['bun', 'run', server.ts])` with BROWSE_HEADLESS_SKIP=1. The Bun-run-server.ts path is the same Playwright-on-Windows broken path that the windows-free-tests job intentionally avoids — the server-node.mjs route only kicks in for the compiled binary, not direct Bun runs of the TypeScript source. Added a BROWSE_HEADLESS_SKIP / spawn-bun-run pattern. Curated Windows subset: 72 → 73 tests (~57% of free suite). Net up by 1 because the tightened bin pattern released one test that was a false positive in the loose `path\\.join` form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): broaden bin/ pattern to match path.join(ROOT, 'bin') Round 6. Round 5 tightened the bin/ pattern to require a script-name segment after 'bin', which inadvertently released test/brain-sync.test.ts that uses: const BIN = path.join(ROOT, 'bin'); const full = bin.startsWith('/') ? bin : path.join(BIN, bin); The 'bin' segment is the LAST argument to path.join — there's no literal script name to match. The earlier looser pattern caught this; round 5 broke that. Fix: revert to `,\\s['"]bin['"]\\s[,)]` which matches both forms: - `, 'bin', 'script-name')` (path.join with name) — typical - `, 'bin')` (path.join ending at bin) — brain-sync style Curated subset: 73 → 66 tests (~52% of free suite). The 7 additional exclusions are all bin-script tests that were misclassified by the round-5 tightening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(find-browse): guard main() with import.meta.main Round 7 of windows-free-tests fixes (and a genuine bug fix beyond Windows). browse/src/find-browse.ts called main() unconditionally at module load. main() calls process.exit(1) when no compiled `browse` binary exists at the known install paths. Any test that imports `locateBinary` from this module then exits the entire test process before any tests run. This affected the windows-free-tests CI lane because the runner intentionally doesn't compile the browse binary (only server-node.mjs is built — full binary compilation is slow and not needed for the curated subset). It would also affect any Mac/Linux contributor who runs tests in a fresh checkout before running ./setup, though the symptom is rarer there. Fix: wrap `main()` in `if (import.meta.main) { main() }`. The CLI invocation (via the find-browse binary or `bun run browse/src/find-browse.ts`) still runs main() and emits the path. Imports get only the named exports. Verified locally: - `bun run browse/src/find-browse.ts` still prints the binary path. - `import { locateBinary } from '...'` no longer exits the process. - `bun test browse/test/find-browse.test.ts` passes 4/4 (was crashing at module load). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): pin LF on extensionless executables (setup, bin/, scripts/) Round 8 of windows-free-tests fixes. Round 7 cleared find-browse + most shards; one fail left in shard 7: test/setup-codesign.test.ts > codesign shell snippet is syntactically valid expect(received).toBeTruthy() — match was null The test extracts a bash codesign block from the `setup` file via a \\n-anchored regex, then syntax-checks it with `bash -n`. On Windows the regex returned null because the `setup` file was checked out with CRLF endings — my round-2 .gitattributes only covered files matched by extension patterns (.md, .sh, .ts) and `setup` is extensionless. Fix: extend .gitattributes with explicit rules for extensionless executables: setup text eol=lf bin/ text eol=lf */scripts/ text eol=lf This also LF-pins all the bash bin/ scripts (gstack-paths, gstack-slug, gstack-codex-probe, ...) which would otherwise break with "bad interpreter" errors on Linux if a Windows contributor accidentally committed CRLF versions. Defense in depth. Verified locally: `git check-attr eol setup bin/gstack-paths` reports `eol: lf` for both. Renormalized via `git add --renormalize` so any already-LF files in the repo stay LF after the .gitattributes change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): gen:skill-docs in workflow + known-bad list for env-specific tests Round 9 of windows-free-tests fixes. Round 8 cleared shard 7; shard 8 surfaced 4 fails: 1+2. test/gen-skill-docs.test.ts golden-file regression for Codex + Factory ship skills failed with ENOENT on `.agents/skills/gstack-ship/SKILL.md` and `.factory/skills/gstack-ship/SKILL.md`. These are gitignored gen-skill-docs outputs that the Mac/Linux CI workflows already regenerate elsewhere — the windows-free-tests lane never did. Fix: add `bun run gen:skill-docs --host all` step to windows-free-tests.yml after `bun install`. 3. test/host-config.test.ts:377 "detect finds claude" asserts the `claude` binary is on PATH. True when running inside Claude Code; false on a bare CI runner. 4. browse/test/findport.test.ts:117 asserts Bun.serve.stop() is fire-and-forget (returns undefined). Bun's Windows behavior for this polyfill differs; the assertion is Bun-on-non-Windows-specific. Both 3 and 4 are environment/runtime-specific failures that don't fit a regex pattern. Added a KNOWN_WINDOWS_INCOMPATIBLE explicit list to scripts/test-free-shards.ts so they're curated by exact path, with a reason string. The list is for cases where pattern matching can't infer the failure shape from the source file alone. Curated subset: 66 → 64 tests (~50% of free suite). 14 unit tests in test/test-free-shards.test.ts still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): curate pre-existing breakage from v1.14.0.0 sidebar refactor Round 10 of windows-free-tests fixes. Round 9 cleared shards 7+8; shard 9 surfaced ENOENT for browse/src/sidebar-agent.ts. That file was DELETED in v1.14.0.0 (sidebar REPL refactor — sidebar-agent.ts and the chat queue path were ripped in favor of the interactive xterm.js PTY). 10 security tests still reference it via top-level fs.readFileSync and fail on import. Verified locally: `bun test browse/test/security-source-contracts.test.ts` on this branch reports 0 pass, 1 fail, 1 error. Mac/Linux CI exits 0 because Bun reports module-load failures as "error" not "fail" and the exit code is 0; Windows CI exits 1 (stricter). Same pre-existing breakage on every platform — just only visible in shard 9 of the Windows lane. Fix: add WINDOWS_FRAGILE_PATTERNS entry matching `sidebar-agent.ts` / `src/sidebar-agent` references. Curates browse/test/sidebar-ux.test.ts (other 9 likely caught by paid-eval filter or earlier patterns). Tracked as a follow-up TODO: update or delete the 10 security tests that reference deleted source. Out of scope for v1.20.0.0 portability wave. Curated subset: 64 → 63 tests (~49% of free suite). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(windows-ci): broaden sidebar-agent.ts pattern to catch all references * fix(windows-ci): catch ./bin/<name> direct path spawns * fix(windows-ci): scope Windows job to v1.20.0.0 new portability work 12 rounds of curation revealed that gstack has a long tail of tests with environment-specific assumptions (POSIX paths, /tmp, mode bits, bash spawns, deleted v1.14 sidebar refs, HOME=unset guards, Bun polyfill specifics). Each round of pattern-matching curation caught 1-2 new buckets but kept surfacing more. Honest scope for v1.20.0.0: this PR delivers two new portability primitives (bin/gstack-paths + browse/src/claude-bin.ts). The Windows CI job should verify those primitives work on Windows. Full-suite Windows parity is a P4 follow-up that requires touching many tests that aren't part of this PR's scope. Change: windows-free-tests.yml now runs: bun test test/gstack-paths.test.ts \\ browse/test/claude-bin.test.ts \\ test/test-free-shards.test.ts That's 31 tests targeting exactly the new code paths shipped here. The release-note headline ("curated Windows lane added") becomes truthful when this passes — we have a real Windows CI gate on the new portability work, not a rebadged failure-tolerant attempt at the full suite. Retained: scripts/test-free-shards.ts curation logic (informational output via `--list`, useful for future expansion of the Windows lane when contributors port specific tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): invoke bin/gstack-paths via bash (Windows shebang fix) Round 13 of windows-free-tests fixes. Round 12 (scope pivot) revealed all 8 gstack-paths tests fail on Windows because the test invokes the bash shebang script directly: spawnSync(BIN, []) # BIN = path.join(ROOT, 'bin', 'gstack-paths') Windows CreateProcess can't parse `#!/usr/bin/env bash` from the file. The script never runs on Windows via this invocation path. Fix: change to `spawnSync('bash', [BIN], ...)`. This matches production usage — the script is sourced from inside skill bash blocks via `eval "$(~/.claude/skills/gstack/bin/gstack-paths)"`, where bash is always the executor. Mac/Linux behavior is identical (bash invocation of a bash script). Verified locally: 8/8 tests still pass on macOS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): rebump v1.20.0.0 → v1.22.0.0 (queue drift) Version-gate workflow rejected v1.20.0.0 because the queue moved during the windows-free-tests fix loop: v1.16.0.0 → garrytan/gbrowser-unleashed (PR #1253) [new since last bump] v1.17.0.0 → garrytan/setup-gbrain-run (PR #1234) v1.19.0.0 → garrytan/browserharness (PR #1233) v1.21.1.0 → garrytan/pty-plan-mode-e2e (PR #1255) [new since last bump] Two new sibling PRs landed slot claims while we iterated on Windows. Next free MINOR slot is v1.22.0.0. Updated VERSION, package.json, CHANGELOG header + body. Also pushing the round-13 windows-fix in parallel (test invokes bin/gstack-paths via bash to handle Windows shebang). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): clear USERPROFILE alongside HOME (Git Bash auto-populates HOME) Final Windows fix. 29/31 pass; 2 fail in gstack-paths HOME-unset tests: (fail) CWD fallback when HOME also unset (container env) (fail) PLAN_ROOT chain: GSTACK_PLAN_DIR > CLAUDE_PLANS_DIR > HOME > CWD Root cause: Git Bash on Windows auto-populates `HOME` from `USERPROFILE` at shell startup if HOME is empty/unset. Passing `HOME: ''` to spawnSync does set HOME='' for the child, but Git Bash overwrites it from USERPROFILE during init, so the script sees `${HOME:-}` as non-empty (C:\\Users\\runneradmin) and never reaches the CWD-fallback branch. Fix: clear USERPROFILE='' too. On Linux/Mac it's a no-op (env var doesn't exist in normal env); on Windows Git Bash it stops the HOME auto-populate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): skip HOME-unset assertions on Windows (Git Bash auto-populates) 29/31 → 31/31 expected on Windows. Final fix: The 2 still-failing gstack-paths tests assert CWD-fallback behavior when HOME is genuinely unset (Linux container scenario). On Windows Git Bash, HOME gets auto-derived from USERPROFILE → HOMEDRIVE+HOMEPATH → /c/Users/<user> during shell startup. Clearing all three of those env vars in the spawn still results in HOME being non-empty by the time the script runs. The bash script's CWD-fallback logic IS correct — it just isn't exercisable through the Git Bash test surface. Skip those specific assertions on Windows; they continue to verify on Linux/Mac. This is the only platform-specific test guard introduced; it's narrowly scoped to the unreachable code path, not a bypass of the real check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 07:21:28 -07:00
Garry Tan	9e244c0bed	v1.11.1.0 fix: plan-mode handshake + canUseTool test harness (#1182 ) * feat: plan-mode handshake for interactive review skills Add a preamble-level STOP-Ask handshake that fires when the user invokes any of the 4 interactive review skills (plan-ceo-review, plan-eng-review, plan-design-review, plan-devex-review) while their Claude Code session is in plan mode. Without this gate, plan mode's "this supercedes any other instructions" system-reminder outranked the skills' interactive STOP gates and the skills silently wrote plan files without any per-finding AskUserQuestion. The handshake offers 2 options (exit-and-rerun, cancel) — the original third "stay and batch" option was dropped after two independent reviewers flagged it as a silent bypass of the skills' anti-skip rule. Architecture decisions (CEO+Eng review): - Preamble-level resolver, not per-template injection (Codex finding #2) - Position 1 in preamble composition: after bash block (_SESSION_ID live), before onboarding AskUserQuestion gates (so fresh-install users see the handshake first, not drowned in telemetry/proactive/routing prompts) - Generator-only `interactive: true` frontmatter flag, following the `preamble-tier` precedent (no host-config frontmatter allowlist edits) - Host-scoped to Claude via `ctx.host === 'claude'` check inside the resolver (simpler than `suppressedResolvers` which only gates `{{}}` placeholders) - One-way-door classification in scripts/question-registry.ts for all 4 skills so question-tuning `never-ask` preferences can't suppress the gate - Synchronous telemetry write to ~/.gstack/analytics/skill-usage.jsonl on handshake fire (captures A-exit and C-cancel outcomes that terminate the skill before end-of-run telemetry runs) Also adds an explicit STOP block to plan-ceo-review Step 0C-bis so the approach-selection question can't silently skip to mode selection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: extend agent-sdk-runner with canUseTool for AskUserQuestion interception Test harness at test/helpers/agent-sdk-runner.ts gains an optional `canUseTool` callback parameter. When a test supplies it, the harness flips `permissionMode` from `bypassPermissions` (overlay-harness default) to `default` so the SDK actually invokes the callback on every tool use, and auto-adds `AskUserQuestion` to `allowedTools` so Claude can fire it at all. Exports a `passThroughNonAskUserQuestion` helper so tests that only want to intercept AskUserQuestion can auto-allow every other tool with one line: `return passThroughNonAskUserQuestion(toolName, input)`. This is the foundation for D14 — every future interactive-skill E2E test can now assert on AskUserQuestion shape and routing. Previous E2E tests at `test/skill-e2e.test.ts` explicitly instructed the model to skip AskUserQuestion ("non-interactive run") which meant no test could actually verify the question content or routing. 6 new unit tests in test/agent-sdk-runner.test.ts cover: - permissionMode flips to 'default' when canUseTool supplied - permissionMode stays 'bypassPermissions' when canUseTool absent - canUseTool callback reaches the SDK options - AskUserQuestion auto-added to allowedTools when canUseTool supplied - AskUserQuestion NOT added when canUseTool absent - passThroughNonAskUserQuestion helper returns allow+updatedInput Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test: plan-mode handshake E2E coverage and unit assertions Adds 6 E2E test files and 8 new unit assertions to verify the plan-mode handshake works end-to-end and stays correct under regeneration. E2E tests (gate-tier, paid, EVALS=1 EVALS_TIER=gate): - test/skill-e2e-plan-ceo-plan-mode.test.ts — handshake fires before any Write/Edit when plan-mode distinctive phrase is present; 2-option shape (Exit/Cancel); option A routes to ExitPlanMode cleanly - test/skill-e2e-plan-eng-plan-mode.test.ts — same contract for plan-eng - test/skill-e2e-plan-design-plan-mode.test.ts — same contract for plan-design; exercises C-cancel branch instead of A-exit - test/skill-e2e-plan-devex-plan-mode.test.ts — same contract for plan-devex - test/skill-e2e-plan-mode-no-op.test.ts — negative regression: handshake must NOT fire when distinctive phrase is absent; skill proceeds normally through Step 0 (REGRESSION RULE guardrail against breaking existing interactive-review sessions) - test/e2e-harness-audit.test.ts — free unit test asserting every `interactive: true` skill has at least one canUseTool-using test file (prevents future drift where a skill opts in without coverage) Shared helper test/helpers/plan-mode-handshake-helpers.ts centralizes the canUseTool interceptor + distinctive-phrase injection so the 4 sibling E2E tests are thin wiring (~20 LOC each) and can't drift out of sync. Unit assertions added to test/gen-skill-docs.test.ts: - handshake section present in all 4 Claude-generated SKILL.md files - handshake section absent from non-interactive Claude skills (ship, review, qa, office-hours, codex, retro, cso) - handshake section absent from non-Claude host outputs (.agents, etc.) - 0C-bis STOP block present in plan-ceo-review/SKILL.md at correct position (between the "Present these approach options" line and "### 0D-prelude" header) - handshake resolver wired BEFORE generateUpgradeCheck in preamble composition order 6 new gate-tier entries added to test/helpers/touchfiles.ts so any change to the handshake resolver, preamble composition, skill templates, question registry, one-way-door classifier, or agent-sdk-runner fires the relevant E2E tests. test/touchfiles.test.ts updated for the new selection count (plan-ceo-review/** now triggers 15 tests, up from 8). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(v1.11.1.0): VERSION bump + CHANGELOG entry + TODOS follow-ups Bumps from main's v1.11.0.0 to v1.11.1.0 (PATCH — bug-fix release, no new user-facing artifacts). CHANGELOG entry covers the plan-mode handshake, agent-sdk-runner canUseTool extension, and the 2 follow-up TODOs. CHANGELOG order: v1.11.1.0 (this) → v1.11.0.0 (workspace-aware ship, merged from main) → v1.10.1.0 (overlay efficacy harness). No duplicate headers. Syncs package.json version to match VERSION per the Step 12 idempotency invariant (both files must agree or /ship halts). TODOS.md: - Preserves the Testing/security-bench-haiku-responses P1 added on main - Adds P1 "Structural STOP-Ask forcing function" — broader class of the bug this release fixes - Adds P2 "Apply interactive: true to non-review skills (office-hours, codex, investigate, qa, retro, cso)" Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-24 00:04:53 -07:00
Garry Tan	e3d7f49c74	feat(v1.10.1.0): overlay efficacy harness + Opus 4.7 fanout nudge removal (#1166 ) * refactor: export readOverlay from model-overlay resolver Needed by the overlay-efficacy eval harness to resolve INHERIT directives without going through generateModelOverlay's full TemplateContext. * chore: add @anthropic-ai/claude-agent-sdk@0.2.117 dep Pinned exact for SDK event-shape stability. Used by the overlay-efficacy harness to drive the model through a closer-to-real Claude Code harness than `claude -p`. * feat(preflight): sanity check for agent-sdk + overlay resolver Verifies: SDK loads, claude-opus-4-7 is a live API model, SDKMessage event shape matches assumptions, readOverlay resolves INHERIT directives and includes expected content. Run with `bun run scripts/preflight-agent-sdk.ts`. PREFLIGHT OK on first run, $0.013 API spend. * feat(eval): parametric overlay-efficacy harness (runner + fixtures) `test/helpers/agent-sdk-runner.ts` wraps @anthropic-ai/claude-agent-sdk with explicit `AgentSdkResult` types, process-level API concurrency semaphore, and 3-shape 429 retry (thrown error, result-message error, mid-stream SDKRateLimitEvent). Pins the local claude binary via `pathToClaudeCodeExecutable`. `test/fixtures/overlay-nudges.ts` holds the typed registry. Two fixtures for the first measurement: `opus-4-7-fanout-toy` (3-file read) and `opus-4-7-fanout-realistic` (mixed-tool audit). Strict validator rejects duplicate ids, non-integer trials, unsafe overlay paths, non-safe id chars, and missing overlay files at module load. Adding a future overlay nudge eval = one fixture entry. * test(eval): unit tests for agent-sdk-runner (36 tests, free tier) Stub `queryProvider` feeds hand-crafted SDKMessage streams. Covers: happy-path shape, all 3 rate-limit shapes + retry, workspace reset on retry, persistent 429 -> `RateLimitExhaustedError`, non-429 propagation, process-level concurrency cap, options propagation, artifact path uniqueness, cost/turn mapping, and every validator rejection case. * test(eval): paid periodic overlay-efficacy harness `test/skill-e2e-overlay-harness.test.ts` iterates OVERLAY_FIXTURES, runs two arms per fixture (overlay-ON, overlay-OFF) at N=10 trials with bounded concurrency. Arms use SDK preset `claude_code` so both include the real Claude Code system prompt; overlay-ON appends the resolved overlay text. Saves per-trial raw event streams to `~/.gstack/projects/<slug>/transcripts/` for forensic recovery. Gated on `EVALS=1 && EVALS_TIER=periodic`. ~$3/run (40 trials). * test: register overlay harness in touchfiles (both maps) Entries for `overlay-harness-opus-4-7-fanout-toy` and `opus-4-7-fanout-realistic` in E2E_TOUCHFILES (deps: model-overlays/, fixtures file, runner, resolver) and E2E_TIERS (`periodic`). Passes `test/touchfiles.test.ts` completeness check. * fix(opus-4.7): remove "Fan out explicitly" overlay nudge Measured counterproductive under the new SDK harness. Baseline Opus 4.7 emits first-turn parallel tool_use blocks 70% of the time on a 3-file read prompt. With the custom nudge: 10%. With Anthropic's own canonical `<use_parallel_tool_calls>` block from their parallel-tool-use docs: 0%. Both overlays suppress fanout; neither improves it. On realistic multi-tool prompts (audit a project: read files + glob + summarize), Opus 4.7 never fans out in first turn regardless of overlay. Zero of 20 trials. Not a prompt problem. Keeping the other three nudges (effort-match, batch questions, literal interpretation) pending their own measurement. Harness is ready for follow-up fixtures — add one entry to `test/fixtures/overlay-nudges.ts` to measure any overlay bullet. Cost of investigation: ~$7 total across 3 eval runs. * chore: bump version and changelog (v1.6.5.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(eval): extend OverlayFixture with allowedTools, maxTurns, direction Per-fixture tool allowlist unblocks measuring nudges that need Edit/Write (e.g. literal-interpretation 'fix the failing tests' needs write access). Per-fixture maxTurns lets harder prompts run longer without changing the default. `direction` is cosmetic metadata for test output labeling. Also adds reusable predicates and metrics: - lowerIsBetter20Pct / higherIsBetter20Pct — 20% lift threshold vs baseline - bashToolCallCount — count of Bash tool_use across the session - turnsToCompletion — SDK-reported num_turns at result - uniqueFilesEdited — Edit/Write/MultiEdit file_path set size test/skill-e2e-overlay-harness.test.ts now threads fixture.allowedTools and fixture.maxTurns through runArm. * test(eval): 3 more overlay fixtures to measure remaining Claude nudges Measures three overlay bullets that haven't been tested yet: - claude-dedicated-tools-vs-bash — claude.md says 'prefer Read/Edit/Write/ Glob/Grep over cat/sed/find/grep'. Fixture prompts 'list every TypeScript file under src/ and tell me what each exports' and counts Bash tool_use across the session. Overlay-ON should drop it by >=20%. - opus-4-7-effort-match-trivial — opus-4-7.md says 'simple file reads don't need deep reasoning.' Fixture uses a trivial one-file prompt (config.json lookup) and measures turns_used. Overlay-ON should be <=80% of baseline turns. - opus-4-7-literal-interpretation — opus-4-7.md says 'fix ALL failing tests, not just the obvious one.' Fixture seeds three failing test files with deliberately distinct failure modes and counts unique files edited. Overlay-ON should touch >=20% more files. Adding a fourth fixture for any remaining overlay nudge is a single entry. The harness is now proven on: fanout (deleted after measurement), dedicated tools, effort-match, and literal-interpretation. * fix(eval): handle SDK max-turns throw gracefully Some @anthropic-ai/claude-agent-sdk versions throw from the query generator when maxTurns is reached, instead of emitting a result message with subtype='error_max_turns'. The runner treated that as a non-retryable error and killed the whole periodic run on the first fixture that exceeded its turn cap. Added isMaxTurnsError() detector and a catch branch that synthesizes an AgentSdkResult from events captured before the throw, with exitReason='error_max_turns' and costUsd=0 (unknown from the thrown path). The metric function still runs against whatever assistant turns were collected, so the trial produces a usable number. Hoisted events/assistantTurns/toolCalls/assistantTextParts and the timing counters out of the inner try so the catch branch can read them. No behavior change on the success path or on rate-limit retry paths. * test(eval): bump maxTurns to 15 for claude-dedicated-tools-vs-bash The prompt 'list every TypeScript file under src/ and tell me what each exports' needs 1 turn for Glob + ~5 for Reads + 1 for summary. Default maxTurns=5 was not enough; prior run threw from the SDK on this fixture and tanked the whole periodic eval. Bumping to 15 gives headroom. The runner now also handles max-turns gracefully even if a future fixture underestimates, so this is belt and suspenders. * test(eval): Sonnet 4.6 variants of the 5 Opus-4.7 fixtures Same overlays, same prompts, same metrics, `model: 'claude-sonnet-4-6'`. Tests whether the overlays behave differently on a weaker Claude model where baseline behavior is shakier. Sonnet trials cost ~3-4x less than Opus so these 5 add ~$4.50 to a full run. Measurement result from the first paired run (100 trials total, ~$14.55): - Sonnet + effort-match shows real overlay benefit. With the overlay on, Sonnet takes 2.5 turns on a trivial `What's the version in config.json?` prompt. Without, it takes exactly 3.0 turns in all 10 trials. ~17% reduction, below the 20% pass threshold but the signal is clean: overlay-ON distribution [2,2,2,2,2,3,3,3,3,3] vs overlay-OFF [3,3,3,3,3,3,3,3,3,3]. - All other Sonnet dimensions flat (fanout, dedicated-tools, literal interpretation). Same as Opus on those axes. - Opus effort-match remains flat (2.60 vs 2.50, +4% slower with overlay). Implication: model-stratified. The overlay stack helps Sonnet on some axes where it does nothing on Opus. Wholesale removal would hurt Sonnet. Per-nudge per-model measurement is the right move going forward. * chore: bump version to 1.10.1.0 Updates VERSION, package.json, CHANGELOG header, and TODOS completion marker from 1.6.5.0 to 1.10.1.0. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 18:42:58 -07:00

5 Commits