diff --git a/CHANGELOG.md b/CHANGELOG.md index 2dd4b64a8..034ab7bcf 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,107 @@ # Changelog +## [1.58.1.0] - 2026-06-14 + +## **Local evals stop lying. Spawned `claude` test children run in a sealed clean room,** +## **and in Conductor every decision is a plain-text brief you answer with a letter.** + +Two things shipped here. First, the local E2E harness is now hermetic by default: +every spawned agent (claude -p, the real-PTY plan-mode runner, the Agent SDK +runner, plus the codex and gemini runners) gets an allowlist-scrubbed environment, +a fresh seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. +Before this, a dev machine leaked the operator's `~/.claude` config, MCP servers +(gbrain, Conductor), skills, `~/.gstack` decision logs, and `CONDUCTOR_*`/`CLAUDECODE` +env into every child, so local eval results disagreed with CI for reasons that had +nothing to do with the code under test. Now local signal matches CI. Set +`EVALS_HERMETIC=0` to debug against real operator state. + +Second, in a Conductor session gstack no longer fights Conductor's flaky +AskUserQuestion tool. It detects the session and renders every decision as a prose +brief, a labeled question with a recommendation, per-option completeness scores, and +"reply with a letter," enforced by a PreToolUse hook that denies the tool and +redirects to prose. Destructive confirmations demand an explicit typed answer. + +Agents that launch long eval runs get `gstack-detach`: a SIGTERM-proof, idle-sleep-proof +wrapper (fresh session + `caffeinate`) with a machine-wide lock so concurrent +worktrees serialize instead of saturating the model API, run-scoped logs, and a +guaranteed `EXIT=` sentinel so a poller never mistakes silence for success. + +### The numbers that matter + +Measured against the gate eval suite on a contaminated dev box (gbrain MCP up, live +Conductor session, sibling worktrees). Reproduce: `bun test` (free unit + wiring +tripwire) and `EVALS=1 EVALS_TIER=gate bun test test/skill-e2e-hermetic-canary.test.ts`. + +| Metric | Before | After | Δ | +|--------|--------|-------|---| +| Spawned-child env | full operator `process.env` | allowlist-scrubbed | sealed | +| Runners hermeticized | 0 of 5 | 5 of 5 | +5 | +| Operator MCP servers visible to child | all (gbrain, Conductor) | 0 (`--strict-mcp-config`) | isolated | +| Config isolation proof | none | poisoned-operator sentinel canary | falsifiable | +| Long eval runs surviving a turn-boundary SIGTERM | no | yes (`gstack-detach`) | survives | + +The clean room is falsifiable, not asserted: a `hermetic-sentinel` gate canary +plants a poisoned operator config (a user `CLAUDE.md` + an MCP server) and fails if +the child can see any of it, and a free static tripwire fails CI if any runner +reverts to a raw `process.env` spread. + +### What this means for contributors + +Run evals locally and trust the result. You no longer have to push to CI to find +out whether a failure was real or just your machine bleeding context into the agent. +Three latent bugs the old harness hid surfaced the moment the suite ran clean and +are fixed: a coverage-judge that scored carved skills against half a document, an +ios-qa daemon test that collided on a shared pidfile under concurrency, and an +operational-learning fixture missing a lib it imports. Start a run with +`bun run eval:bg:gate`; flip `EVALS_HERMETIC=0` only when you deliberately want your +real `~/.claude` in the loop. + +### Itemized changes + +#### Added +- **Hermetic E2E environment** (`test/helpers/hermetic-env.ts`): allowlist env + builder (process basics, network/proxy vars, named `ANTHROPIC_*` auth, per-runner + `extraAllow`), pure `promotedEnv()` shared with `lib/conductor-env-shim.ts`, a + sync-memoized singleton temp dir (`/.claude` keeps the plan-file path + contract), a seeded `.claude.json` for non-interactive first run, and pid-aware GC + of crashed runs. Default-on; `EVALS_HERMETIC=0` restores the legacy env AND drops + `--strict-mcp-config`. +- **Two gate-tier isolation canaries** (`test/skill-e2e-hermetic-canary.test.ts`): + `hermetic-canary` asserts env redirect + scrub + zero MCP servers + nonzero + API-key cost from the Bash tool_result (not model prose); `hermetic-sentinel` + proves the child cannot see a planted poisoned operator config. +- **Static wiring tripwire** (`test/hermetic-wiring.test.ts`): free-tier invariants + that fail CI if any of the five runners drops `hermeticChildEnv()`, the gated + `--strict-mcp-config`, or leaks `process.env` through a callsite override. +- **`gstack-detach`** + `eval:bg` / `eval:bg:all` / `eval:bg:gate` / `eval:bg:periodic` + scripts: detached, SIGTERM-proof, `caffeinate`-wrapped eval runs with a machine-wide + lock, per-run logs under `~/.gstack-dev/eval-runs/`, a watchdog, and an `EXIT=` + sentinel. +- **Conductor prose AskUserQuestion**: when a Conductor session is detected, every + decision renders as a prose brief (labeled question, recommendation, per-option + completeness, reply-with-a-letter), enforced by a PreToolUse hook that denies the + tool and redirects. Auto-decide preferences still apply first; destructive + confirmations require an explicit typed answer. Installed for Conductor even in + non-interactive setup, with an upgrade migration for existing installs. + +#### Changed +- All five E2E runners (`session-runner`, `claude-pty-runner`, `agent-sdk-runner`, + `codex-session-runner`, `gemini-session-runner`) spawn children through + `hermeticChildEnv()`. The Agent SDK runner now receives a COMPLETE hermetic env + via `Options.env` (the old "never pass env: to the SDK" rule was partial-env + replacement; a complete env is safe). +- `hermetic-env.ts` is a global touchfile, so any change to it selects every E2E + + judge test. +- CLAUDE.md documents hermetic-by-default local evals and retires the stale SDK env + warning. + +#### Fixed +- The workflow LLM-judge now re-appends body-carved `sections/*.md` after the marker + slice, so carved skills (document-release) are judged on the full workflow the + agent executes instead of a half-document. +- ios-qa daemon scenarios use unique pidfiles, fixing `already_running` collisions + under `bun test --concurrent`. + ## [1.58.0.0] - 2026-06-12 ## **Your documents grow diagrams. Mermaid and excalidraw fences render as real pictures,** diff --git a/CLAUDE.md b/CLAUDE.md index 03384ae79..984844902 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -31,11 +31,26 @@ use Codex's own auth from `~/.codex/` config — no `OPENAI_API_KEY` env var nee `lib/conductor-env-shim.ts`) promotes `GSTACK_ANTHROPIC_API_KEY` / `GSTACK_OPENAI_API_KEY` to their canonical names inside gstack's TS binaries. Tests run through gstack entrypoints inherit this promotion automatically. -Don't echo the key value to stdout, logs, or shell history. When passing to a -test's Agent SDK, do NOT pass `env: {...}` to `runAgentSdkTest` — the SDK's -auth pipeline doesn't pick up the key the same way when env is supplied as an -object (confirmed failure mode). Mutate `process.env.ANTHROPIC_API_KEY` -ambiently before the call and restore in `finally`. +Don't echo the key value to stdout, logs, or shell history. The historical +"never pass `env:` to `runAgentSdkTest`" rule is retired: the failure was +partial-env replacement (the SDK's `Options.env` REPLACES the child's entire +environment, so an object without the key broke auth). The runner now always +passes a COMPLETE hermetic env with per-test `env:` merged last, so per-test +overrides are safe; ambient `process.env.ANTHROPIC_API_KEY` mutation also +still works (the env builder reads process.env at call time). + +**Hermetic local E2E (default).** Every E2E runner (claude -p, PTY, Agent +SDK, codex, gemini) spawns children through `test/helpers/hermetic-env.ts`: +allowlist-scrubbed env (operator `CONDUCTOR_*`, `CLAUDE_*`, `GSTACK_*`, +`MCP_*`, `GBRAIN_*`, and credentials like `GH_TOKEN` never reach children), +a fresh seeded `CLAUDE_CONFIG_DIR` (no operator `~/.claude` CLAUDE.md / +MCP servers / skills), a temp `GSTACK_HOME`, and `--strict-mcp-config`. +Local eval signal matches CI. Debug against real operator state with +`EVALS_HERMETIC=0` (restores the legacy env AND drops the strict-MCP flag). +Per-test `env:` overrides merge last, so deliberate contamination +(`CONDUCTOR_WORKSPACE_PATH`, per-test `GSTACK_HOME`) keeps working. Wiring +is pinned by `test/hermetic-wiring.test.ts` (static tripwire) and two +gate-tier canaries in `test/skill-e2e-hermetic-canary.test.ts`. E2E tests stream progress in real-time (tool-by-tool via `--output-format stream-json --verbose`). Results are persisted to `~/.gstack-dev/evals/` with auto-comparison @@ -828,6 +843,34 @@ them. Report progress at each check (which tests passed, which are running, any failures so far). The user wants to see the run complete, not a promise that you'll check later. +## Running evals as an agent: always detach (SIGTERM-proof) + +When **you (an agent/harness)** launch a long eval/benchmark run, run it through +`bin/gstack-detach` — NEVER as a plain backgrounded Bash task. A plain background +task lives in the harness's process group, so a SIGTERM ("polite quit") on a turn +boundary, a stopped Monitor, or an interruption kills the run mid-flight (observed: +`script "test:gate" was terminated by signal SIGTERM` ~40 min into a run). On macOS +the run can also die to idle-sleep. `gstack-detach` fixes both: a fresh session +(escapes the group SIGTERM) wrapped in `caffeinate -i` (blocks idle-sleep). + +- Use the `eval:bg*` scripts (`eval:bg`, `eval:bg:all`, `eval:bg:gate`, + `eval:bg:periodic`) — they wrap the eval command in `gstack-detach` with the + machine-wide `gstack-evals` lock (concurrent worktrees serialize instead of + saturating the shared model API), a per-tier watchdog, and a **run-scoped** log + under `~/.gstack-dev/eval-runs/` (no shared-`/tmp` collision). Each prints its + log path. Or call `gstack-detach [--lock NAME] [--timeout SECS] [--label LBL] -- + ` directly for any long agent job. Export `ANTHROPIC_API_KEY` first (never + pass keys in argv). +- Then **poll the printed logfile** with a death-aware watcher: break on the + guaranteed `### gstack-detach EXIT= ###` sentinel (success AND failure are + both marked, so silence is never mistaken for success). The detached run survives + even if your watcher gets reaped, so re-checking the log always works. +- Why the lock: a shared dev box with several Conductor worktrees will rate-limit + the model API if two eval suites run at once (15-way concurrency each), which + mass-times-out E2E tests. The lock makes the second run WAIT, not collide. +- Humans running `bun run test:evals` foreground in their own terminal don't need + this — Ctrl-C is intended there. Detachment is for agent-launched runs only. + ## E2E test fixtures: extract, don't copy **NEVER copy a full SKILL.md file into an E2E test fixture.** SKILL.md files are diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 5a56ef5d3..b75d4a898 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -176,6 +176,18 @@ EVALS=1 bun test test/skill-e2e-*.test.ts - Saves full NDJSON transcripts and failure JSON for debugging - Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts` +**Hermetic by default.** Every E2E runner (claude -p, the real-PTY plan-mode +runner, the Agent SDK runner, plus the codex and gemini runners) spawns its child +through `test/helpers/hermetic-env.ts`: an allowlist-scrubbed environment, a fresh +seeded `CLAUDE_CONFIG_DIR`, a temp `GSTACK_HOME`, and `--strict-mcp-config`. Your +operator `~/.claude` config, MCP servers (gbrain, Conductor), skills, `~/.gstack` +decision logs, and `CONDUCTOR_*` env never leak into the child, so local eval +signal matches CI instead of disagreeing for reasons unrelated to the code under +test. Set `EVALS_HERMETIC=0` to debug against your real operator state (this also +drops `--strict-mcp-config`). The wiring is pinned by `test/hermetic-wiring.test.ts` +(a free static tripwire) and two gate-tier isolation canaries in +`test/skill-e2e-hermetic-canary.test.ts`. + ### E2E observability When E2E tests run, they produce machine-readable artifacts in `~/.gstack-dev/`: @@ -198,6 +210,25 @@ bun run eval:compare # compare two runs — shows per-test deltas + Take bun run eval:summary # aggregate stats + per-test efficiency averages across runs ``` +**Detached runs for agents and long suites.** When an agent (or you, for a run +you don't want to babysit) launches a long eval, use the `eval:bg*` scripts. They +wrap the eval command in `bin/gstack-detach`: a fresh session that escapes a +turn-boundary SIGTERM, a `caffeinate` wrapper that blocks idle-sleep, a machine-wide +`gstack-evals` lock so concurrent worktrees serialize instead of saturating the +model API, a run-scoped log under `~/.gstack-dev/eval-runs/`, a per-tier watchdog, +and a guaranteed `### gstack-detach EXIT= ###` sentinel so a poller never +mistakes silence for success. + +```bash +bun run eval:bg # detached test:evals (diff-based) +bun run eval:bg:all # detached test:evals:all +bun run eval:bg:gate # detached gate-tier suite +bun run eval:bg:periodic # detached periodic-tier suite +``` + +Each prints its log path. Humans running `bun run test:evals` foreground in their +own terminal don't need this — Ctrl-C is intended there. + **Eval comparison commentary:** `eval:compare` generates natural-language Takeaway sections interpreting what changed between runs — flagging regressions, noting improvements, calling out efficiency gains (fewer turns, faster, cheaper), and producing an overall summary. This is driven by `generateCommentary()` in `eval-store.ts`. Artifacts are never cleaned up — they accumulate in `~/.gstack-dev/` for post-mortem debugging and trend analysis. diff --git a/SKILL.md b/SKILL.md index 8711ae7f3..90774950e 100644 --- a/SKILL.md +++ b/SKILL.md @@ -48,6 +48,13 @@ echo "REPO_MODE: $REPO_MODE" _SESSION_KIND=$(~/.claude/skills/gstack/bin/gstack-session-kind 2>/dev/null || echo "interactive") case "$_SESSION_KIND" in spawned|headless|interactive) ;; *) _SESSION_KIND="interactive" ;; esac echo "SESSION_KIND: $_SESSION_KIND" +# Conductor host: AskUserQuestion is unreliable here (native disabled, MCP +# variant flaky), so skills render decisions as prose instead of calling the +# tool. Gated on !headless so an eval/CI run INSIDE Conductor (GSTACK_HEADLESS) +# still BLOCKs rather than rendering prose to nobody. +if [ "$_SESSION_KIND" != "headless" ] && { [ -n "${CONDUCTOR_WORKSPACE_PATH:-}" ] || [ -n "${CONDUCTOR_PORT:-}" ]; }; then + echo "CONDUCTOR_SESSION: true" +fi _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") echo "LAKE_INTRO: $_LAKE_SEEN" _TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true) diff --git a/VERSION b/VERSION index 3a62339b5..eb4d8b4b5 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.58.0.0 +1.58.1.0 \ No newline at end of file diff --git a/autoplan/SKILL.md b/autoplan/SKILL.md index de7174b03..49db38ff9 100644 --- a/autoplan/SKILL.md +++ b/autoplan/SKILL.md @@ -57,6 +57,13 @@ echo "REPO_MODE: $REPO_MODE" _SESSION_KIND=$(~/.claude/skills/gstack/bin/gstack-session-kind 2>/dev/null || echo "interactive") case "$_SESSION_KIND" in spawned|headless|interactive) ;; *) _SESSION_KIND="interactive" ;; esac echo "SESSION_KIND: $_SESSION_KIND" +# Conductor host: AskUserQuestion is unreliable here (native disabled, MCP +# variant flaky), so skills render decisions as prose instead of calling the +# tool. Gated on !headless so an eval/CI run INSIDE Conductor (GSTACK_HEADLESS) +# still BLOCKs rather than rendering prose to nobody. +if [ "$_SESSION_KIND" != "headless" ] && { [ -n "${CONDUCTOR_WORKSPACE_PATH:-}" ] || [ -n "${CONDUCTOR_PORT:-}" ]; }; then + echo "CONDUCTOR_SESSION: true" +fi _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") echo "LAKE_INTRO: $_LAKE_SEEN" _TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true) @@ -306,7 +313,9 @@ AI orchestrator (e.g., OpenClaw). In spawned sessions: "AskUserQuestion" can resolve to two tools at runtime: the **host MCP variant** (e.g. `mcp__conductor__AskUserQuestion` — appears in your tool list when the host registers it) or the **native** Claude Code tool. -**Rule:** if any `mcp__*__AskUserQuestion` variant is in your tool list, prefer it. Hosts may disable native AUQ via `--disallowedTools AskUserQuestion` (Conductor does, by default) and route through their MCP variant; calling native there silently fails. Same questions/options shape; same decision-brief format applies. +**Conductor rule (read before the MCP rule):** if `CONDUCTOR_SESSION: true` was echoed by the preamble, do NOT call AskUserQuestion at all — neither native nor any `mcp__*__AskUserQuestion` variant. Render EVERY decision brief as the **prose form** below and STOP. This is proactive, not a reaction to a failure: Conductor disables native AUQ and its MCP variant is flaky (it returns `[Tool result missing due to internal error]`), so prose is the reliable path. **Auto-decide preferences still apply first:** if a `[plan-tune auto-decide]