mirror of https://github.com/garrytan/gstack.git
2 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
c7ae63201a
|
v1.58.1.0 feat: hermetic local E2E + Conductor prose AskUserQuestion (#2004)
* feat: add shared call-time isConductor() helper
Single source of truth for Conductor host detection in TS consumers
(CONDUCTOR_WORKSPACE_PATH / CONDUCTOR_PORT). Reads the passed env at
call time, not a module-load snapshot, so unit tests can pin the env
inline without Bun --preload (esm-hoist-breaks-env-pin-bootstrap).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: harden question-preference-hook harness against ambient Conductor env
runHook copied all of process.env into the hook subprocess, so running the
suite inside Conductor (CONDUCTOR_WORKSPACE_PATH/PORT set) would leak those
markers. Strip them so the existing cases deterministically characterize
NON-Conductor behavior before the Conductor branch lands. Baseline: 15 pass.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects to prose
Conductor disables native AskUserQuestion and routes through a flaky MCP
variant that returns '[Tool result missing due to internal error]'. The
hook now denies any AUQ call in a Conductor session and instructs the model
to render a prose decision brief instead (transport avoidance, not preference
enforcement) — firing for one-way doors too, with a typed-confirmation
requirement for destructive paths.
Precedence: never-ask auto-decide still wins (user already settled those);
Conductor prose is the fallback for everything else; non-Conductor behavior
is byte-for-byte unchanged. Restructured the per-question loop to compute
eligibility without early-returning so the Conductor branch can run as the
fallback while preserving memoryContext on every exit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: Conductor renders AskUserQuestion decisions as prose by default
In Conductor, native AskUserQuestion is disabled and the MCP variant is
flaky, so skills now render every decision as a plain-text prose brief the
user answers by typing a letter — proactively, not as a failure reaction.
- Preamble emits CONDUCTOR_SESSION, gated on != headless so eval/CI inside
Conductor still BLOCKs instead of rendering prose to nobody.
- AskUserQuestion Format gains a Conductor-default-prose rule (auto-decide
preferences still apply first; prose decisions log via gstack-question-log
since PostToolUse never fires), a one-way/destructive typed-confirmation
rule, and a typed-reply continuation protocol for split chains.
- Regenerated all SKILL.md + ship golden fixtures; bumped affected carve
skeleton caps to absorb the always-loaded additions.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migration)
The PreToolUse hook only delivers its Conductor-prose guarantee if it's
installed, but setup skips hook registration in non-interactive (conductor/CI)
setups. Two fixes so layer 3 actually deploys:
- setup: treat a Conductor workspace as an implicit opt-in for the PreToolUse
hook on the silent fall-through (never overriding an explicit opt-out).
- migration v1.58.0.0: re-register the hook for existing Conductor installs on
/gstack-upgrade, idempotent and respecting plan_tune_hooks=no.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME bug
- New skill-e2e-conductor-prose (periodic): Conductor env + plan-eng-review
surfaces a prose decision brief, not a silent skip. Header documents this is
end-to-end behavior coverage; the deterministic Conductor guard is the
question-preference-hook unit test (the PTY harness can't register the MCP
variant — Codex #10).
- Fix the pre-existing bug in auto-decide-preserved: it seeded the never-ask
preference under GSTACK_HOME=tmpHome but never passed GSTACK_HOME into the
PTY run, so the spawned claude read the real ~/.gstack and the preference
was inert (Codex #9). Now passes GSTACK_HOME + CONDUCTOR_WORKSPACE_PATH to
prove auto-decide still wins over the Conductor prose redirect.
- Register both in touchfiles (periodic tier).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: strip ambient Conductor env in memory-cache-injection hook harness
Same dev-in-Conductor leak fixed for question-preference-hook: this suite's
runHook copies process.env, so running it inside Conductor flipped the
defer-path memoryContext assertions into the [conductor] prose deny. Strip
CONDUCTOR_* so the cases characterize non-Conductor behavior. (CI is headless,
so this only bit local Conductor runs.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: gstack-detach — run agent eval/bench jobs in their own session
Long agent-run jobs (30-60 min evals, benchmarks) die when the harness sends
SIGTERM to a background task's process group on turn boundaries / monitor
stops / interruptions (observed: 'script test:gate terminated by signal
SIGTERM'). gstack-detach runs the command in a fresh session (python3
os.setsid, or setsid on Linux, nohup fallback) so a group SIGTERM can't reach
it, and wraps it in caffeinate -i on macOS so idle-sleep can't kill it either.
Returns immediately; caller polls the logfile. Secrets stay in env, never argv.
The guard test pins the contract: the command runs in a different process
group than the caller and outlives the launching shell.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* scripts — detached eval runs for agents
Agent-facing convenience scripts that launch the eval suites through
gstack-detach so a harness SIGTERM can't kill a long run. eval:bg (diff-based),
eval:bg:all, eval:bg:gate, eval:bg:periodic — each returns immediately and
streams to /tmp/gstack-evals.log for polling. The plain test:evals / test:e2e
scripts stay foreground for humans.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md — agents must run long evals via gstack-detach
Codifies the detached-execution default: agent-launched eval/benchmark runs go
through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or
macOS idle-sleep can't kill a 30-60 min run, then poll the log with a
death-aware watcher. Humans keep foreground scripts.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: harden gstack-detach against all four eval-infra killers
The basic bash detach fixed SIGTERM but a real run on a shared dev box hit
three more killers: cross-worktree API saturation (15-way concurrency x a
sibling worktree mass-timed-out the suite), a silent hang (periodic bun died
with no exit marker), and shared-/tmp log contamination (a concurrent
worktree's agent output bled into the log). Rewrite as a portable python3 tool
that bakes in all four fixes:
- fork + setsid: SIGTERM-proof (own session, survives harness polite-quit)
- caffeinate -i on macOS: no idle-sleep death
- --lock NAME (fcntl, machine-wide): concurrent worktrees SERIALIZE instead of
saturating the shared model API
- run-scoped default log (~/.gstack-dev/eval-runs/<label>-<slug>-<branch>-<ts>-<pid>):
no cross-worktree collision/contamination
- --timeout watchdog + a guaranteed '### gstack-detach EXIT=<code> ###' sentinel
on every terminal path: no silent hang, finished-vs-died always detectable
Guard test pins all four: detached pgid differs + outlives launcher, run-scoped
log path, watchdog EXIT=timeout, and lock serialization (second run WAITS).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* use run-scoped logs + machine lock + watchdog
Drop the shared /tmp/gstack-evals.log path (the cross-worktree collision that
contaminated a live run) for gstack-detach's run-scoped default, and add the
machine-wide gstack-evals lock (concurrent worktrees serialize, no API
saturation) plus per-tier watchdog timeouts (60/90/120 min). Each eval:bg*
prints its run-scoped log path to poll.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags
- /ship eval step (sections/tests.md): long eval suites launch via gstack-detach
(own session, machine lock, EXIT sentinel) so a turn boundary can't kill a
30+ min run mid-ship — the exact failure observed during this branch's ship.
- CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize
worktrees, no API saturation), --timeout watchdog, run-scoped log, and the
guaranteed EXIT sentinel the poller breaks on.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* refactor: extract pure promotedEnv() from conductor-env-shim
Single source of truth for GSTACK_* key promotion semantics. The ambient
promoteConductorEnv() becomes a wrapper; behavior-preserving. Needed by the
hermetic env builder which must not mutate process.env.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: hermetic child-env builder for E2E runners
Allowlist scrub (basics/network/named-auth kept; CONDUCTOR_*, CLAUDE_*,
GSTACK_*, MCP_*, GBRAIN_*, operator credentials dropped), per-runner
extraAllow, overrides merge last, EVALS_HERMETIC=0 byte-identical escape
hatch read at call time (ESM-hoist safe). Sync memoized singleton temp dirs
(<runRoot>/.claude keeps the extractPlanFilePath contract), seeded
.claude.json for non-interactive first run, pid-aware GC of crashed runs.
19 free unit tests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: session-runner spawns hermetic children + isolation canaries
claude -p children now get the allowlist-scrubbed env and a gated
--strict-mcp-config (EVALS_HERMETIC=0 restores operator env AND args).
Two gate-tier canaries make the clean room falsifiable: hermetic-canary
asserts env redirect + scrub + zero MCP servers + nonzero API-key cost
from the Bash tool_result (never model prose); hermetic-sentinel plants a
poisoned operator config (user CLAUDE.md + MCP server) and proves the
child cannot see it. Empirically verified on claude 2.1.175: print mode
needs no seed config (the seed serves the PTY path); the child CLI sets
CLAUDECODE for its own tools, so that scrub is pinned in unit tests, not
E2E. hermetic-env.ts joins GLOBAL_TOUCHFILES.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PTY runner spawns hermetic claude sessions
launchClaudePty children get the allowlist-scrubbed env, a gated
--strict-mcp-config, and the session exposes hermeticConfigDir for
forensics (hermetic plan files live under <dir>/plans/ and still match
extractPlanFilePath via the /.claude dir-name contract). Seeded trust
state covers repo-cwd sessions; the 15s trust-watcher stays as fallback.
Verified foreground via the plan-mode-no-op gate test.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: codex/gemini runners spawn hermetic children
Same allowlist scrub as the claude runners, with each provider's auth
surface re-admitted via extraAllow (codex: OPENAI_API_KEY/CODEX_* plus
its tempHome .codex copy; gemini: GEMINI_*/GOOGLE_* with real HOME for
~/.gemini auth). The gemini spawn previously inherited the full operator
env with no env property at all.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: agent-sdk-runner spawns hermetic children via complete Options.env
The historical 'env: breaks SDK auth' failure was partial-env replacement:
Options.env replaces the child's entire environment, so objects lacking
ANTHROPIC_API_KEY killed auth. Passing the complete hermetic env (key +
PATH + redirected CLAUDE_CONFIG_DIR/GSTACK_HOME) works — validated live
via query() with a Bash tool call (success, real cost, Conductor vars
scrubbed). Per-test opts.env merges last; ambient key mutation still
works because the builder reads process.env at call time.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: static tripwire pins hermetic wiring in all five runners
Free-tier invariants: every runner builds child env via hermeticChildEnv,
no raw ...process.env spread at any spawn site, --strict-mcp-config gated
on isHermeticEnabled in both claude runners, and no test callsite passes
the operator env into a runner's override parameter (scoped to runner
calls — unit tests spawning gstack bin scripts directly are exempt).
Mirrors the terminal-agent-pid-identity / server-embedder-terminal-port
tripwire idiom.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: refresh codex/factory ship goldens with detached-eval block
|
|
|
|
a81be53621
|
v1.10.0.0: fix AskUserQuestion cadence + Pros/Cons format upgrade (#1178)
* fix(preamble): reorder AskUserQuestion Format above model overlay + rewrite Opus 4.7 pacing directive
Root cause of plan-review regression (v1.6.4.0): model overlays rendered
ABOVE the pacing rule in every SKILL.md, so Opus 4.7 read "Batch your
questions" first and absorbed it as the ambient default. The overlay's
claimed subordination ("skill wins on pacing, always") didn't stick —
literal-interpretation mode reads physical order, not claimed hierarchy.
Part 1 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md):
scripts/resolvers/preamble.ts
- Move generateAskUserFormat above generateModelOverlay in section array
- Comment explains why — prevents future refactors from silently reverting
model-overlays/opus-4-7.md
- Replace "Batch your questions" block with "Pace questions to the skill"
- New wording makes one-question-per-turn the default when the skill
contains STOP directives; batching becomes the explicit exception
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- With --model opus-4-7: Format renders at line 359, Model-Specific
Patch at 373, "Pace questions" at 419 (Format comes first, overlay
second, pacing directive intact).
- bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(plan-reviews): tighten STOP/escape-hatch directives across 4 templates
Part 2 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Codex caught that v1.6.3.0's reasoning collapsed on Opus 4.7: the old
escape-hatch wording ("If no issues or fix is obvious, state what
you'll do and move on — don't waste a question") let the literal
interpreter classify every finding as having an "obvious fix" and skip
AskUserQuestion entirely. Reviews became reports.
Per-template hardening (16 sites total, verified by rg):
plan-ceo-review/SKILL.md.tmpl (13 sites):
- 12 inline STOP directives: replace the full escape-hatch clause with
"zero findings → say so and proceed; findings → MUST call AskUserQuestion
as a tool_use, including for obvious fixes."
- 1 Escape hatch bullet in CRITICAL RULE section: tightened.
plan-eng-review, plan-design-review, plan-devex-review (1 site each):
- Each template's Escape hatch bullet tightened to match the new CEO wording,
adapted for each review's domain (issue/gap, decision/design/DX alternatives).
After regeneration: rg "don't waste a question" returns 0 across all
*SKILL.md.tmpl and *SKILL.md files. "zero findings, state" wording
present 16 times (matches prior count of escape-hatch sites).
bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(preamble): upgrade AskUserQuestion format to Pros/Cons decision brief
Part 4 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Every AskUserQuestion now renders as a decision brief, not a bullet list:
D-numbered header, ELI10, Stakes-if-we-pick-wrong, Recommendation, Pros/Cons
with ✅/❌ markers per option, closing Net: tradeoff synthesis.
scripts/resolvers/preamble/generate-ask-user-format.ts
- Full rewrite. Preserves prior rules (Re-ground, ELI10, Recommend,
Completeness, Options) and adds:
- D-numbering per skill invocation (model-level, not runtime state)
- Stakes line (pain avoided / capability unlocked / consequence named)
- Pros/Cons block with min 2 ✅ + 1 ❌ per option, min 40 chars/bullet
- Hard-stop escape: "✅ No cons — this is a hard-stop choice" for
genuine one-sided choices (destructive-action confirmations)
- Neutral-posture handling (CT1-compliant): (recommended) label
STAYS on default option to preserve AUTO_DECIDE contract; neutrality
expressed as prose in Recommendation line only
- Net line closes the decision with a one-sentence tradeoff frame
- Rule 11: tool_use mandate (prose "Question:" blocks don't count)
- Self-check list before emitting
test/skill-validation.test.ts
- Update format assertions to check for new Pros/Cons tokens
(Pros / cons:, Recommendation: <choice>, Net:, ELI10, Stakes if we
pick wrong:, ✅, ❌) across all tier-2+ skills
- Old "RECOMMENDATION: Choose" expectation removed (the new format uses
mixed-case "Recommendation:" with no literal "Choose")
test/skill-e2e-plan-format.test.ts
- Add v1.7.0.0 format token regexes (PROS_CONS_HEADER_RE, PRO_BULLET_RE,
CON_BULLET_RE, NET_LINE_RE, D_NUMBER_RE, STAKES_RE)
- Existing RECOMMENDATION_RE loosened to accept mixed-case "Recommendation:"
(canonical v1.7.0.0 form) alongside all-caps (legacy). Tests are
additive — the strict new-format gate is the upcoming cadence eval.
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- bun test: 319 pass (1 pre-existing security-bench fixture oversize
failure on main, unrelated — confirmed via git stash test on main HEAD)
- New format tokens render in all tier-2+ skills (plan-ceo-review,
plan-eng-review, ship, office-hours verified)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format
Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Gate-tier (E1, free, runs on every `bun test`):
test/preamble-compose.test.ts — pins the composition order
Asserts AskUserQuestion Format section renders BEFORE Model-Specific
Behavioral Patch in tier-≥2 preamble output. Covers claude default,
opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit
to scripts/resolvers/preamble.ts that silently reverts the order.
test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract
14 assertions against generateAskUserFormat output: D<N>, ELI10,
Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:,
✅/❌ markers, min 2 pros + 1 con rules, hard-stop escape exact
phrase, neutral-posture CT1 rule ((recommended) label preserved for
AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate
(rule 11), self-check list, D-numbering model-level caveat.
test/model-overlay-opus-4-7.test.ts — pins the pacing directive
Asserts raw overlay file + resolved overlay output contain "Pace
questions to the skill" and NOT "Batch your questions". Verifies
INHERIT:claude chain still works (Todo-list, subordination wrapper),
Fan out / Effort-match / Literal interpretation nudges preserved.
Also asserts claude base overlay does NOT carry the Opus-specific
pacing directive (no cross-contamination).
Periodic-tier (E2, Opus-dependent, ~$1-2/run):
test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness
1. Format positive — every token present when plan has real tradeoff
2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to
"No cons — hard-stop choice" escape
3. Neutral-posture NEGATIVE — plan where one option dominates must emit
(recommended) label + "because <reason>", must NOT dodge to
"taste call" / "no preference"
4. Hard-stop POSITIVE — destructive-action plan may legitimately use
the hard-stop escape
test/helpers/touchfiles.ts — entries for all new eval cases
Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and
the 4 plan-review templates. Diff-based selection triggers the evals
whenever those files change. Also added entries for 7 expanded-coverage
cases (ship, office-hours, investigate, qa, review, design-review,
document-release) — test cases will land in follow-up PRs per skill.
Follow-ups noted in test file header:
- True multi-turn cadence eval (3 findings → 3 distinct asks) — current
harness captures one $OUT_FILE per session; multi-turn capture needs
new harness support.
- Expanded-coverage test cases for the 7 non-plan-review skills.
Verified:
- bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench
oversize failure on main (unrelated, unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: regenerate golden fixtures + update ELI10 phrase check for v1.7.0.0
Pros/Cons format rewrite (
|