mirror of https://github.com/garrytan/gstack.git
2 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
c7ae63201a
|
v1.58.1.0 feat: hermetic local E2E + Conductor prose AskUserQuestion (#2004)
* feat: add shared call-time isConductor() helper
Single source of truth for Conductor host detection in TS consumers
(CONDUCTOR_WORKSPACE_PATH / CONDUCTOR_PORT). Reads the passed env at
call time, not a module-load snapshot, so unit tests can pin the env
inline without Bun --preload (esm-hoist-breaks-env-pin-bootstrap).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: harden question-preference-hook harness against ambient Conductor env
runHook copied all of process.env into the hook subprocess, so running the
suite inside Conductor (CONDUCTOR_WORKSPACE_PATH/PORT set) would leak those
markers. Strip them so the existing cases deterministically characterize
NON-Conductor behavior before the Conductor branch lands. Baseline: 15 pass.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects to prose
Conductor disables native AskUserQuestion and routes through a flaky MCP
variant that returns '[Tool result missing due to internal error]'. The
hook now denies any AUQ call in a Conductor session and instructs the model
to render a prose decision brief instead (transport avoidance, not preference
enforcement) — firing for one-way doors too, with a typed-confirmation
requirement for destructive paths.
Precedence: never-ask auto-decide still wins (user already settled those);
Conductor prose is the fallback for everything else; non-Conductor behavior
is byte-for-byte unchanged. Restructured the per-question loop to compute
eligibility without early-returning so the Conductor branch can run as the
fallback while preserving memoryContext on every exit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: Conductor renders AskUserQuestion decisions as prose by default
In Conductor, native AskUserQuestion is disabled and the MCP variant is
flaky, so skills now render every decision as a plain-text prose brief the
user answers by typing a letter — proactively, not as a failure reaction.
- Preamble emits CONDUCTOR_SESSION, gated on != headless so eval/CI inside
Conductor still BLOCKs instead of rendering prose to nobody.
- AskUserQuestion Format gains a Conductor-default-prose rule (auto-decide
preferences still apply first; prose decisions log via gstack-question-log
since PostToolUse never fires), a one-way/destructive typed-confirmation
rule, and a typed-reply continuation protocol for split chains.
- Regenerated all SKILL.md + ship golden fixtures; bumped affected carve
skeleton caps to absorb the always-loaded additions.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migration)
The PreToolUse hook only delivers its Conductor-prose guarantee if it's
installed, but setup skips hook registration in non-interactive (conductor/CI)
setups. Two fixes so layer 3 actually deploys:
- setup: treat a Conductor workspace as an implicit opt-in for the PreToolUse
hook on the silent fall-through (never overriding an explicit opt-out).
- migration v1.58.0.0: re-register the hook for existing Conductor installs on
/gstack-upgrade, idempotent and respecting plan_tune_hooks=no.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME bug
- New skill-e2e-conductor-prose (periodic): Conductor env + plan-eng-review
surfaces a prose decision brief, not a silent skip. Header documents this is
end-to-end behavior coverage; the deterministic Conductor guard is the
question-preference-hook unit test (the PTY harness can't register the MCP
variant — Codex #10).
- Fix the pre-existing bug in auto-decide-preserved: it seeded the never-ask
preference under GSTACK_HOME=tmpHome but never passed GSTACK_HOME into the
PTY run, so the spawned claude read the real ~/.gstack and the preference
was inert (Codex #9). Now passes GSTACK_HOME + CONDUCTOR_WORKSPACE_PATH to
prove auto-decide still wins over the Conductor prose redirect.
- Register both in touchfiles (periodic tier).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: strip ambient Conductor env in memory-cache-injection hook harness
Same dev-in-Conductor leak fixed for question-preference-hook: this suite's
runHook copies process.env, so running it inside Conductor flipped the
defer-path memoryContext assertions into the [conductor] prose deny. Strip
CONDUCTOR_* so the cases characterize non-Conductor behavior. (CI is headless,
so this only bit local Conductor runs.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: gstack-detach — run agent eval/bench jobs in their own session
Long agent-run jobs (30-60 min evals, benchmarks) die when the harness sends
SIGTERM to a background task's process group on turn boundaries / monitor
stops / interruptions (observed: 'script test:gate terminated by signal
SIGTERM'). gstack-detach runs the command in a fresh session (python3
os.setsid, or setsid on Linux, nohup fallback) so a group SIGTERM can't reach
it, and wraps it in caffeinate -i on macOS so idle-sleep can't kill it either.
Returns immediately; caller polls the logfile. Secrets stay in env, never argv.
The guard test pins the contract: the command runs in a different process
group than the caller and outlives the launching shell.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* scripts — detached eval runs for agents
Agent-facing convenience scripts that launch the eval suites through
gstack-detach so a harness SIGTERM can't kill a long run. eval:bg (diff-based),
eval:bg:all, eval:bg:gate, eval:bg:periodic — each returns immediately and
streams to /tmp/gstack-evals.log for polling. The plain test:evals / test:e2e
scripts stay foreground for humans.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md — agents must run long evals via gstack-detach
Codifies the detached-execution default: agent-launched eval/benchmark runs go
through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or
macOS idle-sleep can't kill a 30-60 min run, then poll the log with a
death-aware watcher. Humans keep foreground scripts.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: harden gstack-detach against all four eval-infra killers
The basic bash detach fixed SIGTERM but a real run on a shared dev box hit
three more killers: cross-worktree API saturation (15-way concurrency x a
sibling worktree mass-timed-out the suite), a silent hang (periodic bun died
with no exit marker), and shared-/tmp log contamination (a concurrent
worktree's agent output bled into the log). Rewrite as a portable python3 tool
that bakes in all four fixes:
- fork + setsid: SIGTERM-proof (own session, survives harness polite-quit)
- caffeinate -i on macOS: no idle-sleep death
- --lock NAME (fcntl, machine-wide): concurrent worktrees SERIALIZE instead of
saturating the shared model API
- run-scoped default log (~/.gstack-dev/eval-runs/<label>-<slug>-<branch>-<ts>-<pid>):
no cross-worktree collision/contamination
- --timeout watchdog + a guaranteed '### gstack-detach EXIT=<code> ###' sentinel
on every terminal path: no silent hang, finished-vs-died always detectable
Guard test pins all four: detached pgid differs + outlives launcher, run-scoped
log path, watchdog EXIT=timeout, and lock serialization (second run WAITS).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* use run-scoped logs + machine lock + watchdog
Drop the shared /tmp/gstack-evals.log path (the cross-worktree collision that
contaminated a live run) for gstack-detach's run-scoped default, and add the
machine-wide gstack-evals lock (concurrent worktrees serialize, no API
saturation) plus per-tier watchdog timeouts (60/90/120 min). Each eval:bg*
prints its run-scoped log path to poll.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags
- /ship eval step (sections/tests.md): long eval suites launch via gstack-detach
(own session, machine lock, EXIT sentinel) so a turn boundary can't kill a
30+ min run mid-ship — the exact failure observed during this branch's ship.
- CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize
worktrees, no API saturation), --timeout watchdog, run-scoped log, and the
guaranteed EXIT sentinel the poller breaks on.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* refactor: extract pure promotedEnv() from conductor-env-shim
Single source of truth for GSTACK_* key promotion semantics. The ambient
promoteConductorEnv() becomes a wrapper; behavior-preserving. Needed by the
hermetic env builder which must not mutate process.env.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: hermetic child-env builder for E2E runners
Allowlist scrub (basics/network/named-auth kept; CONDUCTOR_*, CLAUDE_*,
GSTACK_*, MCP_*, GBRAIN_*, operator credentials dropped), per-runner
extraAllow, overrides merge last, EVALS_HERMETIC=0 byte-identical escape
hatch read at call time (ESM-hoist safe). Sync memoized singleton temp dirs
(<runRoot>/.claude keeps the extractPlanFilePath contract), seeded
.claude.json for non-interactive first run, pid-aware GC of crashed runs.
19 free unit tests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: session-runner spawns hermetic children + isolation canaries
claude -p children now get the allowlist-scrubbed env and a gated
--strict-mcp-config (EVALS_HERMETIC=0 restores operator env AND args).
Two gate-tier canaries make the clean room falsifiable: hermetic-canary
asserts env redirect + scrub + zero MCP servers + nonzero API-key cost
from the Bash tool_result (never model prose); hermetic-sentinel plants a
poisoned operator config (user CLAUDE.md + MCP server) and proves the
child cannot see it. Empirically verified on claude 2.1.175: print mode
needs no seed config (the seed serves the PTY path); the child CLI sets
CLAUDECODE for its own tools, so that scrub is pinned in unit tests, not
E2E. hermetic-env.ts joins GLOBAL_TOUCHFILES.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PTY runner spawns hermetic claude sessions
launchClaudePty children get the allowlist-scrubbed env, a gated
--strict-mcp-config, and the session exposes hermeticConfigDir for
forensics (hermetic plan files live under <dir>/plans/ and still match
extractPlanFilePath via the /.claude dir-name contract). Seeded trust
state covers repo-cwd sessions; the 15s trust-watcher stays as fallback.
Verified foreground via the plan-mode-no-op gate test.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: codex/gemini runners spawn hermetic children
Same allowlist scrub as the claude runners, with each provider's auth
surface re-admitted via extraAllow (codex: OPENAI_API_KEY/CODEX_* plus
its tempHome .codex copy; gemini: GEMINI_*/GOOGLE_* with real HOME for
~/.gemini auth). The gemini spawn previously inherited the full operator
env with no env property at all.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: agent-sdk-runner spawns hermetic children via complete Options.env
The historical 'env: breaks SDK auth' failure was partial-env replacement:
Options.env replaces the child's entire environment, so objects lacking
ANTHROPIC_API_KEY killed auth. Passing the complete hermetic env (key +
PATH + redirected CLAUDE_CONFIG_DIR/GSTACK_HOME) works — validated live
via query() with a Bash tool call (success, real cost, Conductor vars
scrubbed). Per-test opts.env merges last; ambient key mutation still
works because the builder reads process.env at call time.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: static tripwire pins hermetic wiring in all five runners
Free-tier invariants: every runner builds child env via hermeticChildEnv,
no raw ...process.env spread at any spawn site, --strict-mcp-config gated
on isHermeticEnabled in both claude runners, and no test callsite passes
the operator env into a runner's override parameter (scoped to runner
calls — unit tests spawning gstack bin scripts directly are exempt).
Mirrors the terminal-agent-pid-identity / server-embedder-terminal-port
tripwire idiom.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: refresh codex/factory ship goldens with detached-eval block
|
|
|
|
46c1fae7f1
|
v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) (#1806)
* feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |