mirror of https://github.com/garrytan/gstack.git
4 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
c7ae63201a
|
v1.58.1.0 feat: hermetic local E2E + Conductor prose AskUserQuestion (#2004)
* feat: add shared call-time isConductor() helper
Single source of truth for Conductor host detection in TS consumers
(CONDUCTOR_WORKSPACE_PATH / CONDUCTOR_PORT). Reads the passed env at
call time, not a module-load snapshot, so unit tests can pin the env
inline without Bun --preload (esm-hoist-breaks-env-pin-bootstrap).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: harden question-preference-hook harness against ambient Conductor env
runHook copied all of process.env into the hook subprocess, so running the
suite inside Conductor (CONDUCTOR_WORKSPACE_PATH/PORT set) would leak those
markers. Strip them so the existing cases deterministically characterize
NON-Conductor behavior before the Conductor branch lands. Baseline: 15 pass.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PreToolUse hook denies AskUserQuestion in Conductor, redirects to prose
Conductor disables native AskUserQuestion and routes through a flaky MCP
variant that returns '[Tool result missing due to internal error]'. The
hook now denies any AUQ call in a Conductor session and instructs the model
to render a prose decision brief instead (transport avoidance, not preference
enforcement) — firing for one-way doors too, with a typed-confirmation
requirement for destructive paths.
Precedence: never-ask auto-decide still wins (user already settled those);
Conductor prose is the fallback for everything else; non-Conductor behavior
is byte-for-byte unchanged. Restructured the per-question loop to compute
eligibility without early-returning so the Conductor branch can run as the
fallback while preserving memoryContext on every exit.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: Conductor renders AskUserQuestion decisions as prose by default
In Conductor, native AskUserQuestion is disabled and the MCP variant is
flaky, so skills now render every decision as a plain-text prose brief the
user answers by typing a letter — proactively, not as a failure reaction.
- Preamble emits CONDUCTOR_SESSION, gated on != headless so eval/CI inside
Conductor still BLOCKs instead of rendering prose to nobody.
- AskUserQuestion Format gains a Conductor-default-prose rule (auto-decide
preferences still apply first; prose decisions log via gstack-question-log
since PostToolUse never fires), a one-way/destructive typed-confirmation
rule, and a typed-reply continuation protocol for split chains.
- Regenerated all SKILL.md + ship golden fixtures; bumped affected carve
skeleton caps to absorb the always-loaded additions.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: deploy the Conductor AskUserQuestion hook (setup + upgrade migration)
The PreToolUse hook only delivers its Conductor-prose guarantee if it's
installed, but setup skips hook registration in non-interactive (conductor/CI)
setups. Two fixes so layer 3 actually deploys:
- setup: treat a Conductor workspace as an implicit opt-in for the PreToolUse
hook on the silent fall-through (never overriding an explicit opt-out).
- migration v1.58.0.0: re-register the hook for existing Conductor installs on
/gstack-upgrade, idempotent and respecting plan_tune_hooks=no.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: E2E for Conductor prose + fix auto-decide-preserved GSTACK_HOME bug
- New skill-e2e-conductor-prose (periodic): Conductor env + plan-eng-review
surfaces a prose decision brief, not a silent skip. Header documents this is
end-to-end behavior coverage; the deterministic Conductor guard is the
question-preference-hook unit test (the PTY harness can't register the MCP
variant — Codex #10).
- Fix the pre-existing bug in auto-decide-preserved: it seeded the never-ask
preference under GSTACK_HOME=tmpHome but never passed GSTACK_HOME into the
PTY run, so the spawned claude read the real ~/.gstack and the preference
was inert (Codex #9). Now passes GSTACK_HOME + CONDUCTOR_WORKSPACE_PATH to
prove auto-decide still wins over the Conductor prose redirect.
- Register both in touchfiles (periodic tier).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* v1.58.0.0 feat: Conductor renders AskUserQuestion decisions as prose
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: strip ambient Conductor env in memory-cache-injection hook harness
Same dev-in-Conductor leak fixed for question-preference-hook: this suite's
runHook copies process.env, so running it inside Conductor flipped the
defer-path memoryContext assertions into the [conductor] prose deny. Strip
CONDUCTOR_* so the cases characterize non-Conductor behavior. (CI is headless,
so this only bit local Conductor runs.)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: gstack-detach — run agent eval/bench jobs in their own session
Long agent-run jobs (30-60 min evals, benchmarks) die when the harness sends
SIGTERM to a background task's process group on turn boundaries / monitor
stops / interruptions (observed: 'script test:gate terminated by signal
SIGTERM'). gstack-detach runs the command in a fresh session (python3
os.setsid, or setsid on Linux, nohup fallback) so a group SIGTERM can't reach
it, and wraps it in caffeinate -i on macOS so idle-sleep can't kill it either.
Returns immediately; caller polls the logfile. Secrets stay in env, never argv.
The guard test pins the contract: the command runs in a different process
group than the caller and outlives the launching shell.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* scripts — detached eval runs for agents
Agent-facing convenience scripts that launch the eval suites through
gstack-detach so a harness SIGTERM can't kill a long run. eval:bg (diff-based),
eval:bg:all, eval:bg:gate, eval:bg:periodic — each returns immediately and
streams to /tmp/gstack-evals.log for polling. The plain test:evals / test:e2e
scripts stay foreground for humans.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: CLAUDE.md — agents must run long evals via gstack-detach
Codifies the detached-execution default: agent-launched eval/benchmark runs go
through bin/gstack-detach (or the eval:bg* scripts) so a harness SIGTERM or
macOS idle-sleep can't kill a 30-60 min run, then poll the log with a
death-aware watcher. Humans keep foreground scripts.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: harden gstack-detach against all four eval-infra killers
The basic bash detach fixed SIGTERM but a real run on a shared dev box hit
three more killers: cross-worktree API saturation (15-way concurrency x a
sibling worktree mass-timed-out the suite), a silent hang (periodic bun died
with no exit marker), and shared-/tmp log contamination (a concurrent
worktree's agent output bled into the log). Rewrite as a portable python3 tool
that bakes in all four fixes:
- fork + setsid: SIGTERM-proof (own session, survives harness polite-quit)
- caffeinate -i on macOS: no idle-sleep death
- --lock NAME (fcntl, machine-wide): concurrent worktrees SERIALIZE instead of
saturating the shared model API
- run-scoped default log (~/.gstack-dev/eval-runs/<label>-<slug>-<branch>-<ts>-<pid>):
no cross-worktree collision/contamination
- --timeout watchdog + a guaranteed '### gstack-detach EXIT=<code> ###' sentinel
on every terminal path: no silent hang, finished-vs-died always detectable
Guard test pins all four: detached pgid differs + outlives launcher, run-scoped
log path, watchdog EXIT=timeout, and lock serialization (second run WAITS).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: eval:bg* use run-scoped logs + machine lock + watchdog
Drop the shared /tmp/gstack-evals.log path (the cross-worktree collision that
contaminated a live run) for gstack-detach's run-scoped default, and add the
machine-wide gstack-evals lock (concurrent worktrees serialize, no API
saturation) plus per-tier watchdog timeouts (60/90/120 min). Each eval:bg*
prints its run-scoped log path to poll.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs: wire detached-eval guidance into /ship + correct CLAUDE.md flags
- /ship eval step (sections/tests.md): long eval suites launch via gstack-detach
(own session, machine lock, EXIT sentinel) so a turn boundary can't kill a
30+ min run mid-ship — the exact failure observed during this branch's ship.
- CLAUDE.md: correct the now-stale /tmp reference; document the --lock (serialize
worktrees, no API saturation), --timeout watchdog, run-scoped log, and the
guaranteed EXIT sentinel the poller breaks on.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* refactor: extract pure promotedEnv() from conductor-env-shim
Single source of truth for GSTACK_* key promotion semantics. The ambient
promoteConductorEnv() becomes a wrapper; behavior-preserving. Needed by the
hermetic env builder which must not mutate process.env.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: hermetic child-env builder for E2E runners
Allowlist scrub (basics/network/named-auth kept; CONDUCTOR_*, CLAUDE_*,
GSTACK_*, MCP_*, GBRAIN_*, operator credentials dropped), per-runner
extraAllow, overrides merge last, EVALS_HERMETIC=0 byte-identical escape
hatch read at call time (ESM-hoist safe). Sync memoized singleton temp dirs
(<runRoot>/.claude keeps the extractPlanFilePath contract), seeded
.claude.json for non-interactive first run, pid-aware GC of crashed runs.
19 free unit tests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: session-runner spawns hermetic children + isolation canaries
claude -p children now get the allowlist-scrubbed env and a gated
--strict-mcp-config (EVALS_HERMETIC=0 restores operator env AND args).
Two gate-tier canaries make the clean room falsifiable: hermetic-canary
asserts env redirect + scrub + zero MCP servers + nonzero API-key cost
from the Bash tool_result (never model prose); hermetic-sentinel plants a
poisoned operator config (user CLAUDE.md + MCP server) and proves the
child cannot see it. Empirically verified on claude 2.1.175: print mode
needs no seed config (the seed serves the PTY path); the child CLI sets
CLAUDECODE for its own tools, so that scrub is pinned in unit tests, not
E2E. hermetic-env.ts joins GLOBAL_TOUCHFILES.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: PTY runner spawns hermetic claude sessions
launchClaudePty children get the allowlist-scrubbed env, a gated
--strict-mcp-config, and the session exposes hermeticConfigDir for
forensics (hermetic plan files live under <dir>/plans/ and still match
extractPlanFilePath via the /.claude dir-name contract). Seeded trust
state covers repo-cwd sessions; the 15s trust-watcher stays as fallback.
Verified foreground via the plan-mode-no-op gate test.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: codex/gemini runners spawn hermetic children
Same allowlist scrub as the claude runners, with each provider's auth
surface re-admitted via extraAllow (codex: OPENAI_API_KEY/CODEX_* plus
its tempHome .codex copy; gemini: GEMINI_*/GOOGLE_* with real HOME for
~/.gemini auth). The gemini spawn previously inherited the full operator
env with no env property at all.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* feat: agent-sdk-runner spawns hermetic children via complete Options.env
The historical 'env: breaks SDK auth' failure was partial-env replacement:
Options.env replaces the child's entire environment, so objects lacking
ANTHROPIC_API_KEY killed auth. Passing the complete hermetic env (key +
PATH + redirected CLAUDE_CONFIG_DIR/GSTACK_HOME) works — validated live
via query() with a Bash tool call (success, real cost, Conductor vars
scrubbed). Per-test opts.env merges last; ambient key mutation still
works because the builder reads process.env at call time.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: static tripwire pins hermetic wiring in all five runners
Free-tier invariants: every runner builds child env via hermeticChildEnv,
no raw ...process.env spread at any spawn site, --strict-mcp-config gated
on isHermeticEnabled in both claude runners, and no test callsite passes
the operator env into a runner's override parameter (scoped to runner
calls — unit tests spawning gstack bin scripts directly are exempt).
Mirrors the terminal-agent-pid-identity / server-embedder-terminal-port
tripwire idiom.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* test: refresh codex/factory ship goldens with detached-eval block
|
|
|
|
4dfdb7cdc2
|
v1.57.2.0 feat: AskUserQuestion prose fallback when the tool fails at runtime (#1908)
* feat(auq): add gstack-session-kind + echo SESSION_KIND in preamble
Classifies the session as spawned | headless | interactive from env markers
(OPENCLAW_SESSION / GSTACK_HEADLESS / CONDUCTOR_* / CLAUDE_CODE_ENTRYPOINT / CI),
defaulting to interactive. Echoed once at skill start alongside BRANCH/REPO_MODE
so the AskUserQuestion-failure fallback can branch without a shell-out at failure
time. Degrade-safe: empty/error => interactive.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(auq): prose fallback when AskUserQuestion fails (interactive sessions)
On a genuine AUQ failure (tool absent, or present-but-erroring like Conductor's
flaky MCP returning '[Tool result missing due to internal error]'): retry once,
then branch on SESSION_KIND — spawned auto-chooses, headless BLOCKs, interactive
renders a prose decision brief the user answers by typing a letter.
The prose fallback MUST surface the triad: a clear ELI10 of the issue, a
per-choice Completeness score, and a recommendation+why (one paragraph per
choice). Carves out the [plan-tune auto-decide] denial as NOT a failure, and
qualifies the former 'tool_use, not prose' assertions so the rule isn't
self-contradicting. Tests pin the triad, the SESSION_KIND branch, the OV2
collision guard, the always-loaded guarantee, and a cross-file invariant on the
auto-decide prefix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): default GSTACK_HEADLESS=1 in eval/E2E runners
Headless harness runs classify as headless (BLOCK on AUQ failure rather than
emit a prose question no one reads). SDK runner uses ambient mutation, not the
Options.env object, to avoid breaking the SDK auth pipeline. Interactive-path
suites opt out by overriding the env per-run.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(auq): defensive PostToolUse error-fallback hook (OV3:B)
When an AskUserQuestion call returns an error/missing result, this hook injects
additionalContext reminding the model to run the prose fallback for the current
SESSION_KIND. It does not render prose itself — it guarantees the reminder fires
at the moment of failure instead of relying on the model recalling SESSION_KIND.
Inert on success and inert if the platform never invokes PostToolUse on tool
errors (unverified — could not force the Conductor MCP error in a harness; see
the spike doc). The prompt-level fallback covers the case regardless. Decision
logic is unit-tested deterministically; registered in setup beside the existing
AUQ hooks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(auq): regenerate SKILL.md for all hosts + refresh ship goldens
Regenerated from the resolver changes (gen:skill-docs --host all). Refreshes the
byte-exact ship golden fixtures (claude/codex/factory). Spec prose tightened so
the cross-cutting preamble addition stays under the 5% per-skill parity ceiling
(investigate 4.8%) — guard unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(test): kebab testNames for section-loading E2Es to match TOUCHFILES keys
The two section-loading E2E tests used display-form testNames ('/ship
section-loading', '/plan-ceo-review section-loading') while every other E2E
testName and their E2E_TOUCHFILES keys are kebab. The completeness gate does an
exact `name in E2E_TOUCHFILES` check, so it failed (pre-existing on main); diff-
based selection also couldn't match them. Align to ship-section-loading /
plan-ceo-section-loading.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(test): make external-host freshness checks deterministic
The parameterized host smoke + --host all freshness tests assumed an external
`gen:skill-docs --host all` had run first (it never does in `bun test`), so which
host reported STALE varied by sibling-test timing — flaky. Regenerate the
gitignored external host dirs in a beforeAll so the --dry-run check is
deterministic. It still catches non-deterministic generation (the real bug class
for regenerated outputs); the tracked-claude freshness test runs earlier and is
unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(parity): headroom for AUQ cross-cutting addition on carved document-release
Merging main brought the carve of document-release (smaller skeleton); the AUQ
prose-fallback adds ~2KB to every skill's always-loaded preamble, landing
document-release at ~5.9% over the pre-carve v1.53.0.0 baseline. Add a per-carve
maxSizeRatio override (CARVE_GUARDS single source of truth) and bump only this
skill to 1.08. All other skills keep the strict 1.05 ceiling.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(auq): harden error-fallback hook + harness per adversarial review
Codex pre-landing review found three real issues:
- The PostToolUse fallback hook shared source 'plan-tune-cathedral' with the
question-log hook (same event+matcher); gstack-settings-hook replaces the entry,
so it would have clobbered plan-tune capture. Give it its own 'auq-error-fallback'
source (separate entry, both run); ALREADY_INSTALLED now requires both sources.
- isErrorResponse triggered on any string containing 'internal error'/'is_error',
so a real answer or a {"is_error": false} payload could fire the fallback after a
successful question. Narrow it to the missing-result sentinel + boolean is_error.
- The SDK runner mutated process.env.GSTACK_HEADLESS process-wide (leaked headless
into later tests). Removed; GSTACK_HEADLESS=1 now lives in the eval package.json
scripts, scoped to the invocation and inherited by the SDK child.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v1.57.2.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
|
|
a6fb31726c
|
v1.48.0.0 feat: AskUserQuestion split rule + runtime AUTO_DECIDE carve-out (#1740)
* feat(preamble): add "Handling 5+ options — split, never drop" rule Agents repeatedly hit Conductor's 4-option AskUserQuestion cap and silently drop one option to fit, shrinking the user's decision space. This rule names the bug and gives two compliant shapes: batch into ≤4-groups (for coherent alternatives) or split into N sequential per-option calls (for independent scope items, default). Inline preamble subsection is ~15 lines (rule + buckets + pointer). Full reference with worked examples, Hold/dependency semantics, and final-summary validation lives in docs/askuserquestion-split.md. The agent loads the docs file on demand when N>4. Per-option call shape: D<N>.k header, ELI10, Recommendation, kind-note (no completeness score — decision actions, not coverage), Include / Defer / Cut / Hold buckets. Hold stops the chain immediately; the final D<N>.final call validates dependencies and confirms the assembled scope. question_ids: <skill>-split-<option-slug> (kebab-case ASCII, ≤64 chars). Also fixes orphan "12. " prefix on the existing CJK rule. Tier-2+ skills inherit via the existing resolver. SKILL.md regenerated for all 41 affected skills + 3 golden fixtures. Net diff per SKILL.md: ~34 lines (vs ~110 for the full inline version). 6 tests pin the inline contract (4-option cap, buckets, D-numbering, docs pointer, runtime AUTO_DECIDE gate reference, orphan 12 regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(question-pref): runtime AUTO_DECIDE carve-out for *-split-* ids Split chains (per-option AskUserQuestion calls emitted by the new "Handling 5+ options" rule) must never be silently auto-approved via /plan-tune preferences. The user's option set is sacred. Layer 1 (mechanism): unique <skill>-split-<option-slug> ids prevent cross-option preference leakage. Layer 2 (this commit): the runtime checker `gstack-question-preference --check` detects any id matching *-split-* and forces ASK_NORMALLY even when never-ask or ask-only-for-one-way preferences exist for that exact id. An explanatory note tells the user their preference was bypassed and why. 7 tests pin the carve-out: no-pref baseline, never-ask override, explanatory note text, ask-only-for-one-way override, always-ask (no note), non-split id containing "split" word (negative case for regex specificity), multi-skill split id formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): split-overflow regression for /plan-ceo-review Periodic-tier E2E test that catches the original failure mode the user complained about: 5+ options for ONE decision must split into N sequential AskUserQuestion calls, not drop one to fit Conductor's 4-option cap. Fixture: 5 independent chat-platform integration candidates (Slack/Discord/Teams/Telegram/Mattermost), each carrying its own include/defer/cut decision. Floor = 4 review-phase AUQs (standard [N-1] tolerance band). Pre-fix "drop to 4 + 1 dropped" fails this floor. Wired into test/helpers/touchfiles.ts: tier periodic, depends on plan-ceo-review/**, the new preamble subsection, the question-pref binary (for the carve-out), and the runner helper. touchfiles.test.ts expected count bumped 21 → 22 to account for the new entry. Cost: ~$0.30/run when EVALS_TIER=periodic. Skips silently otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: post-merge regen + rebase size-budget baseline to v1.47.0.0 After merging origin/main (v1.45 → v1.47), three things needed cleanup: 1. spec/SKILL.md (main's new skill) regenerated to include our split-vs-drop preamble subsection — same mechanical regen as the other 41 tier-2+ skills. 2. Three golden ship fixtures refreshed to capture main's GSTACK_PLAN_MODE block + /spec routing entry + jargon-list.json refactor. 3. docs/skills.md — added /spec table row that main's PR (#1698/#1733) shipped without. Pre-existing failure on main; this PR catches and fixes. Also rebased test/skill-size-budget.test.ts from v1.44.1 → v1.47.0.0 baseline. Main's v1.46 (catalog tokens trim) + v1.47 (/spec skill) pushed the v1.44.1 anchor past the 5% ratchet to ×1.059 — pre-existing failure on main. This PR captures a fresh parity-baseline-v1.47.0.0.json and re-anchors the test there. Historical v1.44.1.json and v1.46.0.0.json retained in test/fixtures/ for reference. Our subsection contributes ~0.1% of the post-rebase corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.48.0.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|
|
|
a81be53621
|
v1.10.0.0: fix AskUserQuestion cadence + Pros/Cons format upgrade (#1178)
* fix(preamble): reorder AskUserQuestion Format above model overlay + rewrite Opus 4.7 pacing directive
Root cause of plan-review regression (v1.6.4.0): model overlays rendered
ABOVE the pacing rule in every SKILL.md, so Opus 4.7 read "Batch your
questions" first and absorbed it as the ambient default. The overlay's
claimed subordination ("skill wins on pacing, always") didn't stick —
literal-interpretation mode reads physical order, not claimed hierarchy.
Part 1 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md):
scripts/resolvers/preamble.ts
- Move generateAskUserFormat above generateModelOverlay in section array
- Comment explains why — prevents future refactors from silently reverting
model-overlays/opus-4-7.md
- Replace "Batch your questions" block with "Pace questions to the skill"
- New wording makes one-question-per-turn the default when the skill
contains STOP directives; batching becomes the explicit exception
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- With --model opus-4-7: Format renders at line 359, Model-Specific
Patch at 373, "Pace questions" at 419 (Format comes first, overlay
second, pacing directive intact).
- bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(plan-reviews): tighten STOP/escape-hatch directives across 4 templates
Part 2 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Codex caught that v1.6.3.0's reasoning collapsed on Opus 4.7: the old
escape-hatch wording ("If no issues or fix is obvious, state what
you'll do and move on — don't waste a question") let the literal
interpreter classify every finding as having an "obvious fix" and skip
AskUserQuestion entirely. Reviews became reports.
Per-template hardening (16 sites total, verified by rg):
plan-ceo-review/SKILL.md.tmpl (13 sites):
- 12 inline STOP directives: replace the full escape-hatch clause with
"zero findings → say so and proceed; findings → MUST call AskUserQuestion
as a tool_use, including for obvious fixes."
- 1 Escape hatch bullet in CRITICAL RULE section: tightened.
plan-eng-review, plan-design-review, plan-devex-review (1 site each):
- Each template's Escape hatch bullet tightened to match the new CEO wording,
adapted for each review's domain (issue/gap, decision/design/DX alternatives).
After regeneration: rg "don't waste a question" returns 0 across all
*SKILL.md.tmpl and *SKILL.md files. "zero findings, state" wording
present 16 times (matches prior count of escape-hatch sites).
bun test passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(preamble): upgrade AskUserQuestion format to Pros/Cons decision brief
Part 4 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Every AskUserQuestion now renders as a decision brief, not a bullet list:
D-numbered header, ELI10, Stakes-if-we-pick-wrong, Recommendation, Pros/Cons
with ✅/❌ markers per option, closing Net: tradeoff synthesis.
scripts/resolvers/preamble/generate-ask-user-format.ts
- Full rewrite. Preserves prior rules (Re-ground, ELI10, Recommend,
Completeness, Options) and adds:
- D-numbering per skill invocation (model-level, not runtime state)
- Stakes line (pain avoided / capability unlocked / consequence named)
- Pros/Cons block with min 2 ✅ + 1 ❌ per option, min 40 chars/bullet
- Hard-stop escape: "✅ No cons — this is a hard-stop choice" for
genuine one-sided choices (destructive-action confirmations)
- Neutral-posture handling (CT1-compliant): (recommended) label
STAYS on default option to preserve AUTO_DECIDE contract; neutrality
expressed as prose in Recommendation line only
- Net line closes the decision with a one-sentence tradeoff frame
- Rule 11: tool_use mandate (prose "Question:" blocks don't count)
- Self-check list before emitting
test/skill-validation.test.ts
- Update format assertions to check for new Pros/Cons tokens
(Pros / cons:, Recommendation: <choice>, Net:, ELI10, Stakes if we
pick wrong:, ✅, ❌) across all tier-2+ skills
- Old "RECOMMENDATION: Choose" expectation removed (the new format uses
mixed-case "Recommendation:" with no literal "Choose")
test/skill-e2e-plan-format.test.ts
- Add v1.7.0.0 format token regexes (PROS_CONS_HEADER_RE, PRO_BULLET_RE,
CON_BULLET_RE, NET_LINE_RE, D_NUMBER_RE, STAKES_RE)
- Existing RECOMMENDATION_RE loosened to accept mixed-case "Recommendation:"
(canonical v1.7.0.0 form) alongside all-caps (legacy). Tests are
additive — the strict new-format gate is the upcoming cadence eval.
Regenerated 30 SKILL.md files via bun run gen:skill-docs.
Verified:
- bun test: 319 pass (1 pre-existing security-bench fixture oversize
failure on main, unrelated — confirmed via git stash test on main HEAD)
- New format tokens render in all tier-2+ skills (plan-ceo-review,
plan-eng-review, ship, office-hours verified)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: gate-tier units + periodic Pros/Cons evals for AskUserQuestion format
Part 3 of 4 (plan: ~/.claude/plans/system-instruction-you-are-working-polymorphic-twilight.md).
Gate-tier (E1, free, runs on every `bun test`):
test/preamble-compose.test.ts — pins the composition order
Asserts AskUserQuestion Format section renders BEFORE Model-Specific
Behavioral Patch in tier-≥2 preamble output. Covers claude default,
opus-4-7 overlay, tier 2/3, and codex host. Catches any future edit
to scripts/resolvers/preamble.ts that silently reverts the order.
test/resolver-ask-user-format.test.ts — pins the Pros/Cons contract
14 assertions against generateAskUserFormat output: D<N>, ELI10,
Stakes if we pick wrong:, Recommendation: <choice>, Pros / cons:,
✅/❌ markers, min 2 pros + 1 con rules, hard-stop escape exact
phrase, neutral-posture CT1 rule ((recommended) label preserved for
AUTO_DECIDE), Completeness coverage-vs-kind, tool_use mandate
(rule 11), self-check list, D-numbering model-level caveat.
test/model-overlay-opus-4-7.test.ts — pins the pacing directive
Asserts raw overlay file + resolved overlay output contain "Pace
questions to the skill" and NOT "Batch your questions". Verifies
INHERIT:claude chain still works (Todo-list, subordination wrapper),
Fan out / Effort-match / Literal interpretation nudges preserved.
Also asserts claude base overlay does NOT carry the Opus-specific
pacing directive (no cross-contamination).
Periodic-tier (E2, Opus-dependent, ~$1-2/run):
test/skill-e2e-plan-prosons.test.ts — 4 cases extending v1.6.3.0 harness
1. Format positive — every token present when plan has real tradeoff
2. Hard-stop NEGATIVE — plan with genuine tradeoff must NOT dodge to
"No cons — hard-stop choice" escape
3. Neutral-posture NEGATIVE — plan where one option dominates must emit
(recommended) label + "because <reason>", must NOT dodge to
"taste call" / "no preference"
4. Hard-stop POSITIVE — destructive-action plan may legitimately use
the hard-stop escape
test/helpers/touchfiles.ts — entries for all new eval cases
Dependencies: overlay, preamble.ts, generate-ask-user-format.ts, and
the 4 plan-review templates. Diff-based selection triggers the evals
whenever those files change. Also added entries for 7 expanded-coverage
cases (ship, office-hours, investigate, qa, review, design-review,
document-release) — test cases will land in follow-up PRs per skill.
Follow-ups noted in test file header:
- True multi-turn cadence eval (3 findings → 3 distinct asks) — current
harness captures one $OUT_FILE per session; multi-turn capture needs
new harness support.
- Expanded-coverage test cases for the 7 non-plan-review skills.
Verified:
- bun test: 349 pass (30 new + 319 baseline), 1 pre-existing security-bench
oversize failure on main (unrelated, unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: regenerate golden fixtures + update ELI10 phrase check for v1.7.0.0
Pros/Cons format rewrite (
|