Commit Graph

4 Commits

Author SHA1 Message Date
Garry Tan 9fd03fae9e
v1.58.4.0 fix: high-priority community bug wave + PTY plan-mode smoke gate (#2077)
* fix(gbrain): stop forcing GBRAIN_PREPARE on transaction-mode poolers (#1965)

buildGbrainEnv auto-set GBRAIN_PREPARE=true whenever DATABASE_URL targeted
port 6543, and the /sync-gbrain capability check exported it for the rest
of the skill run. Both had the semantics inverted: gbrain auto-disables
prepared statements on transaction-mode poolers because they break every
write there ("prepared statement does not exist"); GBRAIN_PREPARE=true is
gbrain's documented override for SESSION-mode poolers on 6543, not a
requirement for transaction mode. The #1435 search symptom the auto-set
worked around was fixed gbrain-side.

Remove both force-sets. A caller-set GBRAIN_PREPARE (either value) still
passes through untouched, preserving the session-mode-on-6543 escape hatch.
isTransactionModePooler stays exported.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(gbrain): classify probe timeout as its own status; sync proceeds instead of skipping (#1964)

The 5s engine probe misclassified healthy-but-slow engines (cold Supabase
pooler connections measured at 6.9-10.7s) as broken-config, so /sync-gbrain
silently skipped code+memory and told the user their config was malformed.

- New "timeout" status: probe killed at the deadline with no recognized
  stderr pattern. Default deadline is now 15s, overridable via
  GSTACK_GBRAIN_PROBE_TIMEOUT_MS (tests set 300ms against a fake that
  sleeps 2s).
- Sync stages PROCEED on timeout with a stderr warning naming the env knob;
  a genuinely-dead engine surfaces its real error at the first operation
  instead of a false config diagnosis.
- Consistency everywhere "ok" gated behavior: gstack-gbrain-detect --is-ok
  exits 0 on timeout, and gen-skill-docs' detection gate accepts it, so a
  slow engine no longer silently suppresses brain-aware features.
- Status cache: key now includes the effective probe timeout (raising it
  invalidates a cached timeout) and GBRAIN_HOME; config detection honors
  GBRAIN_HOME so relocated-home users stop being misclassified as
  missing-config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(bins): cygpath-normalize SCRIPT_DIR for bun imports; surface learnings-log errors (#1950)

Under Windows git-bash, pwd yields a POSIX path (/c/Users/...) that Bun on
Windows cannot resolve as an ES module specifier. gstack-learnings-log
interpolates SCRIPT_DIR into a bun -e import, so every invocation died with
"Cannot find module" — and 2>/dev/null swallowed the error, silently
dropping every AI-logged learning for Windows users.

- 3-line cygpath -m guard in gstack-learnings-log and gstack-question-log
  (which gains the same import shape in the next commit). Matches the
  duplicated IS_WINDOWS convention in setup; no shared shell lib exists.
- learnings-log adopts question-log's set +e / TMPERR capture pattern
  wholesale: validation errors now print to stderr. The old
  `if [ $? -ne 0 ]` check was dead code under set -euo pipefail — the
  script exited at the failing assignment before reaching it.
- New test/bin-windows-bun-import-paths.test.ts: static invariant (any
  bash bin interpolating $SCRIPT_DIR into a bun -e import must carry the
  guard) + behavioral end-to-end run invoked via `bash <bin>` — added to
  the windows-free-tests workflow list so the conversion is proven on the
  only platform where the bug exists.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(question-log): dedupe INJECTION_PATTERNS via lib/jsonl-store (#1934)

bin/gstack-question-log carried a local copy of the injection-pattern list,
so pattern fixes to lib/jsonl-store.ts never propagated — including the
/override[:\s]/i false-positive fix arriving via community PR #1940.
Import the shared hasInjection instead (enabled by the previous commit's
cygpath guard). question-log also gets the lib's stricter superset
(human:, disregard, from-now-on, approve-all patterns).

Tests pin the contract in a #1940-order-independent way: an "Override:
ignore all previous instructions" header is rejected, "prose overrides the
deterministic table" is accepted, and a static invariant keeps local
INJECTION_PATTERNS duplicates out of the bin.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(security): community-pulse + both dashboards never report fake zeros (#1947)

The security-signaling surface failed open at three layers — every failure
mode read as a reassuring "0 attacks" / "0 installs":

- community-pulse edge function: supabase-js returns {data,error} without
  throwing, and all five queries discarded `error` — a DB outage produced
  real-looking zeros via the SUCCESS path, and the catch (also returning
  zeros with HTTP 200) was unreachable for query failures. Every query now
  destructures and throws; the catch serves the stale cache (marked
  "stale": true) when one exists, else 503 {"error":"pulse_unavailable"}.
  Success responses carry "status":"ok" so clients can distinguish
  authoritative data from legacy backends. NOTE: the edge function deploys
  out-of-band (supabase functions deploy community-pulse).
- gstack-security-dashboard: captures the HTTP status; non-200 / network
  failure / error body / missing section → "unknown — backend error";
  jq missing → "unknown — install jq" (the lossy grep fallback broke on
  nested arrays and under-reported attacks as zero — removed); a 200
  without the new marker shows figures with an "unverified (legacy
  backend)" note. Also fixes a latent display bug: the TOTAL grep matched
  the digit 7 inside "attacks_last_7_days" and misreported every count.
- gstack-community-dashboard: same class — curl || echo "{}" plus
  grep || echo "0" printed "Weekly active installs: 0" on any failure.
  Now "unknown — backend error (HTTP N)".

test/security-dashboard-fallback.test.ts pins the matrix (200+marker,
200-legacy, 503, network failure) x (jq present, jq absent) for both bins:
"unknown" states never render as 0.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(telemetry): redact error_message spans before they leave the machine (#1947)

error_message was uploaded with only quote/newline escaping — stack traces
and failed-API errors can embed credentials, private paths, and hostnames,
and the sync path strips only _repo_slug/_branch.

New lib/redact-engine.ts export redactFindingSpans(): replaces EVERY
finding's span with <REDACTED-{id}> regardless of tier (applyRedactions is
the interactive PII-only path and exits nonzero on credential findings, so
it can't serve machine egress). Returns null when a span can't be located —
callers drop the whole payload rather than risk a leak.

gstack-telemetry-log pipes error_message through it at LOG time, so the
local JSONL at rest is clean too; surrounding text survives for crash
triage. FAIL CLOSED: bun missing, engine error, or non-JSON-string output
all null the field. Tests pin: embedded ghp_ token → <REDACTED-github.pat>
with context intact; redactor unavailable → null; raw bytes on disk never
contain the token.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(redact): prepush guard fails closed on git failure; /ship owns hook install (#1946)

Two gaps closed:

1. Fail closed. The git() helper returned "" on ANY non-zero exit or
   maxBuffer overflow (status null), addedLinesFor produced an empty
   string, and the push sailed through unscanned — fail-open on exactly
   the oversized-diff case where a large secret-bearing blob is most
   likely. The diff call now uses a strict variant that throws; main
   blocks with a clear message naming the GSTACK_REDACT_PREPUSH=skip
   escape valve. Probe calls (symbolic-ref, rev-parse, merge-base) keep
   the permissive helper — their failures are normal control flow.

2. Install path. The hook was installed by nothing ("opt-in, installed by
   nothing" was the issue's words). ./setup runs in the gstack checkout —
   the wrong repo for a per-project hook — so it gets a one-line hint
   only. /ship owns per-repo install: config redact_prepush_hook=true +
   hook missing → silent install (consent already given); config unset +
   no ~/.gstack/.redact-prepush-prompted marker → one-time machine-wide
   AskUserQuestion offer, answer persisted. ship/SKILL.md regenerated in
   this same commit (check-freshness bisect discipline).

Tests: unscannable diff (bogus SHAs) → exit 1 + valve named; empty-but-
successful diff → exit 0; static asserts pin setup as hint-only and the
ship template as the installer surface.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(redact): six new credential patterns — GitLab, HuggingFace, npm, DigitalOcean, Bearer, GCP SA (#1946)

Coverage gaps from the #1946 security review, including token types for
tooling gstack itself drives (glab):

HIGH (block): gitlab.token (glpat-/glptt-/gldt-), huggingface.token (hf_),
npm.token (npm_), digitalocean.token (dop_v1_), gcp.service_account (the
JSON-escaped "private_key" form that dodges pem.private_key's literal-block
match when minified, confirmed by "private_key_id" proximity).

MEDIUM (warn): auth.bearer — the most FP-prone shape in the set (docs are
full of "Authorization: Bearer <token>"), so it requires header-context
proximity and the same entropy>=3.0 + placeholder validator recipe as
env.kv. "Bearer YOUR_TOKEN_HERE" never fires; calibration over coverage,
per the cries-wolf principle.

All shapes are linear-time; test/redact-pattern-lint.test.ts covers them
automatically. Engine tests add positive + placeholder-negative cases per
pattern.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: coverage-audit additions for the fix wave

Ship Step 7 gap-fill (all passing, 248 tests across the touched suites):
memory + dream stage probe-timeout proceeds, gbrain-detect override paths,
stale-flag passthrough, 200-body-missing-.security fail-closed case,
telemetry redaction edges, and credential-pattern edge cases.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: pre-landing review fixes

Review army findings (1 critical, auto-fixed with regression tests):

- CRITICAL (security specialist, verified live): redactFindingSpans spliced
  only the regex capture span, and pem.private_key / gcp.service_account
  capture just the BEGIN-header — the key body survived "redaction" and
  shipped via telemetry. Marker-only patterns now drop the whole payload
  (null, fail closed). Overlapping spans (Bearer+JWT on the same bytes) are
  coalesced before splicing so stale offsets can't leave partial secret
  bytes behind.
- gitStrict: drop the dead `|| r.status === null` disjunct (null !== 0
  already covers it); add the signal-kill/null-status regression test the
  docstring promised.
- security-dashboard human mode flags stale snapshots ("figures may be out
  of date") instead of presenting frozen counts as current.
- community-dashboard marker check uses jq when available — the grep-only
  variant misclassified whitespaced/reserialized bodies as legacy.
- telemetry fail-closed test now shadows bun with a failing stub
  (deterministic on any host layout); stale "five status cases" describe
  title renamed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: adversarial review fixes (Claude + Codex cross-model passes)

Both adversarial passes ran against the wave; every FIXABLE finding landed
with a regression test:

- probeTimeoutMs clamps to >=1ms: a fractional override floored to 0, and
  execFileSync treats timeout:0 as NO timeout — the probe that exists to
  bound hangs could hang forever (found by both models independently).
- /ship silent hook install now requires the hooks dir to live inside
  .git: with core.hooksPath (husky's COMMITTED .husky/), the chaining
  installer would have renamed the team's committed pre-push and written a
  machine-local wrapper into the working tree (found by both models).
- gstack-config gbrain-refresh accepts the "timeout" status — the last
  consumer still gating on literal "ok" (Codex); gstack-gbrain-detect's
  config-derived fields honor GBRAIN_HOME so the detection JSON can't
  report status ok alongside config_exists false (Codex).
- prepush: a remote sha absent locally (shallow clone / stale fetch) falls
  back to the merge-base/empty-tree range — scans MORE, never blocks a
  legitimate push into training users toward --no-verify.
- dashboards: curl's own 000 no longer doubles to "HTTP 000000"; the
  community dashboard flags stale snapshots like the security one; array
  sections parse via jq (the sed/grep loops truncated at the first ']');
  the no-jq marker grep tolerates whitespace.
- telemetry: multi-line redactor output nulls the field instead of
  corrupting the JSONL record; setup's hint fires only when the config key
  is genuinely unset (an explicit false is a recorded decline); the /ship
  prompt marker honors GSTACK_HOME.

Kept as designed (cross-model tension noted): Bearer stays MEDIUM in the
prepush gate — a HIGH Bearer would block every docs example; the entropy
validator can't eliminate that FP class, and MEDIUM warns visibly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: bump version and changelog (v1.57.11.0)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs: P1 TODO — eval harness live progress + incremental persistence

Root-caused during this ship: a killed eval run was indistinguishable from a
healthy one for hours (per-file output buffering across mega test files, no
incremental eval-store writes, no honest liveness signal). Full context and
starting points in the entry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: fix operational-learning E2E fixture — copy lib/jsonl-store.ts

Pre-existing breakage, proven on main: gstack-learnings-log has imported
lib/jsonl-store.ts (shared injection patterns) since v1.57.5.0 / #1910, but
the fixture copies only the bin scripts — the bin exits 1 before writing
anything, on main silently (stderr swallowed) and on this branch loudly
(the #1950 error-surfacing made the four-day-old failure visible). A real
install always ships bin/ and lib/ together; the fixture now does too.
Verified: the fixture-shaped invocation writes the learning (exit 0) with
lib present, exits 1 on both main and this branch without it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(ios-qa): isolate E2E tests under --concurrent (3 real races)

The ios-qa E2E file failed intermittently under `bun test --concurrent`
(the eval harness default). Three distinct shared-state races, all fixed:

1. Shared pidfile: a module-level `workDir` reassigned in beforeEach was
   clobbered by parallel tests, so concurrent daemons collided on the same
   pidfile and the loser returned `already_running`. Each test now gets its
   own dir via makeWorkDir().
2. process.env path globals: tests set GSTACK_IOS_AUDIT_PATH /
   _ATTEMPTS_PATH / _ALLOWLIST_PATH on the shared process env; concurrent
   tests stomped each other's audit/attempts destinations. Threaded
   auditPath/attemptsPath/allowlistPath through DaemonOptions (and
   mintForCaller) as explicit args — env is no longer load-bearing.
3. afterEach cleanup race: the per-test cleanup drained a shared dir array,
   so the first test to finish deleted still-running tests' workDirs
   mid-assertion. Moved to afterAll (cleans once, after all settle).

Verified: 5/5 clean full-suite runs at --max-concurrency 15 (was
intermittent); daemon unit suite 91/91; daemon source compiles. The paths
default to the env-derived locations when options are omitted, so the
production CLI path is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test(pty): pin spawned claude to EVALS model chain (default claude-sonnet-4-6)

launchClaudePty spawned the interactive `claude` TUI with no --model flag, so
the child inherited the operator's ~/.claude/settings.json model. On a
slow-thinking model that meant 5+ min of extended thinking on empty plan-mode
context, timing out the plan-mode smoke tests regardless of contention. Pin the
model via opts.model ?? EVALS_MODEL ?? 'claude-sonnet-4-6' — byte-identical to
session-runner.ts:144, so PTY and `claude -p` evals always agree.

Pushed before extraArgs (last flag wins, so a per-test --model still overrides).
Placement leaves the spawn region byte-stable for a clean merge with the
in-flight hermetic-env branch. Plumbed model through the three plan-skill
wrappers. Static-grep tripwires guard the pin, its fallback chain, the
before-extraArgs ordering, and all three wrapper forwards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(pty): detect markdown bold-bullet prose AUQs (fixes office-hours smoke)

office-hours auto-mode renders its mode question as `- **Building a startup**`
markdown bullets (office-hours/SKILL.md.tmpl:102) with no letter/number marker.
isProseAUQVisible only matched `A)`-style lettered or `1.`-style numbered
options, so the question went undetected: the model surfaced it at ~2m19s
(well under the 300s budget) but the harness kept scoring the run "working"
off the spinner glyphs and timed out — a false timeout on a question that was
already on screen.

Add Pattern 3: when an interrogative line ('?') is present AND 3+ bold-bullet
markers (`- **`) appear in the 4KB tail, classify as a prose AUQ. Bold is the
discriminator vs incidental prose bullets; the line anchor is dropped (stripAnsi
can collapse option lines) and the existing `❯ 1.` cursor gate still defers to a
live native list. Wires through the existing classifyVisible 'asked' path and the
timeout high-water-mark, so office-hours now classifies 'asked' instead of
'timeout'. Five unit cases: the office-hours render passes; no-'?', <3-bullet,
plain-bullet, and native-cursor cases stay false.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(pty): detect stripAnsi-collapsed prose AUQs + judge spinner-precedence

The plan-eng/plan-design plan-mode + finding-floor smokes timed out even when
the skill HAD rendered a complete prose AskUserQuestion and was waiting: the PTY
strips cursor-positioning escapes, collapsing the option newlines/spaces so
"A) ..." arrives as "A(recommended)" / "-B:" and "Reply with A, B, or C" as
"ReplywithA,B,orC". Every line-anchored detector (Patterns 1-3) returns false on
those bytes, so proseAUQEverObserved never latched and the run timed out on a
question that was already on screen.

Add Pattern 4/5: a two-signal collapsed-form detector — a reply/recommendation
marker (space-insensitive "reply with [A-D]", "Recommendation:", or
"(recommended)") AND 2+ distinct A-D letters each punctuated by ) : or (. The
conjunction is what separates a real AUQ from incidental report prose; verified
true on the verbatim failing-run buffers where Patterns 1-3 return false.

Also fix the Haiku judge spinner bias: of 614 verdicts, 569 were 'working' and
95 of those noted a question was visible — Claude Code keeps the spinner
animating at an idle prose decision, so the judge coin-flipped. Add a precedence
override: when an option list AND a Recommendation/Reply instruction are both
visible, classify WAITING even with spinner glyphs. Kept the strict dual-signal
gate (never option-list-alone) so auto-decide-preserved doesn't flip.

5 unit tests pin the two-signal contract (2 true on real collapsed bytes, 3
false guards). 90 -> 95 pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(plan-review): ask-first scope gate for plan-eng + plan-design review

On an empty/cold invocation, plan-eng-review and plan-design-review would dive
straight into repo exploration (plan-eng) or a 7-pass mockup+audit (plan-design)
and only ask the user much later, if at all. plan-ceo-review already asks first
via an unconditional Step-0 gate and behaves well; these two did not.

Add a hard-STOP scope gate as the FIRST operational instruction in each skill
(above the design-doc check / pre-review audit / mockup defaults it explicitly
overrides): the first tool call must be AskUserQuestion confirming the review
target, before any git/Read/Grep/Glob/Bash or mockup generation. Under
--disallowedTools the options render as plain column-0 lettered prose with a
Recommendation + "Reply with A, B, or C" line so the answer is detectable.

This is correct cold-start UX (confirm what to review before grinding a full
review on nothing) and it is the product half of the plan-mode smoke fix; the
harness collapsed-form detector is the deterministic half that catches the ask
however it renders. Templates + regenerated SKILL.md (default variant).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(tiers): reclassify stochastic plan-eng/plan-design ask-first smokes as periodic

plan-eng-review and plan-design-review run a long explore/audit before their
first AskUserQuestion, so whether the plan-mode + finding-floor smokes reach a
terminal outcome within the 300s/600s budget depends on stochastic ask-first
compliance (measured ~50-67%/run even with the hardened gate). Per the
"non-deterministic -> periodic" tiering rule, move the four affected smokes
(plan-eng/plan-design review-plan-mode + finding-floor) to periodic.

The deterministic harness fix (collapsed-form detector + judge precedence) and
the ask-first gate lift these from always-failing to mostly-passing and are the
real product+harness improvements; periodic monitoring tracks the rate weekly
without blocking PRs on an LLM coin-flip. plan-ceo/plan-devex ask-first reliably
and stay gate-tier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci(evals): gate the deterministic PTY plan-mode smokes in CI

The real-PTY plan-mode smokes never ran in CI — the gate was local-only. Add an
e2e-pty-plan-smoke matrix suite running the two deterministically-reliable ones
(office-hours-auto-mode, plan-mode-no-op) so a regression there blocks PRs. The
stochastic plan-eng/plan-design ask-first smokes stay periodic (touchfiles
E2E_TIERS) and are not CI-gated.

A fresh CI container has no ~/.claude.json, so the spawned interactive `claude`
would wedge on the onboarding + API-key-approval dialog. Add a scoped seed step
(hasCompletedOnboarding + key approval, its own ANTHROPIC_API_KEY env) before the
run — mirrors what the hermetic E2E child env seeds. Per-suite timeout override
(35 min) via matrix.suite.timeout so the PTY suite has headroom for --retry 2
without bumping the other 12 suites. Report runner count 12 -> 13.

Validate via workflow_dispatch before relying on the gate (PTY-in-CI is new).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci(evals): install gstack skill registry for the PTY smoke suite

The first dry-run of e2e-pty-plan-smoke failed: the spawned interactive `claude`
printed "Unknown command: /plan-ceo-review". .claude/skills is gitignored, so a
fresh CI checkout has no gstack skill registry and the TUI can't resolve
/office-hours or /plan-ceo-review.

Add a Register step (scoped to the suite, after Seed, before Run) that mirrors
setup's --no-prefix user-scoped registry minimally: $HOME/.claude/skills/gstack
-> repo (resolves the preambles' absolute ~/.claude/skills/gstack/bin/* and
<skill>/sections/* paths) + per-skill SKILL.md/sections symlinks for the two
skills these tests invoke. HOME is /github/home in this container and the runner
adds no HOME/CLAUDE_CONFIG_DIR override (no hermetic mode), so $HOME is the right
anchor — the Seed step already proved claude reads it. No ./setup (binary build
+ Chromium + fonts + /dev/tty prompt); SKILL.md + bin/ + sections/ are committed.

Self-validating: fails the step loudly on a dangling symlink or missing
`name:` frontmatter, so a moved target surfaces here instead of as a silent
35-min "Unknown command" timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.58.4.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-21 07:15:19 -07:00
Garry Tan 5d4fe7df07
v1.31.0.0 fix: delete AskUserQuestion fallback (root cause of forever war) + harness primitives (#1390)
* test: add multi-finding batching regression test (periodic tier)

Adds a periodic-tier E2E that catches the May 2026 transcript bug shape
the existing single-finding gate-tier floor test cannot detect: a model
that fires one AskUserQuestion and then batches the remaining findings
into a single "## Decisions to confirm" plan write + ExitPlanMode.

Why a separate test from skill-e2e-plan-eng-finding-floor: the gate-tier
floor (runPlanSkillFloorCheck) exits on the first AUQ render and returns
success, so a once-then-batch model would pass it trivially. This test
uses runPlanSkillCounting at periodic tier with N-AUQ tracking and
asserts >= 3 distinct review-phase AUQs on a 4-finding seeded plan.

- test/fixtures/forcing-finding-seeds.ts: FORCING_BATCHING_ENG fixture
  (4 distinct non-trivial findings spread across Architecture, Code
  Quality, Tests, Performance — mirrors the D1-D4 transcript shape)
- test/skill-e2e-plan-eng-multi-finding-batching.test.ts: new test
- test/helpers/touchfiles.ts: registered in BOTH E2E_TOUCHFILES and
  E2E_TIERS (touchfiles.test.ts asserts exact equality)

Test will fail on baseline today because today's model uses the preamble
fallback to batch findings; passes after the architectural fix lands in
a follow-up commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: expand plan-mode pass envelopes to accept BLOCKED path

Three existing plan-mode regression tests previously codified the
preamble fallback as a valid PASS path under --disallowedTools
AskUserQuestion: outcome=plan_ready was accepted only when the model
wrote a "## Decisions to confirm" section. The forever-war fix deletes
that fallback, so this assertion would fail post-deletion.

Expanded envelope accepts EITHER:
- 'plan_ready' WITH (## Decisions section [legacy] OR BLOCKED string
  visible in TTY [post-fix])
- 'exited' WITH BLOCKED string visible in TTY [post-fix]

The legacy ## Decisions branch stays in the envelope so these tests
keep passing on today's code (where the fallback still exists) and
on tomorrow's code (where the model reports BLOCKED instead). Once
the deletion has been on main long enough that the cache flushes,
the legacy branch can be removed in a follow-up.

Failure signals (regression we DO want to catch) unchanged:
auto_decided / silent_write / timeout / exited-without-BLOCKED /
plan_ready-without-(decisions OR BLOCKED).

- test/skill-e2e-plan-ceo-plan-mode.test.ts (test 2 only)
- test/skill-e2e-autoplan-auto-mode.test.ts
- test/skill-e2e-plan-design-plan-mode.test.ts

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: delete AskUserQuestion fallback (root cause of forever war)

The /plan-eng-review skill failed to fire AskUserQuestion on a real
plan review and surfaced 4 calibration decisions via prose instead.
Investigation traced this to a "fallback when neither variant is
callable" clause in the preamble that the model rationalizes around
as a general escape hatch from "fanning out round-trip AUQs," even
when an AUQ variant IS callable. Codex review confirmed the fallback
exists in 8 inline sites with 2 surviving escape hatches the original
narrowing missed (a "genuinely trivial" exception duplicated across
all 4 plan-* templates, and a "outside plan mode, output as prose
and stop" branch in the preamble itself).

Net deletion in skill text. Closes both branches of the deleted
fallback (plan-file write AND prose-and-stop) and the trivial-fix
exception with a single hard rule:

  If no AskUserQuestion variant appears in your tool list, this
  skill is BLOCKED. Stop, report `BLOCKED — AskUserQuestion
  unavailable`, and wait for the user.

Honest about being a model directive, not a runtime guard — none of
the PTY harness helpers enforce BLOCKED today. The architectural
improvement is that the model has fewer alternatives to obey it
against. Runtime enforcement is a follow-up TODO.

Sources changed:
- scripts/resolvers/preamble/generate-ask-user-format.ts: delete both
  fallback branches; replace with 1-line BLOCKED rule
- scripts/resolvers/preamble/generate-completion-status.ts: delete
  fallback in generatePlanModeInfo
- plan-eng-review/SKILL.md.tmpl: delete fallback at Step 0 + Sections
  1-4 (5 instances) + delete trivial-fix exception
- office-hours/SKILL.md.tmpl: delete fallback in approach-selection
- plan-ceo-review/SKILL.md.tmpl: delete trivial-fix exception
- plan-design-review/SKILL.md.tmpl: delete trivial-fix exception
- plan-devex-review/SKILL.md.tmpl: delete trivial-fix exception

Generated SKILL.md regen lands in a follow-up commit per the bisect
convention (template changes separate from regenerated output).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate SKILL.md after fallback deletion

Regenerates all 47 generated SKILL.md files (default + 7 host adapters)
after the template/resolver edits in the prior commit. Pure mechanical
output of `bun run gen:skill-docs`; no hand-edits.

Verifies fallback deletion landed across the entire skill surface:
- zero hits for "Decisions to confirm" in canonical SKILL.md / .tmpl
- zero hits for "no AskUserQuestion variant is callable"
- zero hits for "genuinely trivial"
- BLOCKED rule present in 42 generated SKILL.md (every Tier-2+ skill)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(harness): detect prose-rendered AskUserQuestion in plan mode

When --disallowedTools AskUserQuestion is set and no MCP variant is
callable, the model surfaces decisions as visible prose options
("A) ... B) ... C) ..." or "1. ... 2. ... 3. ...") rather than via the
native numbered-prompt UI. isNumberedOptionListVisible doesn't catch
these because the ❯ cursor sits on the empty input prompt rather than
on option 1, so runPlanSkillObservation and runPlanSkillFloorCheck
would time out at 5-10 minutes per test even though the model was
correctly waiting for user input.

This was exposed by the v1.28 fallback deletion: pre-deletion the
model used the preamble fallback to silently auto-resolve to
plan_ready in this scenario. Post-deletion the model correctly
surfaces the question and waits, but the harness couldn't tell.

isProseAUQVisible matches:
  - 2+ distinct lettered options at line starts (A/B/C/D form)
  - 3+ distinct numbered options at line starts WITHOUT a `❯ 1.`
    cursor (so it doesn't double-fire on native numbered prompts)

Wired into:
  - classifyVisible (used by runPlanSkillObservation) → returns
    outcome='asked' instead of timeout
  - runPlanSkillFloorCheck → counts as auq_observed (floor met)

8 new unit tests in claude-pty-runner.unit.test.ts cover the lettered
shape, numbered shape, threshold edges, native-cursor exclusion, and
mid-prose false-positive guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(harness): LLM judge for waiting-vs-working PTY state + snapshot logs

Regex detectors (isNumberedOptionListVisible, isProseAUQVisible) are
fast and free, but PTY rendering quirks fragment prose AUQ option
lists across logical lines that no regex can reliably reassemble.
When detection misses, polling loops time out at the full budget
even though the model is correctly waiting for user input.

Adds judgePtyState — a Haiku-graded trichotomy classifier:
  - waiting: agent surfaced a question/options, sitting at input prompt
  - working: spinner / tool calls / generation in progress
  - hung:    stopped without surfacing anything (rare crash signal)

Wired as a fallback into the polling loops of runPlanSkillObservation
and runPlanSkillFloorCheck: after 60s with no regex hit, snapshot the
TTY every 30s and call the judge. On 'waiting' verdict, return
outcome=asked / auq_observed early. On 'working' or 'hung', enrich the
eventual timeout summary with the verdict so failures are diagnosable.

Implementation:
  - Spawns `claude -p --model claude-haiku-4-5 --max-turns 1` synchronously
    with prompt piped via stdin (subscription auth, no API key env required)
  - In-process cache keyed by SHA-1 of normalized last-4KB so identical
    spinner-frame snapshots don't re-charge
  - Best-effort JSONL log to ~/.gstack/analytics/pty-judge.jsonl with
    timestamp, testName, state, reasoning, hash, judge wall time
  - 30s timeout per call; returns state='unknown' with diagnostic on any
    failure mode (timeout, malformed JSON, missing claude binary)

Snapshot logging: when GSTACK_PTY_LOG=1 is set, dump last 4KB of visible
TTY at every judge tick to ~/.gstack/analytics/pty-snapshots/<test>-
<elapsed>ms.txt — postmortem trail for debugging flakes.

Cost: ~$0.0005 per call; ~10 calls per 5-min test budget; ~$0.005 per
test added in worst case (only when regex detectors miss).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: accept prose-AUQ visible as third valid surface in plan-mode envelopes

The first re-run after wiring the LLM judge revealed that the model also
emits a third surface I hadn't anticipated: a properly-formatted question
with options ("Pick A, B, or C in your reply") rendered as prose AND
followed by ExitPlanMode (outcome=plan_ready). The migrated tests only
accepted (## Decisions section) OR (BLOCKED string) — neither matched
this case, so the test failed even though the user clearly saw the
question.

Three valid surfaces now:
  1. `## Decisions to confirm` section in plan file (legacy fallback path,
     still valid through migration window)
  2. `BLOCKED — AskUserQuestion` string in TTY (post-v1.28 BLOCKED rule)
  3. Numbered/lettered options visible in TTY as prose (post-v1.28 prose
     rendering — uses the existing isProseAUQVisible detector)

Also fixes assertReportAtBottomIfPlanWritten to be tolerant of:
  - Missing files (path detected from TTY but file not persisted) — was
    throwing ENOENT on plan_design_plan_mode and plan_ceo_plan_mode test 1
  - 'asked' outcome (smoke test exited at first AUQ before the model
    reached the report-writing step) — was throwing on the 1 fail in the
    plan-eng-plan-mode --disallowedTools test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: drop GSTACK REVIEW REPORT contract from --disallowedTools migrations

The plan-ceo / plan-design --disallowedTools migrated tests called
assertReportAtBottomIfPlanWritten as the final assertion, but that
contract is for full multi-section review completions. Under
--disallowedTools AskUserQuestion the model can't run the full
review (no AUQ tools to ask findings questions through), so it exits
at Step 0 with either prose-AUQ rendering or the legacy decisions
fallback. A plan file written in that mode WON'T have a GSTACK
REVIEW REPORT section — the workflow never reached the report-writing
step.

The contract is still enforced by the periodic finding-count tests
(skill-e2e-plan-{ceo,eng,design,devex}-finding-count.test.ts), which
DO run the full review end-to-end and assert report-at-bottom there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(harness): high-water-mark prose-AUQ tracking across polling iterations

The autoplan E2E surfaces a brief prose-AUQ window (model emits options,
waits ~30s for non-existent test responder, then resumes thinking) that
the existing polling loop misses: by judge-tick time the buffer has
moved into spinner state, so the LLM judge correctly reports 'working'
and the loop times out at 5min.

Adds two flags tracked across polling iterations:
  - proseAUQEverObserved: set true the first tick isProseAUQVisible
    returns true on the recent buffer
  - waitingEverObserved: set true on the first LLM judge 'waiting' verdict

At timeout, if either flag is set, return outcome='asked' with a
summary explaining the historical signal. The model DID surface the
question — we just missed the live-state window.

Snapshot logged with tag='prose-auq-surfaced' when GSTACK_PTY_LOG=1
for postmortem trace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: migrate plan-eng-plan-mode test 2 envelope to match other plan-mode tests

The plan-ceo, plan-design, and autoplan plan-mode tests under
--disallowedTools all moved to the same surface-visibility envelope
(decisions section OR BLOCKED string OR prose-AUQ visible) and dropped
the GSTACK REVIEW REPORT contract because the workflow can't complete
without AUQ tools. plan-eng-plan-mode test 2 had been left on the old
envelope and was the last failing test.

This commit migrates it to match. Also lifts 'exited' out of the failure
list and into a guarded path (acceptable when surface-visible).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(harness): isProseAUQVisible — gate numbered path on tail, not full buffer

The numbered-options branch of isProseAUQVisible deferred to
isNumberedOptionListVisible whenever a `❯ 1.` cursor was visible in the
full buffer. But the boot trust dialog (`❯ 1. Yes, trust`) lives in
scrollback for the entire run, so this gate suppressed prose-numbered
detection for any session that had the trust prompt at startup —
i.e., every E2E run after the first user-trust acceptance.

Fix: check only the last 4KB tail. Native-UI deferral applies when
the cursor list is CURRENTLY rendered, not historically present in
scrollback.

Adds a regression test that puts the trust dialog in early scrollback
+ 5KB filler + a current prose-AUQ render, asserts true.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(harness): isProseAUQVisible — lower numbered threshold to 2 (matches lettered)

The 4KB tail window often contains only options 2-4 of a 4-option
numbered prose AUQ because the model emits the question header + option 1
several KB earlier in the buffer. The threshold of 3 distinct numbered
markers caused the detector to miss real prose AUQs whenever option 1
had scrolled out.

Threshold 2 matches the lettered branch and is still tightly gated by:
- Line-start anchoring (no false positives on inline `1.` references)
- No-cursor gate (defers to native UI when ❯ 1. is currently rendered)
- The 4KB tail window itself (prose-AUQ rendering happens at the end of
  the model's response, so options are clustered in the tail)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: expose high-water-mark flags through PlanSkillObservation

The 2KB obs.evidence window often misses the prose-AUQ moment because
ExitPlanMode UI ("Ready to execute" + numbered approve/reject prompt)
pushes the model's earlier option list out of the tail by the time
outcome=plan_ready fires. Tests checking "did the user see a question"
need to consult historical state, not just the truncated final tail.

Adds two optional fields to PlanSkillObservation:
  - proseAUQEverObserved: true if isProseAUQVisible was true at any tick
  - waitingEverObserved: true if the LLM judge ever returned 'waiting'

The 4 plan-mode --disallowedTools tests now check these flags as part
of the surfaceVisible computation:
    isProseAUQVisible(obs.evidence) || obs.proseAUQEverObserved === true
    blockedVisible || proseAUQVisible || obs.waitingEverObserved === true

This catches the autoplan / plan-ceo / plan-eng case where the model
surfaces options briefly, fails to get a response, then keeps thinking
— eventually emitting ExitPlanMode and pushing options out of evidence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(plan-ceo): bump --disallowedTools test timeout to 10 min

Last 5 runs showed the model under --disallowedTools spending the full
5-min budget in 'high effort thinking' before surfacing options. The LLM
judge correctly reports state=working at every 30s tick, so the
high-water-mark fallback never fires.

10-min budget gives the model 20 judge windows to eventually surface
the question. Outer bun timeout bumped accordingly to 660s (inner +60s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(plan-ceo): pre-prime --disallowedTools test with concrete plan content

Root cause of the persistent timeout: under --disallowedTools, the model
can't fire the AUQ tool to ask "what should I review?" — it has to
prose-render that question. Prose-rendering a 4-option choice requires
the model to first enumerate every option, which spent the full 5min
budget in 'high effort thinking' (8 consecutive 'state=working' verdicts
from the LLM judge).

Fix: pass initialPlanContent (already supported by runPlanSkillObservation)
with a CEO-review-shaped seed plan (vague success metric, missing
premise, scope creep smell). The model now has concrete material to
critique on entry, bypasses the scope-deliberation loop, and moves
directly to surfacing Step 0 / Section 1 findings — the actual
behavior we want to regression-test.

Reverted timeout from 600_000 back to 300_000 since the 5-min budget
is plenty when the model has a real plan to work with.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: delete --disallowedTools AskUserQuestion-blocked test variants

These tests simulated a fictional environment that doesn't exist in
production. Real Conductor sessions launch claude with
`--disallowedTools AskUserQuestion` AND register
`mcp__conductor__AskUserQuestion` — the model has the MCP variant. But
the tests passed `--disallowedTools` without standing up any MCP server,
so they tested "model behavior with NO AUQ available," which no real
user state produces.

Combined with bare `/plan-ceo-review` invocation (no follow-up content),
this forced the model into a 5+ minute deliberation loop trying to
prose-render a question with options it had to first invent. The result
was persistent flakes that consumed nine paid E2E runs trying to fix
"the model takes too long" — but the actual problem was the test
configuration, not the model.

Removals:
- test/skill-e2e-autoplan-auto-mode.test.ts (deleted; the entire file
  was a single AUQ-blocked test)
- test/skill-e2e-plan-ceo-plan-mode.test.ts test 2 (the migrated
  --disallowedTools test); test 1 (baseline plan-mode smoke) stays
- test/skill-e2e-plan-design-plan-mode.test.ts test 2 (same shape);
  test 1 stays
- test/skill-e2e-plan-eng-plan-mode.test.ts test 2 (same shape); test 1
  (baseline) and test 3 (STOP-gate with seeded plan, different
  contract) stay
- test/helpers/touchfiles.ts: autoplan-auto-mode entry removed
- test/touchfiles.test.ts: assertion count + commentary updated

Coverage retained: test 1 of each plan-mode file already verifies the
model fires AUQ; the periodic finding-count tests verify per-finding
AUQ cadence end-to-end. The harness improvements landed during this
debugging cycle (isProseAUQVisible regex, LLM judge, snapshot logging,
high-water-mark tracking, ENOENT-tolerant assertReportAtBottomIfPlanWritten)
all stay — they're useful for the remaining plan-mode tests that can
also encounter prose rendering and slow-thinking phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.31.0.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 17:01:13 -07:00
Garry Tan 30fe6bb11c
v1.26.2.0 fix: plan-eng-review STOP gates always fire AskUserQuestion + report-at-bottom contract enforcement (#1313)
* fix(plan-eng-review): tighten STOP gates with anti-rationalization clause

Five sites in SKILL.md.tmpl uplift to the office-hours b512be71 pattern:
the four review-section gates (Architecture, Code Quality, Test, Performance)
plus the Step 0 complexity-check trigger. Adds tool_use reminder ("call the
tool directly"), names blocked next steps explicitly, anti-rationalization
clause naming the precise failure mode (loading the schema via ToolSearch
and writing the recommendation as chat prose).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(test/helpers): initialPlanContent + wrote_findings_before_asking + shared report-at-bottom assertion

Three additions to claude-pty-runner.ts:

1. runPlanSkillObservation gains initialPlanContent?: string. Pre-pumps a
   user message containing the seeded plan before invoking the skill, with
   a 3s gap so the message renders before the slash command. claude has no
   --plan-file flag (verified via claude --help), so message-pump is the
   route. Lets STOP-gate regression tests force complexity findings.

2. ClassifyResult gains wrote_findings_before_asking with companion
   strictPlanWrites?: boolean opt on classifyVisible. Fires when a Write/
   Edit to .claude/plans/* precedes any AskUserQuestion render in the
   session window. Default off — preserves zero-findings → write plan →
   plan_ready as legitimate for unseeded smokes. Six new unit tests cover
   before/after-AUQ ordering, permission-dialog edge case, strict-off path.

3. assertReportAtBottomIfPlanWritten(obs) shared helper. Wraps the existing
   assertReviewReportAtBottom(content) and gates on obs.planFile (artifact
   existing), so the assertion fires under both 'asked' and 'plan_ready'
   when a plan was actually written.

Also: runPlanSkillObservation now captures obs.planFile on every classifier
outcome, not just 'plan_ready'. Catches the case where the skill wrote a
plan partway through then paused on a question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: wire assertReportAtBottomIfPlanWritten into 4 plan-mode E2E tests + add seeded-plan STOP-gate case

Every test case in skill-e2e-plan-{eng,ceo,design,devex}-plan-mode.test.ts
that produces a plan file now asserts ## GSTACK REVIEW REPORT is the last
## section. The {{PLAN_FILE_REVIEW_REPORT}} resolver mandated this contract;
nothing tested it until now.

Plan-eng additionally gains a third test case: STOP gate fires when seeded
plan forces Step 0 findings. Combines the new initialPlanContent runner
option with --disallowedTools AskUserQuestion to force the Conductor
MCP-variant path through mcp__*__AskUserQuestion. Asserts outcome NOT in
{wrote_findings_before_asking, auto_decided, silent_write, exited, timeout}
and that plan_ready outcomes carry a ## Decisions section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(touchfiles): delete duplicate plan-design-review-plan-mode keys

Verified duplicates in test/helpers/touchfiles.ts:
- E2E_TOUCHFILES had plan-design-review-plan-mode at line 94 (full deps)
  AND line 243 (smaller deps); JS object literals: later wins.
- E2E_TIERS had it at line 399 ('gate') AND line 524 ('periodic'); same
  later-wins rule.

Effective tier was 'periodic', not 'gate'. Three of four plan-mode siblings
ran on every PR; design ran weekly only.

Delete the line-243 and line-524 duplicates. Keep line 94 (full deps) and
line 399 ('gate'). Also extend the four plan-mode-test entries to include
scripts/resolvers/review.ts so changes to {{PLAN_FILE_REVIEW_REPORT}}
trigger all four siblings in bun run eval:select.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.26.2.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: tighten CHANGELOG voice for v1.26.2.0

Move contributor-flavored bullet (runPlanSkillObservation seeding) into
For contributors. Drop branch-internal narrative (Codex review pass,
plan iteration tracking) per CHANGELOG-for-users style.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 20:26:59 -07:00
Garry Tan 454423aeb3
v1.21.1.0 test: tighten plan-ceo-review smoke (Step 0 must fire) (#1255)
* test: extract classifyVisible() + permission-dialog filter in PTY runner

Pure classifier extracted from runPlanSkillObservation's polling loop so
unit tests can exercise the actual branch order with synthetic input
strings. Runner gains:

- env? passthrough on runPlanSkillObservation (forwarded to launchClaudePty).
  gstack-config does not yet honor env overrides; plumbing is in place for a
  future change to make tests hermetic.
- TAIL_SCAN_BYTES = 1500 exported constant. Replaces a duplicated magic
  number in test/skill-e2e-plan-ceo-mode-routing.test.ts so tuning stays
  in sync.
- isPermissionDialogVisible: the bare phrase "Do you want to proceed?" now
  requires a file-edit context co-trigger. Other clauses unchanged. Skill
  questions that contain the bare phrase are no longer mis-classified.
- classifyVisible(visible): pure function. Branch order silent_write →
  plan_ready → asked → null. Permission dialogs filtered out of the
  'asked' classification so a permission prompt cannot pose as a Step 0
  skill question.

Adds 24 unit tests covering all classifier branches, edge cases, and the
co-trigger contract.

* test: tighten plan-ceo-review smoke to require Step 0 fires first

Assertion narrows from ['asked', 'plan_ready'] to 'asked' only. Reaching
plan_ready first means the agent skipped Step 0 entirely and went
straight to ExitPlanMode — the regression we want to catch.

Why plan-ceo is special: unlike plan-eng / plan-design / plan-devex
(whose smokes legitimately reach plan_ready on certain branches without
asking), plan-ceo-review's template mandates Step 0A premise challenge
plus Step 0F mode selection BEFORE any plan write. There is no
legitimate path to plan_ready that does not first emit a skill-question
numbered prompt.

Failure message now branches on outcome (plan_ready vs timeout vs
silent_write) with a tailored diagnosis line per case. References the
skill template by section name ("Step 0 STOP rules", "One issue = one
AskUserQuestion call") instead of line numbers, so it survives template
edits.

Passes env: { QUESTION_TUNING: 'false', EXPLAIN_LEVEL: 'default' }
through the runner. Today this is advisory — gstack-config reads only
~/.gstack/config.yaml, not env vars — but the wiring is in place for a
future change. Documented honestly in the docstring.

Verified across 4 PTY runs: 3 pre-refactor + 1 post-refactor, all PASS.

* chore: capture v1.21.1.0 follow-ups in TODOS.md

- P2: per-finding AskUserQuestion count assertion (V2)
- P3: honor env vars in gstack-config so test isolation env actually works
- P3: path-confusion hardening on SANCTIONED_WRITE_SUBSTRINGS

All three surfaced during the v1.21.1.0 plan-eng-review and adversarial
review passes. Captured here so the design intent persists.

* chore: bump version and changelog (v1.21.1.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: extract MODE_RE + optionsSignature into PTY runner exports

Refactor prep for the upcoming per-finding AskUserQuestion count test
across plan-{ceo,eng,design,devex}-review. Both new tests and the existing
mode-routing test need the same mode regex and the same option-list
fingerprint dedupe — pulling them into one source of truth in
test/helpers/claude-pty-runner.ts so a fifth mode (or a tweak to the
fingerprint shape) updates everywhere instead of drifting per-test.

Mechanical: no behavior change in the mode-routing test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add per-finding count primitives + unit tests

Pure helpers landing ahead of runPlanSkillCounting:

  - parseQuestionPrompt(visible) — extract the 1-3 line prompt above
    the latest "❯ 1." cursor, normalize to a 240-char snippet
  - auqFingerprint(prompt, opts) — Bun.hash of normalized prompt + sorted
    options signature; distinct prompts with shared option labels
    (the generic A/B/C TODO menu) get distinct fingerprints
  - COMPLETION_SUMMARY_RE — terminal-signal regex matching all four
    plan-review skills' completion / verdict markers
  - assertReviewReportAtBottom(content) — checks "## GSTACK REVIEW
    REPORT" is present and is the last "## " heading in a plan file
  - Step0BoundaryPredicate type + four per-skill predicates
    (ceo / eng / design / devex) — fire on the answered AUQ's
    fingerprint, marking the end of Step 0 deterministically
    (event-based, not content-based, per Codex F7)

Plus 37 deterministic unit tests covering option-label collision
regression, prompt extraction edge cases, predicate positive AND
negative cases, and review-report-at-bottom triple-check
(missing / mid-file / multiple trailing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add runPlanSkillCounting PTY helper

Drives a plan-* skill end-to-end and counts distinct review-phase
AskUserQuestions. Composes the primitives from the previous commit:

  - Boot + auto-trust handler (existing launchClaudePty)
  - Send slash command alone, sleep 3s, send plan content as follow-up
    message (proven pattern from skill-e2e-plan-design-with-ui)
  - Poll loop with permission-dialog auto-grant, same-redraw skip,
    empty-prompt re-poll
  - Event-based Step-0 boundary via isLastStep0AUQ predicate fired on
    the answered AUQ's fingerprint (Codex F7 — boundary is observed
    event, not later rendered content)
  - Multi-signal terminals: hard ceiling, COMPLETION_SUMMARY_RE,
    plan_ready, silent_write, exited, timeout

Empty-prompt fingerprints are skipped per the contract documented in
auqFingerprint's unit tests — fingerprinting them would re-introduce
the option-label collision regression Codex F1 caught.

No E2E tests yet — those land in commit 5 with the four skill fixtures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: register four finding-count tests in touchfiles + tier map

Each new test depends on its skill template, the runner, and three
preamble resolvers (preamble.ts, generate-ask-user-format.ts,
generate-completion-status.ts) — those affect question cadence and
completion rendering, which is exactly what the test asserts on.

All four classified periodic. Sequential execution during calibration;
opt-in to concurrent only after measured comparison agrees (plan §D15).

Updated touchfiles.test.ts: plan-ceo-review/** now selects 19 tests
(was 18) because plan-ceo-finding-count joins the family.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add four per-finding count E2E tests (plan-ceo + eng + design + devex)

Each test drives its plan-* skill through Step 0 then asserts the
review-phase AskUserQuestion count falls in [N-1, N+2] for an N=5
seeded plan, plus D19: produced plan file ends with
"## GSTACK REVIEW REPORT" as its last "## " heading.

plan-ceo also runs a paired-finding positive control: 2 deliberately
related findings should still produce 2 distinct AUQs, not 1 batched.

Periodic-tier (gate-skipped without EVALS=1, EVALS_TIER=periodic).
Sequential execution by plan §D15. Each fixture is inline TypeScript
content delivered as a follow-up message after the slash command, per
the proven pattern at skill-e2e-plan-design-with-ui.test.ts.

Calibration loop (5 runs per skill) and the manual pre-merge negative
check (D7 + D12) are required before merge per plan §Verification.
NOT yet run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: fix parseNumberedOptions for inline-cursor box-layout AUQs

Calibration run 1 timed out with step0=0 review=0 because the parser
could not find the cursor in /plan-ceo-review's scope-selection AUQ.
The TTY's box-layout rendering inlines divider + header + prompt +
"1." onto one logical line — cursor escapes get stripped, leaving
text crushed onto a single line.

Cursor anchor regex changed from anchored to unanchored so it matches
mid-line. Cursor-line option extraction uses a non-anchored regex;
subsequent options stay with the original start-of-line parser.

parseQuestionPrompt picks up the inline prompt text BEFORE the cursor
on the cursor line (after stripping box-drawing chars + sigil) and
appends it after any walked-up multi-line prompt above.

Three new unit tests: clean-cursor still works, inline-cursor
extracts all 7 options, prompt extraction strips box chars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: add firstAUQPick + plan-ceo skip-interview routing

Calibration run 1 surfaced a second issue beyond the parser bug: the
default pick of 1 on /plan-ceo-review's scope-selection AUQ routes
the agent to "branch diff vs main" — so it reviews the gstack PR
itself (recursive!) instead of the seeded fixture plan we sent.

Added firstAUQPick callback to runPlanSkillCounting. Override applies
only to the FIRST AUQ; subsequent presses keep using defaultPick.

ceoStep0Boundary now fires on either the mode-pick AUQ (existing path)
or any AUQ containing "Skip interview and plan immediately" — which
is the scope-selection AUQ. Picking that option bypasses Step 0 and
routes straight to review-phase using the chat-paste plan as context.

Plan-ceo test wires firstAUQPick = pickSkipInterview which finds the
"Skip interview" option by label. Falls back to "describe inline" if
the option labels change.

Two new unit tests: ceoStep0Boundary fires on the scope-selection
fixture; existing mode-pick fixture still fires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 02:50:09 -07:00