diff --git a/CHANGELOG.md b/CHANGELOG.md index 551b845ef..860ebd357 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,73 @@ # Changelog +## [1.50.0.0] - 2026-05-27 + +## **`/plan-tune` settings actually do something now. Hooks make capture deterministic, preferences binding, and free-text answers loop back as memory.** + +Before this release, plan-tune was a profile inspector with a hollow substrate. Every gstack skill told the agent "log this AskUserQuestion fire," and in weeks of dogfood, zero events ever landed. Preferences were agent-honored convention. Declared profile dimensions sat in a JSON file doing nothing. After this release: a PostToolUse hook captures every AUQ fire whether the agent remembers to log or not. A PreToolUse hook substitutes auto-decided answers when you've set `never-ask`. Free-text "Other" responses get dream-cycled through Claude into structured proposals you approve, then injected into future related questions as inline context. Codex sessions are backfilled by a structured-JSONL parser, not regex on transcript text. + +The cathedral lands behind one explicit consent prompt at `./setup` (with diff preview, backup, and one-command rollback) and stays on once installed. + +### The numbers that matter + +Measured against the existing v1.49 substrate. Reproduce with `bun test test/plan-tune-gates.test.ts test/question-log-hook.test.ts test/question-preference-hook.test.ts test/memory-cache-injection.test.ts test/distill-free-text.test.ts test/distill-apply.test.ts test/declared-annotation.test.ts test/gstack-codex-session-import.test.ts test/skill-e2e-plan-tune-cathedral.test.ts`. + +| Metric | Before (v1.49.0.0) | After (v1.50.0.0) | Δ | +|---|---|---|---| +| AUQ events captured per session | 0 (agent convention) | every fire (hook) | substrate works | +| `never-ask` preferences enforced | 0% (agent convention) | 100% (hook + deny+reason) | actually binds | +| Declared profile annotations | 0 / week | every signal_key match | profile renders | +| Dream-cycle memory persistence | 0 (no mechanism) | per-project + gbrain mirror | cross-project recall | +| Codex session backfill | none (regex idea) | structured JSONL parser | future-proof | +| Per-PR test cost added | $0 | $0 (deterministic; no claude -p) | gate-tier safe | +| Unit + E2E tests added | — | 96 tests / 8 new files | green | + +| Layer | What it does | Where it lives | +|---|---|---| +| 1 — Capture | PostToolUse hook → question-log.jsonl with dedup + async derive | hosts/claude/hooks/question-log-hook.ts | +| 2 — Enforcement | PreToolUse hook → deny+reason with auto-decided option | hosts/claude/hooks/question-preference-hook.ts | +| 3 — Annotation | declared profile → kebab signal_key → plain-English phrase | scripts/declared-annotation.ts | +| 4 — Surfaces | host-aware Stats, Recent auto-decisions, Audit unmarked | plan-tune/SKILL.md.tmpl | +| 5 — Discoverability | setup hook-install prompt + post-ship nudge | setup, ship/SKILL.md.tmpl | +| 6 — Tests | 5 E2E scenarios, all gate tier, $0 cost | test/skill-e2e-plan-tune-cathedral.test.ts | +| 7 — Installation | schema-aware bin: PreToolUse + PostToolUse, backup + rollback | bin/gstack-settings-hook | +| 8 — Dream cycle | Anthropic SDK distill + gbrain put_page + memory injection | bin/gstack-distill-* + Layer 2 inject | + +Highest-impact number is the third row: declared profile annotations now render inline before every AUQ that matches a signal_key. Set `declared.scope_appetite = 0.85` once during /plan-tune setup, and every "should I bundle this fix?" question shows up with "(your profile leans complete-implementation)" on the recommended option. The same loop applies to verbose-vs-terse, consult-vs-delegate, and ship-now-vs-get-the-design-right. + +### What this means for solo builders + +The feature compounds now. Each AskUserQuestion you answer "Other" with free text gets captured by the hook, batched into proposals by `gstack-distill-free-text` (3/day cap, ~$0.01 per run), reviewed via `/plan-tune distill`, and applied as either a `never-ask` preference, a declared-profile nudge, or a reusable memory nugget that routes to your gbrain (when configured) and reappears as context the next time a related question fires. The dream cycle is the unlock — without it, every nuanced answer evaporated after one turn. Now they accumulate. Run `./setup` and accept the hook-install prompt to turn it on, then `/plan-tune` whenever you want to see what your profile knows about you. + +### Itemized changes + +**Added** +- `hosts/claude/hooks/question-log-hook` — PostToolUse hook, matcher covers `AskUserQuestion` + `mcp__*__AskUserQuestion`. Captures every AUQ fire with marker-first question_id (D18), hash-fallback observed-only, source-tagged. +- `hosts/claude/hooks/question-preference-hook` — PreToolUse hook with `(recommended)`-label parser, refuse-on-ambiguous (D2 safety), project-then-global preference precedence (D8), one-way safety override. Auto-decided events logged from the hook itself since deny prevents PostToolUse from firing. +- `scripts/declared-annotation.ts` — `getDeclaredAnnotation(signal_key)` with kebab→underscore namespace mapping. Returns null in the middle band, plain-English phrase in strong bands (>= 0.7 or <= 0.3). +- `bin/gstack-codex-session-import` — structured JSONL parser for `~/.codex/sessions/`. Marker-first recovery with pattern fallback, source-tagged `codex-import-marker` / `codex-import-pattern`. +- `bin/gstack-distill-free-text` — Layer 8 dream cycle distiller. Anthropic SDK direct call (Haiku 4.5), 3/day rate cap per slug (D7), cumulative cost log, sync-or-background execution context (D14). +- `bin/gstack-distill-apply` — applies one approved proposal to its surface (preference / declared-nudge / memory-nugget), with optional `--gbrain-published true` flag. +- `setup` — interactive consent prompt for hook installation with diff preview, backup, one-command rollback. Marker-gated so users are asked at most once. +- `ship/SKILL.md.tmpl` Step 21 — post-success plan-tune nudge, marker-gated for at-most-once. +- `docs/spikes/claude-code-hook-mutation.md` + `docs/spikes/codex-session-format.md` — Phase 1 spike outputs that pinned protocol contracts before implementation. +- 96 new tests across 8 files: STATE_ROOT honoring, v1.49 gates, settings-hook schema-aware ops, both hooks, declared-annotation, codex import, distill bin, distill apply, memory injection, 5 cathedral E2E scenarios. + +**Changed** +- `bin/gstack-settings-hook` schema-aware rewrite: PreToolUse + PostToolUse registration with `_gstack_source` tag for dedup, `add-event` / `remove-source` / `diff-event` / `rollback` / `list-sources` subcommands. Legacy `add`/`remove` SessionStart shape preserved verbatim. +- `bin/gstack-question-log` — accepts source, tool_use_id, free_text; composite dedup on (source, tool_use_id) across last 100 lines (D3); async-fires `gstack-developer-profile --derive` after every successful write (D17 — without this, sample_size stayed 0). +- Three bins (`gstack-question-log`, `gstack-question-preference`, `gstack-developer-profile`) + `gstack-config` now honor `GSTACK_STATE_ROOT` env var as highest-priority override (D16 Codex correction — without this, isolation tests silently wrote to real ~/.gstack). +- `scripts/resolvers/question-tuning.ts` preamble — added marker-embedding convention (``) and `(recommended)` label convention. Hook enforcement gates on marker presence. +- `scripts/question-registry.ts` — added `signal_key: 'decision-autonomy'` to `land-and-deploy-merge-confirm` and `land-and-deploy-rollback` so the autonomy dimension has a real signal source. +- `scripts/psychographic-signals.ts` — added `decision-autonomy` signal map. +- `plan-tune/SKILL.md.tmpl` — new sections (Recent auto-decisions, Audit unmarked, Dream cycle review, Dream cycle distill); host-aware Stats with source breakdown + MARKED %; Step 0 routing extended with dream-cycle gate. +- `bin/gstack-uninstall` — also cleans up `plan-tune-cathedral`-tagged hooks during uninstall. + +**For contributors** +- 4 cross-model tension resolutions during eng review locked in: project preferences win over global (D8), hash IDs are observed-only never preference keys (D18), AUQ matcher covers MCP variants (Codex correction), enforcement uses `permissionDecision: "deny"` + reason instead of `"allow"` + `updatedInput` until the AUQ input shape is verified against real Claude Code (T6 conservative path). +- Plan-review preamble byte budget ratcheted 39000 → 40000 in `test/gen-skill-docs.test.ts` (~700 bytes added by the marker convention). +- 9 Codex outside-voice findings folded directly without re-prompting (matcher correction, derive wiring, settings.json consent, signal_key namespace, etc.). + ## [1.49.0.0] - 2026-05-26 ## **`/plan-tune` learns to ask for consent before logging, and runs the 5-question setup automatically when your profile is empty.** diff --git a/VERSION b/VERSION index 5889b7e20..d6c236f53 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -1.49.0.0 +1.50.0.0 diff --git a/package.json b/package.json index 8f916cc88..76e45815e 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "gstack", - "version": "1.49.0.0", + "version": "1.50.0.0", "description": "Garry's Stack — Claude Code skills + fast headless browser. One repo, one install, entire AI engineering workflow.", "license": "MIT", "type": "module", diff --git a/test/fixtures/golden/claude-ship-SKILL.md b/test/fixtures/golden/claude-ship-SKILL.md index 9611072f7..12e4c7799 100644 --- a/test/fixtures/golden/claude-ship-SKILL.md +++ b/test/fixtures/golden/claude-ship-SKILL.md @@ -650,7 +650,11 @@ If you are looping on the same diagnostic, same file, or failed fix variants, ST Before each AskUserQuestion, choose `question_id` from `scripts/question-registry.ts` or `{skill}-{slug}`, then run `~/.claude/skills/gstack/bin/gstack-question-preference --check ""`. `AUTO_DECIDE` means choose the recommended option and say "Auto-decided [summary] → [option] (your preference). Change with /plan-tune." `ASK_NORMALLY` means ask. -After answer, log best-effort: +**Embed the question_id as a marker in the question text** so hooks can identify it deterministically (plan-tune cathedral T14 / D18 progressive markers). Append `` somewhere in the rendered question (the leading line or trailing line is fine; the marker doesn't render visibly to the user when wrapped in HTML-style angle brackets, but the hook strips it). Without the marker the PreToolUse enforcement hook treats the AUQ as observed-only and never auto-decides — so always include it when the question matches a registered `question_id`. + +**Embed the option recommendation via the `(recommended)` label suffix** on exactly one option per AUQ. The PreToolUse hook parses `(recommended)` first, falls back to "Recommendation: X" prose, and refuses to auto-decide if ambiguous. Two `(recommended)` labels = refuse. + +After answer, log best-effort (PostToolUse hook also captures deterministically when installed; dedup on (source, tool_use_id) handles double-writes): ```bash ~/.claude/skills/gstack/bin/gstack-question-log '{"skill":"ship","question_id":"","question_summary":"","category":"","door_type":"","options_count":N,"user_choice":"","recommended":"","session_id":"'"$_SESSION_ID"'"}' 2>/dev/null || true ``` @@ -3082,6 +3086,29 @@ This step is automatic — never skip it, never ask for confirmation. --- +## Step 21: Plan-tune discoverability nudge (first-successful-ship only) + +Plan-tune cathedral T15. After a successful ship, surface /plan-tune once +per machine. Single line, non-blocking, marker-gated so it never re-fires. + +```bash +_NUDGE_MARKER="$HOME/.gstack/.plan-tune-nudge-shown" +_QT=$(~/.claude/skills/gstack/bin/gstack-config get question_tuning 2>/dev/null || echo "false") +if [ ! -f "$_NUDGE_MARKER" ] && [ "$_QT" = "false" ]; then + echo "" + echo "gstack can learn from your AskUserQuestion answers. Run /plan-tune to opt in" + echo "— it captures which prompts you find valuable vs noisy and (with hooks installed)" + echo "auto-decides your never-ask preferences." + touch "$_NUDGE_MARKER" +fi +``` + +If the marker exists, OR question_tuning is already on, the nudge is a +no-op. The marker guarantees at-most-once per machine. To re-enable: +`rm ~/.gstack/.plan-tune-nudge-shown` before next ship. + +--- + ## Important Rules - **Never skip tests.** If tests fail, stop. diff --git a/test/fixtures/golden/codex-ship-SKILL.md b/test/fixtures/golden/codex-ship-SKILL.md index d13aca74a..4ef5d6cfa 100644 --- a/test/fixtures/golden/codex-ship-SKILL.md +++ b/test/fixtures/golden/codex-ship-SKILL.md @@ -324,7 +324,36 @@ Effort both-scales: when an option involves effort, label both human-team and CC Net line closes the tradeoff. Per-skill instructions may add stricter rules. -12. **Non-ASCII characters — write directly, never \u-escape.** When any +### Handling 5+ options — split, never drop + +AskUserQuestion caps every call at **4 options**. With 5+ real options, NEVER +drop, merge, or silently defer one to fit. Pick a compliant shape: + +- **Batch into ≤4-groups** — for coherent alternatives (e.g. version bumps, + layout variants). One call, 5th surfaced only if first 4 don't fit. +- **Split per-option** — for independent scope items (e.g. "ship E1..E6?"). + Fire N sequential calls, one per option. Default to this when unsure. + +Per-option call shape: `D.k` header (e.g. D3.1..D3.5), ELI10 per option, +Recommendation, kind-note (no completeness score — Include/Defer/Cut/Hold are +decision actions), and 4 buckets: +**A) Include**, **B) Defer**, **C) Cut**, **D) Hold** (stop chain, discuss). + +After the chain, fire `D.final` to validate the assembled set (reprompt +dependency conflicts) and confirm shipping it. Use `D.revise-` to +revise one option without re-running the chain. + +For N>6, fire a `D.0` meta-AskUserQuestion first (proceed / narrow / batch). + +question_ids for split chains: `-split-` (kebab-case ASCII, +≤64 chars, `-2`/`-3` suffix on collision). The runtime checker +(`bin/gstack-question-preference`) refuses `never-ask` on any `*-split-*` id, +so split chains are never AUTO_DECIDE-eligible — the user's option set is sacred. + +**Full rule + worked examples + Hold/dependency semantics:** see +`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4. + +**Non-ASCII characters — write directly, never \u-escape.** When any string field (question, option label, option description) contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit the literal UTF-8 characters in the JSON string. **Never escape them @@ -357,6 +386,9 @@ Before calling AskUserQuestion, verify: - [ ] Net line closes the decision - [ ] You are calling the tool, not writing prose - [ ] Non-ASCII characters (CJK / accents) written directly, NOT \u-escaped +- [ ] If you had 5+ options, you split (or batched into ≤4-groups) — did NOT drop any +- [ ] If you split, you checked dependencies between options before firing the chain +- [ ] If a per-option Hold fires, you stopped the chain immediately (didn't queue) ## Artifacts Sync (skill start) @@ -604,7 +636,11 @@ If you are looping on the same diagnostic, same file, or failed fix variants, ST Before each AskUserQuestion, choose `question_id` from `scripts/question-registry.ts` or `{skill}-{slug}`, then run `$GSTACK_BIN/gstack-question-preference --check ""`. `AUTO_DECIDE` means choose the recommended option and say "Auto-decided [summary] → [option] (your preference). Change with /plan-tune." `ASK_NORMALLY` means ask. -After answer, log best-effort: +**Embed the question_id as a marker in the question text** so hooks can identify it deterministically (plan-tune cathedral T14 / D18 progressive markers). Append `` somewhere in the rendered question (the leading line or trailing line is fine; the marker doesn't render visibly to the user when wrapped in HTML-style angle brackets, but the hook strips it). Without the marker the PreToolUse enforcement hook treats the AUQ as observed-only and never auto-decides — so always include it when the question matches a registered `question_id`. + +**Embed the option recommendation via the `(recommended)` label suffix** on exactly one option per AUQ. The PreToolUse hook parses `(recommended)` first, falls back to "Recommendation: X" prose, and refuses to auto-decide if ambiguous. Two `(recommended)` labels = refuse. + +After answer, log best-effort (PostToolUse hook also captures deterministically when installed; dedup on (source, tool_use_id) handles double-writes): ```bash $GSTACK_BIN/gstack-question-log '{"skill":"ship","question_id":"","question_summary":"","category":"","door_type":"","options_count":N,"user_choice":"","recommended":"","session_id":"'"$_SESSION_ID"'"}' 2>/dev/null || true ``` @@ -2660,6 +2696,29 @@ This step is automatic — never skip it, never ask for confirmation. --- +## Step 21: Plan-tune discoverability nudge (first-successful-ship only) + +Plan-tune cathedral T15. After a successful ship, surface /plan-tune once +per machine. Single line, non-blocking, marker-gated so it never re-fires. + +```bash +_NUDGE_MARKER="$HOME/.gstack/.plan-tune-nudge-shown" +_QT=$($GSTACK_ROOT/bin/gstack-config get question_tuning 2>/dev/null || echo "false") +if [ ! -f "$_NUDGE_MARKER" ] && [ "$_QT" = "false" ]; then + echo "" + echo "gstack can learn from your AskUserQuestion answers. Run /plan-tune to opt in" + echo "— it captures which prompts you find valuable vs noisy and (with hooks installed)" + echo "auto-decides your never-ask preferences." + touch "$_NUDGE_MARKER" +fi +``` + +If the marker exists, OR question_tuning is already on, the nudge is a +no-op. The marker guarantees at-most-once per machine. To re-enable: +`rm ~/.gstack/.plan-tune-nudge-shown` before next ship. + +--- + ## Important Rules - **Never skip tests.** If tests fail, stop. diff --git a/test/fixtures/golden/factory-ship-SKILL.md b/test/fixtures/golden/factory-ship-SKILL.md index f18183678..f15e68b85 100644 --- a/test/fixtures/golden/factory-ship-SKILL.md +++ b/test/fixtures/golden/factory-ship-SKILL.md @@ -326,7 +326,36 @@ Effort both-scales: when an option involves effort, label both human-team and CC Net line closes the tradeoff. Per-skill instructions may add stricter rules. -12. **Non-ASCII characters — write directly, never \u-escape.** When any +### Handling 5+ options — split, never drop + +AskUserQuestion caps every call at **4 options**. With 5+ real options, NEVER +drop, merge, or silently defer one to fit. Pick a compliant shape: + +- **Batch into ≤4-groups** — for coherent alternatives (e.g. version bumps, + layout variants). One call, 5th surfaced only if first 4 don't fit. +- **Split per-option** — for independent scope items (e.g. "ship E1..E6?"). + Fire N sequential calls, one per option. Default to this when unsure. + +Per-option call shape: `D.k` header (e.g. D3.1..D3.5), ELI10 per option, +Recommendation, kind-note (no completeness score — Include/Defer/Cut/Hold are +decision actions), and 4 buckets: +**A) Include**, **B) Defer**, **C) Cut**, **D) Hold** (stop chain, discuss). + +After the chain, fire `D.final` to validate the assembled set (reprompt +dependency conflicts) and confirm shipping it. Use `D.revise-` to +revise one option without re-running the chain. + +For N>6, fire a `D.0` meta-AskUserQuestion first (proceed / narrow / batch). + +question_ids for split chains: `-split-` (kebab-case ASCII, +≤64 chars, `-2`/`-3` suffix on collision). The runtime checker +(`bin/gstack-question-preference`) refuses `never-ask` on any `*-split-*` id, +so split chains are never AUTO_DECIDE-eligible — the user's option set is sacred. + +**Full rule + worked examples + Hold/dependency semantics:** see +`docs/askuserquestion-split.md` in the gstack repo. Read on demand when N>4. + +**Non-ASCII characters — write directly, never \u-escape.** When any string field (question, option label, option description) contains Chinese (繁體/簡體), Japanese, Korean, or other non-ASCII text, emit the literal UTF-8 characters in the JSON string. **Never escape them @@ -359,6 +388,9 @@ Before calling AskUserQuestion, verify: - [ ] Net line closes the decision - [ ] You are calling the tool, not writing prose - [ ] Non-ASCII characters (CJK / accents) written directly, NOT \u-escaped +- [ ] If you had 5+ options, you split (or batched into ≤4-groups) — did NOT drop any +- [ ] If you split, you checked dependencies between options before firing the chain +- [ ] If a per-option Hold fires, you stopped the chain immediately (didn't queue) ## Artifacts Sync (skill start) @@ -606,7 +638,11 @@ If you are looping on the same diagnostic, same file, or failed fix variants, ST Before each AskUserQuestion, choose `question_id` from `scripts/question-registry.ts` or `{skill}-{slug}`, then run `$GSTACK_BIN/gstack-question-preference --check ""`. `AUTO_DECIDE` means choose the recommended option and say "Auto-decided [summary] → [option] (your preference). Change with /plan-tune." `ASK_NORMALLY` means ask. -After answer, log best-effort: +**Embed the question_id as a marker in the question text** so hooks can identify it deterministically (plan-tune cathedral T14 / D18 progressive markers). Append `` somewhere in the rendered question (the leading line or trailing line is fine; the marker doesn't render visibly to the user when wrapped in HTML-style angle brackets, but the hook strips it). Without the marker the PreToolUse enforcement hook treats the AUQ as observed-only and never auto-decides — so always include it when the question matches a registered `question_id`. + +**Embed the option recommendation via the `(recommended)` label suffix** on exactly one option per AUQ. The PreToolUse hook parses `(recommended)` first, falls back to "Recommendation: X" prose, and refuses to auto-decide if ambiguous. Two `(recommended)` labels = refuse. + +After answer, log best-effort (PostToolUse hook also captures deterministically when installed; dedup on (source, tool_use_id) handles double-writes): ```bash $GSTACK_BIN/gstack-question-log '{"skill":"ship","question_id":"","question_summary":"","category":"","door_type":"","options_count":N,"user_choice":"","recommended":"","session_id":"'"$_SESSION_ID"'"}' 2>/dev/null || true ``` @@ -3038,6 +3074,29 @@ This step is automatic — never skip it, never ask for confirmation. --- +## Step 21: Plan-tune discoverability nudge (first-successful-ship only) + +Plan-tune cathedral T15. After a successful ship, surface /plan-tune once +per machine. Single line, non-blocking, marker-gated so it never re-fires. + +```bash +_NUDGE_MARKER="$HOME/.gstack/.plan-tune-nudge-shown" +_QT=$($GSTACK_ROOT/bin/gstack-config get question_tuning 2>/dev/null || echo "false") +if [ ! -f "$_NUDGE_MARKER" ] && [ "$_QT" = "false" ]; then + echo "" + echo "gstack can learn from your AskUserQuestion answers. Run /plan-tune to opt in" + echo "— it captures which prompts you find valuable vs noisy and (with hooks installed)" + echo "auto-decides your never-ask preferences." + touch "$_NUDGE_MARKER" +fi +``` + +If the marker exists, OR question_tuning is already on, the nudge is a +no-op. The marker guarantees at-most-once per machine. To re-enable: +`rm ~/.gstack/.plan-tune-nudge-shown` before next ship. + +--- + ## Important Rules - **Never skip tests.** If tests fail, stop. diff --git a/test/fixtures/parity-baseline-v1.47.0.0.json b/test/fixtures/parity-baseline-v1.47.0.0.json index aad9c538e..27af1ffb0 100644 --- a/test/fixtures/parity-baseline-v1.47.0.0.json +++ b/test/fixtures/parity-baseline-v1.47.0.0.json @@ -491,13 +491,14 @@ }, "plan-tune": { "skill": "plan-tune", - "skillMdBytes": 51717, - "skillMdLines": 1077, - "estTokens": 12929, - "tmplBytes": 15586, + "skillMdBytes": 64017, + "skillMdLines": 1357, + "estTokens": 16004, + "tmplBytes": 25196, "descriptionLen": 325, "hasGateEval": true, - "hasPeriodicEval": false + "hasPeriodicEval": false, + "_baseline_note": "Rebased from 51717 → 64017 in plan-tune cathedral v1.50.0.0 (T13). Cathedral added Dream cycle, Recent auto-decisions, Audit unmarked, Dream cycle review/distill sections — all load-bearing for hook substrate. See CHANGELOG.md [1.50.0.0]." }, "qa": { "skill": "qa",