mirror of https://github.com/garrytan/gstack.git
2 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
9fd03fae9e
|
v1.58.4.0 fix: high-priority community bug wave + PTY plan-mode smoke gate (#2077)
* fix(gbrain): stop forcing GBRAIN_PREPARE on transaction-mode poolers (#1965) buildGbrainEnv auto-set GBRAIN_PREPARE=true whenever DATABASE_URL targeted port 6543, and the /sync-gbrain capability check exported it for the rest of the skill run. Both had the semantics inverted: gbrain auto-disables prepared statements on transaction-mode poolers because they break every write there ("prepared statement does not exist"); GBRAIN_PREPARE=true is gbrain's documented override for SESSION-mode poolers on 6543, not a requirement for transaction mode. The #1435 search symptom the auto-set worked around was fixed gbrain-side. Remove both force-sets. A caller-set GBRAIN_PREPARE (either value) still passes through untouched, preserving the session-mode-on-6543 escape hatch. isTransactionModePooler stays exported. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(gbrain): classify probe timeout as its own status; sync proceeds instead of skipping (#1964) The 5s engine probe misclassified healthy-but-slow engines (cold Supabase pooler connections measured at 6.9-10.7s) as broken-config, so /sync-gbrain silently skipped code+memory and told the user their config was malformed. - New "timeout" status: probe killed at the deadline with no recognized stderr pattern. Default deadline is now 15s, overridable via GSTACK_GBRAIN_PROBE_TIMEOUT_MS (tests set 300ms against a fake that sleeps 2s). - Sync stages PROCEED on timeout with a stderr warning naming the env knob; a genuinely-dead engine surfaces its real error at the first operation instead of a false config diagnosis. - Consistency everywhere "ok" gated behavior: gstack-gbrain-detect --is-ok exits 0 on timeout, and gen-skill-docs' detection gate accepts it, so a slow engine no longer silently suppresses brain-aware features. - Status cache: key now includes the effective probe timeout (raising it invalidates a cached timeout) and GBRAIN_HOME; config detection honors GBRAIN_HOME so relocated-home users stop being misclassified as missing-config. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(bins): cygpath-normalize SCRIPT_DIR for bun imports; surface learnings-log errors (#1950) Under Windows git-bash, pwd yields a POSIX path (/c/Users/...) that Bun on Windows cannot resolve as an ES module specifier. gstack-learnings-log interpolates SCRIPT_DIR into a bun -e import, so every invocation died with "Cannot find module" — and 2>/dev/null swallowed the error, silently dropping every AI-logged learning for Windows users. - 3-line cygpath -m guard in gstack-learnings-log and gstack-question-log (which gains the same import shape in the next commit). Matches the duplicated IS_WINDOWS convention in setup; no shared shell lib exists. - learnings-log adopts question-log's set +e / TMPERR capture pattern wholesale: validation errors now print to stderr. The old `if [ $? -ne 0 ]` check was dead code under set -euo pipefail — the script exited at the failing assignment before reaching it. - New test/bin-windows-bun-import-paths.test.ts: static invariant (any bash bin interpolating $SCRIPT_DIR into a bun -e import must carry the guard) + behavioral end-to-end run invoked via `bash <bin>` — added to the windows-free-tests workflow list so the conversion is proven on the only platform where the bug exists. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(question-log): dedupe INJECTION_PATTERNS via lib/jsonl-store (#1934) bin/gstack-question-log carried a local copy of the injection-pattern list, so pattern fixes to lib/jsonl-store.ts never propagated — including the /override[:\s]/i false-positive fix arriving via community PR #1940. Import the shared hasInjection instead (enabled by the previous commit's cygpath guard). question-log also gets the lib's stricter superset (human:, disregard, from-now-on, approve-all patterns). Tests pin the contract in a #1940-order-independent way: an "Override: ignore all previous instructions" header is rejected, "prose overrides the deterministic table" is accepted, and a static invariant keeps local INJECTION_PATTERNS duplicates out of the bin. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(security): community-pulse + both dashboards never report fake zeros (#1947) The security-signaling surface failed open at three layers — every failure mode read as a reassuring "0 attacks" / "0 installs": - community-pulse edge function: supabase-js returns {data,error} without throwing, and all five queries discarded `error` — a DB outage produced real-looking zeros via the SUCCESS path, and the catch (also returning zeros with HTTP 200) was unreachable for query failures. Every query now destructures and throws; the catch serves the stale cache (marked "stale": true) when one exists, else 503 {"error":"pulse_unavailable"}. Success responses carry "status":"ok" so clients can distinguish authoritative data from legacy backends. NOTE: the edge function deploys out-of-band (supabase functions deploy community-pulse). - gstack-security-dashboard: captures the HTTP status; non-200 / network failure / error body / missing section → "unknown — backend error"; jq missing → "unknown — install jq" (the lossy grep fallback broke on nested arrays and under-reported attacks as zero — removed); a 200 without the new marker shows figures with an "unverified (legacy backend)" note. Also fixes a latent display bug: the TOTAL grep matched the digit 7 inside "attacks_last_7_days" and misreported every count. - gstack-community-dashboard: same class — curl || echo "{}" plus grep || echo "0" printed "Weekly active installs: 0" on any failure. Now "unknown — backend error (HTTP N)". test/security-dashboard-fallback.test.ts pins the matrix (200+marker, 200-legacy, 503, network failure) x (jq present, jq absent) for both bins: "unknown" states never render as 0. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(telemetry): redact error_message spans before they leave the machine (#1947) error_message was uploaded with only quote/newline escaping — stack traces and failed-API errors can embed credentials, private paths, and hostnames, and the sync path strips only _repo_slug/_branch. New lib/redact-engine.ts export redactFindingSpans(): replaces EVERY finding's span with <REDACTED-{id}> regardless of tier (applyRedactions is the interactive PII-only path and exits nonzero on credential findings, so it can't serve machine egress). Returns null when a span can't be located — callers drop the whole payload rather than risk a leak. gstack-telemetry-log pipes error_message through it at LOG time, so the local JSONL at rest is clean too; surrounding text survives for crash triage. FAIL CLOSED: bun missing, engine error, or non-JSON-string output all null the field. Tests pin: embedded ghp_ token → <REDACTED-github.pat> with context intact; redactor unavailable → null; raw bytes on disk never contain the token. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(redact): prepush guard fails closed on git failure; /ship owns hook install (#1946) Two gaps closed: 1. Fail closed. The git() helper returned "" on ANY non-zero exit or maxBuffer overflow (status null), addedLinesFor produced an empty string, and the push sailed through unscanned — fail-open on exactly the oversized-diff case where a large secret-bearing blob is most likely. The diff call now uses a strict variant that throws; main blocks with a clear message naming the GSTACK_REDACT_PREPUSH=skip escape valve. Probe calls (symbolic-ref, rev-parse, merge-base) keep the permissive helper — their failures are normal control flow. 2. Install path. The hook was installed by nothing ("opt-in, installed by nothing" was the issue's words). ./setup runs in the gstack checkout — the wrong repo for a per-project hook — so it gets a one-line hint only. /ship owns per-repo install: config redact_prepush_hook=true + hook missing → silent install (consent already given); config unset + no ~/.gstack/.redact-prepush-prompted marker → one-time machine-wide AskUserQuestion offer, answer persisted. ship/SKILL.md regenerated in this same commit (check-freshness bisect discipline). Tests: unscannable diff (bogus SHAs) → exit 1 + valve named; empty-but- successful diff → exit 0; static asserts pin setup as hint-only and the ship template as the installer surface. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(redact): six new credential patterns — GitLab, HuggingFace, npm, DigitalOcean, Bearer, GCP SA (#1946) Coverage gaps from the #1946 security review, including token types for tooling gstack itself drives (glab): HIGH (block): gitlab.token (glpat-/glptt-/gldt-), huggingface.token (hf_), npm.token (npm_), digitalocean.token (dop_v1_), gcp.service_account (the JSON-escaped "private_key" form that dodges pem.private_key's literal-block match when minified, confirmed by "private_key_id" proximity). MEDIUM (warn): auth.bearer — the most FP-prone shape in the set (docs are full of "Authorization: Bearer <token>"), so it requires header-context proximity and the same entropy>=3.0 + placeholder validator recipe as env.kv. "Bearer YOUR_TOKEN_HERE" never fires; calibration over coverage, per the cries-wolf principle. All shapes are linear-time; test/redact-pattern-lint.test.ts covers them automatically. Engine tests add positive + placeholder-negative cases per pattern. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: coverage-audit additions for the fix wave Ship Step 7 gap-fill (all passing, 248 tests across the touched suites): memory + dream stage probe-timeout proceeds, gbrain-detect override paths, stale-flag passthrough, 200-body-missing-.security fail-closed case, telemetry redaction edges, and credential-pattern edge cases. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: pre-landing review fixes Review army findings (1 critical, auto-fixed with regression tests): - CRITICAL (security specialist, verified live): redactFindingSpans spliced only the regex capture span, and pem.private_key / gcp.service_account capture just the BEGIN-header — the key body survived "redaction" and shipped via telemetry. Marker-only patterns now drop the whole payload (null, fail closed). Overlapping spans (Bearer+JWT on the same bytes) are coalesced before splicing so stale offsets can't leave partial secret bytes behind. - gitStrict: drop the dead `|| r.status === null` disjunct (null !== 0 already covers it); add the signal-kill/null-status regression test the docstring promised. - security-dashboard human mode flags stale snapshots ("figures may be out of date") instead of presenting frozen counts as current. - community-dashboard marker check uses jq when available — the grep-only variant misclassified whitespaced/reserialized bodies as legacy. - telemetry fail-closed test now shadows bun with a failing stub (deterministic on any host layout); stale "five status cases" describe title renamed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix: adversarial review fixes (Claude + Codex cross-model passes) Both adversarial passes ran against the wave; every FIXABLE finding landed with a regression test: - probeTimeoutMs clamps to >=1ms: a fractional override floored to 0, and execFileSync treats timeout:0 as NO timeout — the probe that exists to bound hangs could hang forever (found by both models independently). - /ship silent hook install now requires the hooks dir to live inside .git: with core.hooksPath (husky's COMMITTED .husky/), the chaining installer would have renamed the team's committed pre-push and written a machine-local wrapper into the working tree (found by both models). - gstack-config gbrain-refresh accepts the "timeout" status — the last consumer still gating on literal "ok" (Codex); gstack-gbrain-detect's config-derived fields honor GBRAIN_HOME so the detection JSON can't report status ok alongside config_exists false (Codex). - prepush: a remote sha absent locally (shallow clone / stale fetch) falls back to the merge-base/empty-tree range — scans MORE, never blocks a legitimate push into training users toward --no-verify. - dashboards: curl's own 000 no longer doubles to "HTTP 000000"; the community dashboard flags stale snapshots like the security one; array sections parse via jq (the sed/grep loops truncated at the first ']'); the no-jq marker grep tolerates whitespace. - telemetry: multi-line redactor output nulls the field instead of corrupting the JSONL record; setup's hint fires only when the config key is genuinely unset (an explicit false is a recorded decline); the /ship prompt marker honors GSTACK_HOME. Kept as designed (cross-model tension noted): Bearer stays MEDIUM in the prepush gate — a HIGH Bearer would block every docs example; the entropy validator can't eliminate that FP class, and MEDIUM warns visibly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore: bump version and changelog (v1.57.11.0) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: P1 TODO — eval harness live progress + incremental persistence Root-caused during this ship: a killed eval run was indistinguishable from a healthy one for hours (per-file output buffering across mega test files, no incremental eval-store writes, no honest liveness signal). Full context and starting points in the entry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test: fix operational-learning E2E fixture — copy lib/jsonl-store.ts Pre-existing breakage, proven on main: gstack-learnings-log has imported lib/jsonl-store.ts (shared injection patterns) since v1.57.5.0 / #1910, but the fixture copies only the bin scripts — the bin exits 1 before writing anything, on main silently (stderr swallowed) and on this branch loudly (the #1950 error-surfacing made the four-day-old failure visible). A real install always ships bin/ and lib/ together; the fixture now does too. Verified: the fixture-shaped invocation writes the learning (exit 0) with lib present, exits 1 on both main and this branch without it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(ios-qa): isolate E2E tests under --concurrent (3 real races) The ios-qa E2E file failed intermittently under `bun test --concurrent` (the eval harness default). Three distinct shared-state races, all fixed: 1. Shared pidfile: a module-level `workDir` reassigned in beforeEach was clobbered by parallel tests, so concurrent daemons collided on the same pidfile and the loser returned `already_running`. Each test now gets its own dir via makeWorkDir(). 2. process.env path globals: tests set GSTACK_IOS_AUDIT_PATH / _ATTEMPTS_PATH / _ALLOWLIST_PATH on the shared process env; concurrent tests stomped each other's audit/attempts destinations. Threaded auditPath/attemptsPath/allowlistPath through DaemonOptions (and mintForCaller) as explicit args — env is no longer load-bearing. 3. afterEach cleanup race: the per-test cleanup drained a shared dir array, so the first test to finish deleted still-running tests' workDirs mid-assertion. Moved to afterAll (cleans once, after all settle). Verified: 5/5 clean full-suite runs at --max-concurrency 15 (was intermittent); daemon unit suite 91/91; daemon source compiles. The paths default to the env-derived locations when options are omitted, so the production CLI path is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * test(pty): pin spawned claude to EVALS model chain (default claude-sonnet-4-6) launchClaudePty spawned the interactive `claude` TUI with no --model flag, so the child inherited the operator's ~/.claude/settings.json model. On a slow-thinking model that meant 5+ min of extended thinking on empty plan-mode context, timing out the plan-mode smoke tests regardless of contention. Pin the model via opts.model ?? EVALS_MODEL ?? 'claude-sonnet-4-6' — byte-identical to session-runner.ts:144, so PTY and `claude -p` evals always agree. Pushed before extraArgs (last flag wins, so a per-test --model still overrides). Placement leaves the spawn region byte-stable for a clean merge with the in-flight hermetic-env branch. Plumbed model through the three plan-skill wrappers. Static-grep tripwires guard the pin, its fallback chain, the before-extraArgs ordering, and all three wrapper forwards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(pty): detect markdown bold-bullet prose AUQs (fixes office-hours smoke) office-hours auto-mode renders its mode question as `- **Building a startup**` markdown bullets (office-hours/SKILL.md.tmpl:102) with no letter/number marker. isProseAUQVisible only matched `A)`-style lettered or `1.`-style numbered options, so the question went undetected: the model surfaced it at ~2m19s (well under the 300s budget) but the harness kept scoring the run "working" off the spinner glyphs and timed out — a false timeout on a question that was already on screen. Add Pattern 3: when an interrogative line ('?') is present AND 3+ bold-bullet markers (`- **`) appear in the 4KB tail, classify as a prose AUQ. Bold is the discriminator vs incidental prose bullets; the line anchor is dropped (stripAnsi can collapse option lines) and the existing `❯ 1.` cursor gate still defers to a live native list. Wires through the existing classifyVisible 'asked' path and the timeout high-water-mark, so office-hours now classifies 'asked' instead of 'timeout'. Five unit cases: the office-hours render passes; no-'?', <3-bullet, plain-bullet, and native-cursor cases stay false. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(pty): detect stripAnsi-collapsed prose AUQs + judge spinner-precedence The plan-eng/plan-design plan-mode + finding-floor smokes timed out even when the skill HAD rendered a complete prose AskUserQuestion and was waiting: the PTY strips cursor-positioning escapes, collapsing the option newlines/spaces so "A) ..." arrives as "A(recommended)" / "-B:" and "Reply with A, B, or C" as "ReplywithA,B,orC". Every line-anchored detector (Patterns 1-3) returns false on those bytes, so proseAUQEverObserved never latched and the run timed out on a question that was already on screen. Add Pattern 4/5: a two-signal collapsed-form detector — a reply/recommendation marker (space-insensitive "reply with [A-D]", "Recommendation:", or "(recommended)") AND 2+ distinct A-D letters each punctuated by ) : or (. The conjunction is what separates a real AUQ from incidental report prose; verified true on the verbatim failing-run buffers where Patterns 1-3 return false. Also fix the Haiku judge spinner bias: of 614 verdicts, 569 were 'working' and 95 of those noted a question was visible — Claude Code keeps the spinner animating at an idle prose decision, so the judge coin-flipped. Add a precedence override: when an option list AND a Recommendation/Reply instruction are both visible, classify WAITING even with spinner glyphs. Kept the strict dual-signal gate (never option-list-alone) so auto-decide-preserved doesn't flip. 5 unit tests pin the two-signal contract (2 true on real collapsed bytes, 3 false guards). 90 -> 95 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(plan-review): ask-first scope gate for plan-eng + plan-design review On an empty/cold invocation, plan-eng-review and plan-design-review would dive straight into repo exploration (plan-eng) or a 7-pass mockup+audit (plan-design) and only ask the user much later, if at all. plan-ceo-review already asks first via an unconditional Step-0 gate and behaves well; these two did not. Add a hard-STOP scope gate as the FIRST operational instruction in each skill (above the design-doc check / pre-review audit / mockup defaults it explicitly overrides): the first tool call must be AskUserQuestion confirming the review target, before any git/Read/Grep/Glob/Bash or mockup generation. Under --disallowedTools the options render as plain column-0 lettered prose with a Recommendation + "Reply with A, B, or C" line so the answer is detectable. This is correct cold-start UX (confirm what to review before grinding a full review on nothing) and it is the product half of the plan-mode smoke fix; the harness collapsed-form detector is the deterministic half that catches the ask however it renders. Templates + regenerated SKILL.md (default variant). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(tiers): reclassify stochastic plan-eng/plan-design ask-first smokes as periodic plan-eng-review and plan-design-review run a long explore/audit before their first AskUserQuestion, so whether the plan-mode + finding-floor smokes reach a terminal outcome within the 300s/600s budget depends on stochastic ask-first compliance (measured ~50-67%/run even with the hardened gate). Per the "non-deterministic -> periodic" tiering rule, move the four affected smokes (plan-eng/plan-design review-plan-mode + finding-floor) to periodic. The deterministic harness fix (collapsed-form detector + judge precedence) and the ask-first gate lift these from always-failing to mostly-passing and are the real product+harness improvements; periodic monitoring tracks the rate weekly without blocking PRs on an LLM coin-flip. plan-ceo/plan-devex ask-first reliably and stay gate-tier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(evals): gate the deterministic PTY plan-mode smokes in CI The real-PTY plan-mode smokes never ran in CI — the gate was local-only. Add an e2e-pty-plan-smoke matrix suite running the two deterministically-reliable ones (office-hours-auto-mode, plan-mode-no-op) so a regression there blocks PRs. The stochastic plan-eng/plan-design ask-first smokes stay periodic (touchfiles E2E_TIERS) and are not CI-gated. A fresh CI container has no ~/.claude.json, so the spawned interactive `claude` would wedge on the onboarding + API-key-approval dialog. Add a scoped seed step (hasCompletedOnboarding + key approval, its own ANTHROPIC_API_KEY env) before the run — mirrors what the hermetic E2E child env seeds. Per-suite timeout override (35 min) via matrix.suite.timeout so the PTY suite has headroom for --retry 2 without bumping the other 12 suites. Report runner count 12 -> 13. Validate via workflow_dispatch before relying on the gate (PTY-in-CI is new). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * ci(evals): install gstack skill registry for the PTY smoke suite The first dry-run of e2e-pty-plan-smoke failed: the spawned interactive `claude` printed "Unknown command: /plan-ceo-review". .claude/skills is gitignored, so a fresh CI checkout has no gstack skill registry and the TUI can't resolve /office-hours or /plan-ceo-review. Add a Register step (scoped to the suite, after Seed, before Run) that mirrors setup's --no-prefix user-scoped registry minimally: $HOME/.claude/skills/gstack -> repo (resolves the preambles' absolute ~/.claude/skills/gstack/bin/* and <skill>/sections/* paths) + per-skill SKILL.md/sections symlinks for the two skills these tests invoke. HOME is /github/home in this container and the runner adds no HOME/CLAUDE_CONFIG_DIR override (no hermetic mode), so $HOME is the right anchor — the Seed step already proved claude reads it. No ./setup (binary build + Chromium + fonts + /dev/tty prompt); SKILL.md + bin/ + sections/ are committed. Self-validating: fails the step loudly on a dangling symlink or missing `name:` frontmatter, so a moved target surfaces here instead of as a silent 35-min "Unknown command" timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.58.4.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com> |
|
|
|
0a803f9e81
|
feat: gstack v1 — simpler prompts + real LOC receipts (v1.0.0.0) (#1039)
* docs: add design doc for /plan-tune v1 (observational substrate)
Canonical record of the /plan-tune v1 design: typed question registry,
per-question explicit preferences, inline tune: feedback with user-origin
gate, dual-track profile (declared + inferred separately), and plain-English
inspection skill. Captures every decision with pros/cons, what's deferred to
v2 with explicit acceptance criteria, and what was rejected entirely.
Codex review drove a substantial scope rollback from the initial CEO
EXPANSION plan. 15+ legitimate findings (substrate claim was false without
a typed registry; E4/E6/clamp logical contradiction; profile poisoning
attack surface; LANDED preamble side effect; implementation order) shaped
the final shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: typed question registry for /plan-tune v1 foundation
scripts/question-registry.ts declares 53 recurring AskUserQuestion categories
across 15 skills (ship, review, office-hours, plan-ceo-review, plan-eng-review,
plan-design-review, plan-devex-review, qa, investigate, land-and-deploy, cso,
gstack-upgrade, preamble, plan-tune, autoplan).
Each entry has: stable kebab-case id, skill owner, category (approval |
clarification | routing | cherry-pick | feedback-loop), door_type (one-way
| two-way), optional stable option keys, optional psychographic signal_key,
and a one-line description.
12 of 53 are one-way doors (destructive ops, architecture/data forks,
security/compliance). These are ALWAYS asked regardless of user preference.
Helpers: getQuestion(id), getOneWayDoorIds(), getAllRegisteredIds(),
getRegistryStats(). No binary or resolver wiring yet — this is the schema
substrate the rest of /plan-tune builds on.
Ad-hoc question_ids (not registered) still log but skip psychographic
signal attribution. Future /plan-tune skill surfaces frequently-firing
ad-hoc ids as candidates for registry promotion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: registry schema + safety + coverage tests (gate tier)
20 tests validating the question registry:
Schema (7 tests):
- Every entry has required fields
- All ids are kebab-case and start with their skill name
- No duplicate ids
- Categories are from the allowed set
- door_type is one-way | two-way
- Options arrays are well-formed
- Descriptions are short and single-line
Helpers (5 tests):
- getQuestion returns entry for known id, undefined for unknown
- getOneWayDoorIds includes destructive questions, excludes two-way
- getAllRegisteredIds count matches QUESTIONS keys
- getRegistryStats totals are internally consistent
One-way door safety (2 tests):
- Every critical question (test failure, SQL safety, LLM trust boundary,
security scan, merge confirm, rollback, fix apply, premise revise,
arch finding, privacy gate, user challenge) is declared one-way
- At least 10 one-way doors exist (catches regression if declarations
are accidentally dropped)
Registry breadth (3 tests):
- 11 high-volume skills each have >= 1 registered question
- Preamble one-time prompts are registered
- /plan-tune's own questions are registered
Signal map references (1 test):
- signal_key values are typed kebab-case strings
Template coverage (2 tests, informational):
- AskUserQuestion usage across templates is non-trivial (>20)
- Registry spans >= 10 skills
20 pass, 0 fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: one-way door classifier (belt-and-suspenders safety fallback)
scripts/one-way-doors.ts — secondary keyword-pattern classifier that catches
destructive questions even when the registry doesn't have an entry for them.
The registry's door_type field (from scripts/question-registry.ts) is the
PRIMARY safety gate. This classifier is the fallback for ad-hoc question_ids
that agents generate at runtime.
Classification priority:
1. Registry lookup by question_id → use declared door_type
2. Skill:category fallback (cso:approval, land-and-deploy:approval)
3. Keyword pattern match against question_summary
4. Default: treat as two-way (safer to log the miss than auto-decide unsafely)
Covers 21 destructive patterns across:
- File system (rm -rf, delete, wipe, purge, truncate)
- Database (drop table/database/schema, delete from)
- Git/VCS (force-push, reset --hard, checkout --, branch -D)
- Deploy/infra (kubectl delete, terraform destroy, rollback)
- Credentials (revoke/reset/rotate API key|token|secret|password)
- Architecture (breaking change, schema migration, data model change)
7 new tests in test/plan-tune.test.ts covering: registry-first lookup,
unknown-id fallthrough, keyword matching on destructive phrasings including
embedded filler words ("rotate the API key"), skill-category fallback,
benign questions defaulting to two-way, pattern-list non-empty.
27 pass, 0 fail. 1270 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: psychographic signal map + builder archetypes
scripts/psychographic-signals.ts — hand-crafted {signal_key, user_choice} →
{dimension, delta} map. Version 0.1.0. Conservative deltas (±0.03 to ±0.06
per event). Covers 9 signal keys: scope-appetite, architecture-care,
code-quality-care, test-discipline, detail-preference, design-care,
devex-care, distribution-care, session-mode.
Helpers: applySignal() mutates running totals, newDimensionTotals() creates
empty starting state, normalizeToDimensionValue() sigmoid-clamps accumulated
delta to [0,1] (0 → 0.5 neutral), validateRegistrySignalKeys() checks that
every signal_key in the registry has a SIGNAL_MAP entry.
In v1 the signal map is used ONLY to compute inferred dimension values for
/plan-tune inspection output. No skill behavior adapts to these signals
until v2.
scripts/archetypes.ts — 8 named archetypes + Polymath fallback:
- Cathedral Builder (boil-the-ocean + architecture-first)
- Ship-It Pragmatist (small scope + fast)
- Deep Craft (detail-verbose + principled)
- Taste Maker (intuitive, overrides recommendations)
- Solo Operator (high-autonomy, delegates)
- Consultant (hands-on, consulted on everything)
- Wedge Hunter (narrow scope aggressively)
- Builder-Coach (balanced steering)
- Polymath (fallback when no archetype matches)
matchArchetype() uses L2 distance scaled by tightness, with a 0.55 threshold
below which we return Polymath. v1 ships the model stable; v2 narrative/vibe
commands wire it into user-facing output.
14 new tests: signal map consistency vs registry, applySignal behavior for
known/unknown keys, normalization bounds, archetype schema validity, name
uniqueness, matchArchetype correctness for each reference profile, Polymath
fallback for outliers.
41 pass, 0 fail total in test/plan-tune.test.ts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: bin/gstack-question-log — append validated AskUserQuestion events
Append-only JSONL log at ~/.gstack/projects/{SLUG}/question-log.jsonl.
Schema: {skill, question_id, question_summary, category?, door_type?,
options_count?, user_choice, recommended?, followed_recommendation?,
session_id?, ts}
Validates:
- skill is kebab-case
- question_id is kebab-case, <= 64 chars
- question_summary non-empty, <= 200 chars, newlines flattened
- category is one of approval/clarification/routing/cherry-pick/feedback-loop
- door_type is one-way or two-way
- options_count is integer in [1, 26]
- user_choice non-empty string, <= 64 chars
Injection defense on question_summary rejects the same patterns as
gstack-learnings-log (ignore previous instructions, system:, override:,
do not report, etc).
followed_recommendation is auto-computed when both user_choice and
recommended are present.
ts auto-injected as ISO 8601 if missing.
21 tests covering: valid payloads, full field preservation, auto-followed
computation, appending, long-summary truncation, newline flattening,
invalid JSON, missing fields, bad case, oversized ids, invalid enum
values, out-of-range options_count, and 6 injection attack patterns.
21 pass, 0 fail, 43 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: bin/gstack-developer-profile — unified profile with migration
bin/gstack-developer-profile supersedes bin/gstack-builder-profile. The old
binary becomes a one-line legacy shim delegating to --read for /office-hours
backward compat.
Subcommands:
--read legacy KEY:VALUE output (tier, session_count, etc)
--migrate folds ~/.gstack/builder-profile.jsonl into
~/.gstack/developer-profile.json. Atomic (temp + rename),
idempotent (no-op when target exists or source absent),
archives source as .migrated-YYYY-MM-DD-HHMMSS
--derive recomputes inferred dimensions from question-log.jsonl
using the signal map in scripts/psychographic-signals.ts
--profile full profile JSON
--gap declared vs inferred diff JSON
--trace <dim> event-level trace of what contributed to a dimension
--check-mismatch flags dimensions where declared and inferred disagree by
> 0.3 (requires >= 10 events first)
--vibe archetype name + description from scripts/archetypes.ts
--narrative (v2 stub)
Auto-migration on first read: if legacy file exists and new file doesn't,
migrate before reading. Creates a neutral (all-0.5) stub if nothing exists.
Unified schema (see docs/designs/PLAN_TUNING_V0.md §Architecture):
{identity, declared, inferred: {values, sample_size, diversity},
gap, overrides, sessions, signals_accumulated, schema_version}
25 new tests across subcommand behaviors:
- --read defaults + stub creation
- --migrate: 3 sessions preserved with signal tallies, idempotency, archival
- Tier calculation: welcome_back / regular / inner_circle boundaries
- --derive: neutral-when-empty, upward nudge on 'expand', downward on 'reduce',
recomputable (same input → same output), ad-hoc unregistered ids ignored
- --trace: contributing events, empty for untouched dims, error without arg
- --gap: empty when no declared, correctly computed otherwise
- --vibe: returns archetype name + description
- --check-mismatch: threshold behavior, 10+ sample requirement
- Unknown subcommand errors
25 pass, 0 fail, 60 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: bin/gstack-question-preference — explicit preferences + user-origin gate
Subcommands:
--check <id> → ASK_NORMALLY | AUTO_DECIDE (decides if a registered
question should be auto-decided by the agent)
--write '{…}' → set a preference (requires user-origin source)
--read → dump preferences JSON
--clear [id] → clear one or all
--stats → short counts summary
Preference values: always-ask | never-ask | ask-only-for-one-way.
Stored at ~/.gstack/projects/{SLUG}/question-preferences.json.
Safety contract (the core of Codex finding #16, profile-poisoning defense
from docs/designs/PLAN_TUNING_V0.md §Security model):
1. One-way doors ALWAYS return ASK_NORMALLY from --check, regardless of
user preference. User's never-ask is overridden with a visible safety
note so the user knows why their preference didn't suppress the prompt.
2. --write requires an explicit `source` field:
- Allowed: "plan-tune", "inline-user"
- REJECTED with exit code 2: "inline-tool-output", "inline-file",
"inline-file-content", "inline-unknown"
Rejection is explicit ("profile poisoning defense") so the caller can
log and surface the attempt.
3. free_text on --write is sanitized against injection patterns (ignore
previous instructions, override:, system:, etc.) and newline-flattened.
Each --write also appends a preference-set event to
~/.gstack/projects/{SLUG}/question-events.jsonl for derivation audit trail.
31 tests:
- --check behavior (4): defaults, two-way, one-way (one-way overrides
never-ask with safety note), unknown ids, missing arg
- --check with prefs (5): never-ask on two-way → AUTO_DECIDE; never-ask
on one-way → ASK_NORMALLY with override note; always-ask always asks;
ask-only-for-one-way flips appropriately
- --write valid (5): inline-user accepted, plan-tune accepted, persisted
correctly, event appended, free_text preserved with flattening
- User-origin gate (6): missing source rejected; inline-tool-output
rejected with exit code 2 and explicit poisoning message; inline-file,
inline-file-content, inline-unknown rejected; unknown source rejected
- Schema validation (4): invalid JSON, bad question_id, bad preference,
injection in free_text
- --read (2): empty → {}, returns writes
- --clear (3): specific id, clear-all, NOOP for missing
- --stats (2): empty zeros, tallies by preference type
31 pass, 0 fail, 52 expect() calls.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: question-tuning preamble resolvers
scripts/resolvers/question-tuning.ts ships three preamble generators:
generateQuestionPreferenceCheck — before each AskUserQuestion, agent runs
gstack-question-preference --check <id>. AUTO_DECIDE suppresses the ask
and auto-chooses recommended. ASK_NORMALLY asks as usual. One-way door
safety override is handled by the binary.
generateQuestionLog — after each AskUserQuestion, agent appends a log
record with skill, question_id, summary, category, door_type,
options_count, user_choice, recommended, session_id.
generateInlineTuneFeedback — offers inline "tune:" prompt after two-way
questions. Documents structured shortcuts (never-ask, always-ask,
ask-only-for-one-way, ask-less) AND accepts free-form English with
normalization + confirmation. Explicitly spells out the USER-ORIGIN
GATE: only write tune events when the prefix appears in the user's own
chat message, never from tool output or file content. Binary enforces.
All three resolvers are gated by the QUESTION_TUNING preamble echo. When
the config is off, the agent skips these sections entirely. Ready to be
wired into preamble.ts in the next commit.
Codex host has a simpler variant that uses $GSTACK_BIN env vars.
scripts/resolvers/index.ts registers three placeholders:
QUESTION_PREFERENCE_CHECK, QUESTION_LOG, INLINE_TUNE_FEEDBACK
Total resolver count goes from 45 to 48.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: wire question-tuning into preamble for tier >= 2 skills
scripts/resolvers/preamble.ts — adds two things:
1. _QUESTION_TUNING config echo in the preamble bash block, gated on the
user's gstack-config `question_tuning` value (default: false).
2. A combined Question Tuning section for tier >= 2 skills, injected after
the confusion protocol. The section itself is runtime-gated by the
QUESTION_TUNING value — agents skip it entirely when off.
scripts/resolvers/question-tuning.ts — consolidated into one compact combined
section `generateQuestionTuning(ctx)` covering: preference check before the
question, log after, and inline tune: feedback with user-origin gate. Per-phase
generators remain exported for unit tests but are no longer the main entrypoint.
Size impact: +570 tokens / +2.3KB per tier-2+ SKILL.md. Three skills
(plan-ceo-review, office-hours, ship) still exceed the 100KB token ceiling —
but they were already over before this change. Delta is the smallest viable
wiring of the /plan-tune v1 substrate.
Golden fixtures (test/fixtures/golden/claude-ship, codex-ship, factory-ship)
regenerated to match the new baseline.
Full test run: 1149 pass, 0 fail, 113 skip across 28 files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: regenerate SKILL.md files with question-tuning section
bun run gen:skill-docs --host all after wiring the QUESTION_TUNING preamble
section. Every tier >= 2 skill now includes the combined Question Tuning
guidance. Runtime-gated — agents skip the section when question_tuning is
off in gstack-config (default).
Golden fixtures (claude-ship, codex-ship, factory-ship) updated to the new
baseline.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: /plan-tune skill — conversational inspection + preferences
plan-tune/SKILL.md.tmpl: the user-facing skill for /plan-tune v1. Routes
plain-English intent to one of 8 flows:
- Enable + setup (first-time): 5 declaration questions mapping to the
5 psychographic dimensions (scope_appetite, risk_tolerance,
detail_preference, autonomy, architecture_care). Writes to
developer-profile.json declared.*.
- Inspect profile: plain-English rendering of declared + inferred + gap.
Uses word bands (low/balanced/high) not raw floats. Shows vibe archetype
when calibration gate is met.
- Review question log: top-20 question frequencies with follow/override
counts. Highlights override-heavy questions as candidates for never-ask.
- Set a preference: normalizes "stop asking me about X" → never-ask, etc.
Confirms ambiguous phrasings before writing via gstack-question-preference.
- Edit declared profile: interprets free-form ("more boil-the-ocean") and
CONFIRMS before mutating declared.* (trust boundary per Codex #15).
- Show gap: declared vs inferred diff with plain-English severity bands
(close / drift / mismatch). Never auto-updates declared from the gap.
- Stats: preference counts + diversity/calibration status.
- Enable / disable: gstack-config set question_tuning true|false.
Design constraints enforced:
- Plain English everywhere. No CLI subcommand syntax required. Shortcuts
(`profile`, `vibe`, `stats`, `setup`) exist but optional.
- user-origin gate on tune: writes. source: "plan-tune" for user-invoked
/plan-tune; source: "inline-user" for inline tune: from other skills.
- One-way doors override never-ask (safety, surfaced to user).
- No behavior adaptation in v1 — this skill inspects and configures only.
Generates plan-tune/SKILL.md at ~11.6k tokens, well under the 100KB ceiling.
Generated for all hosts via `bun run gen:skill-docs --host all`.
Full free test suite: 1149 pass, 0 fail, 113 skip across 28 files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: end-to-end pipeline + preamble injection coverage
Added 6 tests to test/plan-tune.test.ts:
Preamble injection (3 tests):
- tier 2+ includes Question Tuning section with preference check, log,
and user-origin gate language ('profile-poisoning defense', 'inline-user')
- tier 1 does NOT include the prose section (QUESTION_TUNING bash echo
still fires since it's in the bash block all tiers share)
- codex host swaps binDir references to $GSTACK_BIN
End-to-end pipeline (3 tests) — real binaries working together, not mocks:
- Log 5 expand choices → --derive → profile shows scope_appetite > 0.5
(full log → registry lookup → signal map → normalization round-trip)
- --write source: inline-tool-output rejected; --read confirms no pref
was persisted (the profile-poisoning defense actually works end-to-end)
- Migrate a 3-session legacy file; confirm legacy gstack-builder-profile
shim still returns SESSION_COUNT: 3, TIER: welcome_back, CROSS_PROJECT: true
test/plan-tune.test.ts now has 47 tests total.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: E2E test for /plan-tune plain-English inspection flow (gate tier)
test/skill-e2e-plan-tune.test.ts — verifies /plan-tune correctly routes
plain-English intent ("review the questions I've been asked") to the
Review question log section without requiring CLI subcommand syntax.
Seeds a synthetic question-log.jsonl with 3 entries exercising:
- override behavior (user chose expand over recommended selective)
- one-way door respect (user followed ship-test-failure-triage recommendation)
- two-way override (user skipped recommended changelog polish)
Invokes the skill via `claude -p` and asserts:
- Agent surfaces >= 2 of 3 logged question_ids in output
- Agent notices override/skip behavior from the log
- Exit reason is success or error_max_turns (not agent-crash)
Gate-tier because the core v1 DX promise is plain-English intent routing.
If it requires memorized subcommands or breaks on natural language, that's
a regression of the defining feature.
Registered in test/helpers/touchfiles.ts with dependencies:
- plan-tune/** (skill template + generated md)
- scripts/question-registry.ts (required for log lookup)
- scripts/psychographic-signals.ts, scripts/one-way-doors.ts (derive path)
- bin/gstack-question-log, gstack-question-preference, gstack-developer-profile
Skipped when EVALS_ENABLED is not set; runs on `bun run test:evals`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.19.0.0) — /plan-tune v1
Ships /plan-tune as observational substrate: typed question registry, dual-track
developer profile (declared + inferred), explicit per-question preferences with
user-origin gate, inline tune: feedback across every tier >= 2 skill, unified
developer-profile.json with migration from builder-profile.jsonl.
Scope rolled back from initial CEO EXPANSION plan after outside-voice review
(Codex). 6 deferrals tracked as P0 TODOs with explicit acceptance criteria:
E1 substrate wiring, E3 narrative/vibe, E4 blind-spot coach, E5 LANDED
celebration, E6 auto-adjustment, E7 psychographic auto-decide.
See docs/designs/PLAN_TUNING_V0.md for the full design record.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(ci): harden Dockerfile.ci against transient Ubuntu mirror failures
The CI image build failed with:
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/...
Connection failed [IP: 91.189.92.22 80]
ERROR: process "/bin/sh -c apt-get update && apt-get install ..."
did not complete successfully: exit code: 100
archive.ubuntu.com periodically returns "connection refused" on individual
regional mirrors. Without retry logic a single failed fetch nukes the whole
Docker build. Three defenses, layered:
1. /etc/apt/apt.conf.d/80-retries — apt fetches each package up to 5 times
with a 30s timeout. Handles per-package flakes.
2. Shell-loop retry around the whole apt-get step (x3, 10s sleep) — handles
the case where apt-get update itself can't reach any mirror.
3. --retry 5 --retry-delay 5 --retry-connrefused on all curl fetches (bun
install script, GitHub CLI keyring, NodeSource setup script).
Applied to every apt-get and curl call in the Dockerfile. No behavior change
on happy path — only kicks in when mirrors blip. Fixes the build-image job
that was blocking CI on the /plan-tune PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: add PLAN_TUNING_V1 + PACING_UPDATES_V0 design docs
Captures the V1 design (ELI10 writing + LOC reframe) in
docs/designs/PLAN_TUNING_V1.md and the extracted V1.1 pacing-overhaul
plan in docs/designs/PACING_UPDATES_V0.md. V1 scope was reduced from
the original bundled pacing + writing-style plan after three
engineering-review passes revealed structural gaps in the pacing
workstream that couldn't be closed via plan-text editing. TODOS.md
P0 entry links to V1.1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: curated jargon list for V1 writing-style glossing
Repo-owned list of ~50 high-frequency technical terms (idempotent,
race condition, N+1, backpressure, etc.) that gstack glosses on first
use in tier-≥2 skill output. Baked into generated SKILL.md prose at
gen-skill-docs time. Terms not on this list are assumed plain-English
enough. Contributions via PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(preamble): V1 Writing Style section + EXPLAIN_LEVEL echo + migration prompt
Adds a new Writing Style section to tier-≥2 preamble output composing with
the existing AskUserQuestion Format section. Six rules: jargon glossed on
first use per skill invocation (from scripts/jargon-list.json), outcome-
framed questions, short sentences, decisions close with user impact,
gloss-on-first-use even if user pasted term, user-turn override for "be
terse" requests. Baked conditionally (skip if EXPLAIN_LEVEL: terse).
Adds EXPLAIN_LEVEL preamble echo using \${binDir} (host-portable matching
V0 QUESTION_TUNING pattern). Adds WRITING_STYLE_PENDING echo reading a
flag file written by the V0→V1 upgrade migration; on first post-upgrade
skill run, the agent fires a one-time AskUserQuestion offering terse mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(gstack-config): validate explain_level + document in header
Adds explain_level: default|terse to the annotated config header with
a one-line description. Whitelists valid values; on set of an unknown
value, prints a specific warning ("explain_level '\$VALUE' not
recognized. Valid values: default, terse. Using default.") and writes
the default value. Matches V1 preamble's EXPLAIN_LEVEL echo expectation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: V1 upgrade migration — writing-style opt-out prompt
New migration script following existing v0.15.2.0.sh / v0.16.2.0.sh
pattern. Writes a .writing-style-prompt-pending flag file on first run
post-upgrade. The preamble's migration-prompt block reads the flag and
fires a one-time AskUserQuestion offering the user a choice between
the new default writing style and restoring V0 prose via
\`gstack-config set explain_level terse\`. Idempotent via flag files;
if the user has already set explain_level explicitly, counts as
answered and skips.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: LOC reframe tooling — throughput comparison + README updater + scc installer
Three new scripts:
- scripts/garry-output-comparison.ts — enumerates Garry-authored commits
in 2013 + 2026 on public repos, extracts ADDED lines from git diff,
classifies as logical SLOC via scc --stdin (regex fallback if scc
missing). Writes docs/throughput-2013-vs-2026.json with per-language
breakdown + explicit caveats (public repos only, commit-style drift,
private-work exclusion).
- scripts/update-readme-throughput.ts — reads the JSON if present,
replaces the README's <!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor
with the computed multiple (preserving the anchor for future runs).
If JSON missing, writes GSTACK-THROUGHPUT-PENDING marker that CI
rejects — forcing the build to run before commit.
- scripts/setup-scc.sh — standalone OS-detecting installer for scc.
Not a package.json dependency (95% of users never run throughput).
Brew on macOS, apt on Linux, GitHub releases link on Windows.
Two-string anchor pattern (PLACEHOLDER vs PENDING) prevents the
pipeline from destroying its own update path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(retro): surface logical SLOC + weighted commits above raw LOC
V1 reorders the /retro summary table to lead with features shipped,
then commits + weighted commits (commits × files-touched capped at 20),
then PRs merged, then logical SLOC added as the primary code-volume
metric. Raw LOC stays present but is demoted to context. Rationale
inline in the template: ten lines of a good fix is not less shipping
than ten thousand lines of scaffold.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(v1): README hero reframe + writing-style + CHANGELOG + version bump to 1.0.0.0
README.md:
- Hero removes "600,000+ lines of production code" framing; replaces
with the computed 2013-vs-2026 pro-rata multiple (via
<!-- GSTACK-THROUGHPUT-PLACEHOLDER --> anchor, filled by the
update-readme-throughput build step).
- Hiring callout: "ship real products at AI-coding speed" instead of
"10K+ LOC/day."
- New Writing Style section (~80 words) between Quick start and
Install: "v1 prompts = simpler" framing, outcome-language example,
terse-mode opt-out, pointer to /plan-tune.
CLAUDE.md: one-paragraph Writing style (V1) note under project
conventions, linking to preamble resolver + V1 design docs.
CHANGELOG.md: V1 entry on top of v0.19.0.0 with user-facing narrative
(what changes, how to opt out, for-contributors notes). Mentions
scope reduction — pacing overhaul ships in V1.1.
CONTRIBUTING.md: one-paragraph note on jargon-list.json maintenance
(PR to add/remove terms; regenerate via gen:skill-docs).
VERSION + package.json: bump to 1.0.0.0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: regenerate SKILL.md files + golden fixtures for V1
Mechanical regeneration from the updated templates in prior commits:
- Writing Style section now appears in tier-≥2 skill output.
- EXPLAIN_LEVEL + WRITING_STYLE_PENDING echoes in preamble bash.
- V1 migration-prompt block fires conditionally on first upgrade.
- Jargon list inlined into preamble prose at gen time.
- Retro template's logical SLOC + weighted commits order applied.
Regenerated for all 8 hosts via bun run gen:skill-docs --host all.
Golden ship-skill fixtures refreshed from regenerated outputs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: V1 gate coverage — writing-style resolver + config + jargon + migration + dormancy
Six new gate-tier test files:
- test/writing-style-resolver.test.ts — asserts Writing Style section
is injected into tier-≥2 preamble, all 6 rules present, jargon list
inlined, terse-mode gate condition present, Codex output uses
\$GSTACK_BIN (not ~/.claude/), tier-1 does NOT get the section,
migration-prompt block present.
- test/explain-level-config.test.ts — gstack-config set/get round-trip
for default + terse, unknown-value warns + defaults to default,
header documents the key, round-trip across set→set→get.
- test/jargon-list.test.ts — shape + ~50 terms + no duplicates
(case-insensitive) + includes canonical high-signal terms.
- test/v0-dormancy.test.ts — 5D dimension names + archetype names
forbidden in default-mode tier-≥2 SKILL.md output, except for
plan-tune and office-hours where they're load-bearing.
- test/readme-throughput.test.ts — script replaces anchor with number
on happy path, writes PENDING marker when JSON missing, CI gate
asserts committed README contains no PENDING string.
- test/upgrade-migration-v1.test.ts — fresh run writes pending flag,
idempotent after user-answered, pre-existing explain_level counts
as answered.
All 95 V1 test-expect() calls pass. Full suite: 0 failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: compute real 2013-vs-2026 throughput multiple (130.2×)
Ran scripts/garry-output-comparison.ts across all 15 public garrytan/*
repos. Aggregated results into docs/throughput-2013-vs-2026.json and
ran scripts/update-readme-throughput.ts to replace the README placeholder.
2013 public activity: 2 commits, 2,384 logical lines added across 1
week, in 1 repo (zurb-foundation-wysihtml5 upstream contribution).
2026 public activity: 279 commits, 310,484 logical lines added across
17 active weeks, in 3 repos (gbrain, gstack, resend_robot).
Multiples (public repos only, apples-to-apples):
- Logical SLOC: 130.2×
- Commits per active week: 8.2×
- Raw lines added: 134.4×
Private work at both eras (2013 Bookface at YC, Posterous-era code,
2026 internal tools) is excluded from this comparison.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat: 207× throughput multiple (with private repos + Bookface)
Re-ran scripts/garry-output-comparison.ts across all 41 repos under
garrytan/* (15 public + 26 private), including Bookface (YC's internal
social network, 2013-era work).
2013 activity: 71 commits, 5,143 logical lines, 4 active repos
(bookface, delicounter, tandong, zurb-foundation-wysihtml5)
2026 activity: 350 commits, 1,064,818 logical lines, 15 active repos
(gbrain, gstack, gbrowser, tax-app, kumo, tenjin, autoemail, kitsune,
easy-chromium-compiles, conductor-playground, garryslist-agent, baku,
gstack-website, resend_robot, garryslist-brain)
Multiples:
- Logical SLOC: 207× (up from 130.2× when including private work)
- Raw lines: 223×
- Commits/active-week: 3.4×
Stopped committing docs/throughput-2013-vs-2026.json — analysis is a
local artifact, not repo state. Added docs/throughput-*.json to
.gitignore. Full markdown analysis at ~/throughput-analysis-2026-04-18.md
(local-only). README multiple is now hardcoded; re-run the script and
edit manually when you want to refresh it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: run rate vs year-to-date throughput comparison
Two separate numbers in the README hero:
- Run rate: ~700× (9,859 logical lines/day in 2026 vs 14/day in 2013)
- Year-to-date: 207× (2026 through April 18 already exceeds 2013 full
year by 207×)
Previous "207× pro-rata" framing mixed full-year 2013 vs partial-year
2026. Run rate is the apples-to-apples normalization; YTD is the
"already produced" total. Both are honest; both are compelling; they
measure different things.
Analysis at ~/throughput-analysis-2026-04-18.md (local-only).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(throughput): script natively computes to-date + run-rate multiples
Enhanced scripts/garry-output-comparison.ts so both calculations come
out of a single run instead of being reassembled ad-hoc in bash:
PerYearResult now includes:
- days_elapsed — 365 for past years, day-of-year for current
- is_partial — flags the current (in-progress) year
- per_day_rate — logical/raw/commits normalized by calendar day
- annualized_projection — per_day_rate × 365
Output JSON's `multiples` now has two sibling blocks:
- multiples.to_date — raw volume ratios (2026-YTD / 2013-full-year)
- multiples.run_rate — per-day pace ratios (apples-to-apples)
Back-compat: multiples.logical_lines_added still aliases to_date for
older consumers reading the JSON.
Updated README hero to cite both (picking up brain/* repo that was
missed in the earlier aggregation pass):
2026 run rate: ~880× my 2013 pace (12,382 vs 14 logical lines/day)
2026 YTD: 260× the entire 2013 year
Stderr summary now prints both multiples at the end of each run.
Full analysis at ~/throughput-analysis-2026-04-18.md (local-only).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: ON_THE_LOC_CONTROVERSY methodology post + README link
Long-form response to the "LOC is a meaningless vanity metric" critique.
Covers:
- The three branches of the LOC critique and which are right
- Why logical SLOC (NCLOC) beats raw LOC as the honest measurement
- Full method: author-scoped git diff, regex-classified added lines,
aggregated across 41 public + private garrytan/* repos
- Both calculations: to-date (260x) and run-rate (879x)
- Steelman of the critics (greenfield-vs-maintenance, survivorship bias,
quality-adjusted productivity, time-to-first-user)
- Reproduction instructions
Linked from README hero via a blockquote directly below the number.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* exclude: tax-app from throughput analysis (import-dominated history)
tax-app's history is one commit of 104K logical lines — an initial
import of a codebase, not authored work. Removing it to keep the
comparison honest.
Changes:
- scripts/garry-output-comparison.ts: added EXCLUDED_REPOS constant
with tax-app + a one-line rationale. The script now skips excluded
repos with a stderr note and deletes any stale output JSON so
aggregation loops don't pick up pre-exclusion numbers.
- README hero: updated to 810× run rate + 240× YTD (were 880×/260×).
Wording updated to "40 public + private repos ... after excluding
repos dominated by imported code."
- docs/ON_THE_LOC_CONTROVERSY.md: updated all numbers, added an
"Exclusions" paragraph explaining tax-app, removed tax-app from
the "shipped not WIP" example list.
New numbers (2026 through day 108, without tax-app):
- To-date: 240× logical SLOC (1,233,062 vs 5,143)
- Run rate: 810× per-day pace (11,417 vs 14 logical/day)
- Annualized: ~4.2M logical lines projected
Future re-runs automatically skip tax-app. Add more exclusions to
EXCLUDED_REPOS at the top of the script with a one-line rationale.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: correct tax-app exclusion rationale
tax-app is a demo app I built for an upcoming YC channel video,
not an "import-dominated history" as the previous commit claimed.
Excluded because it's not production shipping work, not because
of an import commit.
Updated rationale in scripts/garry-output-comparison.ts's
EXCLUDED_REPOS constant, in docs/ON_THE_LOC_CONTROVERSY.md's
method section + conclusion, and in the README hero wording
("one demo repo" vs the earlier "repos dominated by imported code").
Numbers unchanged — the exclusion itself is the same, just the
reason.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: harden ON_THE_LOC_CONTROVERSY against Cramer + neckbeard critiques
Reframes the thesis as "engineers can fly now" (amplification, not
replacement) and fortifies the soft spots critics will attack.
Added:
- Flight-thesis opener: pilot vs walker, leverage not replacement.
- Second deflation layer for AI verbosity (on top of NCLOC). Headline
moves from 810x to 408x after generous 2x AI-boilerplate cut, with
explicit sensitivity analysis showing the number is still large under
pessimistic priors (5x → 162x, 10x → 81x, 100x impossible).
- Weekly distribution check (kills "you had one burst week" attack).
- Revert rate (2.0%) and post-merge fix rate (6.3%) with OSS
comparables (K8s/Rails/Django band). Addresses "where are your error
rates" directly.
- Named production adoption signals (gstack 1000+ installs, gbrain beta,
resend_robot paying API) with explicit concession that "shipped != used
at scale" for most of the corpus.
- Harder steelman: 5 specific concessions with quantified pivot points
(e.g., "if 2013 baseline was 3.5x higher, 810x → 228x, still high").
Removed factual error: Posterous acquisition paragraph (Garry had already
left Posterous by 2011, so the "Twitter bought our private repos" excuse
for the 2013 corpus gap doesn't apply).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: update gstack/gbrain adoption numbers in LOC controversy post
gstack: "1,000+ distinct project installations" → "tens of thousands of
daily active users" (telemetry-reported, community tier, opt-in).
gbrain: "small set of beta testers" → "hundreds of beta testers running
it live."
Both are the accurate current numbers. The concession paragraph below
(about shipped != adopted at scale for the long-tail repos) still reads
correctly since it's about the corpus as a whole, not gstack/gbrain
specifically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: reframe reproducibility note as OSS breakout flex
"You'd need access to my private repos" → "Bookface and Posthaven are
private, but gstack and gbrain are open-sourced with tens of thousands
of GitHub stars and tens of thousands of confirmed regular users, among
the most-used OSS projects in the world that didn't exist three months
ago."
Keeps the `gh repo list` command at the end for the actual
reproducibility instruction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Rewrite LOC controversy post
- Lead with concession (LOC is garbage, do the math anyway)
- Preempt 14 lines/day meme with historical baselines (Brooks, Jones, McConnell)
- Remove 'neckbeard' language throughout
- Add slop-scan story (Ben Vinegar, 5.24 → 1.96, 62% cut)
- David Cramer GUnit joke
- Add testing philosophy section (the real unlock)
- ASCII weekly distribution chart
- gstack telemetry section with real numbers (15K installs, 305K invocations, 95.2% success)
- Top skills usage chart
- Pick-your-priors paragraph moved earlier (the killer)
- Sharper close: run the script, show me your numbers
* docs: four precision fixes on LOC controversy post
1. Citation fix. Kernighan didn't say anything about LOC-as-metric
(that's the famous "aircraft building by weight" quote, commonly
misattributed but actually Bill Gates). Replaced "Kernighan implied
it before that" with the real Dijkstra quote ("lines produced" vs
"lines spent" from EWD1036, with direct link) + the Gates quote.
Verified via web search.
2. Slop-scan direction clarified. "(highest on his benchmark)" was
ambiguous — could read as a brag. Now: "Higher score = more slop.
He ran it on gstack and we scored 5.24, the worst he'd measured
at the time." Then the 62% cut lands as an actual win.
3. Prose/chart skill-usage ordering now matches. Added /plan-eng-review
(28,014) to the prose list so it doesn't conflict with the chart
below it.
4. Cut the "David — I owe you one / GUnit" insider joke. Most readers
won't connect Cramer → Sentry → GUnit naming. Ends the slop-scan
paragraph on the stronger line: "Run `bun test` and watch 2,000+
tests pass."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: tighten four LOC post citations to match primary sources
1. Bill Gates quote: flagged as folklore-grade. Was "Bill Gates put it
more memorably" (firm attribution). Now "The old line (widely
attributed to Bill Gates, sourcing murky) puts it more memorably."
The quote stands; honesty about attribution avoids the same
misattribution trap we just fixed for Kernighan.
2. Capers Jones: "15-50 across thousands of projects" → "roughly 16-38
LOC/day across thousands of projects" — matches his actual published
measurements (which also report as 325-750 LOC/month).
3. Steve McConnell: "10-50 for finished, tested, delivered code" was
folklore. Replaced with his actual project-size-dependent range from
Code Complete: "20-125 LOC/day for small projects (10K LOC) down to
1.5-25 for large projects (10M LOC) — it's size-dependent, not a
single number."
4. Revert rate comparison: "Kubernetes, Rails, and Django historically
run 1.5-3%" was unsourced. Replaced with "mature OSS codebases
typically run 1-3%" + "run the same command on whatever you consider
the bar and compare." No false specificity about which repos.
Net: every quantitative citation in the post now matches primary-source
figures or is explicitly flagged as folklore. Neckbeards can verify.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: drop Writing style section from README
Was sitting in prime real estate between Quick start and Install —
internal implementation detail, not something users need up-front.
Existing coverage is enough:
- Upgrade migration prompt notifies users on first post-upgrade run
- CLAUDE.md has the contributor note
- docs/designs/PLAN_TUNING_V1.md has the full design
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: collapse team-mode setup into one paste-and-go command
Step 2 was three separate code blocks: setup --team, then team-init,
then git add/commit. Mirrors Step 1's style now — one shell one-liner
that does all three. Subshell (cd && ./setup --team) keeps the user
in their repo pwd so team-init + git commit land in the right place.
"Swap required for optional" moved to a one-liner below.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: move full-clone footnote from README to CONTRIBUTING
The "Contributing or need full history?" note is for contributors, not
for someone following the README install flow. Moved into CONTRIBUTING's
Quick start section where it fits next to the existing clone command,
with a tip to upgrade an existing shallow clone via
\`git fetch --unshallow\`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: root <root@localhost>
|