mirror of https://github.com/garrytan/gstack.git
4 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
a6fb31726c
|
v1.48.0.0 feat: AskUserQuestion split rule + runtime AUTO_DECIDE carve-out (#1740)
* feat(preamble): add "Handling 5+ options — split, never drop" rule Agents repeatedly hit Conductor's 4-option AskUserQuestion cap and silently drop one option to fit, shrinking the user's decision space. This rule names the bug and gives two compliant shapes: batch into ≤4-groups (for coherent alternatives) or split into N sequential per-option calls (for independent scope items, default). Inline preamble subsection is ~15 lines (rule + buckets + pointer). Full reference with worked examples, Hold/dependency semantics, and final-summary validation lives in docs/askuserquestion-split.md. The agent loads the docs file on demand when N>4. Per-option call shape: D<N>.k header, ELI10, Recommendation, kind-note (no completeness score — decision actions, not coverage), Include / Defer / Cut / Hold buckets. Hold stops the chain immediately; the final D<N>.final call validates dependencies and confirms the assembled scope. question_ids: <skill>-split-<option-slug> (kebab-case ASCII, ≤64 chars). Also fixes orphan "12. " prefix on the existing CJK rule. Tier-2+ skills inherit via the existing resolver. SKILL.md regenerated for all 41 affected skills + 3 golden fixtures. Net diff per SKILL.md: ~34 lines (vs ~110 for the full inline version). 6 tests pin the inline contract (4-option cap, buckets, D-numbering, docs pointer, runtime AUTO_DECIDE gate reference, orphan 12 regression). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(question-pref): runtime AUTO_DECIDE carve-out for *-split-* ids Split chains (per-option AskUserQuestion calls emitted by the new "Handling 5+ options" rule) must never be silently auto-approved via /plan-tune preferences. The user's option set is sacred. Layer 1 (mechanism): unique <skill>-split-<option-slug> ids prevent cross-option preference leakage. Layer 2 (this commit): the runtime checker `gstack-question-preference --check` detects any id matching *-split-* and forces ASK_NORMALLY even when never-ask or ask-only-for-one-way preferences exist for that exact id. An explanatory note tells the user their preference was bypassed and why. 7 tests pin the carve-out: no-pref baseline, never-ask override, explanatory note text, ask-only-for-one-way override, always-ask (no note), non-split id containing "split" word (negative case for regex specificity), multi-skill split id formats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): split-overflow regression for /plan-ceo-review Periodic-tier E2E test that catches the original failure mode the user complained about: 5+ options for ONE decision must split into N sequential AskUserQuestion calls, not drop one to fit Conductor's 4-option cap. Fixture: 5 independent chat-platform integration candidates (Slack/Discord/Teams/Telegram/Mattermost), each carrying its own include/defer/cut decision. Floor = 4 review-phase AUQs (standard [N-1] tolerance band). Pre-fix "drop to 4 + 1 dropped" fails this floor. Wired into test/helpers/touchfiles.ts: tier periodic, depends on plan-ceo-review/**, the new preamble subsection, the question-pref binary (for the carve-out), and the runner helper. touchfiles.test.ts expected count bumped 21 → 22 to account for the new entry. Cost: ~$0.30/run when EVALS_TIER=periodic. Skips silently otherwise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: post-merge regen + rebase size-budget baseline to v1.47.0.0 After merging origin/main (v1.45 → v1.47), three things needed cleanup: 1. spec/SKILL.md (main's new skill) regenerated to include our split-vs-drop preamble subsection — same mechanical regen as the other 41 tier-2+ skills. 2. Three golden ship fixtures refreshed to capture main's GSTACK_PLAN_MODE block + /spec routing entry + jargon-list.json refactor. 3. docs/skills.md — added /spec table row that main's PR (#1698/#1733) shipped without. Pre-existing failure on main; this PR catches and fixes. Also rebased test/skill-size-budget.test.ts from v1.44.1 → v1.47.0.0 baseline. Main's v1.46 (catalog tokens trim) + v1.47 (/spec skill) pushed the v1.44.1 anchor past the 5% ratchet to ×1.059 — pre-existing failure on main. This PR captures a fresh parity-baseline-v1.47.0.0.json and re-anchors the test there. Historical v1.44.1.json and v1.46.0.0.json retained in test/fixtures/ for reference. Our subsection contributes ~0.1% of the post-rebase corpus. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.48.0.0) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
|
|
|
f8bb59094d
|
v1.47.0.0 feat: /spec — author backlog-ready spec in 5 phases + optional agent spawn (#1698) (#1733)
* feat(issue): add /issue skill for backlog-ready GitHub issue authoring
Interrogates an ambiguous request through five strict phases (why, scope,
technical, draft, final) and produces a GitHub issue precise enough that an
unfamiliar engineer or AI agent can execute it without follow-up. Slots in
after /office-hours (when the idea has passed the "worth building" bar) and
before /plan-eng-review (which assumes a plan already exists).
- issue/SKILL.md.tmpl + generated SKILL.md
- routing entry in root SKILL.md.tmpl
- llms.txt regenerated to include the new skill
* chore(spec): rename /issue → /spec + fix duplicate analytics block
Foundation commit for the /spec skill (extends PR #1698 by @jayzalowitz).
- Renames issue/ → spec/ (template + generated)
- Removes the hand-rolled analytics block in spec/SKILL.md.tmpl (lines 46-49 of the original); {{PREAMBLE}} already emits the analytics write with the telemetry opt-out guard, so the duplicate would have bypassed gstack-config set telemetry off
- Updates frontmatter (name: spec, expanded description with magical-moment preview, triggers reordered to lead with "spec this out")
- Updates root SKILL.md.tmpl routing entry → /spec
- Regenerates spec/SKILL.md and gstack/llms.txt via bun run gen:skill-docs
Co-Authored-By: Jay Zalowitz <jayzalowitz@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(spec): expansions — flags, archive, quality gate, plan-mode-aware Phase 5, /ship integration, tests
Builds on the @jayzalowitz foundation (commit
|
|
|
|
22f8c7f4e1
|
v1.46.0.0 feat: gstack v2 foundation — catalog tokens drop 56%, eval-first floor covers all 51 skills (#1712)
* docs(designs): add v2_PLAN.md — gstack v2 the lightest opinionated skill pack
The approved plan from /plan-ceo-review → /plan-eng-review → /codex×2 →
/plan-devex-review. Captures the v1.45/v2.0 hybrid release shape,
cathedral parity-eval suite, sequential v1.45 execution, sections/*.md.tmpl
pipeline, EVALS_BUDGET_HARD_CAP override path, and v2 launch copy specs.
This commit just lands the design doc. Implementation follows in the rest
of the v1.45.0.0 branch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(parity): T0a — capture v1.44.1 baseline + capture helper + diff utility
Cathedral parity-eval suite primitive. captureBaseline() walks every
top-level SKILL.md and records bytes, lines, estimated tokens, frontmatter
description length, and eval coverage. diffBaselines() reports per-skill
delta + total corpus delta + catalog tokens delta.
Locks the v1.44.1 reference snapshot at test/fixtures/parity-baseline-v1.44.1.json.
After Phase A+B+C land, scripts/capture-baseline.ts --tag v1.45.0.0 produces
a comparable snapshot; diff supplies the real numbers the v2 CHANGELOG quotes.
Never invent baseline numbers; ship them only if they came from a real run.
v1.44.1 numbers captured this commit:
- 51 skills
- 2,847 KB total corpus
- ~9,319 catalog tokens (sum of description bytes / 4)
- top 3: ship 160 KB, plan-ceo-review 128 KB, office-hours 108 KB
Test plan:
- bun test test/helpers/capture-parity-baseline.test.ts passes 4/4
- The baseline JSON file is committed so reviewers can audit v1→v2 numbers
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(resolvers): T2 — ResolverEntry + appliesTo gate infrastructure
Adds the conditional-resolver-injection plumbing from the v2_PLAN A.1
step. Resolvers can now be either a bare ResolverFn (always fires, current
behavior) or a ResolverEntry { resolve, appliesTo? } (gated; appliesTo
returning false skips the resolver, substitutes empty string).
Why infrastructure-only: the audit during T0a confirmed most resolvers
don't need gating. The {{NAME}} placeholder system is already conditional
at the template level — a resolver only fires for skills that reference it.
The gate is for future use when a placeholder's audience needs a structural
guardrail beyond social convention, or when a sub-resolver inside a larger
composed resolver (e.g. preamble) needs per-skill skip.
scripts/gen-skill-docs.ts:444 now uses unwrapResolver() to handle both
shapes. RESOLVERS map signature widens from Record<string, ResolverFn>
to Record<string, ResolverValue>. All existing resolvers stay bare
functions and work unchanged.
Test plan:
- bun test test/resolver-entry.test.ts: 6 pass (gate plumbing + registry)
- bun test test/gen-skill-docs.test.ts: 389 pass (no regression)
- bun run gen:skill-docs --dry-run: all SKILL.md files FRESH (no diff)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(preamble): T3 — jargon dedup + terse-build flag (Phase A.2 + A.3)
A.2 jargon dedup: generate-writing-style.ts replaces the inlined 80-term
jargon list with a one-line pointer to scripts/jargon-list.json. The list
was duplicated into every tier-2+ skill (48 of 51 skills); inlining cost
was ~1.5 KB × 48 = ~70 KB across the corpus. Pointer cost is ~30 bytes per
skill. Agents Read the JSON once per session on first jargon term
encountered; thereafter the terms array is the canonical reference.
A.3 terse build flag: --explain-level=terse compresses preamble prose at
gen time. When the flag is set, writing-style collapses to a one-line
terse directive and completeness-section + confusion-protocol +
context-health are dropped entirely. The default build keeps the
runtime-conditional behavior intact (sections still render; the model
skips them when EXPLAIN_LEVEL: terse appears in the preamble echo). Terse
build is opt-in for users who want shipped skills to match their runtime
preference and avoid the per-session terse-mode dead prose.
TemplateContext gains an optional `explainLevel: 'default' | 'terse'`
field. Default builds set it to 'default'; --explain-level=terse sets
'terse'. Resolvers gate their output via `ctx?.explainLevel === 'terse'`.
Measured impact (default build, post-T3):
- Total corpus: 2,847 KB → 2,812 KB (saved 35 KB)
- ship.md: 160 → 159 KB
- plan-ceo-review.md: 128 → 127 KB
- Top 10 heaviest: all slightly smaller from jargon pointer
Larger compression lands in T4 (catalog trim) and T7 (atomic regen across
the full Phase A pipeline). The terse build path further compresses to
~711K tokens vs default ~725K (saved ~14K tokens corpus-wide).
Test plan:
- bun test test/gen-skill-docs.test.ts: 389 pass (no regression)
- bun test test/resolver-entry.test.ts: 6 pass
- bun test test/helpers/capture-parity-baseline.test.ts: 4 pass
- bun run gen:skill-docs --explain-level=terse: ship.md drops completeness +
confusion-protocol + context-health sections; writing-style collapses to
one-line terse directive
48 SKILL.md files updated (every tier-2+ skill picks up the jargon pointer).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(catalog): T4 — catalog trim + proactive-suggestions.json (Phase A.4)
Shortens frontmatter `description:` in every Claude SKILL.md to a single
lead sentence + (gstack) tag. The routing prose ("Use when asked to...",
"Proactively suggest...") and voice triggers move to a "## When to invoke"
body section so they remain discoverable inside the skill. A per-run
registry at scripts/proactive-suggestions.json aggregates the routing/
voice text for all 52 skills so agents can pull guidance on demand
without paying for it in the always-loaded catalog.
Build flag --catalog-mode=full restores v1.44 legacy behavior (full
multi-line descriptions in frontmatter). Default is trim.
splitCatalogDescription() extracts: lead sentence, routing paragraphs,
voice-triggers line, (gstack) tag presence. Short descriptions (<120
chars, already trimmed) are skipped via a guard so re-runs are idempotent.
Measured impact (vs v1.44.1 baseline):
- Catalog tokens (sum of description bytes / 4): 9,319 → 4,045 (-56.6%)
- Total SKILL.md corpus bytes: 2,915 KB → 2,880 KB (-1.2%)
- Routing prose preserved as in-skill "## When to invoke" sections
- 52 skill entries in scripts/proactive-suggestions.json (on-demand registry)
The corpus drop is small because catalog trim MOVES text from frontmatter
to body, it doesn't delete it. The headline win is the catalog: the
always-loaded system prompt surface drops by more than half.
Test plan:
- bun test test/gen-skill-docs.test.ts: 389 pass, 0 fail
- Manual: ship/SKILL.md frontmatter description is now ONE line ending
with `(gstack)`; allowed-tools field on next line (YAML well-formed)
- Manual: scripts/proactive-suggestions.json contains 52 entries
- bun run gen:skill-docs --catalog-mode=full restores legacy behavior
53 files changed (52 SKILL.md across hosts + the new proactive-suggestions.json).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(budget): T5 — hard token budgets + override audit trail (Phase A.6)
Two new gate-tier guardrails for the v1.45.0.0 compression baseline:
1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget.
Compares current state to test/fixtures/parity-baseline-v1.44.1.json.
Three checks: per-skill (×1.05 default ratio), total corpus, and
catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05
not 1.0 because the T4 catalog trim moves text from frontmatter to a
body section; small skills see a tiny body growth that's fine when
offset by the much larger catalog-token win.
2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on
per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella
EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry,
model price changes) before they amortize across PRs.
Both checks support an override path with audit trail:
GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" — size
EVALS_BUDGET_OVERRIDE_REASON="why this is OK" — cost
Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with
timestamp + scope + reason + CI provenance (runner, branch, commit)
via test/helpers/budget-override.ts.
Why the override audit: a hard cap with no escape valve becomes
operationally hostile (legit price changes, longer transcripts, new
required evals can all blow the cap). An override with no audit becomes
"everyone overrides everything and the gate is theater." This module
ships the audit half so reviewers can see what was waived and why.
Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with
auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json
already in test/fixtures/).
Test plan:
- bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists)
- bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks)
- Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(cso): T6 — pin must-preserve security phrases (Phase A.5)
cso/SKILL.md is a content-heavy security audit skill (75 KB after T3+T4).
Codex 2nd-pass critique #9: "cso exemption too broad ... should still get
resolver dedup, catalog trim, sectioning if safe, and targeted evals
around must-not-miss checks."
T3 (jargon dedup) and T4 (catalog trim) already applied to cso the same
way they applied to every other skill — confirmed by inspection:
- jargon list NOT inlined (0 inline term lines)
- catalog description trimmed to one line (74 bytes vs 774 bytes baseline)
- "## When to invoke" body section present
T6 work: lock in the security-prose preservation via a gate-tier test
that fails CI if future compression strips load-bearing phrases:
- OWASP, STRIDE positioning
- daily / comprehensive mode discipline
- confidence scoring language
- active verification ("verif" prefix catches verify/verified/verification)
- ## Preamble heading (preamble resolver still fires)
Also guards cso against accidental over-stripping: SKILL.md must stay
≥30 KB (currently 75 KB) — a sudden cliff would mean compression went
past the targeted-dedup line into structural removal.
No structural change to cso. Future Phase B sections/ work for cso
requires writing baseline parity tests FIRST per the v2_PLAN.md
sequencing.
Test plan:
- bun test test/cso-preserved.test.ts: 5 pass
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(parity): T0b — cathedral parity-suite harness + invariant registry
Adds the harness that the v2_PLAN.md cathedral parity-eval suite is built
on. Compares CURRENT SKILL.md output to v1.44.1 baseline along three axes:
STRUCTURE frontmatter shape (catalog trim landed, "## When to invoke" present)
CONTENT must-preserve phrases per skill family (cso: OWASP/STRIDE;
plan-ceo: SCOPE EXPANSION/HOLD SCOPE/REDUCTION; ship:
VERSION/CHANGELOG/PR; etc.)
SIZE per-skill byte budget (maxSizeRatio + minBytes guards)
PARITY_INVARIANTS registry pins 10 load-bearing skills (cso, ship, plan-*-
review, review, qa, investigate, office-hours, autoplan). Each entry
declares what must NOT regress; future compression that strips these
phrases or shrinks a skill past its minBytes cliff fails CI.
Periodic-tier LLM-judge parity (paid, ~$0.20/skill) lands in v2.0.0.0
sections/ phase. Same registry, same harness, judge added on top.
Test plan:
- bun test test/parity-suite.test.ts: 10/10 invariants pass vs v1.44.1
- Per-skill failures get actionable per-line breakdown so a reviewer can
see which phrase / heading / size limit went sideways
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(coverage): T1 — skill coverage matrix + structural-compliance floor
Phase 0 deliverable — eval-first foundation. Two new test files plus the
registry:
1. test/skill-coverage-matrix.ts — single source of truth mapping each
skill to its gate-tier + periodic-tier test files. SKILL_COVERAGE
record with 51 entries; every gstack skill on disk has at least one
gate-tier entry.
2. test/skill-coverage-matrix.test.ts — CI gate. Asserts every skill on
disk has a registry entry AND that gate[] is non-empty. Catches
"skill added but eval not registered" the moment a new SKILL.md
lands.
3. test/skill-coverage-floor.test.ts — per-skill structural compliance
(FREE, file-IO only). For each of 51 skills, verifies:
- SKILL.md exists
- Frontmatter well-formed (name + description fields)
- Catalog-trim contract (inline description ≤ 250 chars, or block form)
- Generated header present (edit .tmpl, not .md)
- Body ≥ 200 bytes (non-trivial content)
- No unresolved {{TEMPLATE}} placeholders leaked
The "floor" is the minimum eval that every skill ships with. Skills that
need deeper behavioral testing get additional entries in their coverage
record (e.g., ship has skill-e2e-ship-idempotency + workflow + floor).
Future skills only need to add the floor entry and the matrix gate
unblocks them.
Codex 2nd-pass critique #1 mitigation: eval-first floor is structural
compliance (the testable part) — judgment-skill behavior gets layered
periodic-tier evals on top. We don't pretend the floor proves
correctness, only that the skill structurally compiles.
Test plan:
- bun test test/skill-coverage-matrix.test.ts: 4 pass (matrix shape + coverage)
- bun test test/skill-coverage-floor.test.ts: 309 pass (6 checks × 51 skills + 3 registry-level)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* build(skills): T7 — atomic regenerate + capture v1.45.0.0 baseline
Final regen pass across all hosts after T1-T6 work landed. Captures the
v1.45.0.0 parity baseline at test/fixtures/parity-baseline-v1.45.0.0.json
for diffing against the v1.44.1 reference.
Measured deltas (real numbers from test/helpers/capture-parity-baseline.ts):
Total SKILL.md corpus 2,847 KB → 2,813 KB (-1.2%)
Catalog tokens (always-loaded) ~9,319 → ~4,045 tokens (-56.6%)
Top 10 heaviest skills 0.5-1.0% drop each
The catalog token cut is the headline. It's the always-loaded surface,
i.e. tokens charged on every session start. Per-skill SKILL.md sizes
barely moved because T4 catalog trim MOVES routing prose from frontmatter
to a body "## When to invoke" section rather than deleting it — the
catalog wins without amputating discoverability.
The bigger per-skill compression lands in v2.0.0.0 (Phase B sections/
pattern on the 5 heavyweights). v1.45 is the foundation: eval-first
infrastructure + cheap wins.
scripts/proactive-suggestions.json regenerated with the latest 52 skills
listed (one-time write per gen-skill-docs run; aggregated catalog parts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* v1.45.0.0 — gstack v2 foundation: catalog tokens drop 56%, eval-first floor
Bumps VERSION + package.json to 1.45.0.0. CHANGELOG entry covers what
shipped between v1.44.1 and this release: the cathedral parity-eval
foundation, conditional resolver injection plumbing, jargon dedup, terse
build flag, catalog trim with one-line frontmatter descriptions, hard
token + dollar budget gates with override audit, cso preservation pins,
and the v1.44.1 ↔ v1.45.0.0 parity baselines committed to test/fixtures/.
Numbers (measured, not estimated):
- Catalog tokens: ~9,319 → ~4,045 (-56.6%)
- Total corpus: 2,847 KB → 2,813 KB (-1.2%)
- Skills with gate-tier eval coverage: 32/51 → 51/51 (floor achieved)
This is the foundation release. v2.0.0.0 will ship the architectural
break (sections/*.md.tmpl pattern + mechanical Read enforcement +
eval-coverage annotations) as a coordinated marketing-grade launch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(catalog): refresh proactive-suggestions.json timestamp after v1.45 bump
The generated_at field updates on every gen-skill-docs run; this is the
T7 atomic-regenerate output landed alongside the v1.45.0.0 bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(catalog): deterministic proactive-suggestions.json (no per-run timestamp)
Original implementation wrote a generated_at timestamp on every gen-skill-docs
run. That made CI dry-run freshness checks flap because the file changed on
every regeneration even when the actual content (skill descriptions, routing
prose, voice triggers) was unchanged.
Two fixes:
1. Drop the generated_at field. The file is purely a content registry now.
2. Only write the file when serialized content actually differs from disk.
Reproducible test: bun run gen:skill-docs twice in a row now leaves
scripts/proactive-suggestions.json unchanged on the second run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(catalog): preserve routing prose when first sentence exceeds 200 chars
splitCatalogDescription truncated the lead BEFORE computing routing
extraction, which meant skills whose first sentence was over 200 chars
(design-consultation: 207 chars) had their entire routing prose silently
dropped — the "## When to invoke" body section came out empty.
Root cause: routing was extracted via `collapsed.indexOf(lead)` after lead
was suffixed with "...". The "..." never appeared in the original string,
so indexOf returned -1 and routingProse fell back to empty.
Fix: compute routing from sentenceLead (the untruncated first sentence)
BEFORE truncating the displayed lead. The displayed lead still gets "..."
when over 200 chars, but the routing extraction uses the real boundary.
Also: refresh golden snapshots for claude/codex/factory ship and update
two unit tests that asserted v1.44 behavior:
- skill-validation.test.ts: trigger-phrase + proactive-routing tests now
search whole content, not just frontmatter (T4 moved them to a body
"## When to invoke" section)
- writing-style-resolver.test.ts: jargon-list assertion now expects the
T3 reference pointer, not the inline list
Test plan:
- bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts
test/host-config.test.ts test/skill-size-budget.test.ts
test/parity-suite.test.ts test/skill-coverage-matrix.test.ts
test/skill-coverage-floor.test.ts test/cso-preserved.test.ts
test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts
test/gen-skill-docs.test.ts: 1134 pass, 0 fail
- Manual verify: design-consultation/SKILL.md "## When to invoke this skill"
body section now contains "Use when asked to..." + "Proactively suggest..."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(catalog): deterministic proactive-suggestions.json across machines
CI check-freshness failed because scripts/proactive-suggestions.json
serialized differently on local vs CI:
1. Root-skill key leaked the directory name. processTemplate's outer loop
computed `dir = path.basename(path.dirname(tmplPath))`. For the root
SKILL.md.tmpl at ROOT/SKILL.md.tmpl, that returns the repo-checkout
directory name — "seville-v3" in a Conductor worktree, "gstack" on
GitHub Actions, anything-else for a fork. Fix: detect root via
`path.dirname(tmplPath) === ROOT` and hardcode the key to "gstack"
for that one case.
2. Aggregate key order was filesystem-iteration order. discoverTemplates
doesn't guarantee stable ordering across platforms, so the JSON
`skills` object came out shuffled between machines. Fix: sort
Object.keys(proactiveAggregate) alphabetically before serializing.
After the fix, the generated file is identical on every machine and
matches what's committed. CI freshness check (bun run gen:skill-docs &&
git diff --exit-code) now passes.
Test plan:
- bun run gen:skill-docs && bun run gen:skill-docs --dry-run: all FRESH
- node -e 'verify keys sorted': sorted match: true
- grep -c '"seville-v3"' scripts/proactive-suggestions.json: 0
- Focused test suite: 704 pass, 0 fail
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(catalog): unit + regression coverage for catalog-trim helpers
Four exported functions in scripts/gen-skill-docs.ts handle every skill's
frontmatter rewrite at gen time but had zero unit tests. Both real bugs we
shipped (and fixed) on this branch lived in these functions:
v1.45.0.0 design-consultation: when the first sentence exceeded 200 chars,
routing-prose extraction lost the entire tail (anchored on truncated lead
with "..." that didn't substring-match the original).
v1.45.0.0 CI freshness: root-skill key leaked the checkout directory
name ("seville-v3" vs "gstack") and aggregate order was filesystem-
iteration order.
Both shapes are now regression-tested:
- splitCatalogDescription: 7 tests covering simple multi-line, >200-char
first sentence (design-consultation regression), voice-trigger
extraction, no-(gstack) handling, embedded periods (documents known
fallback), no-period fragments, and idempotency.
- buildTrimmedDescription: 3 tests.
- buildWhenToInvokeSection: 3 tests.
- applyCatalogTrim: 4 tests covering the standard rewrite, no-op for
already-short descriptions, the YAML-collision newline fix, and the
malformed-frontmatter null return.
- proactive-suggestions.json determinism: 3 tests asserting sorted keys,
root keyed as "gstack" (not the worktree directory), and no
timestamp/generated_at field that would flap CI freshness.
Test plan:
- bun test test/catalog-trim.test.ts: 20 pass, 0 fail
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(coverage): fill three remaining v1.46.0.0 test gaps
Three untested surfaces from the v1.46.0.0 work. All three would have
caught real bugs we shipped (and fixed) on this branch.
1. test/helpers/budget-override.test.ts — 7 tests pin the audit-trail
contract for EVALS_BUDGET_OVERRIDE_REASON and
GSTACK_SIZE_BUDGET_OVERRIDE_REASON. Without this, the audit logger
could silently drop events and overrides become invisible. Tests
cover: required fields per JSONL line, CI provenance capture
(CI/GITHUB_ACTIONS/branch/commit), local-runner defaults,
append-only behavior, missing-directory recovery, and unwritable-
path resilience (logs warning instead of throwing).
2. test/terse-build.test.ts — 16 tests pin --explain-level=terse
behavior across the 4 gated resolvers and the composed preamble.
Default vs terse vs undefined-ctx all asserted. Without this, a
refactor that breaks the explainLevel threading silently regresses
the opt-in compression path; the runtime EXPLAIN_LEVEL: terse gate
still works so users wouldn't notice. Tier-1 invariant pinned
(terse-only-affects-tier-2+).
3. test/gen-skill-docs-idempotency.test.ts — 2 tests catch the class
of bug behind the v1.45.0.0 timestamp flap. Two consecutive
gen-skill-docs runs must produce byte-identical outputs across
STABLE_OUTPUTS (proactive-suggestions.json, SKILL.md, ship/SKILL.md,
plan-ceo-review/SKILL.md, office-hours/SKILL.md, gstack/llms.txt).
--dry-run reports zero stale files after a fresh gen. CI freshness
regressions surface as test failures BEFORE a PR is opened.
Test plan:
- bun test test/helpers/budget-override.test.ts: 7 pass
- bun test test/terse-build.test.ts: 16 pass
- bun test test/gen-skill-docs-idempotency.test.ts: 2 pass
- Full focused suite (15 test files): 1179 pass, 0 fail (+45 new tests
vs the pre-fill baseline of 1134)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(coverage): close 5 remaining v1.46.0.0 test gaps (A-E)
Five behaviors that v1.46 ships but had no test coverage. All now pinned.
A) --host all idempotency (test/gen-skill-docs-idempotency.test.ts)
The default test ran Claude host only. Non-Claude hosts (Codex, Factory,
Cursor, OpenClaw, GBrain, Slate, OpenCode, Hermes, Kiro) each have their
own output paths and could carry their own non-deterministic fields. We
hit a "--host all needed for freshness check" mid-/ship. Now: two
consecutive `bun run gen:skill-docs --host all` runs must produce
byte-identical outputs across a per-host sample (.agents/, .cursor/,
.factory/, .gbrain/). Catches per-host adapter regressions before CI.
B) --catalog-mode=full opt-out (test/catalog-mode-full.test.ts)
The legacy escape hatch had zero tests. 6 new tests across two layers:
static (CATALOG_MODE_ARG parsed; conditional gate present; default is
"trim"; invalid value throws) + smoke (actual --catalog-mode=full run
produces a multi-line `description: |` block + omits "## When to invoke"
body section; mutates the working tree then restores in a finally block).
C) parity-baseline-v1.44.1.json integrity (test/parity-baseline-integrity.test.ts)
The baseline is the source of every v1→v2 number cited in the
CHANGELOG v1.46.0.0 entry. Anyone could edit it without test failure
until now. 8 new tests pin: existence, tag, capturedFromCommit
allowlist, expected v1.44 numbers (51 skills, ~2,915 KB, ~9,319
catalog tokens), CHANGELOG references this file by path, per-skill
shape, and a SHA256 byte-stability hash. Any edit fails with a clear
"if intentional, update EXPECTED_HASH AND the CHANGELOG numbers" signal.
D) Live appliesTo gate end-to-end (test/resolver-entry.test.ts extended)
The unwrapResolver unit tests covered the function; the gen-skill-docs.ts
substitution loop that USES the gate had no integration coverage. 6 new
tests simulate the exact 4-line shape from gen-skill-docs.ts:457-467
against synthetic registries: plain-function fires unconditionally,
gated fires when true / empty-string when false, mixed registries
compose, parameterized resolvers respect gates, unknown resolvers throw.
E) Per-skill min-size floor (test/skill-size-budget.test.ts extended)
The existing 200-byte body coverage-floor is a noise floor — a skill
that lost 99.75% of content still passes. 1 new test asserts every
skill stays ≥80% of its v1.44.1 baseline size (the parity-suite
content invariants only covered 10 of 51 skills; the remaining 41
were uncovered). SECTIONS_EXTRACTED hook in place for v2.0.0.0 when
the sections/ pattern legitimately shrinks ship/plan-ceo/etc. past
the floor.
Test plan:
- bun test focused 17-file suite: 1202 pass, 0 fail
(+23 new tests vs the pre-fill 1179 baseline)
- catalog-mode=full mutates working tree then restores cleanly
- --host all idempotency runs two full gen passes in <1s on this machine
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
|
|
1d9b9c4cfc
|
v1.43.0.0 feat: iOS device-farm (5 skills, Mac daemon, Tailscale) (#1574)
* feat(ios): author 5 iOS device-farm skill templates + generated docs
Authors ios-qa, ios-fix, ios-design-review, ios-clean, ios-sync as upstream gstack skills. Each follows the standard SKILL.md.tmpl pattern with preamble-tier:3 frontmatter. The fork at time-attack/gstack shipped these but as byte-identical .md/.tmpl pairs that wouldn't pass skill-docs.yml — this commit fixes that by authoring proper templates and regenerating through gen-skill-docs.
* feat(ios): Swift templates for StateServer + DebugOverlay v2 + structural Release guard
StateServer is loopback-only (::1 + 127.0.0.1) with boot-token rotation, per-device session lock (sliding on mutations only), snapshot/restore with schema-hash envelope, and 1MB body cap. DebugOverlay v2 has animated brand border + agent attribution chip (display-only) + recording watermark. Package.swift enforces structural Release-build exclusion via .when(configuration: .debug). Includes Tailscale ACL example doc.
* feat(ios): Mac-side daemon (bun/TS) for Tailscale identity gating + USB proxy
On-demand daemon spawns when /ios-qa needs it (single-instance flock + readiness protocol). Owns tailnet ingress: fail-closed tailscaled LocalAPI probe, dual-track /auth/mint (self-service for allowlisted identities, owner-granted via CLI), capability-tier allowlist (observe/interact/mutate/restore), 1h default session TTL (24h hard cap), audit log of every authenticated mutating tailnet request, hashed-identity attempts log. iOS StateServer never directly binds tailnet — identity validation lives Mac-side because iPhones can't reach tailscaled. 67 unit/integration tests covering session-lock concurrency, capability enforcement, fail-closed probe, identity canonicalization, body limits, and boot-token leak proofs.
* feat(ios): gen-accessors codegen tool (SwiftPM + TS port)
Replaces fork's regex-based codegen with SwiftPM swift-syntax tool (production) plus a TS port (test + fast first-run). Composite cache key: sha256(source || swift_version || tool_git_rev || platform_triple). Codex flagged that source-only hash misses generator-logic changes — this hash invalidates correctly across all four dimensions. 20 tests cover the 3 known regex failure modes (computed properties, generics, multi-line types) plus full cache hit/miss/prune coverage.
* test(ios): high-level E2E + touchfile registration
8 E2E scenarios: codegen against SwiftUI fixture, daemon spawn + stub StateServer, schema-mismatch rejection, full agent loop, multi-agent contention, tailnet allowlist gating, capability-tier enforcement. Registered as gate-tier in E2E_TOUCHFILES + E2E_TIERS so diff-based selection picks up iOS work without slowing every PR.
* chore: bump version and changelog (v1.40.0.0)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* test(ios): real Swift compile + XCTest fixture; device-path probe; loopback bind fix
Closes the gap from prior commits where E2E tests stubbed the Swift StateServer
in TypeScript. Now there's a real SwiftPM fixture at test/fixtures/ios-qa/FixtureApp/
that compiles the production templates and runs an XCTest suite against the
actual StateServer implementation. Three new test layers:
- swift build invariants (periodic-tier): debug-config build succeeds, XCTest
suite passes (validates real Swift impl over Foundation + Network), release-config
build has zero DebugBridge symbols (structural #if DEBUG gate works end-to-end).
- Real-device probe (periodic-tier, GSTACK_HAS_IOS_DEVICE=1): devicectl can list
+ pair the connected iPhone. Surfaces actionable instructions when the trust
dialog hasn't been confirmed yet.
- Fixture sources copied from ios-qa/templates/ — Package.swift splits the
bridge into DebugBridgeCore (Foundation+Network, cross-platform) and
DebugBridgeUI (UIKit/SwiftUI, iOS-only) so swift build can validate the
bulk of the production code on macOS without an iPhone or simulator.
Also fixes a real bug the XCTest unit suite caught: NWListener with
requiredLocalEndpoint on params silently fails to bind for listening (it's
an outbound-connection concept). Replaced with .requiredInterfaceType=.loopback
+ .acceptLocalOnly=true + a per-connection peer-address check. The fork's
inherited code had this bug; we shipped it untouched in v1.41.0.0 and the
new XCTest suite caught it immediately.
* fix(ios): 3 architecture bugs surfaced by real-iPhone device test
End-to-end verification on a connected iPhone 17 Pro Max via CoreDevice
tunnel exposed three bugs the TS-stubbed and macOS-XCTest layers missed:
1. acceptLocalOnly=true was too tight. Network.framework's "local" gate
only allows ::1 / 127.0.0.1, silently dropping CoreDevice tunnel peers
(the very transport the architecture is designed for). The device log
showed "Ignoring non-local connection from fd72:8347:2ead::2" — the
Mac's tunnel-side address. Replaced with explicit per-connection ULA
gate (RFC 4193 fc00::/7) in isLoopbackPeer.
2. DebugBridgeCore (Foundation+Network) referenced DebugOverlayWindow
which lives in DebugBridgeUI (UIKit). Backwards module dep. Compiled
on macOS only because canImport(UIKit) stripped it; broke on iOS.
Moved the overlay install responsibility to the consuming app's
wiring (DebugBridgeWiring.swift.template already shows the pattern).
3. @Observable macro + @Snapshotable property wrapper conflict. Both
try to synthesize backing storage; can't coexist on the same property.
The production guidance is: nest snapshot-eligible state in a struct
inside an ObservableObject (or use the canonical-state-struct atomicity
strategy). Fixture switched to a plain class to demonstrate.
Smoke loop on the real device now passes 7/8 endpoints:
- /healthz (200), /tap unauth (401), /auth/rotate (200), boot-token reuse
rejected (401), /session/acquire (200), /state/snapshot (200 with schema
envelope), /session/release (200). /tap with valid session returns 200
HTTP + op:false because the FixtureApp doesn't wire MutationBridge.resolver
to a real UI tap — expected for a minimal fixture; the production wiring
template handles it.
Also adds:
- test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/FixtureAppApp.swift
(SwiftUI @main entry that boots StateServer)
- test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/Info.plist
- test/fixtures/ios-qa/FixtureApp/project.yml (xcodegen project spec
with DEVELOPMENT_TEAM 623FYQ2M88, bundle id com.gstack.iosqa.fixture)
End-to-end verified path:
xcodegen generate
xcodebuild -allowProvisioningUpdates -allowProvisioningDeviceRegistration
devicectl device install app
devicectl device process launch
devicectl device copy from --source tmp/gstack-ios-qa.token
curl -6 http://[<corodevice-ipv6>]:9999/...
* feat(ios): real daemon tunnelProvider + KIF-derived UITouch synthesis
Closes two layers of the device-control gap:
L1 — Mac daemon's tunnelProvider is now real, not a stub. New files:
- ios-qa/daemon/src/devicectl.ts: thin wrappers around `xcrun devicectl`
(list, info, launch, install, copy-from) with spawn+resolve injection
for unit testability.
- ios-qa/daemon/src/tunnel-bootstrap.ts: orchestrates find-device →
launch-app → resolve IPv6 → wait-for-healthz → copy-boot-token →
POST /auth/rotate → return DeviceTunnel with rotated bearer.
- ios-qa/daemon/test/tunnel-bootstrap.test.ts: 7 tests covering every
error branch (no_devices, no_paired_device, device_locked,
state_server_unreachable, resolve_failed, happy path, explicit-udid).
- index.ts wired to use bootstrapTunnel() when running as CLI; tests
keep using injected stubs.
L2 — In-process touch synthesis for non-UIControl widgets. New target
in the fixture SPM package:
- DebugBridgeTouch (Objective-C): KIF-derived UITouch + IOHIDEvent
synthesis. Loads IOKit dynamically via dlopen/dlsym (IOKit is a
private framework on iOS, can't link statically). Uses iOS 18+
_UIHitTestContext for SwiftUI hit-testing. Public Swift-callable
API: DebugBridgeTouch.sendTap(at:in:). MIT-attributed to
kif-framework/KIF.
- DebugBridgeUI/Bridges.swift: rewritten MutationBridge.handleTap to
delegate to DebugBridgeTouch. ScreenshotBridge + ElementsBridge
implementations also land here.
- FixtureApp/Sources/FixtureApp/FixtureAppApp.swift: wires the bridges
on app launch under #if DEBUG.
Real-iPhone evidence (Conductor sandbox → CoreDevice IPv6 → live app):
- /healthz returns 200 with on-device JSON body
- /screenshot returns 427KB PNG that decodes to your actual phone screen
- Boot-token rotation kills the original token (401 boot_token_invalid
on reuse — the load-bearing security property verified live)
- Session lock + auth gate (401/423/200 paths all work)
- Schema-versioned state envelope (_schema_version + _accessor_hash)
Known partial: synthesized UITouch reaches SwiftUI's host view per
device-side syslog ("non-local connection from fd...:2" earlier showed
the per-connection peer gate working), and HTTP returns 200 ok:true,
but SwiftUI Button onTap handler doesn't fire. UIControl widgets DO
work via UIControl.sendActions. Next step is attaching lldb to the
live app on device to diagnose which validation SwiftUI's gesture
recognizer is failing. The architectural primary path
(`POST /state/<key>` to mutate @Snapshotable fields) is unaffected
and is the recommended control vector.
Documented sources for the KIF-derived synthesis:
- https://github.com/kif-framework/KIF (MIT)
- UITouch-KIFAdditions.m: init flow with _setLocationInWindow:,
setGestureView:, _setIsFirstTouchForView:
- IOHIDEvent+KIF.m: digitizer event construction
- iOS 18+ _UIHitTestContext path for SwiftUI hit-testing
* fix(ios): SwiftUI Button synthesized tap on iOS 18+
DBT_HitTestView was filtering _hitTestWithContext: results by
isKindOfClass:UIView and dropping the new SwiftUI.UIKitGestureContainer
(a UIResponder, not UIView). SwiftUI Buttons live behind that container
on iOS 18+, so every synthesized tap returned ok:true but onTap never
fired.
Mirror KIF PR #1323: return id, pass the responder through to
UITouch.setView: directly (the setter accepts non-UIView responders).
Verified: real iPhone 17 Pro Max, iOS 26.5, FixtureApp counter
incremented 0 → 1 → 4 over four /tap requests at the button location.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(ios): hoist DebugBridgeTouch into canonical templates
Bridges.swift.template imports DebugBridgeTouch but no .m/.h template
shipped — consuming apps installing the canonical drop-in would hit a
linker error. Closes that gap with the fixture's verified working code.
Changes:
- New ios-qa/templates/DebugBridgeTouch.{h,m}.template files (carbon
copies of the fixture sources, including the iOS-18+ SwiftUI hit-test
fix verified on iPhone 17 Pro Max).
- Package.swift.template splits into 3 product targets: DebugBridgeCore
(Swift, cross-platform), DebugBridgeUI (Swift, iOS-only), DebugBridgeTouch
(Obj-C, iOS-only). Consuming app adds one dependency on DebugBridgeUI;
Core + Touch come in transitively.
- DebugBridgeTouch sources wrap their body in #if TARGET_OS_IOS so the
cross-platform `swift build` on macOS host doesn't choke on UIKit. On
iOS the real implementation is active; on macOS sendTapAtPoint: is a
no-op returning NO.
- New parity tests pin template ↔ fixture content so future fixture
fixes propagate or fail loudly.
- Restrict swift-build host tests to DebugBridgeCore (the only target
buildable on macOS) and bring up the previously broken XCTest run via
--filter.
Verified post-change: real iPhone 17 Pro Max, iOS 26.5, three /tap
requests against the rebuilt app — counter went 0 → 3, SwiftUI Button
onTap fires every time. Templates now sufficient to ship to any
consuming iOS app.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(ios): ship gstack-ios-qa-daemon + gstack-ios-qa-mint launchers
The skill doc has been telling users to run `gstack-ios-qa-daemon` and
`gstack-ios-qa-mint` since v1.41.0.0, but neither binary actually existed.
Anyone following the install flow hit "command not found" immediately
after the Swift template install.
Adds the missing pieces:
- bin/gstack-ios-qa-daemon — bash shim that execs
`bun run ios-qa/daemon/src/index.ts`. Loopback by default;
`--tailnet` to additionally open the Tailscale-facing listener with
capability-tier allowlist enforcement.
- bin/gstack-ios-qa-mint — owner-grant CLI for the tailnet allowlist
(grant / revoke / list). Writes ~/.gstack/ios-qa-allowlist.json at
mode 0600. Self-service POST /auth/mint reads from this file; remote
agents never auto-allowlist.
- ios-qa/daemon/src/cli-mint.ts — TS implementation behind the shim.
Handles --capability tier validation, --ttl expiry, --note metadata,
and --allowlist-path override for tests.
- ios-qa/daemon/src/allowlist.ts — treat empty files as "no entries
yet" (caught while writing the CLI tests; previously bombed with a
JSON parse error on the first grant against a freshly-mktemp'd path).
Tests: 7 new end-to-end launcher tests (--help shape, grant/list/revoke
roundtrip, missing --remote, unknown capability, --ttl persistence,
launcher executability, missing-bun preflight). All 81 daemon tests
pass.
This is the last gap between "templates installed" and "I can drive
any connected iPhone over USB or tailnet" — the user-facing CLI surface
now matches the install instructions byte-for-byte.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: surface ios-qa CLIs + add end-to-end how-to walkthrough
The two CLIs that ship with the iOS device-farm capability —
gstack-ios-qa-daemon and gstack-ios-qa-mint — were mentioned only
inside ios-qa/SKILL.md. Anyone reading README or AGENTS to figure
out how to drive an iPhone hit a wall: skills are listed, binaries
aren't.
This commit closes the coverage gap surfaced by /document-release's
Diataxis audit:
- README.md, AGENTS.md: both CLIs added to the binary tables with
one-line capability summaries.
- docs/howto-ios-testing-with-gstack.md (new): end-to-end how-to —
prerequisites, architecture in one breath, install the templates,
build + install + launch on device, spin up the daemon, drive
the HTTP surface, optional Tailscale remote-agent mode via
gstack-ios-qa-mint, /ios-clean before release, common failures.
Pulled directly from the real iPhone 17 Pro Max / iOS 26.5
verification run.
- README + AGENTS link to the new how-to from the iOS skill row.
No CHANGELOG entry change — the consolidated 1.43.0.0 entry is /ship
work. No VERSION bump — already at 1.43.0.0 covering all branch work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(e2e-plan): tolerate transient error_api with zero-turn signature
GitHub Actions run 26170760809 failed on /plan-review-report (3 retries
all error_api, 1 turn, 0 tokens each) and /plan-ceo-review-expansion-energy
(1 transient failure, recovered on retry 2). The prior run on the same
branch (
|