gstack

Author SHA1 Message Date

Author	SHA1	Message	Date
Garry Tan	24742f9dac	Merge origin/main: rebump v1.45→v1.46 (queue collision with design-daemon) Main shipped v1.45.0.0 (persistent design daemon) at `cf50443b` while this branch was at v1.45.0.0 (gstack v2 foundation). Same-MINOR-bump-level queue collision per CLAUDE.md versioning invariant — advance this branch to v1.46.0.0 to claim the next slot. Resolved CHANGELOG.md conflict: kept both v1.45.0.0 (design daemon, main) and v1.46.0.0 (gstack v2 foundation, this branch) entries in reverse-chronological order. Updated VERSION, package.json, and renamed test/fixtures/parity-baseline-v1.45.0.0.json → -v1.46.0.0.json with the internal tag field synced. Updated CHANGELOG entry numbers-table column header from v1.45.0.0 to v1.46.0.0 + the source-reproduction line + the "v1.45 absorbs..." prose and the eval target reference in skill-size-budget.test.ts comment. Auto-merged main's design-daemon SKILL.md changes for design-consultation, design-shotgun, office-hours, plan-design-review. Regenerated all SKILL.md files via gen-skill-docs --host all to ensure clean state. Refreshed golden ship snapshots (claude/codex/factory). Test plan: - bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts test/host-config.test.ts test/skill-size-budget.test.ts test/parity-suite.test.ts test/skill-coverage-matrix.test.ts test/skill-coverage-floor.test.ts test/cso-preserved.test.ts test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts test/gen-skill-docs.test.ts: 1134 pass, 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 21:18:15 -07:00
Garry Tan	81fdf9cc61	test(budget): T5 — hard token budgets + override audit trail (Phase A.6) Two new gate-tier guardrails for the v1.45.0.0 compression baseline: 1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget. Compares current state to test/fixtures/parity-baseline-v1.44.1.json. Three checks: per-skill (×1.05 default ratio), total corpus, and catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05 not 1.0 because the T4 catalog trim moves text from frontmatter to a body section; small skills see a tiny body growth that's fine when offset by the much larger catalog-token win. 2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry, model price changes) before they amortize across PRs. Both checks support an override path with audit trail: GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" — size EVALS_BUDGET_OVERRIDE_REASON="why this is OK" — cost Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with timestamp + scope + reason + CI provenance (runner, branch, commit) via test/helpers/budget-override.ts. Why the override audit: a hard cap with no escape valve becomes operationally hostile (legit price changes, longer transcripts, new required evals can all blow the cap). An override with no audit becomes "everyone overrides everything and the gate is theater." This module ships the audit half so reviewers can see what was waived and why. Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json already in test/fixtures/). Test plan: - bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists) - bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks) - Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:36:43 -07:00

Garry Tan

24742f9dac

Merge origin/main: rebump v1.45→v1.46 (queue collision with design-daemon)

Main shipped v1.45.0.0 (persistent design daemon) at cf50443b while this
branch was at v1.45.0.0 (gstack v2 foundation). Same-MINOR-bump-level
queue collision per CLAUDE.md versioning invariant — advance this branch
to v1.46.0.0 to claim the next slot.

Resolved CHANGELOG.md conflict: kept both v1.45.0.0 (design daemon, main)
and v1.46.0.0 (gstack v2 foundation, this branch) entries in
reverse-chronological order. Updated VERSION, package.json, and renamed
test/fixtures/parity-baseline-v1.45.0.0.json → -v1.46.0.0.json with the
internal tag field synced.

Updated CHANGELOG entry numbers-table column header from v1.45.0.0 to
v1.46.0.0 + the source-reproduction line + the "v1.45 absorbs..." prose
and the eval target reference in skill-size-budget.test.ts comment.

Auto-merged main's design-daemon SKILL.md changes for design-consultation,
design-shotgun, office-hours, plan-design-review. Regenerated all SKILL.md
files via gen-skill-docs --host all to ensure clean state. Refreshed golden
ship snapshots (claude/codex/factory).

Test plan:
- bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts
  test/host-config.test.ts test/skill-size-budget.test.ts
  test/parity-suite.test.ts test/skill-coverage-matrix.test.ts
  test/skill-coverage-floor.test.ts test/cso-preserved.test.ts
  test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts
  test/gen-skill-docs.test.ts: 1134 pass, 0 fail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 21:18:15 -07:00

Garry Tan

81fdf9cc61

test(budget): T5 — hard token budgets + override audit trail (Phase A.6)

Two new gate-tier guardrails for the v1.45.0.0 compression baseline:

1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget.
Compares current state to test/fixtures/parity-baseline-v1.44.1.json.
Three checks: per-skill (×1.05 default ratio), total corpus, and
catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05
not 1.0 because the T4 catalog trim moves text from frontmatter to a
body section; small skills see a tiny body growth that's fine when
offset by the much larger catalog-token win.

2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on
per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella
EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry,
model price changes) before they amortize across PRs.

Both checks support an override path with audit trail:
GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" — size
EVALS_BUDGET_OVERRIDE_REASON="why this is OK" — cost
Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with
timestamp + scope + reason + CI provenance (runner, branch, commit)
via test/helpers/budget-override.ts.

Why the override audit: a hard cap with no escape valve becomes
operationally hostile (legit price changes, longer transcripts, new
required evals can all blow the cap). An override with no audit becomes
"everyone overrides everything and the gate is theater." This module
ships the audit half so reviewers can see what was waived and why.

Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with
auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json
already in test/fixtures/).

Test plan:
- bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists)
- bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks)
- Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-25 20:36:43 -07:00

2 Commits