gstack/scripts
Garry Tan 17c1c06cd9
feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0) (#139)
* feat: diff-based test selection for E2E and LLM-judge evals

Each test declares file dependencies in a TOUCHFILES map. The test runner
checks git diff against the base branch and only runs tests whose
dependencies were modified. Global touchfiles (session-runner, eval-store,
gen-skill-docs) trigger all tests.

New scripts: test:e2e:all, test:evals:all, eval:select

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* chore: bump version and changelog (v0.6.1.0)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: plan-design-review-audit eval — bump turns to 30, add efficiency hints

The test was flaky at 20 turns because the agent reads a 300-line SKILL.md,
navigates, extracts design data, and writes a report. Added hints to skip
preamble/batch commands/write early while still testing the real SKILL.md.
Now completes in ~13 turns consistently.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-17 18:45:41 -05:00
..
dev-skill.ts feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41) 2026-03-13 21:08:12 -07:00
eval-compare.ts feat: eval CLI tools + docs cleanup 2026-03-14 03:49:57 -05:00
eval-list.ts feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83) 2026-03-15 23:55:39 -05:00
eval-select.ts feat: diff-based test selection for E2E and LLM-judge evals (v0.6.1.0) (#139) 2026-03-17 18:45:41 -05:00
eval-summary.ts feat: QA restructure, browser ref staleness, eval efficiency metrics (v0.4.0) (#83) 2026-03-15 23:55:39 -05:00
eval-watch.ts fix: auto-clear stale heartbeat when process is dead 2026-03-14 12:55:40 -05:00
gen-skill-docs.ts feat: Completeness Principle — Boil the Lake (v0.6.1) (#140) 2026-03-17 16:34:08 -05:00
skill-check.ts feat: /plan-design-review + /qa-design-review skills (v0.5.0) (#102) 2026-03-16 21:55:07 -05:00