mirror of https://github.com/garrytan/gstack.git
Adds comprehensive eval infrastructure: - Tier 1 (free): 13 new static tests — cross-skill path consistency, QA structure validation, greptile format, planted-bug fixture validation - Tier 2 (Agent SDK E2E): /qa quick, /review with pre-built git repo, 3 planted-bug outcome evals (static, SPA, checkout — each with 5 bugs) - Tier 3 (LLM judge): QA workflow quality, health rubric clarity, cross-skill consistency, baseline score pinning New fixtures: 3 HTML pages with 15 total planted bugs, ground truth JSON, review-eval-vuln.rb, eval-baselines.json. Shared llm-judge.ts helper (DRY). Unified EVALS=1 flag replaces SKILL_E2E + ANTHROPIC_API_KEY checks. `bun run test:evals` runs everything that costs money (~$4/run). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| basic.html | ||
| cursor-interactive.html | ||
| dialog.html | ||
| empty.html | ||
| forms.html | ||
| qa-eval-checkout.html | ||
| qa-eval-spa.html | ||
| qa-eval.html | ||
| responsive.html | ||
| snapshot.html | ||
| spa.html | ||
| states.html | ||
| upload.html | ||