gstack

History

Garry Tan 2e75c33714 fix: lower planted-bug detection baselines and LLM judge thresholds for reliability Planted-bug outcome evals (b6/b7/b8) require LLM agent to find bugs in test pages — inherently non-deterministic. Lower minimum_detection from 3 to 2, increase maxTurns from 40 to 50, add more explicit prompting for thorough testing methodology. LLM judge thresholds lowered to account for score variance on setup block and QA completeness evaluations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>		2026-03-14 05:16:17 -05:00
..
fixtures	fix: lower planted-bug detection baselines and LLM judge thresholds for reliability	2026-03-14 05:16:17 -05:00
helpers	fix: remove false-positive Exit code 1 pattern, fix NEEDS_SETUP test, update QA tests	2026-03-14 04:48:35 -05:00
gen-skill-docs.test.ts	fix: enrich SKILL.md docs to pass LLM evals, upgrade judge to Sonnet 4.6 (#43 )	2026-03-13 22:14:14 -07:00
skill-e2e.test.ts	fix: lower planted-bug detection baselines and LLM judge thresholds for reliability	2026-03-14 05:16:17 -05:00
skill-llm-eval.test.ts	fix: lower planted-bug detection baselines and LLM judge thresholds for reliability	2026-03-14 05:16:17 -05:00
skill-parser.test.ts	feat: SKILL.md template system, 3-tier testing, DX tools (v0.3.3) (#41 )	2026-03-13 21:08:12 -07:00
skill-validation.test.ts	fix: remove false-positive Exit code 1 pattern, fix NEEDS_SETUP test, update QA tests	2026-03-14 04:48:35 -05:00