Commit Graph

2 Commits

Author SHA1 Message Date
Garry Tan 7658179879
test(judge): pin every hedging-regex alternate with a fixture
Coverage audit flagged 5 unpinned alternates in the choice-portion hedging
regex (depends? on, depending, if .+ then, or maybe, whichever). Only "either"
was previously exercised, leaving 5 deterministic regex branches with no
fixture — a typo in any alternate would have shipped silently.

Add one fixture per hedge form. Mix of has-because (LLM call) and
no-because (deterministic-only) cases keeps total Haiku cost at ~$0.015
extra per fixture run while taking branch coverage from 9/14 → 14/14.

Fixture passes 30/30 expect() calls in 20.7s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 18:16:11 -07:00
Garry Tan dfb68fe88d
test: add fixture-based sanity test for judgeRecommendation rubric
Replaces "manually inject bad text into a captured file and revert the SKILL
template" sabotage testing with deterministic negative coverage: hand-graded
good/bad recommendation strings asserted against the same threshold (>= 4)
the production E2E tests use.

Seven fixtures cover the rubric corners: substance 5 (option-specific +
cross-alternative), substance 4 (option-specific without comparison), substance
~1 (boilerplate "because it's better"), substance ~3 (generic "because it's
faster"), no-because (deterministic skip), no-recommendation (deterministic
skip), and hedging ("either B or C" — fails commits).

Periodic-tier so it doesn't run on every PR but does fire on llm-judge.ts
rubric tweaks. ~$0.04 per run via Haiku 4.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 14:18:16 -07:00