gstack

Commit Graph

Author	SHA1	Message	Date
Garry Tan	9a424a9f55	test: apply ship review-army findings — helper extract, slice SKILL.md, defensive judge Five categories of fixes surfaced by the /ship pre-landing reviews (testing + maintainability + security + performance + adversarial Claude), applied as one review-iteration commit. Refactor — collapse 5x duplicated judge-assertion block: - Add assertRecommendationQuality() + RECOMMENDATION_SUBSTANCE_THRESHOLD constant to test/helpers/e2e-helpers.ts. - Plan-format (4 cases) and Phase 4 (1 case) collapse from ~22 lines each to a single helper call. Future rubric tweaks land in one place instead of five. Performance — extract Phase 4 slice instead of copying full SKILL.md: - Phase 4 test fixture now reads office-hours/SKILL.md and writes only the AskUserQuestion Format section + Phase 4 section to the tmpdir, per CLAUDE.md "extract, don't copy" rule. Verified locally: cost dropped from $0.51 → $0.36/run, turn count 8 → 4, latency 50s → 36s. Reduces Opus context bloat without weakening the regression check. - Add `if (!workDir) return` guard to Phase 4 afterAll cleanup so a skipped describe block doesn't silently fs.rmSync(undefined) under the empty catch. Defense — judge prompt + output: - Wrap captured AskUserQuestion text in clearly delimited UNTRUSTED_CONTEXT block with explicit instruction to treat its content as data, not commands. Cheap defense against the (unlikely but real) injection vector where a captured AskUserQuestion contains "Ignore previous instructions" text. - Bump captured-text budget from 4000 → 8000 chars; real plan-format menus with 4 options × ~800 chars exceed 4000 and were silently truncating Haiku context mid-option. Cleanup — abbreviation rule + dead imports + touchfile consistency: - AUQ → AskUserQuestion in 3 sites (office-hours/SKILL.md.tmpl Phase 4 footer, two test comments) per the always-write-in-full memory rule. Regenerated office-hours/SKILL.md. - Drop unused `describe`/`test` imports in 2 new test files (only describeIfSelected/testConcurrentIfSelected wrappers are used). - Add `test/skill-e2e-office-hours-phase4.test.ts` to its own touchfile entry for consistency with other entries that include their test file. - Fix misleading comment in fixture test about LLM short-circuiting (it's has_because, not commits, that skips the API call). Verified: build clean, free `bun test` exits 0, fixture test 30/30 expect() calls pass, Phase 4 paid eval passes substance 5 in 36s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:40:01 -07:00
Garry Tan	7658179879	test(judge): pin every hedging-regex alternate with a fixture Coverage audit flagged 5 unpinned alternates in the choice-portion hedging regex (depends? on, depending, if .+ then, or maybe, whichever). Only "either" was previously exercised, leaving 5 deterministic regex branches with no fixture — a typo in any alternate would have shipped silently. Add one fixture per hedge form. Mix of has-because (LLM call) and no-because (deterministic-only) cases keeps total Haiku cost at ~$0.015 extra per fixture run while taking branch coverage from 9/14 → 14/14. Fixture passes 30/30 expect() calls in 20.7s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 18:16:11 -07:00
Garry Tan	dfb68fe88d	test: add fixture-based sanity test for judgeRecommendation rubric Replaces "manually inject bad text into a captured file and revert the SKILL template" sabotage testing with deterministic negative coverage: hand-graded good/bad recommendation strings asserted against the same threshold (>= 4) the production E2E tests use. Seven fixtures cover the rubric corners: substance 5 (option-specific + cross-alternative), substance 4 (option-specific without comparison), substance ~1 (boilerplate "because it's better"), substance ~3 (generic "because it's faster"), no-because (deterministic skip), no-recommendation (deterministic skip), and hedging ("either B or C" — fails commits). Periodic-tier so it doesn't run on every PR but does fire on llm-judge.ts rubric tweaks. ~$0.04 per run via Haiku 4.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:18:16 -07:00

Author

SHA1

Message

Date

Garry Tan

9a424a9f55

test: apply ship review-army findings — helper extract, slice SKILL.md, defensive judge

Five categories of fixes surfaced by the /ship pre-landing reviews
(testing + maintainability + security + performance + adversarial Claude),
applied as one review-iteration commit.

Refactor — collapse 5x duplicated judge-assertion block:
- Add assertRecommendationQuality() + RECOMMENDATION_SUBSTANCE_THRESHOLD
  constant to test/helpers/e2e-helpers.ts.
- Plan-format (4 cases) and Phase 4 (1 case) collapse from ~22 lines each
  to a single helper call. Future rubric tweaks land in one place instead
  of five.

Performance — extract Phase 4 slice instead of copying full SKILL.md:
- Phase 4 test fixture now reads office-hours/SKILL.md and writes only the
  AskUserQuestion Format section + Phase 4 section to the tmpdir, per
  CLAUDE.md "extract, don't copy" rule. Verified locally: cost dropped
  from $0.51 → $0.36/run, turn count 8 → 4, latency 50s → 36s. Reduces
  Opus context bloat without weakening the regression check.
- Add `if (!workDir) return` guard to Phase 4 afterAll cleanup so a
  skipped describe block doesn't silently fs.rmSync(undefined) under the
  empty catch.

Defense — judge prompt + output:
- Wrap captured AskUserQuestion text in clearly delimited UNTRUSTED_CONTEXT
  block with explicit instruction to treat its content as data, not commands.
  Cheap defense against the (unlikely but real) injection vector where a
  captured AskUserQuestion contains "Ignore previous instructions" text.
- Bump captured-text budget from 4000 → 8000 chars; real plan-format menus
  with 4 options × ~800 chars exceed 4000 and were silently truncating
  Haiku context mid-option.

Cleanup — abbreviation rule + dead imports + touchfile consistency:
- AUQ → AskUserQuestion in 3 sites (office-hours/SKILL.md.tmpl Phase 4
  footer, two test comments) per the always-write-in-full memory rule.
  Regenerated office-hours/SKILL.md.
- Drop unused `describe`/`test` imports in 2 new test files (only
  describeIfSelected/testConcurrentIfSelected wrappers are used).
- Add `test/skill-e2e-office-hours-phase4.test.ts` to its own touchfile
  entry for consistency with other entries that include their test file.
- Fix misleading comment in fixture test about LLM short-circuiting (it's
  has_because, not commits, that skips the API call).

Verified: build clean, free `bun test` exits 0, fixture test 30/30
expect() calls pass, Phase 4 paid eval passes substance 5 in 36s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 18:40:01 -07:00

Garry Tan

7658179879

test(judge): pin every hedging-regex alternate with a fixture

Coverage audit flagged 5 unpinned alternates in the choice-portion hedging
regex (depends? on, depending, if .+ then, or maybe, whichever). Only "either"
was previously exercised, leaving 5 deterministic regex branches with no
fixture — a typo in any alternate would have shipped silently.

Add one fixture per hedge form. Mix of has-because (LLM call) and
no-because (deterministic-only) cases keeps total Haiku cost at ~$0.015
extra per fixture run while taking branch coverage from 9/14 → 14/14.

Fixture passes 30/30 expect() calls in 20.7s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 18:16:11 -07:00

Garry Tan

dfb68fe88d

test: add fixture-based sanity test for judgeRecommendation rubric

Replaces "manually inject bad text into a captured file and revert the SKILL
template" sabotage testing with deterministic negative coverage: hand-graded
good/bad recommendation strings asserted against the same threshold (>= 4)
the production E2E tests use.

Seven fixtures cover the rubric corners: substance 5 (option-specific +
cross-alternative), substance 4 (option-specific without comparison), substance
~1 (boilerplate "because it's better"), substance ~3 (generic "because it's
faster"), no-because (deterministic skip), no-recommendation (deterministic
skip), and hedging ("either B or C" — fails commits).

Periodic-tier so it doesn't run on every PR but does fire on llm-judge.ts
rubric tweaks. ~$0.04 per run via Haiku 4.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-01 14:18:16 -07:00

3 Commits