First paid run of the 8 tests (commit bdcf2504) surfaced 3 genuine
failures all rooted in two mechanical problems:
1. Over-instructed prompts bypassed the Skill tool.
When the prompt said "Use GSTACK_HOME=X and the bin scripts at
./bin/ to save my state", the agent interpreted that as step-by-step
bash instructions and executed Bash+Write directly — never invoking
the Skill tool. skillCalls(result).includes("context-save") was
always false, so routing assertions failed. The whole point of the
routing test was exactly to prove the Skill tool got called, so
this was invalidating the test.
Fix: minimal slash-command prompts ("/context-save wintermute
progress", "/context-restore", "/context-save list"). Environment
setup moved to the runSkillTest env: param added in 5f316e0e.
2. Assertions were too strict on paraphrased agent output.
legacy-compat required the exact string OLD_CHECKPOINT_SKILL_LEGACYCOMPAT
in output — but the agent loaded the file, summarized it, and the
summary didn't include that marker verbatim. Similarly,
list-all-branches required 3 branch names in prose, but the agent
renders /context-save list as a table where filenames are the
reliable token and branch names may not appear.
Fix: relax assertions to accept multiple forms of evidence.
- legacy-compat: OR of (verbatim marker | title phrase | filename
prefix | branch name | "pre-rename" token) — any one is proof.
- list-all-branches + list-current-branch: check filename timestamp
prefixes (20260101-, 20260202-, 20260303-) which are unique and
unambiguous, instead of prose branch names.
Also bumped round-trip test: maxTurns 20→25, timeout 180s→240s. The
two-step flow (save then restore) needs headroom — one attempt timed
out mid-restore on the prior run, passed on retry.
Relaunched: PID 34131. Monitor armed. Will report whether the 3
previously-failing tests now pass.
First run results (pre-fix):
5/8 final pass (with retries)
3 failures: context-save-routing, legacy-compat, list-all-branches
Total cost: $3.69, 984s wall