gstack

History

Garry Tan 2ac44c432c fix(commands): tighten descriptions for LLM-judge baseline pinning The skill-llm-eval test "baseline score pinning" failed CI on three retry attempts: judge gave command_reference.actionability=3, baseline demands ≥4. Judge cited 8 specific gaps in COMMAND_DESCRIPTIONS. This commit closes 7 of 8 by tightening the descriptions: - press: documents that key names are case-sensitive Playwright keys, shows modifier syntax (Shift+Enter, Control+A), links the full key list. Removes the "is this case-sensitive?" guesswork. - is: documents that <sel> accepts either a CSS selector OR an @ref token from a prior snapshot, and that property values are case- sensitive. - scroll: documents that there is no --by/--to amount option, points at `js window.scrollTo(0, N)` for pixel-precise scrolling. - js / eval: clarifies that both run in the same JS sandbox, the difference is just inline expr (js) vs file (eval). - storage: clarifies sessionStorage is read-only via this command, points at `js sessionStorage.setItem(...)` for the write path. - chain: walks through how to invoke (pipe a JSON array of arrays to $B chain), confirms it stops at the first error. - cdp: explains how to discover allowed methods (read cdp-allowlist.ts) + shows a concrete example invocation. - domain-skill: explains that the "classifier flag" is set automatically by the L4 prompt-injection scan (agents do not set it manually); enumerates the full lifecycle verbs. The 8th gap (storage set syntax conflict) is also resolved as part of the storage rewrite. Two pipe-character bugs caught by the existing `no command description contains pipe character` guard at `test/gen-skill-docs.test.ts:595`: the chain example originally used `echo '[...]' \| $B chain` (literal pipe) and the cdp description used `tab\|browser` / `trusted\|untrusted` (also literal pipes). Both rewritten to keep markdown table cells intact. Verification: 696/0 pass on skill-validation + gen-skill-docs after regen across all hosts. The CI llm-judge eval will re-run against the new SKILL.md and should hit actionability ≥4 reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-28 02:16:53 -07:00
..
bin	feat: multi-agent support — gstack works on Codex, Gemini CLI, and Cursor (v0.9.0) (#226 )	2026-03-19 18:20:50 -07:00
scripts	fix: ngrok Windows build + close CI error-swallowing gap (v0.18.0.1) (#1024 )	2026-04-16 13:49:04 -07:00
src	fix(commands): tighten descriptions for LLM-judge baseline pinning	2026-04-28 02:16:53 -07:00
test	Merge origin/main into garrytan/browserharness	2026-04-28 01:47:26 -07:00
PLAN-snapshot-dropdown-interactive.md	fix: snapshot -i auto-detects dropdown/popover interactive elements (#845 )	2026-04-05 22:57:45 -07:00
SKILL.md	fix(commands): tighten descriptions for LLM-judge baseline pinning	2026-04-28 02:16:53 -07:00
SKILL.md.tmpl	feat(browse): Puppeteer parity — load-html, screenshot --selector, viewport --scale, file:// (v1.1.0.0) (#1062 )	2026-04-18 23:25:33 +08:00