gstack

History

Garry Tan afc6661f8c test(security): add BrowseSafe-Bench smoke harness (v1 baseline) 200-case smoke test against Perplexity's BrowseSafe-Bench adversarial dataset (3,680 cases, 11 attack types, 9 injection strategies). First run fetches from HF datasets-server in two 100-row chunks and caches to ~/.gstack/cache/browsesafe-bench-smoke/test-rows.json — subsequent runs are hermetic. V1 baseline (recorded via console.log for regression tracking): * Detection rate: ~15% at WARN=0.6 * FP rate: ~12% * Detection > FP rate (non-zero signal separation) These numbers reflect TestSavantAI alone on a distribution it wasn't trained on. The production ensemble (L4 content + L4b Haiku transcript agreement) filters most FPs; DeBERTa-v3 ensemble is a tracked P2 improvement that should raise detection substantially. Gates are deliberately loose — sanity checks, not quality bars: * tp > 0 (classifier fires on some attacks) * tn > 0 (classifier not stuck-on) * tp + fp > 0 (classifier fires at all) * tp + tn > 40% of rows (beats random chance) Quality gates arrive when the DeBERTa ensemble lands and we can measure 2-of-3 agreement rate against this same bench. Model cache gate via test.skipIf(!ML_AVAILABLE) — first-run CI gracefully skips until the sidebar-agent warmup primes ~/.gstack/models/testsavant- small/. Documented in the test file head comment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-20 04:50:53 +08:00
..
bin	feat: multi-agent support — gstack works on Codex, Gemini CLI, and Cursor (v0.9.0) (#226 )	2026-03-19 18:20:50 -07:00
scripts	fix: ngrok Windows build + close CI error-swallowing gap (v0.18.0.1) (#1024 )	2026-04-16 13:49:04 -07:00
src	fix(security-classifier): truncation + HTML preprocessing	2026-04-20 04:50:53 +08:00
test	test(security): add BrowseSafe-Bench smoke harness (v1 baseline)	2026-04-20 04:50:53 +08:00
PLAN-snapshot-dropdown-interactive.md	fix: snapshot -i auto-detects dropdown/popover interactive elements (#845 )	2026-04-05 22:57:45 -07:00
SKILL.md	feat(browse): Puppeteer parity — load-html, screenshot --selector, viewport --scale, file:// (v1.1.0.0) (#1062 )	2026-04-18 23:25:33 +08:00
SKILL.md.tmpl	feat(browse): Puppeteer parity — load-html, screenshot --selector, viewport --scale, file:// (v1.1.0.0) (#1062 )	2026-04-18 23:25:33 +08:00