mirror of https://github.com/garrytan/gstack.git
Merge PR #1290: align prompt-injection thresholds in CLAUDE.md and ARCHITECTURE.md to security.ts
This commit is contained in:
commit
4fa770459b
|
|
@ -156,7 +156,7 @@ The Chrome sidebar agent has tools (Bash, Read, Glob, Grep, WebFetch) and reads
|
|||
|
||||
4. **L5 canary token (`browse/src/security.ts`).** A random token injected into the system prompt at session start. Rolling-buffer detection across `text_delta` and `input_json_delta` streams catches the token if it shows up anywhere in Claude's output, tool arguments, URLs, or file writes. Deterministic BLOCK — if the token leaks, the attacker convinced Claude to reveal the system prompt, and the session ends.
|
||||
|
||||
5. **L6 ensemble combiner (`combineVerdict`).** BLOCK requires agreement from two ML classifiers at >= `WARN` (0.60), not a single confident hit. This is the Stack Overflow instruction-writing false-positive mitigation. On tool-output scans, single-layer high confidence BLOCKs directly — the content wasn't user-authored, so the FP concern doesn't apply.
|
||||
5. **L6 ensemble combiner (`combineVerdict`).** BLOCK requires agreement from two ML classifiers at >= `WARN` (0.75), not a single confident hit. This is the Stack Overflow instruction-writing false-positive mitigation. On tool-output scans, single-layer high confidence BLOCKs directly — the content wasn't user-authored, so the FP concern doesn't apply.
|
||||
|
||||
**Critical constraint:** `security-classifier.ts` runs only in the sidebar-agent process, never in the compiled browse binary. `@huggingface/transformers` v4 requires `onnxruntime-node`, which fails `dlopen` from Bun compile's temp extract directory. Only the pure-string pieces (canary inject/check, verdict combiner, attack log, status) are in `security.ts`, which is safe to import from `server.ts`.
|
||||
|
||||
|
|
|
|||
|
|
@ -287,8 +287,12 @@ for `server.ts`. See `~/.gstack/projects/garrytan-gstack/ceo-plans/2026-04-19-pr
|
|||
|
||||
**Thresholds** (in `security.ts`):
|
||||
- `BLOCK: 0.85` — single-layer score that would cause BLOCK if cross-confirmed
|
||||
- `WARN: 0.60` — cross-confirm threshold. When L4 AND L4b both >= 0.60 → BLOCK
|
||||
- `WARN: 0.75` — cross-confirm threshold. When L4 AND L4b both >= 0.75 → BLOCK
|
||||
- `LOG_ONLY: 0.40` — gates transcript classifier (skip Haiku when all layers < 0.40)
|
||||
- `SOLO_CONTENT_BLOCK: 0.92` — single-layer threshold for label-less content classifiers
|
||||
(testsavant, deberta). Intentionally higher than `BLOCK` because these layers can't
|
||||
distinguish "this is an injection" from "this looks like phishing aimed at the user."
|
||||
The transcript classifier keeps a separate, label-gated solo path at `BLOCK` (0.85).
|
||||
|
||||
**Ensemble rule:** BLOCK only when the ML content classifier AND the transcript
|
||||
classifier both report >= WARN. Single-layer high confidence degrades to WARN —
|
||||
|
|
|
|||
Loading…
Reference in New Issue