security: gate domain-skill auto-promote on classifier_score > 0

`browse/src/domain-skill-commands.ts:140` (handleSave) writes
`classifier_score: 0` with the comment "L4 deferred to load-time / sidebar-agent
fills this in on first prompt-injection load." But CLAUDE.md "Sidebar
architecture" documents that sidebar-agent.ts was ripped, and grep for
recordSkillUse + classifierFlagged callers across browse/src/ returns zero hits
outside the module under test.

Net effect: every quarantined skill that survives three benign uses without
flag (`recordSkillUse(... , classifierFlagged: false)` x3) auto-promotes to
`active` and lands in prompt context wrapped as UNTRUSTED on every subsequent
visit to that host. The L4 score that was supposed to gate the promotion was
never written — the production save path puts 0 on disk and nothing later
updates it.

Threat model: a domain-skill body authored by an agent under the influence of
a poisoned page (the new `gstackInjectToTerminal` PTY path runs no L1-L3
either) would lose its auto-promote barrier after three uses. The exploit
isn't single-step but the bar is exactly N=3 prompt-injection-shaped uses on
a hostile page, which is well within reach.

Fix adds a single condition to the auto-promote gate in `recordSkillUse`:

    if (state === 'quarantined' && useCount >= PROMOTE_THRESHOLD &&
        flagCount === 0 && current.classifier_score > 0) {
      state = 'active';
    }

`classifier_score` is set once at writeSkill and never updated. Production
saves it as 0 (handleSave), so the gate stays closed; existing tests that
explicitly pass `classifierScore: 0.1` still auto-promote (the auto-promote
path is preserved for the day L4 is rewired).

Manual promotion via `domain-skill promote-to-global` is unaffected (it goes
through `promoteToGlobal` which has its own state-machine guard at line 337+).

Test: new regression case `does NOT auto-promote when classifier_score is 0
(production handleSave shape)` plants a skill with classifierScore=0 (matches
domain-skill-commands.ts:140), runs three uses without flag, asserts the skill
stays quarantined and readSkill returns null. Negative control: revert the
patch, the test fails with `Received: "active"`. With the patch: 15/15 pass.
This commit is contained in:
gus 2026-05-07 23:22:27 -03:00
parent 7b4738bca0
commit 01e584253d
2 changed files with 45 additions and 3 deletions

View File

@ -291,8 +291,20 @@ export async function writeSkill(input: WriteSkillInput): Promise<DomainSkillRow
*
* Auto-promote logic:
* - increment use_count
* - if use_count >= PROMOTE_THRESHOLD AND flag_count == 0 state:active
* - else stay quarantined with updated counter
* - if use_count >= PROMOTE_THRESHOLD AND flag_count == 0 AND L4 has scored
* the body (classifier_score > 0) state:active
* - else stay quarantined with updated counter; user must run
* `domain-skill promote-to-global` manually
*
* The classifier_score > 0 gate is load-bearing: handleSave currently writes
* classifier_score=0 with the comment "L4 deferred to load-time / sidebar-agent
* fills this in on first prompt-injection load," but sidebar-agent was ripped
* (CLAUDE.md "Sidebar architecture") and nothing else updates the score, so
* skills authored via the production path never had their body scanned by L4.
* Without this gate, three benign uses promote any quarantined skill including
* one written under the influence of a poisoned page into the prompt context
* for every subsequent visit. The gate re-opens automatically the day L4 is
* rewired and writeSkill / recordSkillUse start receiving non-zero scores.
*/
export async function recordSkillUse(host: string, projectSlug: string, classifierFlagged: boolean): Promise<DomainSkillRow | null> {
const normalized = normalizeHost(host);
@ -303,7 +315,12 @@ export async function recordSkillUse(host: string, projectSlug: string, classifi
const useCount = current.use_count + 1;
const flagCount = current.flag_count + (classifierFlagged ? 1 : 0);
let state: SkillState = current.state;
if (state === 'quarantined' && useCount >= PROMOTE_THRESHOLD && flagCount === 0) {
if (
state === 'quarantined' &&
useCount >= PROMOTE_THRESHOLD &&
flagCount === 0 &&
current.classifier_score > 0
) {
state = 'active';
}
const updated: DomainSkillRow = {

View File

@ -106,6 +106,31 @@ describe('domain-skills: state machine (T6)', () => {
})
).rejects.toThrow(/classifier flagged/);
});
// domain-skill-commands.ts:140 (handleSave) writes classifier_score=0 with
// the comment "L4 deferred to load-time" — but sidebar-agent (the deferred
// scanner) was ripped per CLAUDE.md "Sidebar architecture." Without an
// explicit gate, three benign uses promote any quarantined skill, including
// one authored under a poisoned page, into prompt context permanently.
it('does NOT auto-promote when classifier_score is 0 (production handleSave shape)', async () => {
const m = await freshImport();
await m.writeSkill({
host: 'linkedin.com',
body: '# LinkedIn',
projectSlug: 'test-slug',
source: 'agent',
classifierScore: 0, // matches domain-skill-commands.ts:140 production path
});
const after3 = await m.recordSkillUse('linkedin.com', 'test-slug', false);
await m.recordSkillUse('linkedin.com', 'test-slug', false);
const final = await m.recordSkillUse('linkedin.com', 'test-slug', false);
expect(after3?.state).toBe('quarantined');
expect(final?.state).toBe('quarantined');
expect(final?.use_count).toBe(3);
// readSkill returns null for quarantined skills — they don't fire.
const read = await m.readSkill('linkedin.com', 'test-slug');
expect(read).toBeNull();
});
});
describe('domain-skills: scope shadowing (T4)', () => {