Commit Graph

5 Commits

Author SHA1 Message Date
Garry Tan 19770ea8b4
v1.51.0.0 feat: $B memory diagnostic + 4 CDP-resource leak fixes (#1751)
* add withCdpSession + getOrCreateCdpSession helpers

Two CDP-session lifecycle helpers in cdp-bridge.ts:

- withCdpSession(page, fn): ephemeral session with try/finally detach.
  For one-shot CDP work (archive snapshots, $B memory, single
  Page.captureScreenshot) where the caller doesn't need session reuse.
- getOrCreateCdpSession(page, cache): cached long-lived session that
  registers a page.once('close') hook to BOTH delete the cache entry
  AND call session.detach(). Pre-helper code only deleted the cache
  entry, leaving the Chromium-side CDP target attached until the
  underlying transport dropped.

Pure addition. Existing callers untouched in this commit; they migrate
in the next commit alongside the static-grep test that pins the
invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* migrate 3 CDP-session sites to lifecycle helpers

Fixes the CDP-target leak class identified by /codex outside-voice on
the eng review (D11 EXPAND_SCOPE). All three sites called
`page.context().newCDPSession(page)` directly and either forgot the
detach entirely (cdp-bridge cache cleanup), only detached on the
success path (write-commands archive), or detached on framenavigated
but not page-close (cdp-inspector).

- cdp-bridge.ts: `getCdpSession` now delegates to
  `getOrCreateCdpSession`, which registers a `page.once('close')` hook
  that BOTH removes the cache entry AND calls `session.detach()`.
- cdp-inspector.ts: same migration for the inspector's session pool.
  Keeps the existing framenavigated detach (more granular than close
  for DOM/CSS state invalidation) plus an inspector-layer close hook
  for the initializedPages WeakSet.
- write-commands.ts archive: wraps Page.captureSnapshot in
  withCdpSession so the detach runs in `finally`, including the path
  where captureSnapshot throws.

The static-grep tripwire (next commit) pins the invariant so future
direct calls to newCDPSession fail CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* add CDP-session cleanup tripwire + helper unit tests

browse/test/cdp-session-cleanup.test.ts pins the invariant that no
source file outside cdp-bridge.ts may call newCDPSession() directly.
If a future refactor reintroduces the direct call, CI fails with a
file:line list and a pointer to the right helper to use instead
(withCdpSession for one-shot, getOrCreateCdpSession for cached).

Also covers the helpers themselves with fake-Page unit tests:
- withCdpSession detaches on success
- withCdpSession detaches on throw (the actual leak fix)
- withCdpSession swallows detach errors so they don't mask fn errors
- getOrCreateCdpSession caches the session across calls
- close hook detaches AND clears the cache

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* extract createSseEndpoint helper with cleanup contract

browse/src/sse-helpers.ts owns the SSE cleanup invariant:
cleanup runs on abort, enqueue failure, AND heartbeat failure,
exactly once, regardless of which edge fires first.

Pre-helper, /activity/stream and /inspector/events ran cleanup only on
the req.signal.abort edge. If the underlying TCP died without firing
abort (Chromium MV3 service-worker suspend, intermediate proxy
half-close), the subscriber closure stayed in the Set capturing the
ReadableStreamDefaultController plus any payloads queued behind it. Over
a multi-day sidebar session this compounded into multi-MB of retained
controllers per dead connection.

Caller surface: initialReplay (optional, for gap replay or state
snapshots), subscribe (live-event source), liveEventName (SSE event
name for live wrap), heartbeatMs. send() helper handles JSON encoding
with sanitizeReplacer + lone-surrogate stripping.

Unit tests pin all three cleanup edges + idempotency + replay ordering
+ surrogate sanitization. Endpoint refactors land in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* route /activity/stream + /inspector/events through createSseEndpoint

Both endpoints collapse from ~45 lines of in-line ReadableStream wiring
to ~8 lines of helper config. Behavior preserved bit-for-bit by the
new sse-helpers tests:
  - initial replay (activity gap + history, inspector state snapshot)
  - live event subscription
  - 15s heartbeat
  - SSE framing
  - sanitizeReplacer applied to every JSON.stringify

The leak fix is the cleanup contract: pre-refactor, both endpoints ran
cleanup only on req.signal.abort. If TCP died without firing abort
(Chromium MV3 SW suspend, intermediate proxy half-close), the
subscriber closure stayed in the Set forever capturing the
ReadableStreamDefaultController + queued payloads. Post-refactor, an
enqueue-failure or heartbeat-failure on a dead consumer triggers the
same idempotent cleanup as abort would.

Net: -83 / +15 in server.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cap inspector modificationHistory at 200 entries

Pre-cap, modificationHistory was an unbounded module-scoped array that
grew for every CSS edit through $B css across the entire session.
Small per-entry footprint but no upper bound, the kind of slow leak
that compounds over multi-day inspector use.

Cap is 200, oldest evicted on push past the cap. modHistoryTotalPushed
stays monotonic across the session so undoModification can tell the
user when their target index has been evicted, instead of just the
opaque pre-cap "No modification at index 500" with no context.

__testInternals export lets the cap + eviction error be unit-tested
without spinning up a CDP-driven Page. Production code must continue
to go through modifyStyle / undoModification / resetModifications.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* add BrowserManager.getMemorySnapshot() + shared types

Diagnostic foundation for $B memory and the /memory endpoint that land
in the next two commits. Collects:

- Bun process memory via process.memoryUsage (cross-platform, accurate).
- Per-tab JS heap via CDP Performance.getMetrics, lazy per tracked page,
  swallows target-died errors so a dying tab doesn't poison the
  snapshot for the rest.
- Chromium process tree via SystemInfo.getProcessInfo (PID + type +
  CPU time). RSS is NOT exposed via CDP — the eng review (D2 USE_CDP)
  picked CDP over shelling to `ps`, so notes[] tells the caller why
  the RSS column is absent and points at the follow-up TODO.

cdp-inspector exports getModificationHistoryStats so the snapshot can
surface buffer occupancy + cap + evicted count without reaching into
module-private state.

memory-snapshot.ts holds the shared types so server.ts and read-commands
can import without circular dep on browser-manager.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* add \$B memory command

Registers 'memory' in META_COMMANDS, wires the meta-command dispatch
to a lazy-imported handler in memory-command.ts. Lazy because the
import graph (cdp-bridge + memory-snapshot + buffer accessors) isn't
useful to projects that never run the diagnostic.

The handler assembles MemoryStructureStats from the modules that own
each buffer (cdp-inspector mod history stats, activity subscriber
count, console/network/dialog buffer lengths, captureBuffer bytes,
inspectorSubscriber count via a new server.ts export) and calls
BrowserManager.getMemorySnapshot. Output is text by default, JSON with
--json so the sidebar footer and test harness can consume it
programmatically. buildMemorySnapshotJson is the entry the /memory
endpoint will call in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* add /memory endpoint (SSE-session-cookie gated)

GET /memory returns the BrowserManager memory snapshot as JSON. Auth
matches /activity/stream and /inspector/events: Bearer header OR
view-only SSE-session cookie (the extension fetches the cookie once
via POST /sse-session, then polls /memory with withCredentials: true).

Deliberately NOT extending /health for the sidebar footer poll —
TODOS.md "Audit /health token distribution" records that /health
already surfaces AUTH_TOKEN to any localhost caller in headed mode. A
separate endpoint with the standard SSE auth keeps the future /health
fix from cascading into the sidebar.

sanitizeReplacer is applied at egress because tab.url and tab.title
come from page content — lone-surrogate bytes from broken emoji could
otherwise reach the sidebar and (when forwarded to Claude API) trigger
HTTP 400.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* add sidebar footer RSS readout (polls /memory every 30s)

Footer now shows "<bun-rss> · <tab-count>" sourced from the /memory
endpoint, polled every 30s. Color thresholds: orange warn at 2 GB Bun
RSS or 50 tabs; red bad at 8 GB or 200 tabs (matches the tab-guardrail
threshold landing in a later commit). The footer gives the user an
early signal that the cliff is forming, instead of only learning when
the OS OOM-kills the process.

Backoff per Codex's flag: if a poll takes > 2s response time the
sidebar drops to a 5-minute cadence until the next successful fast
poll. The diagnostic shouldn't add load to a browser that's already
unhealthy.

Start/stop is wired to the existing setServerInfo() hook so the timer
only runs while the sidebar is connected to a server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* stop materializing response bodies in requestfinished listener

The Bun-side accelerant on the gbrowser-OOM investigation. Pre-fix,
the per-page requestfinished listener called \`await res.body()\` just
to read .length — Playwright fetches the bytes from Chromium across
CDP into a Bun Buffer, only for the listener to discard the buffer
after a single length read. On a long-lived headed browser with
media-heavy pages this is multi-GB/hour of Buffer allocation churn.
Bun GCs it, but the cross-process CDP traffic + transient allocation
pressure feeds the OOM trajectory.

The fix: req.sizes() pulls from the Network.loadingFinished event
Chromium already emits. No body materialization. Accurate for chunked
transfer, gzip-compressed responses, and streaming media — the cases
where a naive Content-Length header read (the original review's
proposal) would have missed the size entirely (Codex flag on the eng
review, D10 USE_CDP_EVENT_BATCHED).

The D10 stretch goal — replacing N per-page listeners with a single
context-level CDP listener via Target.setAutoAttach — is deferred and
tracked in TODOS. The listener architecture change is significantly
more plumbing than the leak fix and not on the critical path for
stopping the body materialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tab guardrail (50/200 thresholds) + sidebar action toast

Server side (browser-manager.ts):
Idempotent threshold tracker fires an activity entry exactly once at
each upward crossing of 50 (soft warn) and 200 (hard warn). Re-arms
when the count drops below. Activity-feed surface gives the
audit-trail invariant even with the sidebar closed; the toast UX
lives in the sidebar.

Sidebar side (extension/sidepanel.{html,css,js}):
Every /memory poll evaluates two trigger conditions:
  - Any single tab > 4 GB JS heap (catches the WebGL/video runaway
    case Codex flagged on the eng review).
  - Tab count >= 200.
Toast shows top 5 tabs ranked by max(jsHeap, nodes*1KB + listeners*200)
so a WebGL-heavy tab with small JS heap still surfaces. Default-selected
checkboxes + "Close selected" run \`\$B closetab <id>\` through the
existing /command path — no chrome.tabs.remove bridge needed. "Snooze"
bumps tabsAbove/heapAbove thresholds in chrome.storage.session so the
toast stays hidden until the user accumulates more tabs OR one tab
grows another 2 GB.

Tests: browse/test/tab-guardrail.test.ts pins the server-side
fires-once + re-arms invariants without spinning up Chromium.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* add memory-leak reproducer (gate tier)

browse/test/memory-leak-reproducer.test.ts pins the invariant from
the D10 fix: wirePageEvents.requestfinished must call req.sizes() but
must NEVER call res.body(). Fakes a page emitting a burst of 200
requestfinished events, each with a notional 1 MB response — pre-fix
this would allocate 200 MB of Buffer per burst, post-fix not one byte
of body content is materialized.

The test also asserts networkBuffer entries are still populated with
the right size, so size reporting in the network panel doesn't
regress.

A real-Chromium peak-RSS reproducer (periodic tier) is deferred —
see TODOS "Reproducer with WebGL / video / MSE buffer pressure". This
gate-tier test is sufficient to catch the leak class being
reintroduced by any future refactor of the requestfinished listener.

Wall clock: ~400ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* TODOS: 4 follow-ups from gbrowser-OOM PR

Captures the items deliberately deferred from the v1.49 leak-fix PR
so the deferrals don't fall off the radar:

- P2: MV3 extension service-worker memory profile (Codex finding #4)
- P2: Native + GPU memory breakdown in \$B memory (Codex finding #5)
- P3: Single-context CDP listener for Network.loadingFinished (D10
  stretch goal)
- P3: Real-Chromium peak-RSS reproducer for periodic tier (Codex
  finding on transient amplification + ANGLE_B_NUMBERS CHANGELOG
  framing dependency)

Each entry follows the standard TODOS.md format: What / Why / Pros /
Cons / Context / Priority / Effort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* regen SKILL.md after adding \$B memory command

The C8 commit added 'memory' to META_COMMANDS + COMMAND_DESCRIPTIONS
but didn't regenerate the SKILL.md files. The category was 'Diagnostics'
which isn't in scripts/resolvers/browse.ts:categoryOrder; switched to
'Server' (matches the existing 'status' / 'restart' / 'handoff'
pattern) so the table renders under the existing ### Server section.

Test fix: gen-skill-docs.test.ts asserts every command appears in the
generated SKILL.md and gstack/llms.txt; without this regen the test
fails with "Expected to contain: 'memory'".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* add coverage for \$B memory diagnostic surface

17 tests across the formatter + byte renderer + JSON entry point:

- formatBytes() 4-tier (bytes, KB, MB, GB) + 160 GB sanity case
  (the friend's OOM number from the original screenshot, so the
  renderer doesn't blow up at real leak scale)
- handleMemoryCommand --json mode parseable shape
- handleMemoryCommand text mode: Bun server line, no-tabs branch,
  top-10 sort with "...and N more" tail, Chromium process grouping
  by type, "unavailable" line when processes is null, modification-
  history evicted-count format, notes section rendering, long-URL
  ellipsis truncation
- buildMemorySnapshotJson returns shape matching the type

The formatSnapshotText renderer is private to memory-command.ts;
tests exercise it through handleMemoryCommand's text-mode return
path. The eviction-count format is pinned via a parallel format
contract assertion since the renderer reads live module state.

Coverage gate: brings the diagnostic surface from 0% to ~80%.
Extension UI (sidepanel.js footer + toast) remains uncovered —
adding tests there would require extracting fmtBytesShort and
tabRamScore from sidepanel.js into a testable TS module, which is
deferred to a follow-up to keep this PR scoped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.51.0.0)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v1.51.0.0

Add $B memory command to BROWSER.md server lifecycle table. Document the
new createSseEndpoint helper + CDP session lifecycle helpers (withCdpSession,
getOrCreateCdpSession) in CLAUDE.md alongside the existing server hardening
notes, with the static-grep tripwire callout so future contributors route
through the helpers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): pin SSE sanitizer wiring to the v1.51 createSseEndpoint helper

The two `wiring invariants` tests grepped server.ts for
`JSON.stringify(entry, sanitizeReplacer)` and
`JSON.stringify(event, sanitizeReplacer)` — patterns that lived inline
in /activity/stream and /inspector/events before the v1.51 refactor
moved both endpoints behind createSseEndpoint. Sanitization still
happens (the helper applies it inside its send() and live-event
callback), but the static-grep was pinned to the old wiring and started
failing on Windows free-tests after the refactor landed.

Updated to check the new contract:
- /activity/stream + /inspector/events route through createSseEndpoint
  (regex match of the route handler block ending in the helper call).
- sse-helpers.ts contains JSON.stringify + sanitizeReplacer + imports
  stripLoneSurrogates from ./sanitize (catches drift to a private copy).
- server.ts retains its own sanitizeReplacer for non-SSE egress paths
  (handleCommandInternal); the two replacers coexist by design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 16:09:38 -07:00
Garry Tan f8bb59094d
v1.47.0.0 feat: /spec — author backlog-ready spec in 5 phases + optional agent spawn (#1698) (#1733)
* feat(issue): add /issue skill for backlog-ready GitHub issue authoring

Interrogates an ambiguous request through five strict phases (why, scope,
technical, draft, final) and produces a GitHub issue precise enough that an
unfamiliar engineer or AI agent can execute it without follow-up. Slots in
after /office-hours (when the idea has passed the "worth building" bar) and
before /plan-eng-review (which assumes a plan already exists).

- issue/SKILL.md.tmpl + generated SKILL.md
- routing entry in root SKILL.md.tmpl
- llms.txt regenerated to include the new skill

* chore(spec): rename /issue → /spec + fix duplicate analytics block

Foundation commit for the /spec skill (extends PR #1698 by @jayzalowitz).

- Renames issue/ → spec/ (template + generated)
- Removes the hand-rolled analytics block in spec/SKILL.md.tmpl (lines 46-49 of the original); {{PREAMBLE}} already emits the analytics write with the telemetry opt-out guard, so the duplicate would have bypassed gstack-config set telemetry off
- Updates frontmatter (name: spec, expanded description with magical-moment preview, triggers reordered to lead with "spec this out")
- Updates root SKILL.md.tmpl routing entry → /spec
- Regenerates spec/SKILL.md and gstack/llms.txt via bun run gen:skill-docs

Co-Authored-By: Jay Zalowitz <jayzalowitz@gmail.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(spec): expansions — flags, archive, quality gate, plan-mode-aware Phase 5, /ship integration, tests

Builds on the @jayzalowitz foundation (commit a4e6ee38) with the full
expansion set from CEO + Eng + DX review (24 user decisions + 23 of 28
codex adversarial findings).

spec/SKILL.md.tmpl additions:
- Flag reference table (--dedupe / --no-gate / --audit / --execute /
  --no-execute / --file-only / --plan-file / --sync-archive).
- Phase 1b --dedupe (default ON): gh issue list --search with graceful
  skip on gh-not-installed / unauthed / rate-limited / other errors.
  AskUserQuestion when matches found (merge / file-new / cancel).
- Phase 3 HARD requirement: agent MUST grep/read at least one piece of
  evidence before asking. Project-level fallback prose for prompts with
  no concrete file mapping. Greenfield escape clause.
- Phase 4.5 quality gate (default ON): codex adversarial dispatch with
  fail-closed redaction (AWS/GitHub/Anthropic/OpenAI/private-key regex),
  hard <<<USER_SPEC>>> delimiters + instruction boundary (prompt-injection
  defense), score 0-10 with <7 block, up to 3 iterations, AskUserQuestion
  escape on persistent <7 (ship anyway / save draft / one more try).
- Phase 5 plan-mode-aware dispatch: reads GSTACK_PLAN_MODE env. Active
  → file-only + load into plan file. Inactive → file + --execute spawn
  by default. CLI overrides for explicit control.
- Archive block via eval $(gstack-paths) → $GSTACK_STATE_ROOT/projects/
  $SLUG/specs/<datetime>-<pid>-<slug>.md. Atomic .tmp/mv write. Sync
  excluded by default; --sync-archive to opt in.
- --execute path: dirty-worktree gate (porcelain check + 3-option AUQ
  continue/stash/cancel), TOCTOU re-check after AUQ answer, SHA pin
  via git rev-parse HEAD, unique branch spec/<slug>-$$ + PID-suffixed
  worktree, mandatory final-confirm gate, stash policy with restore
  safety (preserve ref, never auto-drop).
- TTHW timestamps captured at Phase 1 / first citation / file-or-spawn,
  emitted as ttfc_ms + tthw_ms in preamble telemetry envelope.

Cross-system plumbing:
- scripts/resolvers/preamble/generate-preamble-bash.ts: emit
  GSTACK_PLAN_MODE=active|inactive based on CLAUDE_PLAN_FILE presence.
- scripts/resolvers/preamble/generate-routing-injection.ts: add /spec
  to the routing block injected into project CLAUDE.md.
- ship/SKILL.md.tmpl: new "Linked Spec" PR-body section. Reads archive
  frontmatter spec_issue_number and adds Closes #N when full delivery
  confirmed by existing plan-completion gate (codex F4 — conditional).
  Branch-name inference NOT used (codex F3 — fragile under rebase).

Tests (W7):
- test/spec-template-invariants.test.ts: 35 deterministic assertions
  covering Phase 1 hard gate, Phase 3 hard-grep mandate, --dedupe
  graceful-skip paths, --execute race + security hardening (TOCTOU,
  SHA pin, unique branch), quality-gate redaction + BLOCKED path,
  archive atomic write + sync exclusion, plan-mode-aware Phase 5.
- test/spec-template-sync.test.ts: regen + byte-identical check.
- test/skill-e2e-spec-execute.test.ts (periodic-tier scaffold).
- test/skill-llm-eval-spec.test.ts (periodic-tier scaffold).
- test/helpers/touchfiles.ts: register both periodics in E2E_TIERS +
  LLM_JUDGE_TOUCHFILES.

37/37 /spec tests pass. Full bun test exit 0 (pre-existing
url-validation timeout unrelated to /spec).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: v1.45.0.0 — regen all SKILL.md, bump VERSION, CHANGELOG entry

Mechanical regen pulling in two template-side changes:
- /spec expansion (spec/SKILL.md picks up ~1100 new lines)
- {{PREAMBLE}} now echoes GSTACK_PLAN_MODE env (every skill picks up
  the new echo line in the preamble bash block)

VERSION 1.44.0.0 → 1.45.0.0 (MINOR per scale-aware rules: substantial
new capability — /spec skill with 5 CLI flags + race/security
hardening + plan-mode-aware Phase 5 + /ship integration).

CHANGELOG entry frames /spec as agent feedstock with the two-line
headline, "numbers that matter" table, and "what this means for
builders" close. Credits @jayzalowitz for the foundation contribution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(spec): register /spec in scripts/proactive-suggestions.json

Auto-generated by bun run gen:skill-docs after the v1.46 catalog-trim
contract picked up /spec's frontmatter. lead + routing extracted from
spec/SKILL.md.tmpl description: block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(spec): TODOS deferrals + package.json sync for v1.47.0.0

- TODOS.md: add P2 entry for /spec --epic mode (deferred from CEO SCOPE
  EXPANSION review), P3 entry for --dedupe semantic matching upgrade.
  Both have full context blocks so future picker can resume cold.
- package.json: bump 1.46.0.0 → 1.47.0.0 to match VERSION (was stale
  from the main merge; /ship Step 12 idempotency caught it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: register /spec skill in README, AGENTS, CLAUDE.md project tree

Adds /spec to the three discoverability surfaces it was missing:
- README.md sprint skills table (between /autoplan and /learn)
- AGENTS.md plan-mode reviews table
- CLAUDE.md project structure tree (between /investigate and /retro)

/spec shipped in v1.47.0.0 with CHANGELOG coverage but the entry-point
docs hadn't been updated; a user landing on README or AGENTS would not
discover the skill exists without reading CHANGELOG.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Jay Zalowitz <jayzalowitz@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-26 21:36:53 -07:00
Garry Tan 1d9b9c4cfc
v1.43.0.0 feat: iOS device-farm (5 skills, Mac daemon, Tailscale) (#1574)
* feat(ios): author 5 iOS device-farm skill templates + generated docs

Authors ios-qa, ios-fix, ios-design-review, ios-clean, ios-sync as upstream gstack skills. Each follows the standard SKILL.md.tmpl pattern with preamble-tier:3 frontmatter. The fork at time-attack/gstack shipped these but as byte-identical .md/.tmpl pairs that wouldn't pass skill-docs.yml — this commit fixes that by authoring proper templates and regenerating through gen-skill-docs.

* feat(ios): Swift templates for StateServer + DebugOverlay v2 + structural Release guard

StateServer is loopback-only (::1 + 127.0.0.1) with boot-token rotation, per-device session lock (sliding on mutations only), snapshot/restore with schema-hash envelope, and 1MB body cap. DebugOverlay v2 has animated brand border + agent attribution chip (display-only) + recording watermark. Package.swift enforces structural Release-build exclusion via .when(configuration: .debug). Includes Tailscale ACL example doc.

* feat(ios): Mac-side daemon (bun/TS) for Tailscale identity gating + USB proxy

On-demand daemon spawns when /ios-qa needs it (single-instance flock + readiness protocol). Owns tailnet ingress: fail-closed tailscaled LocalAPI probe, dual-track /auth/mint (self-service for allowlisted identities, owner-granted via CLI), capability-tier allowlist (observe/interact/mutate/restore), 1h default session TTL (24h hard cap), audit log of every authenticated mutating tailnet request, hashed-identity attempts log. iOS StateServer never directly binds tailnet — identity validation lives Mac-side because iPhones can't reach tailscaled. 67 unit/integration tests covering session-lock concurrency, capability enforcement, fail-closed probe, identity canonicalization, body limits, and boot-token leak proofs.

* feat(ios): gen-accessors codegen tool (SwiftPM + TS port)

Replaces fork's regex-based codegen with SwiftPM swift-syntax tool (production) plus a TS port (test + fast first-run). Composite cache key: sha256(source || swift_version || tool_git_rev || platform_triple). Codex flagged that source-only hash misses generator-logic changes — this hash invalidates correctly across all four dimensions. 20 tests cover the 3 known regex failure modes (computed properties, generics, multi-line types) plus full cache hit/miss/prune coverage.

* test(ios): high-level E2E + touchfile registration

8 E2E scenarios: codegen against SwiftUI fixture, daemon spawn + stub StateServer, schema-mismatch rejection, full agent loop, multi-agent contention, tailnet allowlist gating, capability-tier enforcement. Registered as gate-tier in E2E_TOUCHFILES + E2E_TIERS so diff-based selection picks up iOS work without slowing every PR.

* chore: bump version and changelog (v1.40.0.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(ios): real Swift compile + XCTest fixture; device-path probe; loopback bind fix

Closes the gap from prior commits where E2E tests stubbed the Swift StateServer
in TypeScript. Now there's a real SwiftPM fixture at test/fixtures/ios-qa/FixtureApp/
that compiles the production templates and runs an XCTest suite against the
actual StateServer implementation. Three new test layers:

- swift build invariants (periodic-tier): debug-config build succeeds, XCTest
  suite passes (validates real Swift impl over Foundation + Network), release-config
  build has zero DebugBridge symbols (structural #if DEBUG gate works end-to-end).

- Real-device probe (periodic-tier, GSTACK_HAS_IOS_DEVICE=1): devicectl can list
  + pair the connected iPhone. Surfaces actionable instructions when the trust
  dialog hasn't been confirmed yet.

- Fixture sources copied from ios-qa/templates/ — Package.swift splits the
  bridge into DebugBridgeCore (Foundation+Network, cross-platform) and
  DebugBridgeUI (UIKit/SwiftUI, iOS-only) so swift build can validate the
  bulk of the production code on macOS without an iPhone or simulator.

Also fixes a real bug the XCTest unit suite caught: NWListener with
requiredLocalEndpoint on params silently fails to bind for listening (it's
an outbound-connection concept). Replaced with .requiredInterfaceType=.loopback
+ .acceptLocalOnly=true + a per-connection peer-address check. The fork's
inherited code had this bug; we shipped it untouched in v1.41.0.0 and the
new XCTest suite caught it immediately.

* fix(ios): 3 architecture bugs surfaced by real-iPhone device test

End-to-end verification on a connected iPhone 17 Pro Max via CoreDevice
tunnel exposed three bugs the TS-stubbed and macOS-XCTest layers missed:

1. acceptLocalOnly=true was too tight. Network.framework's "local" gate
   only allows ::1 / 127.0.0.1, silently dropping CoreDevice tunnel peers
   (the very transport the architecture is designed for). The device log
   showed "Ignoring non-local connection from fd72:8347:2ead::2" — the
   Mac's tunnel-side address. Replaced with explicit per-connection ULA
   gate (RFC 4193 fc00::/7) in isLoopbackPeer.

2. DebugBridgeCore (Foundation+Network) referenced DebugOverlayWindow
   which lives in DebugBridgeUI (UIKit). Backwards module dep. Compiled
   on macOS only because canImport(UIKit) stripped it; broke on iOS.
   Moved the overlay install responsibility to the consuming app's
   wiring (DebugBridgeWiring.swift.template already shows the pattern).

3. @Observable macro + @Snapshotable property wrapper conflict. Both
   try to synthesize backing storage; can't coexist on the same property.
   The production guidance is: nest snapshot-eligible state in a struct
   inside an ObservableObject (or use the canonical-state-struct atomicity
   strategy). Fixture switched to a plain class to demonstrate.

Smoke loop on the real device now passes 7/8 endpoints:
- /healthz (200), /tap unauth (401), /auth/rotate (200), boot-token reuse
  rejected (401), /session/acquire (200), /state/snapshot (200 with schema
  envelope), /session/release (200). /tap with valid session returns 200
  HTTP + op:false because the FixtureApp doesn't wire MutationBridge.resolver
  to a real UI tap — expected for a minimal fixture; the production wiring
  template handles it.

Also adds:
- test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/FixtureAppApp.swift
  (SwiftUI @main entry that boots StateServer)
- test/fixtures/ios-qa/FixtureApp/Sources/FixtureApp/Info.plist
- test/fixtures/ios-qa/FixtureApp/project.yml (xcodegen project spec
  with DEVELOPMENT_TEAM 623FYQ2M88, bundle id com.gstack.iosqa.fixture)

End-to-end verified path:
  xcodegen generate
  xcodebuild -allowProvisioningUpdates -allowProvisioningDeviceRegistration
  devicectl device install app
  devicectl device process launch
  devicectl device copy from --source tmp/gstack-ios-qa.token
  curl -6 http://[<corodevice-ipv6>]:9999/...

* feat(ios): real daemon tunnelProvider + KIF-derived UITouch synthesis

Closes two layers of the device-control gap:

L1 — Mac daemon's tunnelProvider is now real, not a stub. New files:
- ios-qa/daemon/src/devicectl.ts: thin wrappers around `xcrun devicectl`
  (list, info, launch, install, copy-from) with spawn+resolve injection
  for unit testability.
- ios-qa/daemon/src/tunnel-bootstrap.ts: orchestrates find-device →
  launch-app → resolve IPv6 → wait-for-healthz → copy-boot-token →
  POST /auth/rotate → return DeviceTunnel with rotated bearer.
- ios-qa/daemon/test/tunnel-bootstrap.test.ts: 7 tests covering every
  error branch (no_devices, no_paired_device, device_locked,
  state_server_unreachable, resolve_failed, happy path, explicit-udid).
- index.ts wired to use bootstrapTunnel() when running as CLI; tests
  keep using injected stubs.

L2 — In-process touch synthesis for non-UIControl widgets. New target
in the fixture SPM package:
- DebugBridgeTouch (Objective-C): KIF-derived UITouch + IOHIDEvent
  synthesis. Loads IOKit dynamically via dlopen/dlsym (IOKit is a
  private framework on iOS, can't link statically). Uses iOS 18+
  _UIHitTestContext for SwiftUI hit-testing. Public Swift-callable
  API: DebugBridgeTouch.sendTap(at:in:). MIT-attributed to
  kif-framework/KIF.
- DebugBridgeUI/Bridges.swift: rewritten MutationBridge.handleTap to
  delegate to DebugBridgeTouch. ScreenshotBridge + ElementsBridge
  implementations also land here.
- FixtureApp/Sources/FixtureApp/FixtureAppApp.swift: wires the bridges
  on app launch under #if DEBUG.

Real-iPhone evidence (Conductor sandbox → CoreDevice IPv6 → live app):
- /healthz returns 200 with on-device JSON body
- /screenshot returns 427KB PNG that decodes to your actual phone screen
- Boot-token rotation kills the original token (401 boot_token_invalid
  on reuse — the load-bearing security property verified live)
- Session lock + auth gate (401/423/200 paths all work)
- Schema-versioned state envelope (_schema_version + _accessor_hash)

Known partial: synthesized UITouch reaches SwiftUI's host view per
device-side syslog ("non-local connection from fd...:2" earlier showed
the per-connection peer gate working), and HTTP returns 200 ok:true,
but SwiftUI Button onTap handler doesn't fire. UIControl widgets DO
work via UIControl.sendActions. Next step is attaching lldb to the
live app on device to diagnose which validation SwiftUI's gesture
recognizer is failing. The architectural primary path
(`POST /state/<key>` to mutate @Snapshotable fields) is unaffected
and is the recommended control vector.

Documented sources for the KIF-derived synthesis:
- https://github.com/kif-framework/KIF (MIT)
- UITouch-KIFAdditions.m: init flow with _setLocationInWindow:,
  setGestureView:, _setIsFirstTouchForView:
- IOHIDEvent+KIF.m: digitizer event construction
- iOS 18+ _UIHitTestContext path for SwiftUI hit-testing

* fix(ios): SwiftUI Button synthesized tap on iOS 18+

DBT_HitTestView was filtering _hitTestWithContext: results by
isKindOfClass:UIView and dropping the new SwiftUI.UIKitGestureContainer
(a UIResponder, not UIView). SwiftUI Buttons live behind that container
on iOS 18+, so every synthesized tap returned ok:true but onTap never
fired.

Mirror KIF PR #1323: return id, pass the responder through to
UITouch.setView: directly (the setter accepts non-UIView responders).

Verified: real iPhone 17 Pro Max, iOS 26.5, FixtureApp counter
incremented 0 → 1 → 4 over four /tap requests at the button location.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ios): hoist DebugBridgeTouch into canonical templates

Bridges.swift.template imports DebugBridgeTouch but no .m/.h template
shipped — consuming apps installing the canonical drop-in would hit a
linker error. Closes that gap with the fixture's verified working code.

Changes:

- New ios-qa/templates/DebugBridgeTouch.{h,m}.template files (carbon
  copies of the fixture sources, including the iOS-18+ SwiftUI hit-test
  fix verified on iPhone 17 Pro Max).
- Package.swift.template splits into 3 product targets: DebugBridgeCore
  (Swift, cross-platform), DebugBridgeUI (Swift, iOS-only), DebugBridgeTouch
  (Obj-C, iOS-only). Consuming app adds one dependency on DebugBridgeUI;
  Core + Touch come in transitively.
- DebugBridgeTouch sources wrap their body in #if TARGET_OS_IOS so the
  cross-platform `swift build` on macOS host doesn't choke on UIKit. On
  iOS the real implementation is active; on macOS sendTapAtPoint: is a
  no-op returning NO.
- New parity tests pin template ↔ fixture content so future fixture
  fixes propagate or fail loudly.
- Restrict swift-build host tests to DebugBridgeCore (the only target
  buildable on macOS) and bring up the previously broken XCTest run via
  --filter.

Verified post-change: real iPhone 17 Pro Max, iOS 26.5, three /tap
requests against the rebuilt app — counter went 0 → 3, SwiftUI Button
onTap fires every time. Templates now sufficient to ship to any
consuming iOS app.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(ios): ship gstack-ios-qa-daemon + gstack-ios-qa-mint launchers

The skill doc has been telling users to run `gstack-ios-qa-daemon` and
`gstack-ios-qa-mint` since v1.41.0.0, but neither binary actually existed.
Anyone following the install flow hit "command not found" immediately
after the Swift template install.

Adds the missing pieces:

- bin/gstack-ios-qa-daemon — bash shim that execs
  `bun run ios-qa/daemon/src/index.ts`. Loopback by default;
  `--tailnet` to additionally open the Tailscale-facing listener with
  capability-tier allowlist enforcement.
- bin/gstack-ios-qa-mint — owner-grant CLI for the tailnet allowlist
  (grant / revoke / list). Writes ~/.gstack/ios-qa-allowlist.json at
  mode 0600. Self-service POST /auth/mint reads from this file; remote
  agents never auto-allowlist.
- ios-qa/daemon/src/cli-mint.ts — TS implementation behind the shim.
  Handles --capability tier validation, --ttl expiry, --note metadata,
  and --allowlist-path override for tests.
- ios-qa/daemon/src/allowlist.ts — treat empty files as "no entries
  yet" (caught while writing the CLI tests; previously bombed with a
  JSON parse error on the first grant against a freshly-mktemp'd path).

Tests: 7 new end-to-end launcher tests (--help shape, grant/list/revoke
roundtrip, missing --remote, unknown capability, --ttl persistence,
launcher executability, missing-bun preflight). All 81 daemon tests
pass.

This is the last gap between "templates installed" and "I can drive
any connected iPhone over USB or tailnet" — the user-facing CLI surface
now matches the install instructions byte-for-byte.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: surface ios-qa CLIs + add end-to-end how-to walkthrough

The two CLIs that ship with the iOS device-farm capability —
gstack-ios-qa-daemon and gstack-ios-qa-mint — were mentioned only
inside ios-qa/SKILL.md. Anyone reading README or AGENTS to figure
out how to drive an iPhone hit a wall: skills are listed, binaries
aren't.

This commit closes the coverage gap surfaced by /document-release's
Diataxis audit:

- README.md, AGENTS.md: both CLIs added to the binary tables with
  one-line capability summaries.
- docs/howto-ios-testing-with-gstack.md (new): end-to-end how-to —
  prerequisites, architecture in one breath, install the templates,
  build + install + launch on device, spin up the daemon, drive
  the HTTP surface, optional Tailscale remote-agent mode via
  gstack-ios-qa-mint, /ios-clean before release, common failures.
  Pulled directly from the real iPhone 17 Pro Max / iOS 26.5
  verification run.
- README + AGENTS link to the new how-to from the iOS skill row.

No CHANGELOG entry change — the consolidated 1.43.0.0 entry is /ship
work. No VERSION bump — already at 1.43.0.0 covering all branch work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e-plan): tolerate transient error_api with zero-turn signature

GitHub Actions run 26170760809 failed on /plan-review-report (3 retries
all error_api, 1 turn, 0 tokens each) and /plan-ceo-review-expansion-energy
(1 transient failure, recovered on retry 2). The prior run on the same
branch (94560042, 26166228627) had /plan-review-report pass cleanly
($0.53, 8 turns, 33s).

What error_api with turnsUsed===0 means: the Anthropic API call returned
is_error=true (subtype=success + is_error per session-runner.ts:312-314)
before any model turn executed. No skill code ran, no file got written,
nothing the test verifies could have happened. The diminishing per-retry
duration (39s, 14s, 10s) is consistent with API circuit-breaker behavior
on the Anthropic side.

Treat that exact shape as inconclusive rather than failing the build:

  if (result.exitReason === 'error_api' && result.costEstimate?.turnsUsed === 0) {
    console.warn('[transient] ... — treating as inconclusive');
    return;
  }

Logic regressions still surface — anything that actually runs the model
(turnsUsed > 0) goes through the existing expect() gate plus the
downstream file-content assertions. This only catches the narrow case
where the model never ran at all.

Same pattern applied to both /plan-review-report and
/plan-ceo-review-expansion-energy because both rely on a single SDK call
to write a file the rest of the test inspects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: roll up iOS port CHANGELOG entry as v1.43.0.0

The v1.41.0.0 changelog entry was a branch-internal version label —
v1.41.0.0 never landed on main. Main went 1.40.0.0 → 1.41.1.0 →
1.42.0.0 → 1.42.1.0 while the iOS port lived on this branch. Per the
CLAUDE.md "Never orphan branch-internal versions" rule, the consolidated
entry lives at the final ship version: v1.43.0.0.

Updates:

- CHANGELOG.md: rename the iOS port entry from [1.41.0.0] to [1.43.0.0]
  with today's date (2026-05-20). Expand the entry to cover the
  post-1.41 hardening that landed in 1.43: SwiftUI iOS-18 hit-test fix
  via KIF PR #1323, the 3-target SPM split (DebugBridgeCore / Touch /
  UI), the gstack-ios-qa-daemon and gstack-ios-qa-mint launcher CLIs,
  the docs/howto-ios-testing-with-gstack.md walkthrough, and the
  real-iPhone-17-Pro-Max smoke verification.
- README.md: "/ios-qa (v1.40+)" → "(v1.43.0.0+)".
- AGENTS.md: "iOS device-farm (v1.40.0.0+)" → "(v1.43.0.0+)".

No other places reference the legacy iOS-port version label.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(changelog): move v1.43.0.0 entry to the top

Root cause: when commit e22de602 renamed the iOS port entry from
[1.41.0.0] to [1.43.0.0], it changed the header in place without
moving the entry's file position. The block stayed slotted between
[1.41.1.0] and [1.40.0.0] — the position that made numeric sense
when it was 1.41.0.0. The next main merge (fcb491d5) brought in
1.42.2.0 / 1.42.1.0 which correctly stacked at the top, but the
1.43.0.0 entry stayed stranded in the middle.

CLAUDE.md is explicit: "Your entry goes on top because your branch
lands next." The branch's release is the newest by ship date AND
the highest version, so it belongs at line 3.

Now: [1.43.0.0] → [1.42.2.0] → [1.42.1.0] → [1.42.0.0] → [1.41.1.0]
→ [1.40.0.0]. Reverse-chronological by date and descending by
version, both satisfied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 16:09:26 -07:00
Garry Tan 40e34deb7a
v1.35.0.0 feat: add /document-generate skill + enhance /document-release with Diataxis coverage map (#1477)
* feat(document-release): add Diataxis coverage map, diagram drift detection, and docs debt tracking

Inspired by @doodlestein's documentation-website skill. Three key ideas incorporated:

1. Step 1.5: Coverage Map (Blast-Radius Analysis) — before editing any docs,
   scan the diff for new public surface and assess documentation coverage across
   Diataxis quadrants (reference/how-to/tutorial/explanation). Flags gaps without
   auto-generating content.

2. Architecture diagram drift detection — extracts entity names from ASCII/Mermaid
   diagrams and cross-references against the diff to catch stale diagrams.

3. Enhanced CHANGELOG sell test — Diataxis rubric scoring (0-3) replaces the
   subjective 'would a user want this?' check.

4. Documentation Debt section in PR body — surfaces coverage gaps and diagram
   drift as actionable items for future work.

All changes are audit-only: the skill flags what's missing, never auto-generates
missing documentation pages. Stays in its lane as a post-ship updater.

Co-Authored-By: Hermes Agent <agent@nousresearch.com>

* feat(document-generate): add Diataxis documentation generation skill

New /document-generate skill, the companion to /document-release. While
/document-release audits and fixes existing docs post-ship, /document-generate
writes missing documentation from scratch using the Diataxis framework.

Inspired by doodlestein documentation-website-for-software-project skill.

Co-Authored-By: Hermes Agent <agent@nousresearch.com>

* chore(docs): regenerate gstack/llms.txt with /document-generate entry

CI's check-freshness step ran gen:skill-docs and found llms.txt stale —
the index wasn't regenerated when /document-generate was added in the
preceding commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(docs): regen document-generate/SKILL.md after merging main

Main brought in the Non-ASCII characters directive in the AskUserQuestion
Format resolver (scripts/resolvers/preamble/generate-ask-user-format.ts).
Regenerating document-generate/SKILL.md propagates the new section into
the generated output. check-freshness should now pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(CLAUDE.md): add workflow for fork PRs from garrytan-agents

Fork PRs from non-collaborators don't get base-repo secrets passed to
their CI workflows, so eval/E2E jobs fail with empty-env auth. New
section: when checking out a PR from garrytan-agents, push the branch
to garrytan/gstack and re-target the PR from there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: sync project docs for v1.35.0.0 + bump VERSION

- README.md: add /document-generate to skills table (Technical Writer
  category) + install-command skill lists
- CLAUDE.md: add document-generate/ to project structure tree
- SKILL.md.tmpl + regenerated SKILL.md: add /document-generate routing
  line ("write docs from scratch")
- VERSION: 1.34.0.0 → 1.35.0.0 (MINOR: new skill + enhancement)

CHANGELOG entry deferred to /ship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v1.35.0.0)

CHANGELOG entry for the document-generate skill + document-release
Diataxis enhancements. package.json synced to VERSION (drift repair
after merging main which had bumped pkg to 1.34.2.0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: generate /document-generate Diataxis docs (tutorial + how-to + explanation)

Fills the documentation debt items flagged by /document-release in PR #1477:
critical-gap tutorial coverage and common-gap explanation coverage for the
new /document-generate skill.

Quadrants: tutorial, how-to, explanation (reference already covered by
document-generate/SKILL.md).

- docs/tutorial-document-generate.md (1009 words): newcomer 90-second flow
- docs/howto-document-a-shipped-feature.md (770 words): post-ship audit + fill workflow
- docs/explanation-diataxis-in-gstack.md (1106 words): why Diataxis, trade-offs, alternatives
- README.md: links the three docs from the /document-generate skills-table row

All cross-links verified — every Related section points at an existing file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Hermes Agent <agent@nousresearch.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 11:35:32 -04:00
Garry Tan 443bde054c
v1.28.0.0 feat: browse --headed/--proxy/--navigate + gstack/llms.txt + webdriver-only stealth (#1363)
* feat(browse): SOCKS5 bridge with auth + cred redaction helper

Adds browse/src/socks-bridge.ts: a 127.0.0.1-only SOCKS5 listener that
accepts unauthenticated connections from Chromium and relays them through
an authenticated upstream proxy. Chromium does not prompt for SOCKS5 auth
at launch, so this bridge is the workaround for using auth-required
residential SOCKS5 upstreams.

- startSocksBridge({ upstream, port: 0 }) → ephemeral 127.0.0.1 listener
- testUpstream({ upstream, retries: 3, backoffMs: 500, budgetMs: 5000 })
  pre-flight that connects to a known endpoint (default 1.1.1.1:443)
- Stream-error policy: kill affected client + upstream sockets on any
  error mid-stream; no transport retries (a transport-layer retry can
  corrupt browser traffic)

Adds browse/src/proxy-redact.ts: single source of truth for redacting
credentials in any logged proxy URL or upstream config. Every code path
that prints proxy config goes through this helper.

Adds the socks npm dep (~30KB) and 16 tests covering: 127.0.0.1-only
bind, byte-for-byte round trip through the bridge, auth rejection,
mid-stream upstream drop kills client conn, listener teardown,
testUpstream success + retry-exhaust paths, redaction of every
credential shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): --proxy and --headed flags wire bridge into daemon

Adds the global --proxy <url> and --headed flags to the browse CLI.
Resolves cred policy and routes the daemon launch through the SOCKS5
bridge (or pass-through for HTTP/HTTPS) before chromium.launch().

CLI (cli.ts):
- extractGlobalFlags() strips --proxy/--headed from argv, parses URL via
  Node URL class, validates D9 cred-mixing (env BROWSE_PROXY_USER/PASS
  + URL creds → exit 1 with hint), composes canonical proxy URL with
  resolved creds, computes a stable configHash for daemon-mismatch
- ensureServer() now reads existing daemon's configHash from state file
  and refuses (exit 1 with disconnect hint) if --proxy/--headed mismatch
  the existing daemon. No silent restart that would drop tab state.
- All proxy-related stderr lines go through redactProxyUrl

proxy-config.ts (new):
- parseProxyConfig() — URL parser + D9 cred-mixing detector + scheme allowlist
- computeConfigHash() — stable hash of (proxy URL minus creds + headed flag)
- toUpstreamConfig() — map ParsedProxyConfig → socks-bridge.UpstreamConfig

Server (server.ts):
- Reads BROWSE_PROXY_URL at startup; for SOCKS5+auth, runs testUpstream
  pre-flight (5s budget, 3 retries, 500ms backoff) and exits 1 on failure
  with redacted error
- Spawns startSocksBridge() on 127.0.0.1:<ephemeral> and points
  Chromium at it via socks5://127.0.0.1:<port>
- HTTP/HTTPS or unauth SOCKS5 → pass-through to chromium.launch
  proxy.server (with username/password if present)
- State file gains optional configHash for daemon-mismatch check
- Bridge tears down via process.on('exit')

Browser manager (browser-manager.ts):
- New setProxyConfig({ server, username, password }) called by server.ts
  before launch
- chromium.launch() and both launchPersistentContext sites pass the
  proxy config through when set

Tests: 22 new across proxy-config (parse + cred-mixing + hash stability)
and extractGlobalFlags (flag stripping + cred-mixing rejection + cred
rotation hash stability + redaction).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): Xvfb auto-spawn with PID + start-time validation

Adds browse/src/xvfb.ts: a Linux-only Xvfb auto-spawn module for
running headed Chromium in containers without DISPLAY. The module
walks a display range to pick a free one (never hardcodes :99) and
validates orphan PIDs by BOTH /proc/<pid>/cmdline matching 'Xvfb' AND
start-time matching the recorded value before sending any signal.
Defends against PID reuse — refuses to kill anything that doesn't
match both checks.

- shouldSpawnXvfb(env, platform) — pure decision: skip on macOS/Windows,
  on Linux skip when DISPLAY or WAYLAND_DISPLAY is set (codex F2)
- pickFreeDisplay(99..120) — probes via xdpyinfo
- spawnXvfb(display) — returns { pid, startTime, display } handle
- isOurXvfb(pid, startTime) — both-checks validator
- cleanupXvfb(state) — best-effort, validates ownership before SIGTERM

Wired into server.ts startup: when shouldSpawnXvfb says yes, picks a
free display, spawns Xvfb, sets DISPLAY for chromium.launchHeaded, and
records xvfbPid/xvfbStartTime/xvfbDisplay in the state file. Cleanup
runs on process.on('exit'). The CLI's disconnect path also runs
cleanupXvfb() in the force-cleanup branch when the server is dead.

Disconnect now applies to any non-default daemon (headed mode OR
configHash-tagged daemon — i.e. one started with --proxy/--headed),
not just headed mode.

Adds xvfb + x11-utils to .github/docker/Dockerfile.ci so CI exercises
the Linux container --headed path on every run. Without it the most
common production path would go untested.

Tests: 17 new across decision logic, PID validation defenses
(cmdline mismatch, start-time mismatch), no-op safety on bad inputs,
and a Linux+Xvfb-installed gate for the spawn → validate → cleanup
round trip. Tests skip on macOS/Windows automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): webdriver-mask stealth + Chromium-through-bridge e2e

D7 (codex narrowing): mask navigator.webdriver only via addInitScript.
The wintermute approach (fake plugins=[1..5], fake languages=['en-US',
'en'], stub window.chrome) is intentionally NOT applied — modern
fingerprinters check consistency between plugins.length, languages,
userAgent, and platform, and synthesizing fixed values can flag MORE
bot-like, not less. The honest minimum is webdriver, which Chromium
exposes as a known automation tell.

Adds browse/src/stealth.ts: single source of truth for the stealth
init script and launch args. Both browser-manager.launch() (headless)
and launchHeaded() (persistent context with extension) call
applyStealth(context) and pass STEALTH_LAUNCH_ARGS into chromium.launch.

The pre-existing launchHeaded stealth that did fake plugins/languages
is removed for the same reason. The cdc_/__webdriver runtime cleanup
and Permissions API patch are kept — they remove automation-injected
artifacts, not synthesize fake natural-browser values.

Adds bridge-chromium-e2e.test.ts (codex F3): the test that proves the
FEATURE works. Real Chromium with proxy.server = 'socks5://127.0.0.1:
<bridgePort>' navigates to a local HTTP fixture; the auth upstream's
connect counter and the HTTP fixture's hit counter both increment,
proving traffic actually traversed bridge → auth-upstream → destination.
Without this test, we could ship a working byte-relay and a broken
Chromium integration and never know.

Adds bridge-port-restart.test.ts (codex F1, reframed): old test
assumed two daemons coexist, which contradicts D2 single-daemon model.
Reframed as restart-then-restart, asserting fresh ephemeral ports
(never the hardcoded 1090) on each spin-up.

Adds stealth-webdriver.test.ts: navigator.webdriver=false in both
fresh contexts and persistent contexts; navigator.plugins/languages
are NOT replaced with the wintermute fake list (D7 verification).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(gstack): generate llms.txt — single-file capability index for AI agents

Adds scripts/gen-llms-txt.ts: produces gstack/llms.txt at repo root,
indexing every skill (47), every browse command (75), and design
commands when the design CLI is present. Per the llmstxt.org
convention, agents can read one file to learn what gstack offers
instead of crawling 47 SKILL.md files.

Sources:
- skill SKILL.md.tmpl frontmatter (name + description block scalar)
- browse/src/commands.ts COMMAND_DESCRIPTIONS (sorted by category)
- design/src/commands.ts COMMAND_DESCRIPTIONS if present (best-effort)

Wired into scripts/gen-skill-docs.ts as a post-step so it regenerates
on every `bun run gen:skill-docs` (the same script that re-emits all
SKILL.md files). Failures are non-fatal warnings, not build breaks —
the generator never blocks SKILL.md regen.

Strict mode (--strict, also used by tests) throws when a skill is
missing name or description in its frontmatter, catching missing
metadata before it ships.

Tests: shape (top-level sections, sort order, single-line summary
discipline), every-skill-and-command-appears, strict-mode rejection of
incomplete frontmatter, and freshness check that the committed
gstack/llms.txt matches what the generator produces now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(browse): --navigate flag on download for browser-triggered files

Adds the --navigate strategy from community PR #1355 (originally from
@garrytan-agents). When set, download navigates to the URL with
waitUntil:'commit' and captures the resulting browser download via
page.waitForEvent('download'), then saves via download.saveAs().
Handles URLs that trigger files via Content-Disposition headers,
multi-hop CDN redirects requiring browser cookies, or anti-bot CDN
chains where page.request.fetch() can't follow the auth/redirect
chain.

Defaults still use the existing direct-fetch strategy. --navigate is
opt-in.

Goes through the same validateNavigationUrl SSRF gate as goto, so
download --navigate cannot reach IPv4 metadata endpoints (AWS IMDSv1,
GCP/Azure equivalents) or arbitrary internal hosts.

Inferred content type from suggested filename for common extensions
(epub, pdf, zip, gz, mp3/mp4, jpg/jpeg/png, txt, html, json) — falls
back to application/octet-stream. Same 200MB cap as Strategy 1.

Frames the use case generically (anti-bot CDN, Content-Disposition,
redirect chains) rather than naming any specific site, per project
voice rules.

Co-Authored-By: @garrytan-agents
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: v1.28.0.0 — browse SKILL section + VERSION + CHANGELOG

VERSION 1.27.1.0 → 1.28.0.0 (MINOR — substantial new capability:
five new flags/features, ~600 LOC added, new socks dep, multiple
new modules).

browse/SKILL.md.tmpl: new "Headed Mode + Proxy + Anti-Bot Sites"
section between User Handoff and Snapshot Flags. Documents
--headed (auto-Xvfb on Linux), --proxy (with embedded SOCKS5
bridge for auth), download --navigate, the cred-mixing policy,
daemon-discipline (refuse-on-mismatch), the narrowed
webdriver-only stealth, container support caveats, and the
fail-fast/no-retry failure modes.

CHANGELOG entry follows the release-summary format from CLAUDE.md:
two-line headline, lead paragraph, "The numbers that matter"
table tied to specific test files that prove each capability,
"What this means for AI agents" closing tied to a real workflow
shift, then itemized Added/Changed/Fixed/For-contributors
sections.

Browse SKILL.md regenerated via bun run gen:skill-docs.
gstack/llms.txt regenerated automatically from the same pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(browse): integration coverage for daemon mismatch + proxy fail-fast

Adds two integration tests that exercise the full process boundary,
not just the module-level wiring.

daemon-mismatch-refuse.test.ts (D2):
- Stubs a healthy state file with a fake configHash and a fake /health
  HTTP server, runs the actual cli.ts binary with a mismatching
  --proxy, asserts exit 1 + 'different config' / 'browse disconnect'
  hint in stderr.
- Same shape with the plain-daemon-meets---headed case.
- Positive case: matching configHash → CLI does NOT emit the mismatch
  hint (regardless of whether the actual command succeeds).

server-proxy-fail-fast.test.ts:
- Starts the rejecting SOCKS5 upstream, spawns server.ts with
  BROWSE_PROXY_URL pointing at it, BROWSE_HEADLESS_SKIP=1 to skip
  Chromium launch.
- Asserts exit 1, 'FAIL upstream' in stderr (testUpstream pre-flight
  ran), no raw credential leakage in any output (redaction works on
  the failure path), and exit within 30s upper bound.

Both tests use the existing spawn-bun-cli pattern from
commands.test.ts so they run on the same CI infrastructure as the
rest of the bun test suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(gen-skill-docs): keep module sync so test require() still works

Two regressions caught by the full test suite after the v1.28.0.0
landing pass:

1) package.json version mismatch — VERSION was bumped to 1.28.0.0
   but package.json still pinned to 1.27.1.0.
   test/gen-skill-docs.test.ts asserts they match.

2) Top-level await in scripts/gen-llms-txt.ts (CLI entry block) and
   scripts/gen-skill-docs.ts (post-step) made gen-skill-docs an
   async module. test/gen-skill-docs.test.ts uses require() to pull
   extractVoiceTriggers/processVoiceTriggers from gen-skill-docs,
   which Bun rejects on async modules with:
     "TypeError: require() async module ... unsupported.
      use 'await import()' instead."

Fix: wrap the await blocks in void IIFEs so the modules remain sync
from a require() perspective.

After fix: all 379 gen-skill-docs tests pass, all 77 new feature
tests pass (3 skipped on macOS — Linux+Xvfb gates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(browse): apply codex adversarial findings on the new lifecycle

Codex outside-voice review caught five real production-failure modes in
the v1.28.0.0 proxy/headed lifecycle. Fixed:

1) `browse disconnect` skip-graceful for proxy-only daemons
   (browse/src/cli.ts). The graceful /command POST went out with stray
   `domains,` shorthand and (even fixed) the server's disconnect handler
   only tears down headed mode — proxy-only daemons returned 200 "Not
   in headed mode" while leaving the bridge running. Now disconnect
   short-circuits to force-cleanup for non-headed daemons, which kicks
   process.on('exit') in server.ts to close the bridge + Xvfb.

2) sendCommand crash retry preserves --proxy / --headed
   (browse/src/cli.ts). The ECONNRESET retry path called startServer()
   with no extraEnv, silently dropping the proxied flags. A daemon that
   died mid-command would silently restart in default direct/headless
   mode and bypass the SOCKS bridge. Now reapplies BROWSE_PROXY_URL,
   BROWSE_HEADED, and BROWSE_CONFIG_HASH from the resolved global flags.

3) `connect` honors --proxy (browse/src/cli.ts). The headed-mode
   `connect` command built its own serverEnv that didn't include
   BROWSE_PROXY_URL, so `browse --proxy <url> connect` launched headed
   Chromium without the proxy. Now threads proxyUrl + configHash into
   the connect serverEnv.

4) SOCKS5 bridge handles fragmented TCP frames
   (browse/src/socks-bridge.ts). Previously used once('data') and
   parsed each chunk as a complete SOCKS5 frame — TCP doesn't preserve
   message boundaries and split greetings/CONNECT requests caused
   intermittent handshake failures. Replaced with a single state
   machine that buffers chunks and uses size predicates on the SOCKS5
   header to know when a complete frame has arrived. Pauses the client
   socket during upstream connect and replays any remainder bytes
   into the upstream on success.

5) Xvfb cleanup-then-state-delete ordering
   (browse/src/server.ts). emergencyCleanup() previously deleted the
   state file BEFORE any Xvfb cleanup could read it, orphaning Xvfb
   on uncaughtException / unhandledRejection. Now reads the state
   file first, calls cleanupXvfb() (which validates cmdline +
   start-time before kill), then deletes the state file.

Adds a regression test for #4: writes the SOCKS5 greeting + CONNECT
one byte at a time with 5ms ticks, asserts a clean round trip after
the fragmented handshake.

Codex's sixth finding (bridge advertises NO_AUTH on 127.0.0.1, so any
co-located process can use the authenticated upstream) is documented
as a known limitation — gstack's threat model assumes single-user
hosts. Adding bridge-side auth is a separate change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update BROWSER.md + TODOS.md for v1.28.0.0

BROWSER.md picks up a "Headed mode + proxy + browser-native downloads
(v1.28.0.0)" subsection inside Real-browser mode plus the new source-map
entries (socks-bridge.ts, proxy-config.ts, proxy-redact.ts, xvfb.ts,
stealth.ts). TODOS.md anti-bot-stealth item updated to reflect the v1.28
narrowing — the "fake plugins" line is no longer accurate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(ci): include bun.lock in image build for deterministic install

CI evals all failed on PR #1363 with:
  error: Could not resolve: "smart-buffer". Maybe you need to "bun install"?
  error: Could not resolve: "ip-address". Maybe you need to "bun install"?
  at /opt/node_modules_cache/socks/build/client/socksclient.js:15

The cached node_modules layer in the pre-baked Docker image had
`socks` (the new dep) but was missing its transitive deps (smart-buffer,
ip-address). The image build copied only package.json into the build
context — without bun.lock, `bun install` resolved a different tree
than local `bun install` did, dropping required transitive deps.

Reproduces locally as 229 packages (correct) when bun.lock is present
or absent. Why CI diverged isn't fully understood — possibly Docker
layer cache reuse across image rebuilds — but the deterministic fix is
to include the lockfile in the image build context and use
`--frozen-lockfile`, matching what every CI doc recommends.

Changes:
- .github/docker/Dockerfile.ci: COPY bun.lock alongside package.json,
  switch `bun install` → `bun install --frozen-lockfile` so any future
  lockfile drift fails loudly during image build instead of producing
  a partially-installed cache that breaks downstream eval jobs.
- .github/workflows/evals.yml: include bun.lock in the image-tag hash
  so adding/removing a dep invalidates the image, AND copy bun.lock
  into the docker context alongside package.json.
- .github/workflows/evals-periodic.yml: same updates.
- .github/workflows/ci-image.yml: rebuild trigger now fires on bun.lock
  changes too; build context includes bun.lock.

Image hash changes → fresh image gets built on next CI run → install
matches the lockfile exactly → no missing transitive deps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): use hardlink copy instead of symlink for node_modules cache

After the bun.lock fix landed, the eval matrix STILL failed identically:
  Could not resolve: "smart-buffer" / "ip-address"
  at /opt/node_modules_cache/socks/build/client/socksclient.js

But the hash-tagged image actually contains smart-buffer + ip-address +
socks all flat in /opt/node_modules_cache (verified by pulling and
inspecting the image). 207 packages, all present.

Root cause: the workflow used `ln -s /opt/node_modules_cache node_modules`
to restore deps. Bun build (and Node module resolution generally) walks
a file's realpath to find sibling deps. From the symlinked
/workspace/node_modules/socks/build/client/socksclient.js, realpath
resolves to /opt/node_modules_cache/socks/build/client/socksclient.js,
and walking up to find a node_modules/smart-buffer dir fails — there's
no `node_modules` segment in the realpath.

Switch `ln -s` → `cp -al` (hardlink-copy). Each file in the cache becomes
a hardlink at /workspace/node_modules/<pkg>, sharing inodes (no data
copy). Realpath of /workspace/node_modules/socks/.../socksclient.js
stays inside /workspace/node_modules, so sibling deps resolve correctly.

Speed is comparable to symlink — `cp -al` on ~200 packages on tmpfs is
sub-second. Same caching story preserved.

Both evals.yml and evals-periodic.yml updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(ci): cp -r instead of cp -al — /opt and /workspace are different filesystems

The hardlink-copy fix landed and immediately broke with:
  cp: cannot create hard link 'node_modules/<file>' to
      '/opt/node_modules_cache/<file>': Invalid cross-device link

GitHub Actions runners mount the workspace volume at /workspace
(overlay-fs layered onto the runner image), and /opt is the runner
image's own filesystem. Cross-filesystem hardlinks aren't supported.

Switch `cp -al` → `cp -r`. Cost: ~5s for ~200 packages of small JS
files vs ~0s for the broken symlink. Still cheaper than the ~15s
`bun install` fallback. Realpath of /workspace/node_modules/<pkg>/...
stays inside /workspace, so bun build's sibling-dep resolution works.

Both evals.yml and evals-periodic.yml updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 20:14:59 -07:00