Merge remote-tracking branch 'origin/main' into garrytan/test-fail-ownership

# Conflicts: # scripts/gen-skill-docs.ts
2026-03-20 16:16:20 -07:00 · 2026-03-20 16:16:20 -07:00 · 0d9116c6e1
parent ed349027f5 9811ed37bf
commit 0d9116c6e1
48 changed files with 2351 additions and 285 deletions
--- a/.agents/skills/gstack-browse/SKILL.md
+++ b/.agents/skills/gstack-browse/SKILL.md
@ -372,7 +372,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
 -s <sel>  --selector              Scope to CSS selector
 -D        --diff                  Unified diff against previous snapshot (first call stores baseline)
 -a        --annotate              Annotated screenshot with red overlay boxes and ref labels
-o <path> --output                Output path for annotated screenshot (default: /tmp/browse-annotated.png)
+-o <path> --output                Output path for annotated screenshot (default: <temp>/browse-annotated.png)
 -C        --cursor-interactive    Cursor-interactive elements (@c refs — divs with pointer, onclick)
 ```
--- a/.agents/skills/gstack-office-hours/SKILL.md
+++ b/.agents/skills/gstack-office-hours/SKILL.md
@ -232,6 +232,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was
 If you cannot determine the outcome, use "unknown". This runs in the background and
 never blocks the user.
 ## SETUP (run this check BEFORE any browse command)
 ```bash
 _ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
 B=""
 [ -n "$_ROOT" ] && [ -x "$_ROOT/.agents/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.agents/skills/gstack/browse/dist/browse"
 [ -z "$B" ] && B=~/.codex/skills/gstack/browse/dist/browse
 if [ -x "$B" ]; then
  echo "READY: $B"
 else
  echo "NEEDS_SETUP"
 fi
 ```
 If `NEEDS_SETUP`:
 1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
 2. Run: `cd <SKILL_DIR> && ./setup`
 3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
 # YC Office Hours
 You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
@ -496,6 +515,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
 ---
 ## Visual Sketch (UI ideas only)
 If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
 or interactive elements), generate a rough wireframe to help the user visualize it.
 If the idea is backend-only, infrastructure, or has no UI component — skip this
 section silently.
 **Step 1: Gather design context**
 1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design
   system constraints (colors, typography, spacing, component patterns). Use these
   constraints in the wireframe.
 2. Apply core design principles:
   - **Information hierarchy** — what does the user see first, second, third?
   - **Interaction states** — loading, empty, error, success, partial
   - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
   - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
   - **Design for trust** — every interface element builds or erodes user trust.
 **Step 2: Generate wireframe HTML**
 Generate a single-page HTML file with these constraints:
 - **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
  hand-drawn-style elements. This is a sketch, not a polished mockup.
 - Self-contained — no external dependencies, no CDN links, inline CSS only
 - Show the core interaction flow (1-3 screens/states max)
 - Include realistic placeholder content (not "Lorem ipsum" — use content that
  matches the actual use case)
 - Add HTML comments explaining design decisions
 Write to a temp file:
 ```bash
 SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
 ```
 **Step 3: Render and capture**
 ```bash
 $B goto "file://$SKETCH_FILE"
 $B screenshot /tmp/gstack-sketch.png
 ```
 If `$B` is not available (browse binary not set up), skip the render step. Tell the
 user: "Visual sketch requires the browse binary. Run the setup script to enable it."
 **Step 4: Present and iterate**
 Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
 If they want changes, regenerate the HTML with their feedback and re-render.
 If they approve or say "good enough," proceed.
 **Step 5: Include in design doc**
 Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
 The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills
 (`/plan-design-review`, `/design-review`) to see what was originally envisioned.
 ---
 ## Phase 4.5: Founder Signal Synthesis
 Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
@ -632,7 +711,73 @@ Supersedes: {prior filename — omit this line if first design on this branch}
 {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
 ```
-Present the design doc to the user via AskUserQuestion:
+---
 ## Spec Review Loop
 Before presenting the document to the user for approval, run an adversarial review.
 **Step 1: Dispatch reviewer subagent**
 Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
 and cannot see the brainstorming conversation — only the document. This ensures genuine
 adversarial independence.
 Prompt the subagent with:
 - The file path of the document just written
 - "Read this document and review it on 5 dimensions. For each dimension, note PASS or
  list specific issues with suggested fixes. At the end, output a quality score (1-10)
  across all dimensions."
 **Dimensions:**
 1. **Completeness** — Are all requirements addressed? Missing edge cases?
 2. **Consistency** — Do parts of the document agree with each other? Contradictions?
 3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
 4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
 5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
 The subagent should return:
 - A quality score (1-10)
 - PASS if no issues, or a numbered list of issues with dimension, description, and fix
 **Step 2: Fix and re-dispatch**
 If the reviewer returns issues:
 1. Fix each issue in the document on disk (use Edit tool)
 2. Re-dispatch the reviewer subagent with the updated document
 3. Maximum 3 iterations total
 **Convergence guard:** If the reviewer returns the same issues on consecutive iterations
 (the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
 and persist those issues as "Reviewer Concerns" in the document rather than looping
 further.
 If the subagent fails, times out, or is unavailable — skip the review loop entirely.
 Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
 already written to disk; the review is a quality bonus, not a gate.
 **Step 3: Report and persist metrics**
 After the loop completes (PASS, max iterations, or convergence guard):
 1. Tell the user the result — summary by default:
   "Your doc survived N rounds of adversarial review. M issues caught and fixed.
   Quality score: X/10."
   If they ask "what did the reviewer find?", show the full reviewer output.
 2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
   section to the document listing each unresolved issue. Downstream skills will see this.
 3. Append metrics:
 ```bash
 mkdir -p ~/.gstack/analytics
 echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
 ```
 Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
 ---
 Present the reviewed design doc to the user via AskUserQuestion:
 - A) Approve — mark Status: APPROVED and proceed to handoff
 - B) Revise — specify which sections need changes (loop back to revise those sections)
 - C) Start over — return to Phase 2
--- a/.agents/skills/gstack-plan-ceo-review/SKILL.md
+++ b/.agents/skills/gstack-plan-ceo-review/SKILL.md
@ -338,6 +338,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
 ```
 If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
 ## Prerequisite Skill Offer
 When the design doc check above prints "No design doc found," offer the prerequisite
 skill before proceeding.
 Say to the user via AskUserQuestion:
 > "No design doc found for this branch. `/office-hours` produces a structured problem
 > statement, premise challenge, and explored alternatives — it gives this review much
 > sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
 > not per-product — it captures the thinking behind this specific change."
 Options:
 - A) Run /office-hours first (in another window, then come back)
 - B) Skip — proceed with standard review
 If they skip: "No worries — standard review. If you ever want sharper input, try
 /office-hours first next time." Then proceed normally. Do not re-offer later in the session.
 **Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
 articulate the problem, keeps changing the problem statement, answers with "I'm not
 sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
 > "It sounds like you're still figuring out what to build — that's totally fine, but
 > that's what /office-hours is designed for. Want to pause this review and run
 > /office-hours first? It'll help you nail down the problem and approach, then come
 > back here for the strategic review."
 Options: A) Yes, run /office-hours first. B) No, keep going.
 If they keep going, proceed normally — no guilt, no re-asking.
 When reading TODOS.md, specifically:
 * Note any TODOs this plan touches, blocks, or unlocks
 * Check if deferred work from prior reviews relates to this plan
@ -481,6 +512,70 @@ Repo: {owner/repo}
 Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
 After writing the CEO plan, run the spec review loop on it:
 ## Spec Review Loop
 Before presenting the document to the user for approval, run an adversarial review.
 **Step 1: Dispatch reviewer subagent**
 Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
 and cannot see the brainstorming conversation — only the document. This ensures genuine
 adversarial independence.
 Prompt the subagent with:
 - The file path of the document just written
 - "Read this document and review it on 5 dimensions. For each dimension, note PASS or
  list specific issues with suggested fixes. At the end, output a quality score (1-10)
  across all dimensions."
 **Dimensions:**
 1. **Completeness** — Are all requirements addressed? Missing edge cases?
 2. **Consistency** — Do parts of the document agree with each other? Contradictions?
 3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
 4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
 5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
 The subagent should return:
 - A quality score (1-10)
 - PASS if no issues, or a numbered list of issues with dimension, description, and fix
 **Step 2: Fix and re-dispatch**
 If the reviewer returns issues:
 1. Fix each issue in the document on disk (use Edit tool)
 2. Re-dispatch the reviewer subagent with the updated document
 3. Maximum 3 iterations total
 **Convergence guard:** If the reviewer returns the same issues on consecutive iterations
 (the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
 and persist those issues as "Reviewer Concerns" in the document rather than looping
 further.
 If the subagent fails, times out, or is unavailable — skip the review loop entirely.
 Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
 already written to disk; the review is a quality bonus, not a gate.
 **Step 3: Report and persist metrics**
 After the loop completes (PASS, max iterations, or convergence guard):
 1. Tell the user the result — summary by default:
   "Your doc survived N rounds of adversarial review. M issues caught and fixed.
   Quality score: X/10."
   If they ask "what did the reviewer find?", show the full reviewer output.
 2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
   section to the document listing each unresolved issue. Downstream skills will see this.
 3. Append metrics:
 ```bash
 mkdir -p ~/.gstack/analytics
 echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
 ```
 Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
 ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
 Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
 ```
@ -912,7 +1007,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
--- a/.agents/skills/gstack-plan-design-review/SKILL.md
+++ b/.agents/skills/gstack-plan-design-review/SKILL.md
@ -542,7 +542,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
--- a/.agents/skills/gstack-plan-eng-review/SKILL.md
+++ b/.agents/skills/gstack-plan-eng-review/SKILL.md
@ -283,6 +283,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
 ```
 If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
 ## Prerequisite Skill Offer
 When the design doc check above prints "No design doc found," offer the prerequisite
 skill before proceeding.
 Say to the user via AskUserQuestion:
 > "No design doc found for this branch. `/office-hours` produces a structured problem
 > statement, premise challenge, and explored alternatives — it gives this review much
 > sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
 > not per-product — it captures the thinking behind this specific change."
 Options:
 - A) Run /office-hours first (in another window, then come back)
 - B) Skip — proceed with standard review
 If they skip: "No worries — standard review. If you ever want sharper input, try
 /office-hours first next time." Then proceed normally. Do not re-offer later in the session.
 ### Step 0: Scope Challenge
 Before reviewing anything, answer these questions:
 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
@ -512,7 +531,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
--- a/.agents/skills/gstack-review/SKILL.md
+++ b/.agents/skills/gstack-review/SKILL.md
@ -488,54 +488,7 @@ If no documentation files exist, skip this step silently.
 ---
 ## Step 5.7: Codex second opinion (optional)
 After completing the review, check if the Codex CLI is available:
 ```bash
 which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
 ```
 If Codex is available, use AskUserQuestion:
 ```
 Review complete. Want an independent second opinion from Codex (OpenAI)?
 A) Run Codex code review — independent diff review with pass/fail gate
 B) Run Codex adversarial challenge — try to find ways this code will fail in production
 C) Both — review first, then adversarial challenge
 D) Skip — no Codex review needed
 ```
 If the user chooses A, B, or C:
 **For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
 Present the full output verbatim under a `CODEX SAYS (code review):` header.
 Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
 After presenting, compare Codex's findings with your own review findings from Steps 4-5
 and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
 and what only Claude found.
 **For adversarial challenge (B or C):** Run:
 ```bash
 codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
 ```
 Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
 **Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
 ```bash
 ~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
 ```
 Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
 **Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
 there is no gate verdict to record, and a false entry would make the Review Readiness
 Dashboard believe a code review happened when it didn't.
 If Codex is not available, skip this step silently.
 ---
 ## Important Rules
--- a/.agents/skills/gstack-ship/SKILL.md
+++ b/.agents/skills/gstack-ship/SKILL.md
@ -309,7 +309,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
@ -949,43 +949,7 @@ For each classified comment:
 ---
 ## Step 3.8: Codex second opinion (optional)
 Check if the Codex CLI is available:
 ```bash
 which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
 ```
 If Codex is available, use AskUserQuestion:
 ```
 Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?
 A) Run Codex code review — independent diff review with pass/fail gate
 B) Run Codex adversarial challenge — try to break this code
 C) Skip — ship without Codex review
 ```
 If the user chooses A or B:
 **For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
 Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
 to determine pass/fail gate. Persist the result:
 ```bash
 ~/.codex/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
 ```
 If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
 If the user says no, stop. If yes, continue to Step 4.
 **For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
 Present findings. This is informational — does not block shipping.
 If Codex is not available, skip silently. Continue to Step 4.
 ---
 ## Step 4: Version bump (auto-decide)
@ -1226,7 +1190,7 @@ doc updates — the user runs `/ship` and documentation stays current without a
 - **Never skip tests.** If tests fail, stop.
 - **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
 - **Never force push.** Use regular `git push` only.
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
+- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
 - **Always use the 4-digit version format** from the VERSION file.
 - **Date format in CHANGELOG:** `YYYY-MM-DD`
 - **Split commits for bisectability** — each commit = one logical change.
--- a/.agents/skills/gstack/SKILL.md
+++ b/.agents/skills/gstack/SKILL.md
@ -500,7 +500,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
 -s <sel>  --selector              Scope to CSS selector
 -D        --diff                  Unified diff against previous snapshot (first call stores baseline)
 -a        --annotate              Annotated screenshot with red overlay boxes and ref labels
-o <path> --output                Output path for annotated screenshot (default: /tmp/browse-annotated.png)
+-o <path> --output                Output path for annotated screenshot (default: <temp>/browse-annotated.png)
 -C        --cursor-interactive    Cursor-interactive elements (@c refs — divs with pointer, onclick)
 ```
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,47 @@
 # Changelog
 ## [0.9.4.0] - 2026-03-20 — Codex Reviews On By Default
 ### Changed
 - **Codex code reviews now run automatically in `/ship` and `/review`.** No more "want a second opinion?" prompt every time — Codex reviews both your code (with a pass/fail gate) and runs an adversarial challenge by default. First-time users get a one-time opt-in prompt; after that, it's hands-free. Configure with `gstack-config set codex_reviews enabled|disabled`.
 - **All Codex operations use maximum reasoning power.** Review, adversarial, and consult modes all use `xhigh` reasoning effort — when an AI is reviewing your code, you want it thinking as hard as possible.
 - **Codex review errors can't corrupt the dashboard.** Auth failures, timeouts, and empty responses are now detected before logging results, so the Review Readiness Dashboard never shows a false "passed" entry. Adversarial stderr is captured separately.
 - **Codex review log includes commit hash.** Staleness detection now works correctly for Codex reviews, matching the same commit-tracking behavior as eng/CEO/design reviews.
 ### Fixed
 - **Codex-for-Codex recursion prevented.** When gstack runs inside Codex CLI (`.agents/skills/`), the Codex review step is completely stripped — no accidental infinite loops.
 ## [0.9.3.0] - 2026-03-20 — Windows Support
 ### Fixed
 - **gstack now works on Windows 11.** Setup no longer hangs when verifying Playwright, and the browse server automatically falls back to Node.js to work around a Bun pipe-handling bug on Windows ([bun#4253](https://github.com/oven-sh/bun/issues/4253)). Just make sure Node.js is installed alongside Bun. macOS and Linux are completely unaffected.
 - **Path handling works on Windows.** All hardcoded `/tmp` paths and Unix-style path separators now use platform-aware equivalents via a new `platform.ts` module. Path traversal protection works correctly with Windows backslash separators.
 ### Added
 - **Bun API polyfill for Node.js.** When the browse server runs under Node.js on Windows, a compatibility layer provides `Bun.serve()`, `Bun.spawn()`, `Bun.spawnSync()`, and `Bun.sleep()` equivalents. Fully tested.
 - **Node server build script.** `browse/scripts/build-node-server.sh` transpiles the server for Node.js, stubs `bun:sqlite`, and injects the polyfill — all automated during `bun run build`.
 ## [0.9.2.0] - 2026-03-20 — Gemini CLI E2E Tests
 ### Added
 - **Gemini CLI is now tested end-to-end.** Two E2E tests verify that gstack skills work when invoked by Google's Gemini CLI (`gemini -p`). The `gemini-discover-skill` test confirms skill discovery from `.agents/skills/`, and `gemini-review-findings` runs a full code review via gstack-review. Both parse Gemini's stream-json NDJSON output and track token usage.
 - **Gemini JSONL parser with 10 unit tests.** `parseGeminiJSONL` handles all Gemini event types (init, message, tool_use, tool_result, result) with defensive parsing for malformed input. The parser is a pure function, independently testable without spawning the CLI.
 - **`bun run test:gemini`** and **`bun run test:gemini:all`** scripts for running Gemini E2E tests independently. Gemini tests are also included in `test:evals` and `test:e2e` aggregate scripts.
 ## [0.9.1.0] - 2026-03-20 — Adversarial Spec Review + Skill Chaining
 ### Added
 - **Your design docs now get stress-tested before you see them.** When you run `/office-hours`, an independent AI reviewer checks your design doc for completeness, consistency, clarity, scope creep, and feasibility — up to 3 rounds. You get a quality score (1-10) and a summary of what was caught and fixed. The doc you approve has already survived adversarial review.
 - **Visual wireframes during brainstorming.** For UI ideas, `/office-hours` now generates a rough HTML wireframe using your project's design system (from DESIGN.md) and screenshots it. You see what you're designing while you're still thinking, not after you've coded it.
 - **Skills help each other now.** `/plan-ceo-review` and `/plan-eng-review` detect when you'd benefit from running `/office-hours` first and offer it — one-tap to switch, one-tap to decline. If you seem lost during a CEO review, it'll gently suggest brainstorming first.
 - **Spec review metrics.** Every adversarial review logs iterations, issues found/fixed, and quality score to `~/.gstack/analytics/spec-review.jsonl`. Over time, you can see if your design docs are getting better.
 ## [0.9.0.1] - 2026-03-19
 ### Changed
--- a/README.md
+++ b/README.md
@ -42,7 +42,7 @@ Expect first useful run in under 5 minutes on any repo with tests already set up
 ## Install — takes 30 seconds
-**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+
+**Requirements:** [Claude Code](https://docs.anthropic.com/en/docs/claude-code), [Git](https://git-scm.com/), [Bun](https://bun.sh/) v1.0+, [Node.js](https://nodejs.org/) (Windows only)
 ### Step 1: Install on your machine
@ -238,6 +238,8 @@ Data is stored in [Supabase](https://supabase.com) (open source Firebase alterna
 **Stale install?** Run `/gstack-upgrade` — or set `auto_upgrade: true` in `~/.gstack/config.yaml`
 **Windows users:** gstack works on Windows 11 via Git Bash or WSL. Node.js is required in addition to Bun — Bun has a known bug with Playwright's pipe transport on Windows ([bun#4253](https://github.com/oven-sh/bun/issues/4253)). The browse server automatically falls back to Node.js. Make sure both `bun` and `node` are on your PATH.
 **Claude says it can't see the skills?** Make sure your project's `CLAUDE.md` has a gstack section. Add this:
 ```
--- a/SKILL.md
+++ b/SKILL.md
@ -506,7 +506,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
 -s <sel>  --selector              Scope to CSS selector
 -D        --diff                  Unified diff against previous snapshot (first call stores baseline)
 -a        --annotate              Annotated screenshot with red overlay boxes and ref labels
-o <path> --output                Output path for annotated screenshot (default: /tmp/browse-annotated.png)
+-o <path> --output                Output path for annotated screenshot (default: <temp>/browse-annotated.png)
 -C        --cursor-interactive    Cursor-interactive elements (@c refs — divs with pointer, onclick)
 ```
--- a/2
+++ b/2
@ -1 +1 @@
-0.9.0.1
+0.9.4.0
--- a/browse/SKILL.md
+++ b/browse/SKILL.md
@ -378,7 +378,7 @@ The snapshot is your primary tool for understanding and interacting with pages.
 -s <sel>  --selector              Scope to CSS selector
 -D        --diff                  Unified diff against previous snapshot (first call stores baseline)
 -a        --annotate              Annotated screenshot with red overlay boxes and ref labels
-o <path> --output                Output path for annotated screenshot (default: /tmp/browse-annotated.png)
+-o <path> --output                Output path for annotated screenshot (default: <temp>/browse-annotated.png)
 -C        --cursor-interactive    Cursor-interactive elements (@c refs — divs with pointer, onclick)
 ```
--- a/browse/scripts/build-node-server.sh
+++ b/browse/scripts/build-node-server.sh
@ -0,0 +1,48 @@
 #!/usr/bin/env bash
 # Build a Node.js-compatible server bundle for Windows.
 #
 # On Windows, Bun can't launch or connect to Playwright's Chromium
 # (oven-sh/bun#4253, #9911). This script produces a server bundle
 # that runs under Node.js with Bun API polyfills.
 set -e
 GSTACK_DIR="$(cd "$(dirname "$0")/../.." && pwd)"
 SRC_DIR="$GSTACK_DIR/browse/src"
 DIST_DIR="$GSTACK_DIR/browse/dist"
 echo "Building Node-compatible server bundle..."
 # Step 1: Transpile server.ts to a single .mjs bundle (externalize runtime deps)
 bun build "$SRC_DIR/server.ts" \
  --target=node \
  --outfile "$DIST_DIR/server-node.mjs" \
  --external playwright \
  --external playwright-core \
  --external diff \
  --external "bun:sqlite"
 # Step 2: Post-process
 # Replace import.meta.dir with a resolvable reference
 perl -pi -e 's/import\.meta\.dir/__browseNodeSrcDir/g' "$DIST_DIR/server-node.mjs"
 # Stub out bun:sqlite (macOS-only cookie import, not needed on Windows)
 perl -pi -e 's|import { Database } from "bun:sqlite";|const Database = null; // bun:sqlite stubbed on Node|g' "$DIST_DIR/server-node.mjs"
 # Step 3: Create the final file with polyfill header injected after the first line
 {
  head -1 "$DIST_DIR/server-node.mjs"
  echo '// ── Windows Node.js compatibility (auto-generated) ──'
  echo 'import { fileURLToPath as _ftp } from "node:url";'
  echo 'import { dirname as _dn } from "node:path";'
  echo 'const __browseNodeSrcDir = _dn(_dn(_ftp(import.meta.url))) + "/src";'
  echo '{ const _r = createRequire(import.meta.url); _r("./bun-polyfill.cjs"); }'
  echo '// ── end compatibility ──'
  tail -n +2 "$DIST_DIR/server-node.mjs"
 } > "$DIST_DIR/server-node.tmp.mjs"
 mv "$DIST_DIR/server-node.tmp.mjs" "$DIST_DIR/server-node.mjs"
 # Step 4: Copy polyfill to dist/
 cp "$SRC_DIR/bun-polyfill.cjs" "$DIST_DIR/bun-polyfill.cjs"
 echo "Node server bundle ready: $DIST_DIR/server-node.mjs"
--- a/browse/src/bun-polyfill.cjs
+++ b/browse/src/bun-polyfill.cjs
@ -0,0 +1,109 @@
 /**
 * Bun API polyfill for Node.js — Windows compatibility layer.
 *
 * On Windows, Bun can't launch or connect to Playwright's Chromium
 * (oven-sh/bun#4253, #9911). The browse server falls back to running
 * under Node.js with this polyfill providing Bun API equivalents.
 *
 * Loaded via --require before the transpiled server bundle.
 */
 'use strict';
 const http = require('http');
 const { spawnSync, spawn } = require('child_process');
 globalThis.Bun = {
  serve(options) {
    const { port, hostname = '127.0.0.1', fetch } = options;
    const server = http.createServer(async (nodeReq, nodeRes) => {
      try {
        const url = `http://${hostname}:${port}${nodeReq.url}`;
        const headers = new Headers();
        for (const [key, val] of Object.entries(nodeReq.headers)) {
          if (val) headers.set(key, Array.isArray(val) ? val[0] : val);
        }
        let body = null;
        if (nodeReq.method !== 'GET' && nodeReq.method !== 'HEAD') {
          body = await new Promise((resolve) => {
            const chunks = [];
            nodeReq.on('data', (chunk) => chunks.push(chunk));
            nodeReq.on('end', () => resolve(Buffer.concat(chunks)));
          });
        }
        const webReq = new Request(url, {
          method: nodeReq.method,
          headers,
          body,
        });
        const webRes = await fetch(webReq);
        nodeRes.statusCode = webRes.status;
        webRes.headers.forEach((val, key) => {
          nodeRes.setHeader(key, val);
        });
        const resBody = await webRes.arrayBuffer();
        nodeRes.end(Buffer.from(resBody));
      } catch (err) {
        nodeRes.statusCode = 500;
        nodeRes.end(JSON.stringify({ error: err.message }));
      }
    });
    server.listen(port, hostname);
    return {
      stop() { server.close(); },
      port,
      hostname,
    };
  },
  spawnSync(cmd, options = {}) {
    const [command, ...args] = cmd;
    const result = spawnSync(command, args, {
      stdio: [
        options.stdin || 'pipe',
        options.stdout === 'pipe' ? 'pipe' : 'ignore',
        options.stderr === 'pipe' ? 'pipe' : 'ignore',
      ],
      timeout: options.timeout,
      env: options.env,
      cwd: options.cwd,
    });
    return {
      exitCode: result.status,
      stdout: result.stdout || Buffer.from(''),
      stderr: result.stderr || Buffer.from(''),
    };
  },
  spawn(cmd, options = {}) {
    const [command, ...args] = cmd;
    const stdio = options.stdio || ['pipe', 'pipe', 'pipe'];
    const proc = spawn(command, args, {
      stdio,
      env: options.env,
      cwd: options.cwd,
    });
    return {
      pid: proc.pid,
      stdout: proc.stdout,
      stderr: proc.stderr,
      stdin: proc.stdin,
      unref() { proc.unref(); },
      kill(signal) { proc.kill(signal); },
    };
  },
  sleep(ms) {
    return new Promise((resolve) => setTimeout(resolve, ms));
  },
 };
--- a/browse/src/cli.ts
+++ b/browse/src/cli.ts
@ -14,7 +14,8 @@ import * as path from 'path';
 import { resolveConfig, ensureStateDir, readVersionHash } from './config';
 const config = resolveConfig();
-const MAX_START_WAIT = 8000; // 8 seconds to start
+const IS_WINDOWS = process.platform === 'win32';
 const MAX_START_WAIT = IS_WINDOWS ? 15000 : 8000; // Node+Chromium takes longer on Windows
 export function resolveServerScript(
  env: Record<string, string | undefined> = process.env,
@ -26,7 +27,9 @@ export function resolveServerScript(
  }
  // Dev mode: cli.ts runs directly from browse/src
-  if (metaDir.startsWith('/') && !metaDir.includes('$bunfs')) {
+  // On macOS/Linux, import.meta.dir starts with /
  // On Windows, it starts with a drive letter (e.g., C:\...)
  if (!metaDir.includes('$bunfs')) {
    const direct = path.resolve(metaDir, 'server.ts');
    if (fs.existsSync(direct)) {
      return direct;
@ -48,6 +51,31 @@ export function resolveServerScript(
 const SERVER_SCRIPT = resolveServerScript();
 /**
 * On Windows, resolve the Node.js-compatible server bundle.
 * Falls back to null if not found (server will use Bun instead).
 */
 export function resolveNodeServerScript(
  metaDir: string = import.meta.dir,
  execPath: string = process.execPath
 ): string | null {
  // Dev mode
  if (!metaDir.includes('$bunfs')) {
    const distScript = path.resolve(metaDir, '..', 'dist', 'server-node.mjs');
    if (fs.existsSync(distScript)) return distScript;
  }
  // Compiled binary: browse/dist/browse → browse/dist/server-node.mjs
  if (execPath) {
    const adjacent = path.resolve(path.dirname(execPath), 'server-node.mjs');
    if (fs.existsSync(adjacent)) return adjacent;
  }
  return null;
 }
 const NODE_SERVER_SCRIPT = IS_WINDOWS ? resolveNodeServerScript() : null;
 interface ServerState {
  pid: number;
  port: number;
@ -139,8 +167,14 @@ async function startServer(): Promise<ServerState> {
  // Clean up stale state file
  try { fs.unlinkSync(config.stateFile); } catch {}
-  // Start server as detached background process
+  // Start server as detached background process.
-  const proc = Bun.spawn(['bun', 'run', SERVER_SCRIPT], {
+  // On Windows, Bun can't launch/connect to Playwright's Chromium (oven-sh/bun#4253, #9911).
  // Fall back to running the server under Node.js with Bun API polyfills.
  const useNode = IS_WINDOWS && NODE_SERVER_SCRIPT;
  const serverCmd = useNode
    ? ['node', NODE_SERVER_SCRIPT]
    : ['bun', 'run', SERVER_SCRIPT];
  const proc = Bun.spawn(serverCmd, {
    stdio: ['ignore', 'pipe', 'pipe'],
    env: { ...process.env, BROWSE_STATE_FILE: config.stateFile },
  });
--- a/browse/src/meta-commands.ts
+++ b/browse/src/meta-commands.ts
@ -10,13 +10,14 @@ import { validateNavigationUrl } from './url-validation';
 import * as Diff from 'diff';
 import * as fs from 'fs';
 import * as path from 'path';
 import { TEMP_DIR, isPathWithin } from './platform';
 // Security: Path validation to prevent path traversal attacks
-const SAFE_DIRECTORIES = ['/tmp', process.cwd()];
+const SAFE_DIRECTORIES = [TEMP_DIR, process.cwd()];
 export function validateOutputPath(filePath: string): void {
  const resolved = path.resolve(filePath);
-  const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/'));
+  const isSafe = SAFE_DIRECTORIES.some(dir => isPathWithin(resolved, dir));
  if (!isSafe) {
    throw new Error(`Path must be within: ${SAFE_DIRECTORIES.join(', ')}`);
  }
@ -88,7 +89,7 @@ export async function handleMetaCommand(
    case 'screenshot': {
      // Parse priority: flags (--viewport, --clip) → selector (@ref, CSS) → output path
      const page = bm.getPage();
-      let outputPath = '/tmp/browse-screenshot.png';
+      let outputPath = `${TEMP_DIR}/browse-screenshot.png`;
      let clipRect: { x: number; y: number; width: number; height: number } | undefined;
      let targetSelector: string | undefined;
      let viewportOnly = false;
@ -147,7 +148,7 @@ export async function handleMetaCommand(
    case 'pdf': {
      const page = bm.getPage();
-      const pdfPath = args[0] || '/tmp/browse-page.pdf';
+      const pdfPath = args[0] || `${TEMP_DIR}/browse-page.pdf`;
      validateOutputPath(pdfPath);
      await page.pdf({ path: pdfPath, format: 'A4' });
      return `PDF saved: ${pdfPath}`;
@ -155,7 +156,7 @@ export async function handleMetaCommand(
    case 'responsive': {
      const page = bm.getPage();
-      const prefix = args[0] || '/tmp/browse-responsive';
+      const prefix = args[0] || `${TEMP_DIR}/browse-responsive`;
      validateOutputPath(prefix);
      const viewports = [
        { name: 'mobile', width: 375, height: 812 },
--- a/browse/src/platform.ts
+++ b/browse/src/platform.ts
@ -0,0 +1,17 @@
 /**
 * Cross-platform constants for gstack browse.
 *
 * On macOS/Linux: TEMP_DIR = '/tmp', path.sep = '/'  — identical to hardcoded values.
 * On Windows: TEMP_DIR = os.tmpdir(), path.sep = '\\' — correct Windows behavior.
 */
 import * as os from 'os';
 import * as path from 'path';
 export const IS_WINDOWS = process.platform === 'win32';
 export const TEMP_DIR = IS_WINDOWS ? os.tmpdir() : '/tmp';
 /** Check if resolvedPath is within dir, using platform-aware separators. */
 export function isPathWithin(resolvedPath: string, dir: string): boolean {
  return resolvedPath === dir || resolvedPath.startsWith(dir + path.sep);
 }
--- a/browse/src/read-commands.ts
+++ b/browse/src/read-commands.ts
@ -10,6 +10,7 @@ import { consoleBuffer, networkBuffer, dialogBuffer } from './buffers';
 import type { Page } from 'playwright';
 import * as fs from 'fs';
 import * as path from 'path';
 import { TEMP_DIR, isPathWithin } from './platform';
 /** Detect await keyword, ignoring comments. Accepted risk: await in string literals triggers wrapping (harmless). */
 function hasAwait(code: string): boolean {
@ -36,12 +37,12 @@ function wrapForEvaluate(code: string): string {
 }
 // Security: Path validation to prevent path traversal attacks
-const SAFE_DIRECTORIES = ['/tmp', process.cwd()];
+const SAFE_DIRECTORIES = [TEMP_DIR, process.cwd()];
 export function validateReadPath(filePath: string): void {
  if (path.isAbsolute(filePath)) {
    const resolved = path.resolve(filePath);
-    const isSafe = SAFE_DIRECTORIES.some(dir => resolved === dir || resolved.startsWith(dir + '/'));
+    const isSafe = SAFE_DIRECTORIES.some(dir => isPathWithin(resolved, dir));
    if (!isSafe) {
      throw new Error(`Absolute path must be within: ${SAFE_DIRECTORIES.join(', ')}`);
    }
--- a/browse/src/snapshot.ts
+++ b/browse/src/snapshot.ts
@ -20,6 +20,7 @@
 import type { Page, Locator } from 'playwright';
 import type { BrowserManager, RefEntry } from './browser-manager';
 import * as Diff from 'diff';
 import { TEMP_DIR, isPathWithin } from './platform';
 // Roles considered "interactive" for the -i flag
 const INTERACTIVE_ROLES = new Set([
@ -61,7 +62,7 @@ export const SNAPSHOT_FLAGS: Array<{
  { short: '-s', long: '--selector', description: 'Scope to CSS selector', takesValue: true, valueHint: '<sel>', optionKey: 'selector' },
  { short: '-D', long: '--diff', description: 'Unified diff against previous snapshot (first call stores baseline)', optionKey: 'diff' },
  { short: '-a', long: '--annotate', description: 'Annotated screenshot with red overlay boxes and ref labels', optionKey: 'annotate' },
-  { short: '-o', long: '--output', description: 'Output path for annotated screenshot (default: /tmp/browse-annotated.png)', takesValue: true, valueHint: '<path>', optionKey: 'outputPath' },
+  { short: '-o', long: '--output', description: 'Output path for annotated screenshot (default: <temp>/browse-annotated.png)', takesValue: true, valueHint: '<path>', optionKey: 'outputPath' },
  { short: '-C', long: '--cursor-interactive', description: 'Cursor-interactive elements (@c refs — divs with pointer, onclick)', optionKey: 'cursorInteractive' },
 ];
@ -308,11 +309,11 @@ export async function handleSnapshot(
  // ─── Annotated screenshot (-a) ────────────────────────────
  if (opts.annotate) {
-    const screenshotPath = opts.outputPath || '/tmp/browse-annotated.png';
+    const screenshotPath = opts.outputPath || `${TEMP_DIR}/browse-annotated.png`;
    // Validate output path (consistent with screenshot/pdf/responsive)
    const resolvedPath = require('path').resolve(screenshotPath);
-    const safeDirs = ['/tmp', process.cwd()];
+    const safeDirs = [TEMP_DIR, process.cwd()];
-    if (!safeDirs.some((dir: string) => resolvedPath === dir || resolvedPath.startsWith(dir + '/'))) {
+    if (!safeDirs.some((dir: string) => isPathWithin(resolvedPath, dir))) {
      throw new Error(`Path must be within: ${safeDirs.join(', ')}`);
    }
    try {
--- a/browse/src/write-commands.ts
+++ b/browse/src/write-commands.ts
@ -10,6 +10,7 @@ import { findInstalledBrowsers, importCookies } from './cookie-import-browser';
 import { validateNavigationUrl } from './url-validation';
 import * as fs from 'fs';
 import * as path from 'path';
 import { TEMP_DIR, isPathWithin } from './platform';
 export async function handleWriteCommand(
  command: string,
@ -277,9 +278,9 @@ export async function handleWriteCommand(
      if (!filePath) throw new Error('Usage: browse cookie-import <json-file>');
      // Path validation — prevent reading arbitrary files
      if (path.isAbsolute(filePath)) {
-        const safeDirs = ['/tmp', process.cwd()];
+        const safeDirs = [TEMP_DIR, process.cwd()];
        const resolved = path.resolve(filePath);
-        if (!safeDirs.some(dir => resolved === dir || resolved.startsWith(dir + '/'))) {
+        if (!safeDirs.some(dir => isPathWithin(resolved, dir))) {
          throw new Error(`Path must be within: ${safeDirs.join(', ')}`);
        }
      }
--- a/browse/test/bun-polyfill.test.ts
+++ b/browse/test/bun-polyfill.test.ts
@ -0,0 +1,72 @@
 import { describe, test, expect, afterAll } from 'bun:test';
 import * as path from 'path';
 // Load the polyfill into a fresh object (don't clobber globalThis.Bun)
 const polyfillPath = path.resolve(import.meta.dir, '../src/bun-polyfill.cjs');
 describe('bun-polyfill', () => {
  // We test the polyfill by requiring it in a subprocess under Node.js
  // since it's designed for Node, not Bun.
  test('Bun.sleep resolves after delay', async () => {
    const result = Bun.spawnSync(['node', '-e', `
      require('${polyfillPath}');
      (async () => {
        const start = Date.now();
        await Bun.sleep(50);
        const elapsed = Date.now() - start;
        console.log(elapsed >= 40 ? 'OK' : 'TOO_FAST');
      })();
    `], { stdout: 'pipe', stderr: 'pipe' });
    expect(result.stdout.toString().trim()).toBe('OK');
    expect(result.exitCode).toBe(0);
  });
  test('Bun.spawnSync runs a command and returns stdout', () => {
    const result = Bun.spawnSync(['node', '-e', `
      require('${polyfillPath}');
      const r = Bun.spawnSync(['echo', 'hello'], { stdout: 'pipe' });
      console.log(r.stdout.toString().trim());
      console.log('exit:' + r.exitCode);
    `], { stdout: 'pipe', stderr: 'pipe' });
    const lines = result.stdout.toString().trim().split('\n');
    expect(lines[0]).toBe('hello');
    expect(lines[1]).toBe('exit:0');
  });
  test('Bun.spawn launches a process with pid', async () => {
    const result = Bun.spawnSync(['node', '-e', `
      require('${polyfillPath}');
      const p = Bun.spawn(['echo', 'test'], { stdio: ['pipe', 'pipe', 'pipe'] });
      console.log(typeof p.pid === 'number' ? 'HAS_PID' : 'NO_PID');
      console.log(typeof p.kill === 'function' ? 'HAS_KILL' : 'NO_KILL');
      console.log(typeof p.unref === 'function' ? 'HAS_UNREF' : 'NO_UNREF');
    `], { stdout: 'pipe', stderr: 'pipe' });
    const lines = result.stdout.toString().trim().split('\n');
    expect(lines[0]).toBe('HAS_PID');
    expect(lines[1]).toBe('HAS_KILL');
    expect(lines[2]).toBe('HAS_UNREF');
  });
  test('Bun.serve creates an HTTP server that responds', async () => {
    const result = Bun.spawnSync(['node', '-e', `
      require('${polyfillPath}');
      const server = Bun.serve({
        port: 0,  // Note: polyfill uses port directly, so we pick one
        hostname: '127.0.0.1',
        fetch(req) {
          return new Response(JSON.stringify({ ok: true }), {
            headers: { 'Content-Type': 'application/json' },
          });
        },
      });
      // The polyfill doesn't support port 0, so we test the object shape
      console.log(typeof server.stop === 'function' ? 'HAS_STOP' : 'NO_STOP');
      console.log(typeof server.port === 'number' ? 'HAS_PORT' : 'NO_PORT');
      server.stop();
    `], { stdout: 'pipe', stderr: 'pipe' });
    const lines = result.stdout.toString().trim().split('\n');
    expect(lines[0]).toBe('HAS_STOP');
    expect(lines[1]).toBe('HAS_PORT');
  });
 });
--- a/browse/test/config.test.ts
+++ b/browse/test/config.test.ts
@ -197,6 +197,36 @@ describe('resolveServerScript', () => {
  });
 });
 describe('resolveNodeServerScript', () => {
  const { resolveNodeServerScript } = require('../src/cli');
  test('finds server-node.mjs in dist from dev mode', () => {
    const srcDir = path.resolve(__dirname, '../src');
    const distFile = path.resolve(srcDir, '..', 'dist', 'server-node.mjs');
    const fs = require('fs');
    // Only test if the file exists (it may not be built yet)
    if (fs.existsSync(distFile)) {
      const result = resolveNodeServerScript(srcDir, '');
      expect(result).toBe(distFile);
    }
  });
  test('returns null when server-node.mjs does not exist', () => {
    const result = resolveNodeServerScript('/nonexistent/$bunfs', '/nonexistent/browse');
    expect(result).toBeNull();
  });
  test('finds server-node.mjs adjacent to compiled binary', () => {
    const distDir = path.resolve(__dirname, '../dist');
    const distFile = path.join(distDir, 'server-node.mjs');
    const fs = require('fs');
    if (fs.existsSync(distFile)) {
      const result = resolveNodeServerScript('/$bunfs/something', path.join(distDir, 'browse'));
      expect(result).toBe(distFile);
    }
  });
 });
 describe('version mismatch detection', () => {
  test('detects when versions differ', () => {
    const stateVersion = 'abc123';
--- a/browse/test/platform.test.ts
+++ b/browse/test/platform.test.ts
@ -0,0 +1,37 @@
 import { describe, test, expect } from 'bun:test';
 import { TEMP_DIR, isPathWithin, IS_WINDOWS } from '../src/platform';
 describe('platform constants', () => {
  test('TEMP_DIR is /tmp on non-Windows', () => {
    if (!IS_WINDOWS) {
      expect(TEMP_DIR).toBe('/tmp');
    }
  });
  test('IS_WINDOWS reflects process.platform', () => {
    expect(IS_WINDOWS).toBe(process.platform === 'win32');
  });
 });
 describe('isPathWithin', () => {
  test('path inside directory returns true', () => {
    expect(isPathWithin('/tmp/foo', '/tmp')).toBe(true);
  });
  test('path outside directory returns false', () => {
    expect(isPathWithin('/etc/foo', '/tmp')).toBe(false);
  });
  test('exact match returns true', () => {
    expect(isPathWithin('/tmp', '/tmp')).toBe(true);
  });
  test('partial prefix does not match (path traversal)', () => {
    // /tmp-evil should NOT match /tmp
    expect(isPathWithin('/tmp-evil/foo', '/tmp')).toBe(false);
  });
  test('nested path returns true', () => {
    expect(isPathWithin('/tmp/a/b/c', '/tmp')).toBe(true);
  });
 });
--- a/codex/SKILL.md
+++ b/codex/SKILL.md
@ -314,13 +314,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)
 2. Run the review (5-minute timeout):
 ```bash
-codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
+codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
 ```
 Use `timeout: 300000` on the Bash call. If the user provided custom instructions
 (e.g., `/codex review focus on security`), pass them as the prompt argument:
 ```bash
-codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
+codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
 ```
 3. Capture the output. Then parse cost from stderr:
@ -475,7 +475,7 @@ THE PLAN:
 For a **new session:**
 ```bash
-codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
+codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
 import sys, json
 for line in sys.stdin:
    line = line.strip()
@ -508,7 +508,7 @@ for line in sys.stdin:
 For a **resumed session** (user chose "Continue"):
 ```bash
-codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
+codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
 <same python streaming parser as above>
 "
 ```
@ -544,10 +544,7 @@ Session saved — run /codex again to continue this conversation.
 agentic coding model). This means as OpenAI ships newer models, /codex automatically
 uses them. If the user wants a specific model, pass `-m` through to codex.
-**Reasoning effort** varies by mode — use the right level for each task:
+**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.
 - **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute.
 - **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible.
 - **Consult mode:** `high` — good balance of depth and speed for conversations.
 **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
 docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.
--- a/codex/SKILL.md.tmpl
+++ b/codex/SKILL.md.tmpl
@ -79,13 +79,13 @@ TMPERR=$(mktemp /tmp/codex-err-XXXXXX.txt)
 2. Run the review (5-minute timeout):
 ```bash
-codex review --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
+codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
 ```
 Use `timeout: 300000` on the Bash call. If the user provided custom instructions
 (e.g., `/codex review focus on security`), pass them as the prompt argument:
 ```bash
-codex review "focus on security" --base <base> -c 'model_reasoning_effort="high"' --enable web_search_cached 2>"$TMPERR"
+codex review "focus on security" --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
 ```
 3. Capture the output. Then parse cost from stderr:
@ -240,7 +240,7 @@ THE PLAN:
 For a **new session:**
 ```bash
-codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
+codex exec "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
 import sys, json
 for line in sys.stdin:
    line = line.strip()
@ -273,7 +273,7 @@ for line in sys.stdin:
 For a **resumed session** (user chose "Continue"):
 ```bash
-codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="high"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
+codex exec resume <session-id> "<prompt>" -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached --json 2>"$TMPERR" | python3 -c "
 <same python streaming parser as above>
 "
 ```
@ -309,10 +309,7 @@ Session saved — run /codex again to continue this conversation.
 agentic coding model). This means as OpenAI ships newer models, /codex automatically
 uses them. If the user wants a specific model, pass `-m` through to codex.
-**Reasoning effort** varies by mode — use the right level for each task:
+**Reasoning effort:** All modes use `xhigh` — maximum reasoning power. When reviewing code, breaking code, or consulting on architecture, you want the model thinking as hard as possible.
 - **Review mode:** `high` — thorough but not slow. Diff review benefits from depth but doesn't need maximum compute.
 - **Challenge (adversarial) mode:** `xhigh` — maximum reasoning power. When trying to break code, you want the model thinking as hard as possible.
 - **Consult mode:** `high` — good balance of depth and speed for conversations.
 **Web search:** All codex commands use `--enable web_search_cached` so Codex can look up
 docs and APIs during review. This is OpenAI's cached index — fast, no extra cost.
--- a/office-hours/SKILL.md
+++ b/office-hours/SKILL.md
@ -241,6 +241,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was
 If you cannot determine the outcome, use "unknown". This runs in the background and
 never blocks the user.
 ## SETUP (run this check BEFORE any browse command)
 ```bash
 _ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
 B=""
 [ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse"
 [ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse
 if [ -x "$B" ]; then
  echo "READY: $B"
 else
  echo "NEEDS_SETUP"
 fi
 ```
 If `NEEDS_SETUP`:
 1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait.
 2. Run: `cd <SKILL_DIR> && ./setup`
 3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash`
 # YC Office Hours
 You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
@ -505,6 +524,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
 ---
 ## Visual Sketch (UI ideas only)
 If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
 or interactive elements), generate a rough wireframe to help the user visualize it.
 If the idea is backend-only, infrastructure, or has no UI component — skip this
 section silently.
 **Step 1: Gather design context**
 1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design
   system constraints (colors, typography, spacing, component patterns). Use these
   constraints in the wireframe.
 2. Apply core design principles:
   - **Information hierarchy** — what does the user see first, second, third?
   - **Interaction states** — loading, empty, error, success, partial
   - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
   - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
   - **Design for trust** — every interface element builds or erodes user trust.
 **Step 2: Generate wireframe HTML**
 Generate a single-page HTML file with these constraints:
 - **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
  hand-drawn-style elements. This is a sketch, not a polished mockup.
 - Self-contained — no external dependencies, no CDN links, inline CSS only
 - Show the core interaction flow (1-3 screens/states max)
 - Include realistic placeholder content (not "Lorem ipsum" — use content that
  matches the actual use case)
 - Add HTML comments explaining design decisions
 Write to a temp file:
 ```bash
 SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
 ```
 **Step 3: Render and capture**
 ```bash
 $B goto "file://$SKETCH_FILE"
 $B screenshot /tmp/gstack-sketch.png
 ```
 If `$B` is not available (browse binary not set up), skip the render step. Tell the
 user: "Visual sketch requires the browse binary. Run the setup script to enable it."
 **Step 4: Present and iterate**
 Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
 If they want changes, regenerate the HTML with their feedback and re-render.
 If they approve or say "good enough," proceed.
 **Step 5: Include in design doc**
 Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
 The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills
 (`/plan-design-review`, `/design-review`) to see what was originally envisioned.
 ---
 ## Phase 4.5: Founder Signal Synthesis
 Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
@ -641,7 +720,73 @@ Supersedes: {prior filename — omit this line if first design on this branch}
 {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
 ```
-Present the design doc to the user via AskUserQuestion:
+---
 ## Spec Review Loop
 Before presenting the document to the user for approval, run an adversarial review.
 **Step 1: Dispatch reviewer subagent**
 Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
 and cannot see the brainstorming conversation — only the document. This ensures genuine
 adversarial independence.
 Prompt the subagent with:
 - The file path of the document just written
 - "Read this document and review it on 5 dimensions. For each dimension, note PASS or
  list specific issues with suggested fixes. At the end, output a quality score (1-10)
  across all dimensions."
 **Dimensions:**
 1. **Completeness** — Are all requirements addressed? Missing edge cases?
 2. **Consistency** — Do parts of the document agree with each other? Contradictions?
 3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
 4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
 5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
 The subagent should return:
 - A quality score (1-10)
 - PASS if no issues, or a numbered list of issues with dimension, description, and fix
 **Step 2: Fix and re-dispatch**
 If the reviewer returns issues:
 1. Fix each issue in the document on disk (use Edit tool)
 2. Re-dispatch the reviewer subagent with the updated document
 3. Maximum 3 iterations total
 **Convergence guard:** If the reviewer returns the same issues on consecutive iterations
 (the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
 and persist those issues as "Reviewer Concerns" in the document rather than looping
 further.
 If the subagent fails, times out, or is unavailable — skip the review loop entirely.
 Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
 already written to disk; the review is a quality bonus, not a gate.
 **Step 3: Report and persist metrics**
 After the loop completes (PASS, max iterations, or convergence guard):
 1. Tell the user the result — summary by default:
   "Your doc survived N rounds of adversarial review. M issues caught and fixed.
   Quality score: X/10."
   If they ask "what did the reviewer find?", show the full reviewer output.
 2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
   section to the document listing each unresolved issue. Downstream skills will see this.
 3. Append metrics:
 ```bash
 mkdir -p ~/.gstack/analytics
 echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
 ```
 Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
 ---
 Present the reviewed design doc to the user via AskUserQuestion:
 - A) Approve — mark Status: APPROVED and proceed to handoff
 - B) Revise — specify which sections need changes (loop back to revise those sections)
 - C) Start over — return to Phase 2
--- a/office-hours/SKILL.md.tmpl
+++ b/office-hours/SKILL.md.tmpl
@ -23,6 +23,8 @@ allowed-tools:
 {{PREAMBLE}}
 {{BROWSE_SETUP}}
 # YC Office Hours
 You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code.
@ -287,6 +289,10 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac
 ---
 {{DESIGN_SKETCH}}
 ---
 ## Phase 4.5: Founder Signal Synthesis
 Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6).
@ -423,7 +429,13 @@ Supersedes: {prior filename — omit this line if first design on this branch}
 {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.}
 ```
-Present the design doc to the user via AskUserQuestion:
+---
 {{SPEC_REVIEW_LOOP}}
 ---
 Present the reviewed design doc to the user via AskUserQuestion:
 - A) Approve — mark Status: APPROVED and proceed to handoff
 - B) Revise — specify which sections need changes (loop back to revise those sections)
 - C) Start over — return to Phase 2
--- a/package.json
+++ b/package.json
@ -8,17 +8,19 @@
    "browse": "./browse/dist/browse"
  },
  "scripts": {
-    "build": "bun run gen:skill-docs && bun run gen:skill-docs --host codex && bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && git rev-parse HEAD > browse/dist/.version && rm -f .*.bun-build || true",
+    "build": "bun run gen:skill-docs && bun run gen:skill-docs --host codex && bun build --compile browse/src/cli.ts --outfile browse/dist/browse && bun build --compile browse/src/find-browse.ts --outfile browse/dist/find-browse && bash browse/scripts/build-node-server.sh && git rev-parse HEAD > browse/dist/.version && rm -f .*.bun-build || true",
    "gen:skill-docs": "bun run scripts/gen-skill-docs.ts",
    "dev": "bun run browse/src/cli.ts",
    "server": "bun run browse/src/server.ts",
-    "test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts",
+    "test": "bun test browse/test/ test/ --ignore test/skill-e2e.test.ts --ignore test/skill-llm-eval.test.ts --ignore test/skill-routing-e2e.test.ts --ignore test/codex-e2e.test.ts --ignore test/gemini-e2e.test.ts",
-    "test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
+    "test:evals": "EVALS=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
-    "test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
+    "test:evals:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-llm-eval.test.ts test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
-    "test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
+    "test:e2e": "EVALS=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
-    "test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts",
+    "test:e2e:all": "EVALS=1 EVALS_ALL=1 bun test test/skill-e2e.test.ts test/skill-routing-e2e.test.ts test/codex-e2e.test.ts test/gemini-e2e.test.ts",
    "test:codex": "EVALS=1 bun test test/codex-e2e.test.ts",
    "test:codex:all": "EVALS=1 EVALS_ALL=1 bun test test/codex-e2e.test.ts",
    "test:gemini": "EVALS=1 bun test test/gemini-e2e.test.ts",
    "test:gemini:all": "EVALS=1 EVALS_ALL=1 bun test test/gemini-e2e.test.ts",
    "skill:check": "bun run scripts/skill-check.ts",
    "dev:skill": "bun run scripts/dev-skill.ts",
    "start": "bun run browse/src/server.ts",
--- a/plan-ceo-review/SKILL.md
+++ b/plan-ceo-review/SKILL.md
@ -10,6 +10,7 @@ description: |
  or "is this ambitious enough".
  Proactively suggest when the user is questioning scope or ambition of a plan,
  or when the plan feels like it could be thinking bigger.
 benefits-from: [office-hours]
 allowed-tools:
  - Read
  - Grep
@ -345,6 +346,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
 ```
 If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
 ## Prerequisite Skill Offer
 When the design doc check above prints "No design doc found," offer the prerequisite
 skill before proceeding.
 Say to the user via AskUserQuestion:
 > "No design doc found for this branch. `/office-hours` produces a structured problem
 > statement, premise challenge, and explored alternatives — it gives this review much
 > sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
 > not per-product — it captures the thinking behind this specific change."
 Options:
 - A) Run /office-hours first (in another window, then come back)
 - B) Skip — proceed with standard review
 If they skip: "No worries — standard review. If you ever want sharper input, try
 /office-hours first next time." Then proceed normally. Do not re-offer later in the session.
 **Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
 articulate the problem, keeps changing the problem statement, answers with "I'm not
 sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
 > "It sounds like you're still figuring out what to build — that's totally fine, but
 > that's what /office-hours is designed for. Want to pause this review and run
 > /office-hours first? It'll help you nail down the problem and approach, then come
 > back here for the strategic review."
 Options: A) Yes, run /office-hours first. B) No, keep going.
 If they keep going, proceed normally — no guilt, no re-asking.
 When reading TODOS.md, specifically:
 * Note any TODOs this plan touches, blocks, or unlocks
 * Check if deferred work from prior reviews relates to this plan
@ -488,6 +520,70 @@ Repo: {owner/repo}
 Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
 After writing the CEO plan, run the spec review loop on it:
 ## Spec Review Loop
 Before presenting the document to the user for approval, run an adversarial review.
 **Step 1: Dispatch reviewer subagent**
 Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
 and cannot see the brainstorming conversation — only the document. This ensures genuine
 adversarial independence.
 Prompt the subagent with:
 - The file path of the document just written
 - "Read this document and review it on 5 dimensions. For each dimension, note PASS or
  list specific issues with suggested fixes. At the end, output a quality score (1-10)
  across all dimensions."
 **Dimensions:**
 1. **Completeness** — Are all requirements addressed? Missing edge cases?
 2. **Consistency** — Do parts of the document agree with each other? Contradictions?
 3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
 4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
 5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
 The subagent should return:
 - A quality score (1-10)
 - PASS if no issues, or a numbered list of issues with dimension, description, and fix
 **Step 2: Fix and re-dispatch**
 If the reviewer returns issues:
 1. Fix each issue in the document on disk (use Edit tool)
 2. Re-dispatch the reviewer subagent with the updated document
 3. Maximum 3 iterations total
 **Convergence guard:** If the reviewer returns the same issues on consecutive iterations
 (the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
 and persist those issues as "Reviewer Concerns" in the document rather than looping
 further.
 If the subagent fails, times out, or is unavailable — skip the review loop entirely.
 Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
 already written to disk; the review is a quality bonus, not a gate.
 **Step 3: Report and persist metrics**
 After the loop completes (PASS, max iterations, or convergence guard):
 1. Tell the user the result — summary by default:
   "Your doc survived N rounds of adversarial review. M issues caught and fixed.
   Quality score: X/10."
   If they ask "what did the reviewer find?", show the full reviewer output.
 2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
   section to the document listing each unresolved issue. Downstream skills will see this.
 3. Append metrics:
 ```bash
 mkdir -p ~/.gstack/analytics
 echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
 ```
 Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.
 ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
 Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
 ```
@ -919,7 +1015,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
--- a/plan-ceo-review/SKILL.md.tmpl
+++ b/plan-ceo-review/SKILL.md.tmpl
@ -10,6 +10,7 @@ description: |
  or "is this ambitious enough".
  Proactively suggest when the user is questioning scope or ambition of a plan,
  or when the plan feels like it could be thinking bigger.
 benefits-from: [office-hours]
 allowed-tools:
  - Read
  - Grep
@ -110,6 +111,20 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
 ```
 If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design.
 {{BENEFITS_FROM}}
 **Mid-session detection:** During Step 0A (Premise Challenge), if the user can't
 articulate the problem, keeps changing the problem statement, answers with "I'm not
 sure," or is clearly exploring rather than reviewing — offer `/office-hours`:
 > "It sounds like you're still figuring out what to build — that's totally fine, but
 > that's what /office-hours is designed for. Want to pause this review and run
 > /office-hours first? It'll help you nail down the problem and approach, then come
 > back here for the strategic review."
 Options: A) Yes, run /office-hours first. B) No, keep going.
 If they keep going, proceed normally — no guilt, no re-asking.
 When reading TODOS.md, specifically:
 * Note any TODOs this plan touches, blocks, or unlocks
 * Check if deferred work from prior reviews relates to this plan
@ -253,6 +268,10 @@ Repo: {owner/repo}
 Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format.
 After writing the CEO plan, run the spec review loop on it:
 {{SPEC_REVIEW_LOOP}}
 ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes)
 Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan?
 ```
--- a/plan-design-review/SKILL.md
+++ b/plan-design-review/SKILL.md
@ -550,7 +550,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
--- a/plan-eng-review/SKILL.md
+++ b/plan-eng-review/SKILL.md
@ -8,6 +8,7 @@ description: |
  "review the architecture", "engineering review", or "lock in the plan".
  Proactively suggest when the user has a plan or design doc and is about to
  start coding — to catch architecture issues before implementation.
 benefits-from: [office-hours]
 allowed-tools:
  - Read
  - Write
@ -291,6 +292,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
 ```
 If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
 ## Prerequisite Skill Offer
 When the design doc check above prints "No design doc found," offer the prerequisite
 skill before proceeding.
 Say to the user via AskUserQuestion:
 > "No design doc found for this branch. `/office-hours` produces a structured problem
 > statement, premise challenge, and explored alternatives — it gives this review much
 > sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
 > not per-product — it captures the thinking behind this specific change."
 Options:
 - A) Run /office-hours first (in another window, then come back)
 - B) Skip — proceed with standard review
 If they skip: "No worries — standard review. If you ever want sharper input, try
 /office-hours first next time." Then proceed normally. Do not re-offer later in the session.
 ### Step 0: Scope Challenge
 Before reviewing anything, answer these questions:
 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
@ -520,7 +540,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
--- a/plan-eng-review/SKILL.md.tmpl
+++ b/plan-eng-review/SKILL.md.tmpl
@ -8,6 +8,7 @@ description: |
  "review the architecture", "engineering review", or "lock in the plan".
  Proactively suggest when the user has a plan or design doc and is about to
  start coding — to catch architecture issues before implementation.
 benefits-from: [office-hours]
 allowed-tools:
  - Read
  - Write
@ -73,6 +74,8 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head
 ```
 If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why.
 {{BENEFITS_FROM}}
 ### Step 0: Scope Challenge
 Before reviewing anything, answer these questions:
 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones?
--- a/review/SKILL.md
+++ b/review/SKILL.md
@ -497,52 +497,128 @@ If no documentation files exist, skip this step silently.
 ---
-## Step 5.7: Codex second opinion (optional)
+## Step 5.7: Codex review
-After completing the review, check if the Codex CLI is available:
+Check if the Codex CLI is available and read the user's Codex review preference:
 ```bash
 which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
 CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
 echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}"
 ```
-If Codex is available, use AskUserQuestion:
+If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step.
 If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step.
 If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.
 If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt:
 ```
-Review complete. Want an independent second opinion from Codex (OpenAI)?
+GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.
-A) Run Codex code review — independent diff review with pass/fail gate
+A) Enable for all future runs (recommended, default)
-B) Run Codex adversarial challenge — try to find ways this code will fail in production
+B) Try it for now, ask me again later
-C) Both — review first, then adversarial challenge
+C) No thanks, don't ask me again
 D) Skip — no Codex review needed
 ```
-If the user chooses A, B, or C:
+If the user chooses A: persist the setting and run both:
 **For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
 Present the full output verbatim under a `CODEX SAYS (code review):` header.
 Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
 After presenting, compare Codex's findings with your own review findings from Steps 4-5
 and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
 and what only Claude found.
 **For adversarial challenge (B or C):** Run:
 ```bash
-codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
+~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
 ```
 Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
-**Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
+If the user chooses B: run both this time but do not persist any setting.
 If the user chooses C: persist the opt-out and skip:
 ```bash
-~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
+~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
 ```
 Then skip this step. Continue to the next step.
 ### Run Codex
 Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call.
 First, create a temp file for stderr capture:
 ```bash
 TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
 ```
 **Code review:** Run:
 ```bash
 codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
 ```
 After the command completes, read stderr for cost/error info:
 ```bash
 cat "$TMPERR"
 ```
 Present the full output verbatim under a `CODEX SAYS (code review):` header:
 ```
 CODEX SAYS (code review):
 ════════════════════════════════════════════════════════════
 <full codex output, verbatim — do not truncate or summarize>
 ════════════════════════════════════════════════════════════
 GATE: PASS                    Tokens: N | Est. cost: ~$X.XX
 ```
 Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`.
 **If GATE is FAIL:** use AskUserQuestion:
 ```
 Codex found N critical issues in the diff.
 A) Investigate and fix now (recommended)
 B) Ship anyway — these issues may cause production problems
 ```
 If the user chooses A: read the Codex findings carefully and work to address them. Then re-run `codex review` to verify the gate is now PASS.
 If the user chooses B: continue to the next step.
 ### Error handling (code review)
 Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators:
 - **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
 - **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
 - **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.
 **Only if codex produced a real review (non-empty stdout):** Persist the code review result:
 ```bash
 ~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
 ```
 Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
-**Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
+**Adversarial challenge:** Run:
-there is no gate verdict to record, and a false entry would make the Review Readiness
+```bash
-Dashboard believe a code review happened when it didn't.
+TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
 codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
 ```
-If Codex is not available, skip this step silently.
+After the command completes, read adversarial stderr:
 ```bash
 cat "$TMPERR_ADV"
 ```
 Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.
 **Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output:
 ```
 CROSS-MODEL ANALYSIS:
  Both found: [findings that overlap between Claude and Codex]
  Only Codex found: [findings unique to Codex]
  Only Claude found: [findings unique to Claude's review]
  Agreement rate: X% (N/M total unique findings overlap)
 ```
 **Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing.
 ---
--- a/review/SKILL.md.tmpl
+++ b/review/SKILL.md.tmpl
@ -231,54 +231,7 @@ If no documentation files exist, skip this step silently.
 ---
-## Step 5.7: Codex second opinion (optional)
+{{CODEX_REVIEW_STEP}}
 After completing the review, check if the Codex CLI is available:
 ```bash
 which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
 ```
 If Codex is available, use AskUserQuestion:
 ```
 Review complete. Want an independent second opinion from Codex (OpenAI)?
 A) Run Codex code review — independent diff review with pass/fail gate
 B) Run Codex adversarial challenge — try to find ways this code will fail in production
 C) Both — review first, then adversarial challenge
 D) Skip — no Codex review needed
 ```
 If the user chooses A, B, or C:
 **For code review (A or C):** Run `codex review --base <base>` with a 5-minute timeout.
 Present the full output verbatim under a `CODEX SAYS (code review):` header.
 Check the output for `[P1]` markers — if found, note `GATE: FAIL`, otherwise `GATE: PASS`.
 After presenting, compare Codex's findings with your own review findings from Steps 4-5
 and output a CROSS-MODEL ANALYSIS showing what both found, what only Codex found,
 and what only Claude found.
 **For adversarial challenge (B or C):** Run:
 ```bash
 codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, failure modes. Be adversarial." -s read-only
 ```
 Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header.
 **Only if a code review ran (user chose A or C):** Persist the Codex review result to the review log:
 ```bash
 ~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE"}'
 ```
 Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
 **Do NOT persist a codex-review entry when only the adversarial challenge (B) ran** —
 there is no gate verdict to record, and a false entry would make the Review Readiness
 Dashboard believe a code review happened when it didn't.
 If Codex is not available, skip this step silently.
 ---
 ## Important Rules
--- a/scripts/gen-skill-docs.ts
+++ b/scripts/gen-skill-docs.ts
@ -55,6 +55,7 @@ const HOST_PATHS: Record<Host, HostPaths> = {
 interface TemplateContext {
  skillName: string;
  tmplPath: string;
  benefitsFrom?: string[];
  host: Host;
  paths: HostPaths;
 }
@ -1206,7 +1207,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \\\`gstack-config set skip_eng_review true\\\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \\\`gstack-config set codex_reviews enabled|disabled\\\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \\\`skip_eng_review\\\` is \\\`true\\\`)
@ -1376,6 +1377,289 @@ Only commit if there are changes. Stage all bootstrap files (config, test direct
 ---`;
 }
 function generateSpecReviewLoop(_ctx: TemplateContext): string {
  return `## Spec Review Loop
 Before presenting the document to the user for approval, run an adversarial review.
 **Step 1: Dispatch reviewer subagent**
 Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context
 and cannot see the brainstorming conversation — only the document. This ensures genuine
 adversarial independence.
 Prompt the subagent with:
 - The file path of the document just written
 - "Read this document and review it on 5 dimensions. For each dimension, note PASS or
  list specific issues with suggested fixes. At the end, output a quality score (1-10)
  across all dimensions."
 **Dimensions:**
 1. **Completeness** — Are all requirements addressed? Missing edge cases?
 2. **Consistency** — Do parts of the document agree with each other? Contradictions?
 3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language?
 4. **Scope** — Does the document creep beyond the original problem? YAGNI violations?
 5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity?
 The subagent should return:
 - A quality score (1-10)
 - PASS if no issues, or a numbered list of issues with dimension, description, and fix
 **Step 2: Fix and re-dispatch**
 If the reviewer returns issues:
 1. Fix each issue in the document on disk (use Edit tool)
 2. Re-dispatch the reviewer subagent with the updated document
 3. Maximum 3 iterations total
 **Convergence guard:** If the reviewer returns the same issues on consecutive iterations
 (the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop
 and persist those issues as "Reviewer Concerns" in the document rather than looping
 further.
 If the subagent fails, times out, or is unavailable — skip the review loop entirely.
 Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is
 already written to disk; the review is a quality bonus, not a gate.
 **Step 3: Report and persist metrics**
 After the loop completes (PASS, max iterations, or convergence guard):
 1. Tell the user the result — summary by default:
   "Your doc survived N rounds of adversarial review. M issues caught and fixed.
   Quality score: X/10."
   If they ask "what did the reviewer find?", show the full reviewer output.
 2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns"
   section to the document listing each unresolved issue. Downstream skills will see this.
 3. Append metrics:
 \`\`\`bash
 mkdir -p ~/.gstack/analytics
 echo '{"skill":"${_ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true
 \`\`\`
 Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.`;
 }
 function generateBenefitsFrom(ctx: TemplateContext): string {
  if (!ctx.benefitsFrom || ctx.benefitsFrom.length === 0) return '';
  const skillList = ctx.benefitsFrom.map(s => `\`/${s}\``).join(' or ');
  const first = ctx.benefitsFrom[0];
  return `## Prerequisite Skill Offer
 When the design doc check above prints "No design doc found," offer the prerequisite
 skill before proceeding.
 Say to the user via AskUserQuestion:
 > "No design doc found for this branch. ${skillList} produces a structured problem
 > statement, premise challenge, and explored alternatives — it gives this review much
 > sharper input to work with. Takes about 10 minutes. The design doc is per-feature,
 > not per-product — it captures the thinking behind this specific change."
 Options:
 - A) Run /${first} first (in another window, then come back)
 - B) Skip — proceed with standard review
 If they skip: "No worries — standard review. If you ever want sharper input, try
 /${first} first next time." Then proceed normally. Do not re-offer later in the session.`;
 }
 function generateDesignSketch(_ctx: TemplateContext): string {
  return `## Visual Sketch (UI ideas only)
 If the chosen approach involves user-facing UI (screens, pages, forms, dashboards,
 or interactive elements), generate a rough wireframe to help the user visualize it.
 If the idea is backend-only, infrastructure, or has no UI component — skip this
 section silently.
 **Step 1: Gather design context**
 1. Check if \`DESIGN.md\` exists in the repo root. If it does, read it for design
   system constraints (colors, typography, spacing, component patterns). Use these
   constraints in the wireframe.
 2. Apply core design principles:
   - **Information hierarchy** — what does the user see first, second, third?
   - **Interaction states** — loading, empty, error, success, partial
   - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails?
   - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels.
   - **Design for trust** — every interface element builds or erodes user trust.
 **Step 2: Generate wireframe HTML**
 Generate a single-page HTML file with these constraints:
 - **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color,
  hand-drawn-style elements. This is a sketch, not a polished mockup.
 - Self-contained — no external dependencies, no CDN links, inline CSS only
 - Show the core interaction flow (1-3 screens/states max)
 - Include realistic placeholder content (not "Lorem ipsum" — use content that
  matches the actual use case)
 - Add HTML comments explaining design decisions
 Write to a temp file:
 \`\`\`bash
 SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html"
 \`\`\`
 **Step 3: Render and capture**
 \`\`\`bash
 $B goto "file://$SKETCH_FILE"
 $B screenshot /tmp/gstack-sketch.png
 \`\`\`
 If \`$B\` is not available (browse binary not set up), skip the render step. Tell the
 user: "Visual sketch requires the browse binary. Run the setup script to enable it."
 **Step 4: Present and iterate**
 Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?"
 If they want changes, regenerate the HTML with their feedback and re-render.
 If they approve or say "good enough," proceed.
 **Step 5: Include in design doc**
 Reference the wireframe screenshot in the design doc's "Recommended Approach" section.
 The screenshot file at \`/tmp/gstack-sketch.png\` can be referenced by downstream skills
 (\`/plan-design-review\`, \`/design-review\`) to see what was originally envisioned.`;
 }
 function generateCodexReviewStep(ctx: TemplateContext): string {
  // Codex host: strip entirely — Codex should never invoke itself
  if (ctx.host === 'codex') return '';
  const isShip = ctx.skillName === 'ship';
  const stepNum = isShip ? '3.8' : '5.7';
  return `## Step ${stepNum}: Codex review
 Check if the Codex CLI is available and read the user's Codex review preference:
 \`\`\`bash
 which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
 CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
 echo "CODEX_REVIEWS: \${CODEX_REVIEWS_CFG:-not_set}"
 \`\`\`
 If \`CODEX_NOT_AVAILABLE\`: skip this step silently. Continue to the next step.
 If \`CODEX_REVIEWS\` is \`disabled\`: skip this step silently. Continue to the next step.
 If \`CODEX_REVIEWS\` is \`enabled\`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.
 If \`CODEX_REVIEWS\` is \`not_set\`: use AskUserQuestion to offer the one-time adoption prompt:
 \`\`\`
 GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.
 A) Enable for all future runs (recommended, default)
 B) Try it for now, ask me again later
 C) No thanks, don't ask me again
 \`\`\`
 If the user chooses A: persist the setting and run both:
 \`\`\`bash
 ~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
 \`\`\`
 If the user chooses B: run both this time but do not persist any setting.
 If the user chooses C: persist the opt-out and skip:
 \`\`\`bash
 ~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
 \`\`\`
 Then skip this step. Continue to the next step.
 ### Run Codex
 Always run **both** code review and adversarial challenge. Use a 5-minute timeout (\`timeout: 300000\`) on each Bash call.
 First, create a temp file for stderr capture:
 \`\`\`bash
 TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
 \`\`\`
 **Code review:** Run:
 \`\`\`bash
 codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
 \`\`\`
 After the command completes, read stderr for cost/error info:
 \`\`\`bash
 cat "$TMPERR"
 \`\`\`
 Present the full output verbatim under a \`CODEX SAYS (code review):\` header:
 \`\`\`
 CODEX SAYS (code review):
 ════════════════════════════════════════════════════════════
 <full codex output, verbatim — do not truncate or summarize>
 ════════════════════════════════════════════════════════════
 GATE: PASS                    Tokens: N | Est. cost: ~$X.XX
 \`\`\`
 Check the output for \`[P1]\` markers. If found: \`GATE: FAIL\`. If no \`[P1]\`: \`GATE: PASS\`.
 **If GATE is FAIL:** use AskUserQuestion:
 \`\`\`
 Codex found N critical issues in the diff.
 A) Investigate and fix now (recommended)
 B) Ship anyway — these issues may cause production problems
 \`\`\`
 If the user chooses A: read the Codex findings carefully and work to address them${isShip ? '. After fixing, re-run tests (Step 3) since code has changed' : ''}. Then re-run \`codex review\` to verify the gate is now PASS.
 If the user chooses B: continue to the next step.
 ### Error handling (code review)
 Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check \`$TMPERR\` output (already read above) for error indicators:
 - **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \\\`codex login\\\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
 - **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
 - **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.
 **Only if codex produced a real review (non-empty stdout):** Persist the code review result:
 \`\`\`bash
 ~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
 \`\`\`
 Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
 **Adversarial challenge:** Run:
 \`\`\`bash
 TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
 codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
 \`\`\`
 After the command completes, read adversarial stderr:
 \`\`\`bash
 cat "$TMPERR_ADV"
 \`\`\`
 Present the full output verbatim under a \`CODEX SAYS (adversarial challenge):\` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.
 ${!isShip ? `
 **Cross-model analysis:** After both Codex outputs are presented, compare Codex's findings with your own review findings from the earlier review steps and output:
 \`\`\`
 CROSS-MODEL ANALYSIS:
  Both found: [findings that overlap between Claude and Codex]
  Only Codex found: [findings unique to Codex]
  Only Claude found: [findings unique to Claude's review]
  Agreement rate: X% (N/M total unique findings overlap)
 \`\`\`
 ` : ''}
 **Cleanup:** Run \`rm -f "$TMPERR" "$TMPERR_ADV"\` after processing.
 ---`;
 }
 const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
  COMMAND_REFERENCE: generateCommandReference,
  SNAPSHOT_FLAGS: generateSnapshotFlags,
@ -1388,6 +1672,10 @@ const RESOLVERS: Record<string, (ctx: TemplateContext) => string> = {
  REVIEW_DASHBOARD: generateReviewDashboard,
  TEST_BOOTSTRAP: generateTestBootstrap,
  TEST_FAILURE_TRIAGE: generateTestFailureTriage,
  SPEC_REVIEW_LOOP: generateSpecReviewLoop,
  DESIGN_SKETCH: generateDesignSketch,
  BENEFITS_FROM: generateBenefitsFrom,
  CODEX_REVIEW_STEP: generateCodexReviewStep,
 };
 // ─── Codex Helpers ───────────────────────────────────────────
@ -1510,7 +1798,14 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath:
  // Extract skill name from frontmatter for TemplateContext
  const nameMatch = tmplContent.match(/^name:\s*(.+)$/m);
  const skillName = nameMatch ? nameMatch[1].trim() : path.basename(path.dirname(tmplPath));
-  const ctx: TemplateContext = { skillName, tmplPath, host, paths: HOST_PATHS[host] };
+
  // Extract benefits-from list from frontmatter (inline YAML: benefits-from: [a, b])
  const benefitsMatch = tmplContent.match(/^benefits-from:\s*\[([^\]]*)\]/m);
  const benefitsFrom = benefitsMatch
    ? benefitsMatch[1].split(',').map(s => s.trim()).filter(Boolean)
    : undefined;
  const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host] };
  // Replace placeholders
  let content = tmplContent.replace(/\{\{(\w+)\}\}/g, (match, name) => {
--- a/46
+++ b/46
@ -12,6 +12,11 @@ GSTACK_DIR="$(cd "$(dirname "$0")" && pwd)"
 SKILLS_DIR="$(dirname "$GSTACK_DIR")"
 BROWSE_BIN="$GSTACK_DIR/browse/dist/browse"
 IS_WINDOWS=0
 case "$(uname -s)" in
  MINGW*|MSYS*|CYGWIN*|Windows_NT) IS_WINDOWS=1 ;;
 esac
 # ─── Parse --host flag ─────────────────────────────────────────
 HOST="claude"
 while [ $# -gt 0 ]; do
@ -44,10 +49,19 @@ elif [ "$HOST" = "codex" ]; then
 fi
 ensure_playwright_browser() {
-  (
+  if [ "$IS_WINDOWS" -eq 1 ]; then
-    cd "$GSTACK_DIR"
+    # On Windows, Bun can't launch Chromium due to broken pipe handling
-    bun --eval 'import { chromium } from "playwright"; const browser = await chromium.launch(); await browser.close();'
+    # (oven-sh/bun#4253). Use Node.js to verify Chromium works instead.
-  ) >/dev/null 2>&1
+    (
      cd "$GSTACK_DIR"
      node -e "const { chromium } = require('playwright'); (async () => { const b = await chromium.launch(); await b.close(); })()" 2>/dev/null
    )
  else
    (
      cd "$GSTACK_DIR"
      bun --eval 'import { chromium } from "playwright"; const browser = await chromium.launch(); await browser.close();'
    ) >/dev/null 2>&1
  fi
 }
 # 1. Build browse binary if needed (smart rebuild: stale sources, package.json, lock)
@ -87,10 +101,32 @@ if ! ensure_playwright_browser; then
    cd "$GSTACK_DIR"
    bunx playwright install chromium
  )
  if [ "$IS_WINDOWS" -eq 1 ]; then
    # On Windows, Node.js launches Chromium (not Bun — see oven-sh/bun#4253).
    # Ensure playwright is importable by Node from the gstack directory.
    if ! command -v node >/dev/null 2>&1; then
      echo "gstack setup failed: Node.js is required on Windows (Bun cannot launch Chromium due to a pipe bug)" >&2
      echo "  Install Node.js: https://nodejs.org/" >&2
      exit 1
    fi
    echo "Windows detected — verifying Node.js can load Playwright..."
    (
      cd "$GSTACK_DIR"
      # Bun's node_modules already has playwright; verify Node can require it
      node -e "require('playwright')" 2>/dev/null || npm install --no-save playwright
    )
  fi
 fi
 if ! ensure_playwright_browser; then
-  echo "gstack setup failed: Playwright Chromium could not be launched" >&2
+  if [ "$IS_WINDOWS" -eq 1 ]; then
    echo "gstack setup failed: Playwright Chromium could not be launched via Node.js" >&2
    echo "  This is a known issue with Bun on Windows (oven-sh/bun#4253)." >&2
    echo "  Ensure Node.js is installed and 'node -e \"require('playwright')\"' works." >&2
  else
    echo "gstack setup failed: Playwright Chromium could not be launched" >&2
  fi
  exit 1
 fi
--- a/ship/SKILL.md
+++ b/ship/SKILL.md
@ -319,7 +319,7 @@ Parse the output. Find the most recent entry for each skill (plan-ceo-review, pl
 - **Eng Review (required by default):** The only review that gates shipping. Covers architecture, code quality, tests, performance. Can be disabled globally with \`gstack-config set skip_eng_review true\` (the "don't bother me" setting).
 - **CEO Review (optional):** Use your judgment. Recommend it for big product/business changes, new user-facing features, or scope decisions. Skip for bug fixes, refactors, infra, and cleanup.
 - **Design Review (optional):** Use your judgment. Recommend it for UI/UX changes. Skip for backend-only, infra, or prompt-only changes.
- **Codex Review (optional):** Independent second opinion from OpenAI Codex CLI. Shows pass/fail gate. Recommend for critical code changes where a second AI perspective adds value. Skip when Codex CLI is not installed.
+- **Codex Review (enabled by default when Codex CLI is installed):** Independent review + adversarial challenge from OpenAI Codex CLI. Shows pass/fail gate. Runs automatically when enabled — configure with \`gstack-config set codex_reviews enabled|disabled\`.
 **Verdict logic:**
 - **CLEARED**: Eng Review has >= 1 entry within 7 days with status "clean" (or \`skip_eng_review\` is \`true\`)
@ -959,41 +959,118 @@ For each classified comment:
 ---
-## Step 3.8: Codex second opinion (optional)
+## Step 3.8: Codex review
-Check if the Codex CLI is available:
+Check if the Codex CLI is available and read the user's Codex review preference:
 ```bash
 which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
 CODEX_REVIEWS_CFG=$(~/.claude/skills/gstack/bin/gstack-config get codex_reviews 2>/dev/null || true)
 echo "CODEX_REVIEWS: ${CODEX_REVIEWS_CFG:-not_set}"
 ```
-If Codex is available, use AskUserQuestion:
+If `CODEX_NOT_AVAILABLE`: skip this step silently. Continue to the next step.
 If `CODEX_REVIEWS` is `disabled`: skip this step silently. Continue to the next step.
 If `CODEX_REVIEWS` is `enabled`: run both code review and adversarial challenge automatically (no prompt). Jump to the "Run Codex" section below.
 If `CODEX_REVIEWS` is `not_set`: use AskUserQuestion to offer the one-time adoption prompt:
 ```
-Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?
+GStack recommends enabling Codex code reviews — Codex is the super smart quiet engineer friend who will save your butt.
-A) Run Codex code review — independent diff review with pass/fail gate
+A) Enable for all future runs (recommended, default)
-B) Run Codex adversarial challenge — try to break this code
+B) Try it for now, ask me again later
-C) Skip — ship without Codex review
+C) No thanks, don't ask me again
 ```
-If the user chooses A or B:
+If the user chooses A: persist the setting and run both:
 **For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
 Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
 to determine pass/fail gate. Persist the result:
 ```bash
-~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
+~/.claude/skills/gstack/bin/gstack-config set codex_reviews enabled
 ```
-If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
+If the user chooses B: run both this time but do not persist any setting.
 If the user says no, stop. If yes, continue to Step 4.
-**For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
+If the user chooses C: persist the opt-out and skip:
-Present findings. This is informational — does not block shipping.
+```bash
 ~/.claude/skills/gstack/bin/gstack-config set codex_reviews disabled
 ```
 Then skip this step. Continue to the next step.
-If Codex is not available, skip silently. Continue to Step 4.
+### Run Codex
 Always run **both** code review and adversarial challenge. Use a 5-minute timeout (`timeout: 300000`) on each Bash call.
 First, create a temp file for stderr capture:
 ```bash
 TMPERR=$(mktemp /tmp/codex-review-XXXXXXXX)
 ```
 **Code review:** Run:
 ```bash
 codex review --base <base> -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR"
 ```
 After the command completes, read stderr for cost/error info:
 ```bash
 cat "$TMPERR"
 ```
 Present the full output verbatim under a `CODEX SAYS (code review):` header:
 ```
 CODEX SAYS (code review):
 ════════════════════════════════════════════════════════════
 <full codex output, verbatim — do not truncate or summarize>
 ════════════════════════════════════════════════════════════
 GATE: PASS                    Tokens: N | Est. cost: ~$X.XX
 ```
 Check the output for `[P1]` markers. If found: `GATE: FAIL`. If no `[P1]`: `GATE: PASS`.
 **If GATE is FAIL:** use AskUserQuestion:
 ```
 Codex found N critical issues in the diff.
 A) Investigate and fix now (recommended)
 B) Ship anyway — these issues may cause production problems
 ```
 If the user chooses A: read the Codex findings carefully and work to address them. After fixing, re-run tests (Step 3) since code has changed. Then re-run `codex review` to verify the gate is now PASS.
 If the user chooses B: continue to the next step.
 ### Error handling (code review)
 Before persisting the gate result, check for errors. All errors are non-blocking — Codex is a quality enhancement, not a prerequisite. Check `$TMPERR` output (already read above) for error indicators:
 - **Auth failure:** If stderr contains "auth", "login", "unauthorized", or "API key", tell the user: "Codex authentication failed. Run \`codex login\` in your terminal to authenticate via ChatGPT." Do NOT persist a review log entry. Continue to the adversarial step (it will likely fail too, but try anyway).
 - **Timeout:** If the Bash call times out (5 min), tell the user: "Codex timed out after 5 minutes. The diff may be too large or the API may be slow." Do NOT persist a review log entry. Skip to cleanup.
 - **Empty response:** If codex returned no stdout output, tell the user: "Codex returned no response. Stderr: <paste relevant error>." Do NOT persist a review log entry. Skip to cleanup.
 **Only if codex produced a real review (non-empty stdout):** Persist the code review result:
 ```bash
 ~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"'"$(date -u +%Y-%m-%dT%H:%M:%SZ)"'","status":"STATUS","gate":"GATE","commit":"'"$(git rev-parse --short HEAD)"'"}'
 ```
 Substitute: STATUS ("clean" if PASS, "issues_found" if FAIL), GATE ("pass" or "fail").
 **Adversarial challenge:** Run:
 ```bash
 TMPERR_ADV=$(mktemp /tmp/codex-adv-XXXXXXXX)
 codex exec "Review the changes on this branch against the base branch. Run git diff origin/<base> to see the diff. Your job is to find ways this code will fail in production. Think like an attacker and a chaos engineer. Find edge cases, race conditions, security holes, resource leaks, failure modes, and silent data corruption paths. Be adversarial. Be thorough. No compliments — just the problems." -s read-only -c 'model_reasoning_effort="xhigh"' --enable web_search_cached 2>"$TMPERR_ADV"
 ```
 After the command completes, read adversarial stderr:
 ```bash
 cat "$TMPERR_ADV"
 ```
 Present the full output verbatim under a `CODEX SAYS (adversarial challenge):` header. This is informational — it never blocks shipping. If the adversarial command timed out or returned no output, note this to the user and continue.
 **Cleanup:** Run `rm -f "$TMPERR" "$TMPERR_ADV"` after processing.
 ---
@ -1236,7 +1313,7 @@ doc updates — the user runs `/ship` and documentation stays current without a
 - **Never skip tests.** If tests fail, stop.
 - **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
 - **Never force push.** Use regular `git push` only.
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
+- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
 - **Always use the 4-digit version format** from the VERSION file.
 - **Date format in CHANGELOG:** `YYYY-MM-DD`
 - **Split commits for bisectability** — each commit = one logical change.
--- a/ship/SKILL.md.tmpl
+++ b/ship/SKILL.md.tmpl
@ -407,43 +407,7 @@ For each classified comment:
 ---
-## Step 3.8: Codex second opinion (optional)
+{{CODEX_REVIEW_STEP}}
 Check if the Codex CLI is available:
 ```bash
 which codex 2>/dev/null && echo "CODEX_AVAILABLE" || echo "CODEX_NOT_AVAILABLE"
 ```
 If Codex is available, use AskUserQuestion:
 ```
 Pre-landing review complete. Want an independent Codex (OpenAI) review before shipping?
 A) Run Codex code review — independent diff review with pass/fail gate
 B) Run Codex adversarial challenge — try to break this code
 C) Skip — ship without Codex review
 ```
 If the user chooses A or B:
 **For code review (A):** Run `codex review --base <base>` with a 5-minute timeout.
 Present the full output verbatim under a `CODEX SAYS:` header. Check for `[P1]` markers
 to determine pass/fail gate. Persist the result:
 ```bash
 ~/.claude/skills/gstack/bin/gstack-review-log '{"skill":"codex-review","timestamp":"TIMESTAMP","status":"STATUS","gate":"GATE"}'
 ```
 If GATE is FAIL, use AskUserQuestion: "Codex found critical issues. Ship anyway?"
 If the user says no, stop. If yes, continue to Step 4.
 **For adversarial (B):** Run codex exec with the adversarial prompt (see /codex skill).
 Present findings. This is informational — does not block shipping.
 If Codex is not available, skip silently. Continue to Step 4.
 ---
 ## Step 4: Version bump (auto-decide)
@ -684,7 +648,7 @@ doc updates — the user runs `/ship` and documentation stays current without a
 - **Never skip tests.** If tests fail, stop.
 - **Never skip the pre-landing review.** If checklist.md is unreadable, stop.
 - **Never force push.** Use regular `git push` only.
- **Never ask for confirmation** except for MINOR/MAJOR version bumps and pre-landing review ASK items (batched into at most one AskUserQuestion).
+- **Never ask for trivial confirmations** (e.g., "ready to push?", "create PR?"). DO stop for: version bumps (MINOR/MAJOR), pre-landing review findings (ASK items), Codex critical findings ([P1]), and the one-time Codex adoption prompt.
 - **Always use the 4-digit version format** from the VERSION file.
 - **Date format in CHANGELOG:** `YYYY-MM-DD`
 - **Split commits for bisectability** — each commit = one logical change.
--- a/test/gemini-e2e.test.ts
+++ b/test/gemini-e2e.test.ts
@ -0,0 +1,173 @@
 /**
 * Gemini CLI E2E tests — verify skills work when invoked by Gemini CLI.
 *
 * Spawns `gemini -p` with stream-json output in the repo root (where
 * .agents/skills/ already exists), parses JSONL events, and validates
 * structured results. Follows the same pattern as codex-e2e.test.ts.
 *
 * Prerequisites:
 * - `gemini` binary installed (npm install -g @google/gemini-cli)
 * - Gemini authenticated via ~/.gemini/ config or GEMINI_API_KEY env var
 * - EVALS=1 env var set (same gate as Claude E2E tests)
 *
 * Skips gracefully when prerequisites are not met.
 */
 import { describe, test, expect, afterAll } from 'bun:test';
 import { runGeminiSkill } from './helpers/gemini-session-runner';
 import type { GeminiResult } from './helpers/gemini-session-runner';
 import { EvalCollector } from './helpers/eval-store';
 import { selectTests, detectBaseBranch, getChangedFiles, GLOBAL_TOUCHFILES } from './helpers/touchfiles';
 import * as path from 'path';
 const ROOT = path.resolve(import.meta.dir, '..');
 // --- Prerequisites check ---
 const GEMINI_AVAILABLE = (() => {
  try {
    const result = Bun.spawnSync(['which', 'gemini']);
    return result.exitCode === 0;
  } catch { return false; }
 })();
 const evalsEnabled = !!process.env.EVALS;
 // Skip all tests if gemini is not available or EVALS is not set.
 const SKIP = !GEMINI_AVAILABLE || !evalsEnabled;
 const describeGemini = SKIP ? describe.skip : describe;
 // Log why we're skipping (helpful for debugging CI)
 if (!evalsEnabled) {
  // Silent — same as Claude E2E tests, EVALS=1 required
 } else if (!GEMINI_AVAILABLE) {
  process.stderr.write('\nGemini E2E: SKIPPED — gemini binary not found (install: npm i -g @google/gemini-cli)\n');
 }
 // --- Diff-based test selection ---
 // Gemini E2E touchfiles — keyed by test name, same pattern as Codex E2E
 const GEMINI_E2E_TOUCHFILES: Record<string, string[]> = {
  'gemini-discover-skill':  ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'],
  'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts'],
 };
 let selectedTests: string[] | null = null; // null = run all
 if (evalsEnabled && !process.env.EVALS_ALL) {
  const baseBranch = process.env.EVALS_BASE
    || detectBaseBranch(ROOT)
    || 'main';
  const changedFiles = getChangedFiles(baseBranch, ROOT);
  if (changedFiles.length > 0) {
    const selection = selectTests(changedFiles, GEMINI_E2E_TOUCHFILES, GLOBAL_TOUCHFILES);
    selectedTests = selection.selected;
    process.stderr.write(`\nGemini E2E selection (${selection.reason}): ${selection.selected.length}/${Object.keys(GEMINI_E2E_TOUCHFILES).length} tests\n`);
    if (selection.skipped.length > 0) {
      process.stderr.write(`  Skipped: ${selection.skipped.join(', ')}\n`);
    }
    process.stderr.write('\n');
  }
  // If changedFiles is empty (e.g., on main branch), selectedTests stays null -> run all
 }
 /** Skip an individual test if not selected by diff-based selection. */
 function testIfSelected(testName: string, fn: () => Promise<void>, timeout: number) {
  const shouldRun = selectedTests === null || selectedTests.includes(testName);
  (shouldRun ? test : test.skip)(testName, fn, timeout);
 }
 // --- Eval result collector ---
 const evalCollector = evalsEnabled && !SKIP ? new EvalCollector('e2e-gemini') : null;
 /** DRY helper to record a Gemini E2E test result into the eval collector. */
 function recordGeminiE2E(name: string, result: GeminiResult, passed: boolean) {
  evalCollector?.addTest({
    name,
    suite: 'gemini-e2e',
    tier: 'e2e',
    passed,
    duration_ms: result.durationMs,
    cost_usd: 0, // Gemini doesn't report cost in USD; tokens are tracked
    output: result.output?.slice(0, 2000),
    turns_used: result.toolCalls.length, // approximate: tool calls as turns
    exit_reason: result.exitCode === 0 ? 'success' : `exit_code_${result.exitCode}`,
  });
 }
 /** Print cost summary after a Gemini E2E test. */
 function logGeminiCost(label: string, result: GeminiResult) {
  const durationSec = Math.round(result.durationMs / 1000);
  console.log(`${label}: ${result.tokens} tokens, ${result.toolCalls.length} tool calls, ${durationSec}s`);
 }
 // Finalize eval results on exit
 afterAll(async () => {
  if (evalCollector) {
    await evalCollector.finalize();
  }
 });
 // --- Tests ---
 describeGemini('Gemini E2E', () => {
  testIfSelected('gemini-discover-skill', async () => {
    // Run Gemini in the repo root where .agents/skills/ exists
    const result = await runGeminiSkill({
      prompt: 'List any skills or instructions you have available. Just list the names.',
      timeoutMs: 60_000,
      cwd: ROOT,
    });
    logGeminiCost('gemini-discover-skill', result);
    // Gemini should have produced some output
    const passed = result.exitCode === 0 && result.output.length > 0;
    recordGeminiE2E('gemini-discover-skill', result, passed);
    expect(result.exitCode).toBe(0);
    expect(result.output.length).toBeGreaterThan(0);
    // The output should reference skills in some form
    const outputLower = result.output.toLowerCase();
    expect(
      outputLower.includes('review') || outputLower.includes('gstack') || outputLower.includes('skill'),
    ).toBe(true);
  }, 120_000);
  testIfSelected('gemini-review-findings', async () => {
    // Run gstack-review skill via Gemini on this repo
    const result = await runGeminiSkill({
      prompt: 'Run the gstack-review skill on this repository. Review the current branch diff and report your findings.',
      timeoutMs: 540_000,
      cwd: ROOT,
    });
    logGeminiCost('gemini-review-findings', result);
    // Should produce structured review-like output
    const output = result.output;
    const passed = result.exitCode === 0 && output.length > 50;
    recordGeminiE2E('gemini-review-findings', result, passed);
    expect(result.exitCode).toBe(0);
    expect(output.length).toBeGreaterThan(50);
    // Review output should contain some review-like content
    const outputLower = output.toLowerCase();
    const hasReviewContent =
      outputLower.includes('finding') ||
      outputLower.includes('issue') ||
      outputLower.includes('review') ||
      outputLower.includes('change') ||
      outputLower.includes('diff') ||
      outputLower.includes('clean') ||
      outputLower.includes('no issues') ||
      outputLower.includes('p1') ||
      outputLower.includes('p2');
    expect(hasReviewContent).toBe(true);
  }, 600_000);
 });
--- a/test/gen-skill-docs.test.ts
+++ b/test/gen-skill-docs.test.ts
@ -416,6 +416,98 @@ describe('REVIEW_DASHBOARD resolver', () => {
  });
 });
 // --- {{SPEC_REVIEW_LOOP}} resolver tests ---
 describe('SPEC_REVIEW_LOOP resolver', () => {
  const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
  test('contains all 5 review dimensions', () => {
    for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
      expect(content).toContain(dim);
    }
  });
  test('references Agent tool for subagent dispatch', () => {
    expect(content).toMatch(/Agent.*tool/i);
  });
  test('specifies max 3 iterations', () => {
    expect(content).toMatch(/3.*iteration|maximum.*3/i);
  });
  test('includes quality score', () => {
    expect(content).toContain('quality score');
  });
  test('includes metrics path', () => {
    expect(content).toContain('spec-review.jsonl');
  });
  test('includes convergence guard', () => {
    expect(content).toMatch(/[Cc]onvergence/);
  });
  test('includes graceful failure handling', () => {
    expect(content).toMatch(/skip.*review|unavailable/i);
  });
 });
 // --- {{DESIGN_SKETCH}} resolver tests ---
 describe('DESIGN_SKETCH resolver', () => {
  const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8');
  test('references DESIGN.md for design system constraints', () => {
    expect(content).toContain('DESIGN.md');
  });
  test('contains wireframe or sketch terminology', () => {
    expect(content).toMatch(/wireframe|sketch/i);
  });
  test('references browse binary for rendering', () => {
    expect(content).toContain('$B goto');
  });
  test('references screenshot capture', () => {
    expect(content).toContain('$B screenshot');
  });
  test('specifies rough aesthetic', () => {
    expect(content).toMatch(/[Rr]ough|hand-drawn/);
  });
  test('includes skip conditions', () => {
    expect(content).toMatch(/no UI component|skip/i);
  });
 });
 // --- {{BENEFITS_FROM}} resolver tests ---
 describe('BENEFITS_FROM resolver', () => {
  const ceoContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8');
  const engContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8');
  test('plan-ceo-review contains prerequisite skill offer', () => {
    expect(ceoContent).toContain('Prerequisite Skill Offer');
    expect(ceoContent).toContain('/office-hours');
  });
  test('plan-eng-review contains prerequisite skill offer', () => {
    expect(engContent).toContain('Prerequisite Skill Offer');
    expect(engContent).toContain('/office-hours');
  });
  test('offer includes graceful decline', () => {
    expect(ceoContent).toContain('No worries');
  });
  test('skills without benefits-from do NOT have prerequisite offer', () => {
    const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8');
    expect(qaContent).not.toContain('Prerequisite Skill Offer');
  });
 });
 // ─── Codex Generation Tests ─────────────────────────────────
 describe('Codex generation (--host codex)', () => {
@ -492,6 +584,16 @@ describe('Codex generation (--host codex)', () => {
    expect(fs.existsSync(path.join(AGENTS_DIR, 'gstack-codex'))).toBe(false);
  });
  test('Codex review step stripped from Codex-host ship and review', () => {
    const shipContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-ship', 'SKILL.md'), 'utf-8');
    expect(shipContent).not.toContain('codex review --base');
    expect(shipContent).not.toContain('Investigate and fix');
    const reviewContent = fs.readFileSync(path.join(AGENTS_DIR, 'gstack-review', 'SKILL.md'), 'utf-8');
    expect(reviewContent).not.toContain('codex review --base');
    expect(reviewContent).not.toContain('Investigate and fix');
  });
  test('--host codex --dry-run freshness', () => {
    const result = Bun.spawnSync(['bun', 'run', 'scripts/gen-skill-docs.ts', '--host', 'codex', '--dry-run'], {
      cwd: ROOT,
--- a/test/helpers/gemini-session-runner.test.ts
+++ b/test/helpers/gemini-session-runner.test.ts
@ -0,0 +1,104 @@
 import { describe, test, expect } from 'bun:test';
 import { parseGeminiJSONL } from './gemini-session-runner';
 // Fixture: actual Gemini CLI stream-json output with tool use
 const FIXTURE_LINES = [
  '{"type":"init","timestamp":"2026-03-20T15:14:46.455Z","session_id":"test-session-123","model":"auto-gemini-3"}',
  '{"type":"message","timestamp":"2026-03-20T15:14:46.456Z","role":"user","content":"list the files"}',
  '{"type":"message","timestamp":"2026-03-20T15:14:49.650Z","role":"assistant","content":"I will list the files.","delta":true}',
  '{"type":"tool_use","timestamp":"2026-03-20T15:14:49.690Z","tool_name":"run_shell_command","tool_id":"cmd_1","parameters":{"command":"ls"}}',
  '{"type":"tool_result","timestamp":"2026-03-20T15:14:49.931Z","tool_id":"cmd_1","status":"success","output":"file1.ts\\nfile2.ts"}',
  '{"type":"message","timestamp":"2026-03-20T15:14:51.945Z","role":"assistant","content":"Here are the files.","delta":true}',
  '{"type":"result","timestamp":"2026-03-20T15:14:52.030Z","status":"success","stats":{"total_tokens":27147,"input_tokens":26928,"output_tokens":87,"cached":0,"duration_ms":5575,"tool_calls":1}}',
 ];
 describe('parseGeminiJSONL', () => {
  test('extracts session ID from init event', () => {
    const parsed = parseGeminiJSONL(FIXTURE_LINES);
    expect(parsed.sessionId).toBe('test-session-123');
  });
  test('concatenates assistant message deltas into output', () => {
    const parsed = parseGeminiJSONL(FIXTURE_LINES);
    expect(parsed.output).toBe('I will list the files.Here are the files.');
  });
  test('ignores user messages', () => {
    const lines = [
      '{"type":"message","role":"user","content":"this should be ignored"}',
      '{"type":"message","role":"assistant","content":"this should be kept","delta":true}',
    ];
    const parsed = parseGeminiJSONL(lines);
    expect(parsed.output).toBe('this should be kept');
  });
  test('extracts tool names from tool_use events', () => {
    const parsed = parseGeminiJSONL(FIXTURE_LINES);
    expect(parsed.toolCalls).toHaveLength(1);
    expect(parsed.toolCalls[0]).toBe('run_shell_command');
  });
  test('extracts total tokens from result stats', () => {
    const parsed = parseGeminiJSONL(FIXTURE_LINES);
    expect(parsed.tokens).toBe(27147);
  });
  test('skips malformed lines without throwing', () => {
    const lines = [
      '{"type":"init","session_id":"ok"}',
      'this is not json',
      '{"type":"message","role":"assistant","content":"hello","delta":true}',
      '{incomplete json',
      '{"type":"result","status":"success","stats":{"total_tokens":100}}',
    ];
    const parsed = parseGeminiJSONL(lines);
    expect(parsed.sessionId).toBe('ok');
    expect(parsed.output).toBe('hello');
    expect(parsed.tokens).toBe(100);
  });
  test('skips empty and whitespace-only lines', () => {
    const lines = [
      '',
      '  ',
      '{"type":"init","session_id":"s1"}',
      '\t',
      '{"type":"result","status":"success","stats":{"total_tokens":50}}',
    ];
    const parsed = parseGeminiJSONL(lines);
    expect(parsed.sessionId).toBe('s1');
    expect(parsed.tokens).toBe(50);
  });
  test('handles empty input', () => {
    const parsed = parseGeminiJSONL([]);
    expect(parsed.output).toBe('');
    expect(parsed.toolCalls).toHaveLength(0);
    expect(parsed.tokens).toBe(0);
    expect(parsed.sessionId).toBeNull();
  });
  test('handles missing fields gracefully', () => {
    const lines = [
      '{"type":"init"}',                              // no session_id
      '{"type":"message","role":"assistant"}',         // no content
      '{"type":"tool_use"}',                           // no tool_name
      '{"type":"result","status":"success"}',          // no stats
    ];
    const parsed = parseGeminiJSONL(lines);
    expect(parsed.sessionId).toBeNull();
    expect(parsed.output).toBe('');
    expect(parsed.toolCalls).toHaveLength(0);
    expect(parsed.tokens).toBe(0);
  });
  test('handles multiple tool_use events', () => {
    const lines = [
      '{"type":"tool_use","tool_name":"run_shell_command","tool_id":"cmd_1","parameters":{"command":"ls"}}',
      '{"type":"tool_use","tool_name":"read_file","tool_id":"cmd_2","parameters":{"path":"foo.ts"}}',
      '{"type":"tool_use","tool_name":"run_shell_command","tool_id":"cmd_3","parameters":{"command":"cat bar.ts"}}',
    ];
    const parsed = parseGeminiJSONL(lines);
    expect(parsed.toolCalls).toEqual(['run_shell_command', 'read_file', 'run_shell_command']);
  });
 });
--- a/test/helpers/gemini-session-runner.ts
+++ b/test/helpers/gemini-session-runner.ts
@ -0,0 +1,201 @@
 /**
 * Gemini CLI subprocess runner for skill E2E testing.
 *
 * Spawns `gemini -p` as an independent process, parses its stream-json
 * output, and returns structured results. Follows the same pattern as
 * codex-session-runner.ts but adapted for the Gemini CLI.
 *
 * Key differences from Codex session-runner:
 * - Uses `gemini -p` instead of `codex exec`
 * - Output is NDJSON with event types: init, message, tool_use, tool_result, result
 * - Uses `--output-format stream-json --yolo` instead of `--json -s read-only`
 * - No temp HOME needed — Gemini discovers skills from `.agents/skills/` in cwd
 * - Message events are streamed with `delta: true` — must concatenate
 */
 import * as path from 'path';
 // --- Interfaces ---
 export interface GeminiResult {
  output: string;           // Full assistant message text (concatenated deltas)
  toolCalls: string[];      // Tool names from tool_use events
  tokens: number;           // Total tokens used
  exitCode: number;         // Process exit code
  durationMs: number;       // Wall clock time
  sessionId: string | null; // Session ID from init event
  rawLines: string[];       // Raw JSONL lines for debugging
 }
 // --- JSONL parser ---
 export interface ParsedGeminiJSONL {
  output: string;
  toolCalls: string[];
  tokens: number;
  sessionId: string | null;
 }
 /**
 * Parse an array of JSONL lines from `gemini -p --output-format stream-json`.
 * Pure function — no I/O, no side effects.
 *
 * Handles these Gemini event types:
 * - init → extract session_id
 * - message (role=assistant, delta=true) → concatenate content into output
 * - tool_use → extract tool_name
 * - tool_result → logged but not extracted
 * - result → extract token usage from stats
 */
 export function parseGeminiJSONL(lines: string[]): ParsedGeminiJSONL {
  const outputParts: string[] = [];
  const toolCalls: string[] = [];
  let tokens = 0;
  let sessionId: string | null = null;
  for (const line of lines) {
    if (!line.trim()) continue;
    try {
      const obj = JSON.parse(line);
      const t = obj.type || '';
      if (t === 'init') {
        const sid = obj.session_id || '';
        if (sid) sessionId = sid;
      } else if (t === 'message') {
        if (obj.role === 'assistant' && obj.content) {
          outputParts.push(obj.content);
        }
      } else if (t === 'tool_use') {
        const name = obj.tool_name || '';
        if (name) toolCalls.push(name);
      } else if (t === 'result') {
        const stats = obj.stats || {};
        tokens = (stats.total_tokens || 0);
      }
    } catch { /* skip malformed lines */ }
  }
  return {
    output: outputParts.join(''),
    toolCalls,
    tokens,
    sessionId,
  };
 }
 // --- Main runner ---
 /**
 * Run a prompt via `gemini -p` and return structured results.
 *
 * Spawns gemini with stream-json output, parses JSONL events,
 * and returns a GeminiResult. Skips gracefully if gemini binary is not found.
 */
 export async function runGeminiSkill(opts: {
  prompt: string;           // What to ask Gemini
  timeoutMs?: number;       // Default 300000 (5 min)
  cwd?: string;             // Working directory (where .agents/skills/ lives)
 }): Promise<GeminiResult> {
  const {
    prompt,
    timeoutMs = 300_000,
    cwd,
  } = opts;
  const startTime = Date.now();
  // Check if gemini binary exists
  const whichResult = Bun.spawnSync(['which', 'gemini']);
  if (whichResult.exitCode !== 0) {
    return {
      output: 'SKIP: gemini binary not found',
      toolCalls: [],
      tokens: 0,
      exitCode: -1,
      durationMs: Date.now() - startTime,
      sessionId: null,
      rawLines: [],
    };
  }
  // Build gemini command
  const args = ['-p', prompt, '--output-format', 'stream-json', '--yolo'];
  // Spawn gemini — uses real HOME for auth, cwd for skill discovery
  const proc = Bun.spawn(['gemini', ...args], {
    cwd: cwd || process.cwd(),
    stdout: 'pipe',
    stderr: 'pipe',
  });
  // Race against timeout
  let timedOut = false;
  const timeoutId = setTimeout(() => {
    timedOut = true;
    proc.kill();
  }, timeoutMs);
  // Stream and collect JSONL from stdout
  const collectedLines: string[] = [];
  const stderrPromise = new Response(proc.stderr).text();
  const reader = proc.stdout.getReader();
  const decoder = new TextDecoder();
  let buf = '';
  try {
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      buf += decoder.decode(value, { stream: true });
      const lines = buf.split('\n');
      buf = lines.pop() || '';
      for (const line of lines) {
        if (!line.trim()) continue;
        collectedLines.push(line);
        // Real-time progress to stderr
        try {
          const event = JSON.parse(line);
          if (event.type === 'tool_use' && event.tool_name) {
            const elapsed = Math.round((Date.now() - startTime) / 1000);
            process.stderr.write(`  [gemini ${elapsed}s] tool: ${event.tool_name}\n`);
          } else if (event.type === 'message' && event.role === 'assistant' && event.content) {
            const elapsed = Math.round((Date.now() - startTime) / 1000);
            process.stderr.write(`  [gemini ${elapsed}s] message: ${event.content.slice(0, 100)}\n`);
          }
        } catch { /* skip — parseGeminiJSONL will handle it later */ }
      }
    }
  } catch { /* stream read error — fall through to exit code handling */ }
  // Flush remaining buffer
  if (buf.trim()) {
    collectedLines.push(buf);
  }
  const stderr = await stderrPromise;
  const exitCode = await proc.exited;
  clearTimeout(timeoutId);
  const durationMs = Date.now() - startTime;
  // Parse all collected JSONL lines
  const parsed = parseGeminiJSONL(collectedLines);
  // Log stderr if non-empty (may contain auth errors, etc.)
  if (stderr.trim()) {
    process.stderr.write(`  [gemini stderr] ${stderr.trim().slice(0, 200)}\n`);
  }
  return {
    output: parsed.output,
    toolCalls: parsed.toolCalls,
    tokens: parsed.tokens,
    exitCode: timedOut ? 124 : exitCode,
    durationMs,
    sessionId: parsed.sessionId,
    rawLines: collectedLines,
  };
 }
--- a/test/helpers/touchfiles.ts
+++ b/test/helpers/touchfiles.ts
@ -57,9 +57,13 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'review-base-branch':       ['review/**'],
  'review-design-lite':       ['review/**', 'test/fixtures/review-eval-design-slop.*'],
  // Office Hours
  'office-hours-spec-review':  ['office-hours/**', 'scripts/gen-skill-docs.ts'],
  // Plan reviews
  'plan-ceo-review':           ['plan-ceo-review/**'],
  'plan-ceo-review-selective': ['plan-ceo-review/**'],
  'plan-ceo-review-benefits':  ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'],
  'plan-eng-review':           ['plan-eng-review/**'],
  'plan-eng-review-artifact':  ['plan-eng-review/**'],
@ -80,6 +84,10 @@ export const E2E_TOUCHFILES: Record<string, string[]> = {
  'codex-discover-skill':  ['codex/**', '.agents/skills/**', 'test/helpers/codex-session-runner.ts'],
  'codex-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'codex/**', 'test/helpers/codex-session-runner.ts'],
  // Gemini E2E (tests skills via Gemini CLI)
  'gemini-discover-skill':  ['.agents/skills/**', 'test/helpers/gemini-session-runner.ts'],
  'gemini-review-findings': ['review/**', '.agents/skills/gstack-review/**', 'test/helpers/gemini-session-runner.ts'],
  // QA bootstrap
  'qa-bootstrap': ['qa/**', 'browse/src/**', 'ship/**'],
@ -140,6 +148,10 @@ export const LLM_JUDGE_TOUCHFILES: Record<string, string[]> = {
  'design-review/SKILL.md fix loop':      ['design-review/SKILL.md', 'design-review/SKILL.md.tmpl'],
  'design-consultation/SKILL.md research': ['design-consultation/SKILL.md', 'design-consultation/SKILL.md.tmpl'],
  // Office Hours
  'office-hours/SKILL.md spec review':    ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
  'office-hours/SKILL.md design sketch':  ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'],
  // Other skills
  'retro/SKILL.md instructions':          ['retro/SKILL.md', 'retro/SKILL.md.tmpl'],
  'qa-only/SKILL.md workflow':            ['qa-only/SKILL.md', 'qa-only/SKILL.md.tmpl'],
@ -152,6 +164,7 @@ export const LLM_JUDGE_TOUCHFILES: Record<string, string[]> = {
 export const GLOBAL_TOUCHFILES = [
  'test/helpers/session-runner.ts',
  'test/helpers/codex-session-runner.ts',
  'test/helpers/gemini-session-runner.ts',
  'test/helpers/eval-store.ts',
  'test/helpers/llm-judge.ts',
  'scripts/gen-skill-docs.ts',
--- a/test/skill-e2e.test.ts
+++ b/test/skill-e2e.test.ts
@ -2911,6 +2911,128 @@ Write the full output (including the GATE verdict) to ${codexDir}/codex-output.m
  }, 360_000);
 });
 // --- Office Hours Spec Review E2E ---
 describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => {
  let ohDir: string;
  beforeAll(() => {
    ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-'));
    const run = (cmd: string, args: string[]) =>
      spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 });
    run('git', ['init', '-b', 'main']);
    run('git', ['config', 'user.email', 'test@test.com']);
    run('git', ['config', 'user.name', 'Test']);
    fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n');
    run('git', ['add', '.']);
    run('git', ['commit', '-m', 'init']);
    // Copy office-hours skill
    fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true });
    fs.copyFileSync(
      path.join(ROOT, 'office-hours', 'SKILL.md'),
      path.join(ohDir, 'office-hours', 'SKILL.md'),
    );
  });
  afterAll(() => {
    try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {}
  });
  test('/office-hours SKILL.md contains spec review loop', async () => {
    const result = await runSkillTest({
      prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop.
 Summarize what the "Spec Review Loop" section does — specifically:
 1. How many dimensions does the reviewer check?
 2. What tool is used to dispatch the reviewer?
 3. What's the maximum number of iterations?
 4. What metrics are tracked?
 Write your summary to ${ohDir}/spec-review-summary.md`,
      workingDirectory: ohDir,
      maxTurns: 8,
      timeout: 120_000,
      testName: 'office-hours-spec-review',
      runId,
    });
    logCost('/office-hours spec review', result);
    recordE2E('/office-hours-spec-review', 'Office Hours Spec Review E2E', result);
    expect(result.exitReason).toBe('success');
    const summaryPath = path.join(ohDir, 'spec-review-summary.md');
    if (fs.existsSync(summaryPath)) {
      const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
      // Verify the agent understood the key concepts
      expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/);
      expect(summary).toMatch(/agent|subagent/);
      expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/);
    }
  }, 180_000);
 });
 // --- Plan CEO Review Benefits-From E2E ---
 describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => {
  let benefitsDir: string;
  beforeAll(() => {
    benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-'));
    const run = (cmd: string, args: string[]) =>
      spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 });
    run('git', ['init', '-b', 'main']);
    run('git', ['config', 'user.email', 'test@test.com']);
    run('git', ['config', 'user.name', 'Test']);
    fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n');
    run('git', ['add', '.']);
    run('git', ['commit', '-m', 'init']);
    // Copy plan-ceo-review skill
    fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true });
    fs.copyFileSync(
      path.join(ROOT, 'plan-ceo-review', 'SKILL.md'),
      path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'),
    );
  });
  afterAll(() => {
    try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {}
  });
  test('/plan-ceo-review SKILL.md contains prerequisite skill offer', async () => {
    const result = await runSkillTest({
      prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found".
 Summarize what happens when no design doc is found — specifically:
 1. Is /office-hours offered as a prerequisite?
 2. What options does the user get?
 3. Is there a mid-session detection for when the user seems lost?
 Write your summary to ${benefitsDir}/benefits-summary.md`,
      workingDirectory: benefitsDir,
      maxTurns: 8,
      timeout: 120_000,
      testName: 'plan-ceo-review-benefits',
      runId,
    });
    logCost('/plan-ceo-review benefits-from', result);
    recordE2E('/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result);
    expect(result.exitReason).toBe('success');
    const summaryPath = path.join(benefitsDir, 'benefits-summary.md');
    if (fs.existsSync(summaryPath)) {
      const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase();
      // Verify the agent understood the skill chaining
      expect(summary).toMatch(/office.hours/);
      expect(summary).toMatch(/design doc|no design/i);
    }
  }, 180_000);
 });
 // Module-level afterAll — finalize eval collector after all tests complete
 afterAll(async () => {
  if (evalCollector) {
--- a/test/skill-validation.test.ts
+++ b/test/skill-validation.test.ts
@ -644,6 +644,59 @@ describe('office-hours skill structure', () => {
  test('contains builder operating principles', () => {
    expect(content).toContain('Delight is the currency');
  });
  // Spec Review Loop (Phase 5.5)
  test('contains spec review loop', () => {
    expect(content).toContain('Spec Review Loop');
  });
  test('contains adversarial review dimensions', () => {
    for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) {
      expect(content).toContain(dim);
    }
  });
  test('contains subagent dispatch instruction', () => {
    expect(content).toMatch(/Agent.*tool|subagent/i);
  });
  test('contains max 3 iterations', () => {
    expect(content).toMatch(/3.*iteration|maximum.*3/i);
  });
  test('contains quality score', () => {
    expect(content).toContain('quality score');
  });
  test('contains spec review metrics path', () => {
    expect(content).toContain('spec-review.jsonl');
  });
  test('contains convergence guard', () => {
    expect(content).toMatch(/convergence/i);
  });
  // Visual Sketch (Phase 4.5)
  test('contains visual sketch section', () => {
    expect(content).toContain('Visual Sketch');
  });
  test('contains wireframe generation', () => {
    expect(content).toMatch(/wireframe|sketch/i);
  });
  test('contains DESIGN.md awareness', () => {
    expect(content).toContain('DESIGN.md');
  });
  test('contains browse rendering', () => {
    expect(content).toContain('$B goto');
    expect(content).toContain('$B screenshot');
  });
  test('contains rough aesthetic instruction', () => {
    expect(content).toMatch(/rough|hand-drawn/i);
  });
 });
 describe('investigate skill structure', () => {
@ -856,6 +909,22 @@ describe('CEO review mode validation', () => {
    expect(content).toContain('HOLD SCOPE');
    expect(content).toContain('REDUCTION');
  });
  // Skill chaining (benefits-from)
  test('contains prerequisite skill offer for office-hours', () => {
    expect(content).toContain('Prerequisite Skill Offer');
    expect(content).toContain('/office-hours');
  });
  test('contains mid-session detection', () => {
    expect(content).toContain('Mid-session detection');
    expect(content).toMatch(/still figuring out|seems lost/i);
  });
  // Spec review on CEO plans
  test('contains spec review loop for CEO plan documents', () => {
    expect(content).toContain('Spec Review Loop');
  });
 });
 // --- gstack-slug helper ---
@ -1187,18 +1256,36 @@ describe('Codex skill', () => {
    expect(content).toContain('mktemp');
  });
-  test('codex integration in /review offers second opinion', () => {
+  test('codex integration in /review has config-driven review step', () => {
    const content = fs.readFileSync(path.join(ROOT, 'review', 'SKILL.md'), 'utf-8');
-    expect(content).toContain('Codex second opinion');
+    expect(content).toContain('Codex review');
    expect(content).toContain('codex_reviews');
    expect(content).toContain('codex review');
    expect(content).toContain('adversarial');
    expect(content).toContain('xhigh');
    expect(content).toContain('Investigate and fix');
    expect(content).toContain('CROSS-MODEL');
  });
-  test('codex integration in /ship offers review gate', () => {
+  test('codex integration in /ship has config-driven review step', () => {
    const content = fs.readFileSync(path.join(ROOT, 'ship', 'SKILL.md'), 'utf-8');
-    expect(content).toContain('Codex');
+    expect(content).toContain('Codex review');
    expect(content).toContain('codex_reviews');
    expect(content).toContain('codex review');
    expect(content).toContain('codex-review');
    expect(content).toContain('xhigh');
    expect(content).toContain('Investigate and fix');
  });
  test('codex-host ship/review do NOT contain codex review step', () => {
    const shipContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-ship', 'SKILL.md'), 'utf-8');
    expect(shipContent).not.toContain('codex review --base');
    expect(shipContent).not.toContain('Investigate and fix');
    const reviewContent = fs.readFileSync(path.join(ROOT, '.agents', 'skills', 'gstack-review', 'SKILL.md'), 'utf-8');
    expect(reviewContent).not.toContain('codex review --base');
    expect(reviewContent).not.toContain('codex_reviews');
    expect(reviewContent).not.toContain('Investigate and fix');
  });
  test('codex integration in /plan-eng-review offers plan critique', () => {
--- a/test/touchfiles.test.ts
+++ b/test/touchfiles.test.ts
@ -78,8 +78,9 @@ describe('selectTests', () => {
    const result = selectTests(['plan-ceo-review/SKILL.md'], E2E_TOUCHFILES);
    expect(result.selected).toContain('plan-ceo-review');
    expect(result.selected).toContain('plan-ceo-review-selective');
-    expect(result.selected.length).toBe(2);
+    expect(result.selected).toContain('plan-ceo-review-benefits');
-    expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 2);
+    expect(result.selected.length).toBe(3);
    expect(result.skipped.length).toBe(Object.keys(E2E_TOUCHFILES).length - 3);
  });
  test('global touchfile triggers ALL tests', () => {