mirror of https://github.com/garrytan/gstack.git
5 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
a5833c413f
|
v1.57.10.0 feat: Codex review default-on across review/ship/plan/docs (#1966)
* feat(config): make codex_reviews the master switch for all Codex review Broaden the codex_reviews doc to describe it governing /review, /ship, /document-release, plan reviews, and /autoplan. Reject invalid values on set (preserving the existing value) so a typo can never silently flip paid Codex calls on or off. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(review): Codex review default-on across review/ship/plan/docs Add a shared codexPreflight() helper (constants.ts) that, in one bash block, reads codex_reviews, sources gstack-codex-probe, checks install + auth, and echoes a single canonical mode (ready/not_installed/not_authed/ disabled). All Codex resolvers route through it. - generateCodexPlanReview: opt-in question removed; the outside voice now runs automatically (default-on), falling back to a Claude subagent when Codex is missing/unauthed. Cross-model tension still gates on user approval (sovereignty preserved). - generateAdversarialStep: probe-based availability (install AND auth), distinct not-installed vs not-authed guidance; 200-line structured-review threshold unchanged. - generateCodexDocReview (new, wired via CODEX_DOC_REVIEW): reviews the release's docs against the shipped diff range, informational + an explicit apply-fixes decision point, never auto-edits. - autoplan Phase 0.5 now honors codex_reviews=disabled so the switch is truly global. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(docs): regenerate SKILL docs + refresh ship golden Output of gen:skill-docs for the Codex-default-on resolver/template changes. Refreshes the factory-ship golden fixture (codex-host output unchanged — resolvers strip for the codex host). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(infra): widen size-budget guards for default-on Codex outside-voice The codexPreflight() block + CODEX_MODE branch prose (replacing the smaller opt-in question) grows plan-ceo/eng/devex-review and review by 5-7% over baseline. Each bump carries a comment justifying it as intentional capability, not slop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: guard Codex default-on + config reject-on-set skill-validation: assert plan reviews no longer carry the opt-in question and render the default-on outside-voice, document-release carries the doc review, and the codex host strips all of it. gstack-config: codex_reviews defaults to enabled, accepts enabled/disabled, and rejects an invalid value while preserving the existing one. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(test): align gstack-config tests with defaults-fallback behavior Three tests (last touched v0.13.7.0) asserted get/list print empty for unset keys, but gstack-config falls back to the documented defaults table (get returns the default, list shows the active-values block). Update the assertions to the real behavior and split out an unknown-key case that does still return empty. Pre-existing red, unrelated to codex review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.57.10.0 feat: Codex review default-on across review/ship/plan/docs Codex cross-model review now runs by default on /review, /ship, all four plan reviews, /document-release, and /autoplan, governed by one master switch (codex_reviews, default enabled). Plan-review outside voice is default-on; /document-release gets a new Codex doc-vs-diff audit; every call site detects install AND auth and falls back to a Claude subagent with a clear reason. Disable everything with: gstack-config set codex_reviews disabled Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |
|
|
|
45cc95d5f4
|
v1.57.5.0 feat: cross-session decision memory + gbrain dream-stage call graph (#1910)
* feat(gbrain-sync): add cycleCompleted() cycle-state probe
Reads `gbrain doctor` cycle_freshness to classify whether a source has
completed a full cycle (completed/never/unknown). A fail naming this source
-> never; a fail naming only other sources -> completed; an absent or
unparseable check -> unknown, so an unrelated doctor failure never masks a
real state. Gates the automatic call-graph build on --full.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(gbrain-sync): --dream call-graph stage with lock-free gate + honest outcome guard
Adds a source-scoped `gbrain dream --source <id>` stage that builds this
worktree's call graph (code-callers/code-callees). Runs lock-free after the
sync lock releases so it never blocks sibling worktrees; a .dream-in-progress
marker dedupes concurrent dreams. --full auto-runs it only when the cycle was
never built; explicit --dream always forces; --no-dream opts out.
The stage parses the cycle's own output and reports the truth, not a flat
"built": a WARN when the schema pack can't extract code symbols, when the
embed phase failed for a missing key, or when 0 edges resolved; OK with the
resolved-edge count otherwise. gbrain exits 0 even when it skips on a held
cycle lock (e.g. autopilot), so that case reports SKIP, not success.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: ignore gbrain .sources/ local staging dir
gbrain writes per-source staging and capability-check artifacts under
.sources/ in the repo root. It's machine-local runtime state, not source.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(gbrain): honest call-graph guidance in /sync-gbrain + pin works on gbrain>=0.41.38
sync-gbrain frames the --dream offer honestly: building a call graph requires a
code-aware schema pack, and the dream stage reports a WARN when it can't. The
verdict's Call graph row mirrors the dream stage's real outcome instead of
assuming a completed cycle means edges exist. The ## GBrain Search Guidance
block written into CLAUDE.md drops the old code-callers --source caveat:
gbrain >=0.41.38.0 honors the .gbrain-source pin for code-callers/code-callees.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(jsonl-store): shared audited JSONL plumbing (injection-reject + atomic append + tolerant read)
Single source of truth extracted for D2A: gstack-learnings-* and the upcoming
gstack-decision-* bins share one injection-pattern list, one atomic single-line
appender, and one tolerant reader. No more drift between stores.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(learnings-log): use shared hasInjection from lib/jsonl-store (D2A)
Replace the inline injection-pattern copy with the shared list. One audited
write-path rejection across learnings + the upcoming decision store. Behavior
unchanged (35/35 learnings tests green); learnings-search keeps its inline copy
because a structural test pins its bash/bun shape.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(decision): event-sourced decision-memory model (lib/gstack-decision)
decide/supersede/redact events on lib/jsonl-store; active set is computed (no
mutable status), dangling refs tolerated. Free-text is injection-checked and
redact-scanned on write (HIGH secret -> reject). Scope filter (repo/branch/issue)
for relevant resurfacing. File-only + reliable; gbrain not required.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(decision): bounded active snapshot + compaction (redact expunges, supersede archives)
writeSnapshot/readSnapshot/rebuildSnapshot give an O(active) bounded read for the
session-start hot path (D1A). compact() rewrites the log to active, archives
superseded decisions for history, and EXPUNGES redacted ones (dropped, never
archived) so an accidentally-captured secret leaves the store for good.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(decision): gstack-decision-log + gstack-decision-search bins (non-interactive)
Two bins mirroring gstack-learnings-* (D3A). log writes decide/--supersede/--redact/
--compact events + refreshes the bounded snapshot + enqueues for cross-machine sync;
search reads the O(active) snapshot, scope-filtered to current branch, newest-first,
--all to include superseded, --json for machines. Empty store returns silently
(no snapshot write on an empty read).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(memory): surface active decisions at session start + capture nudge (Context Recovery)
Context Recovery now shows recent scope-relevant active decisions (bounded read of
decisions.active.json via gstack-decision-search) and instructs the agent to treat
them as settled calls and to log durable decisions/reversals. Closes the Phase-1
capture->curate->resurface loop, reliable + file-only. Regen across all hosts folded
in (squash-with-regen); parity 10/10, freshness green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test: refresh ship golden baselines for the memory-loop preamble change
Context Recovery now emits the cross-session-decisions block, so ship's preamble
(all hosts) changed. Golden baselines are hand-maintained copies (gen does not
write them); refresh them from the fresh gen so golden-file regression passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(memory): document the cross-session decision-memory loop in CLAUDE.md
Adds a '## Cross-session decision memory' section: how to resurface
(gstack-decision-search) and capture (gstack-decision-log) durable decisions,
the supersede/redact/compact verbs, and a crisp durable-vs-trivial definition
so the store stays signal. Reliable file-only path; gbrain not required.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(memory): emit durable decisions from ship/ceo/eng/spec at structured points
Wires the four skills that finalize real decisions to capture them in the
cross-session decision store, from their STRUCTURED outputs (never free-text
scraping):
- ship: the version bump (level + why) at write time
- plan-ceo-review: accepted scope + verdict (branch-scoped)
- plan-eng-review: the architecture verdict + key call (branch-scoped)
- spec: the filed issue's core approach (issue-scoped)
All emits are non-interactive, schema-correct (content in decision/rationale,
source=skill, confidence 1-10), and best-effort (|| true) so a decision-log
failure never blocks the workflow. Includes regen across hosts + refreshed ship
golden baselines.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(memory): optional gbrain --semantic recall for decision search
Adds gstack-decision-search --semantic (with --query): appends a 'Related from
memory' block from gbrain semantic search, scoped to the curated-memory source.
Pure enhancement, reliability-first: a new lib/gstack-decision-semantic.ts is the
ONLY decision module that touches gbrain and is imported lazily only on --semantic,
so the reliable file path never loads gbrain code. Every path degrades to the
reliable file results when gbrain is off, unconfigured, empty, or errors (never
throws, 10s timeout).
Built against the verified gbrain 0.42.x surface (text output [score] slug --
snippet, NOT JSON; curated-memory source resolved by worktree path, not a
gstack-brain-<user> id). Deterministic-contract tests only: parser units,
degrade-to-null when gbrain absent, and a fake-gbrain shim proving scope+search
end-to-end. find-contradictions deferred (no verifiable CLI surface yet + curated
memory not indexed).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(gbrain-sync): self-heal stale autopilot lock (dead-pid)
detectAutopilot treated a lock FILE as proof of life, so a crashed gbrain daemon
left a stale lock that wedged every sync forever (observed: a dead pid refused
--full indefinitely). Now read the holder pid (bare or JSON body) and check
liveness via signal-0: ESRCH=dead → ignore the stale signal and keep checking;
EPERM=alive (other user) → active. A stale lock never masks a live autopilot
process. Pure decision function — does not delete the file; the caller may clean it.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(review): drop stray trailing code fence in TODOS-format
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(test): align section-loading E2E testNames with their TOUCHFILES keys
Pre-existing on main (v1.56.x): the two section-loading E2E tests used
human-label testNames ('/ship section-loading') that don't match their slug
keys ('ship-section-loading') in E2E_TOUCHFILES/E2E_TIERS. Every other E2E test
uses the slug as its testName, and the TOUCHFILES completeness gate requires
testName to be a registered key — so the gate was red. Align both testNames to
their slug keys (also fixes tier lookup for these two periodic tests).
Verified failing on a clean origin/main checkout before the fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix: pre-landing review fixes (datamark, DRY, compact, coverage)
Addresses the pre-landing review findings (all INFORMATIONAL, no criticals):
- security: datamark resurfaced decision text at the render boundary
(lib/gstack-decision.ts datamark() — neutralizes code fences, --- banners,
<|role|>/</system> markers, control chars, newlines). Applied in
gstack-decision-search human output so stored text can't masquerade as
instructions in Context Recovery (codex hardening #3 / AC #7). --json stays raw.
- DRY: extract resolveSlug/gitBranch/flagValue to lib/bin-context.ts; both
decision bins use it instead of duplicating the helpers.
- compact(): batch the archive append (one write, not N) and shrink the
mid-compact crash window; simplify the opaque branch/issue ternary.
- coverage: learnings-log injection rejection (D2A wiring), search --recent/
--scope + NaN-safe --recent, datamark-applied, unparseable lock body,
compact-empty, corrupt-snapshot degrade.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(security): close adversarial-review findings in decision memory
Adversarial review (Claude subagent) found a CRITICAL the specialist pass missed:
- F1 (CRITICAL): 'Human:'/'Assistant:' turn-prefixes bypassed BOTH the write-time
denylist AND datamark(), landing verbatim in agent context inside the trusted
ACTIVE DECISIONS fence. Add 'human:' (+ 'disregard previous', 'from now on') to
the shared denylist, and have datamark() neutralize Human:/Assistant:/System:/User:
turn-prefixes (ZWSP) at the render boundary.
- F2: datamark() only stripped ASCII C0; extend to Unicode line terminators
(U+0085/2028/2029) and U+007F so 'strip newlines' actually holds.
- F3: validateDecide blocked only HIGH secrets; MEDIUM-tier PII (e.g. SSN) persisted
silently and synced cross-machine. The store is non-interactive (no confirm path),
so fail closed on MEDIUM too.
- F4: compact() was a lock-free read-modify-rewrite that could clobber a concurrent
append (lost decision). Add an O_EXCL compact lock + a pre-rename size recheck that
aborts untouched (skipped=true) if an append landed; caller re-runs.
- F7: filterByScope unknown/garbage scope fell through to 'return true' (leaked into
every context); fail conservative (false).
F5 (pid reuse) and F6 (pgrep over-match) are intentionally left as-is: both fail SAFE
(over-refuse sync); making them precise would introduce a fail-DANGEROUS path
(allowing sync during a real autopilot). True disambiguation needs gbrain to stamp the
lock with a start-time, which gstack doesn't own. F8 (compact moves history to archive)
is by design.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(security): close cross-model (Codex) adversarial findings
Codex adversarial review found a HIGH the Claude pass missed plus 3 mediums:
- C1 (HIGH): gstack-decision-search --all returned every decide and IGNORED redact
events, so a redacted secret still resurfaced via --all until compact ran. --all
now excludes redacted (redact = expunge from every read path), still showing
superseded history.
- C-med: semantic (external gbrain) slug/snippet were printed raw — datamark them too
so a gbrain hit can't spoof role markers / fences into agent context.
- C4: semanticRecall fell back to an UNSCOPED gbrain search when no curated-memory
source resolved, pulling code/doc corpora mislabeled as 'related decisions'. Now
returns null (degrade) when there's no worktree-backed memory source.
- C5: validateDecide scanned only decision/rationale/alternatives; branch and issue
are stored + surfaced (raw via --json), so include them in the injection+secret scan.
C2 (snapshot staleness) / C3 (compact TOCTOU residual): accepted for a single-user
store — atomic appends never lose the event, rebuilds self-heal, and the compact
size-recheck leaves only a sub-ms window; full append-locking would break the
lock-free append design.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v1.57.5.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
|
|
d8c91c6267
|
v1.57.3.0 fix(ship): always-loaded PR-title-version rule + fork-PR title-sync backstop (#1909)
* fix(ship): restore always-loaded PR-title-version invariant to skeleton
The v1.54.0.0 carve moved the 'PR title MUST start with v$NEW_VERSION' rule
out of the always-loaded ship skeleton and entirely into the lazily-loaded
pr-body.md section. The agent only set the version prefix if it happened to
read that section before creating the PR, so PRs landed with bare titles.
Restore a one-line invariant (+ helper reference) to ship/SKILL.md.tmpl right
before the {{SECTION:pr-body}} pointer, mirroring the AUQ always-loaded
precedent. Full procedure stays sectioned. Regenerated all hosts.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(ship): guard PR-title-version rule + pull_request_target safety
Two free gate tests so a future carve or workflow refactor can't silently
regress:
- ship-pr-title-version-always-loaded: asserts the invariant lives in the
always-loaded ship/SKILL.md skeleton (not only sections/), and that the
skeleton+sections union keeps BOTH the create and the existing-PR update
title paths. Modeled on test/auq-format-always-loaded.test.ts.
- pr-title-sync-workflow-safety: static tripwire that fails CI if
pr-title-sync.yml checks out PR-head code or inlines an attacker-controlled
${{ github.event.pull_request.* }} field inside a run: block (the two
pull_request_target footguns actionlint cannot catch).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(ci): pr-title-sync covers fork PRs via hardened pull_request_target
Under plain pull_request the GITHUB_TOKEN is read-only on fork PRs, so the
title-sync backstop could never edit a fork/agent PR title. Switch to
pull_request_target (write token in base context) and make it safe:
- Check out the base repo only (no ref:) — execute trusted infra, never
fork-head code.
- All attacker-controlled PR fields (title, head repo, head sha) pass via
env: and are referenced as shell-quoted "$VAR", never inlined into run:.
- Read the PR-head VERSION as data (raw media type) from the head repo at the
head sha; guard the assignment under set -e.
- Same-repo read failure fails loudly; fork miss warns and skips (the backstop
stays green without going silently optional).
- Never echo the raw fork title (Actions parses ::workflow-command:: from stdout).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(ship): expand binDir path in pr-body Linked Spec block
ship/sections/pr-body.md.tmpl:98-99 used ${ctx.paths.binDir}, but the
gen-skill-docs generator only resolves {{TOKEN}} syntax in .tmpl files — the
${...} JS-template-literal form is substituted only inside .ts resolver files.
So the token passed through literally into the generated pr-body.md, leaving the
agent with an unexpandable ${ctx.paths.binDir}/gstack-paths command in the
Linked Spec auto-detect block. Use the hardcoded helper path, consistent with
every other path reference in this section.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* refactor(test): fold ship PR-title skeleton guard into carve-guard registry
main shipped a generalized carve-guard system (PR #1907) that is now the single
source of truth for carved-skill skeleton invariants. Register the PR-title rule
there instead of a standalone test: ship's mustStayInSkeleton asserts v$NEW_VERSION
+ the rewrite helper stay always-loaded, and mustMoveToSection asserts both the
create and update PR paths stay carved into pr-body.md (present in the union, out of
the skeleton). Delete the standalone ship-pr-title-version-always-loaded test it
replaces. The CI-workflow safety tripwire stays standalone (not a carve concern).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v1.57.3.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
|
|
4dfdb7cdc2
|
v1.57.2.0 feat: AskUserQuestion prose fallback when the tool fails at runtime (#1908)
* feat(auq): add gstack-session-kind + echo SESSION_KIND in preamble
Classifies the session as spawned | headless | interactive from env markers
(OPENCLAW_SESSION / GSTACK_HEADLESS / CONDUCTOR_* / CLAUDE_CODE_ENTRYPOINT / CI),
defaulting to interactive. Echoed once at skill start alongside BRANCH/REPO_MODE
so the AskUserQuestion-failure fallback can branch without a shell-out at failure
time. Degrade-safe: empty/error => interactive.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(auq): prose fallback when AskUserQuestion fails (interactive sessions)
On a genuine AUQ failure (tool absent, or present-but-erroring like Conductor's
flaky MCP returning '[Tool result missing due to internal error]'): retry once,
then branch on SESSION_KIND — spawned auto-chooses, headless BLOCKs, interactive
renders a prose decision brief the user answers by typing a letter.
The prose fallback MUST surface the triad: a clear ELI10 of the issue, a
per-choice Completeness score, and a recommendation+why (one paragraph per
choice). Carves out the [plan-tune auto-decide] denial as NOT a failure, and
qualifies the former 'tool_use, not prose' assertions so the rule isn't
self-contradicting. Tests pin the triad, the SESSION_KIND branch, the OV2
collision guard, the always-loaded guarantee, and a cross-file invariant on the
auto-decide prefix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(auq): default GSTACK_HEADLESS=1 in eval/E2E runners
Headless harness runs classify as headless (BLOCK on AUQ failure rather than
emit a prose question no one reads). SDK runner uses ambient mutation, not the
Options.env object, to avoid breaking the SDK auth pipeline. Interactive-path
suites opt out by overriding the env per-run.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(auq): defensive PostToolUse error-fallback hook (OV3:B)
When an AskUserQuestion call returns an error/missing result, this hook injects
additionalContext reminding the model to run the prose fallback for the current
SESSION_KIND. It does not render prose itself — it guarantees the reminder fires
at the moment of failure instead of relying on the model recalling SESSION_KIND.
Inert on success and inert if the platform never invokes PostToolUse on tool
errors (unverified — could not force the Conductor MCP error in a harness; see
the spike doc). The prompt-level fallback covers the case regardless. Decision
logic is unit-tested deterministically; registered in setup beside the existing
AUQ hooks.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore(auq): regenerate SKILL.md for all hosts + refresh ship goldens
Regenerated from the resolver changes (gen:skill-docs --host all). Refreshes the
byte-exact ship golden fixtures (claude/codex/factory). Spec prose tightened so
the cross-cutting preamble addition stays under the 5% per-skill parity ceiling
(investigate 4.8%) — guard unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(test): kebab testNames for section-loading E2Es to match TOUCHFILES keys
The two section-loading E2E tests used display-form testNames ('/ship
section-loading', '/plan-ceo-review section-loading') while every other E2E
testName and their E2E_TOUCHFILES keys are kebab. The completeness gate does an
exact `name in E2E_TOUCHFILES` check, so it failed (pre-existing on main); diff-
based selection also couldn't match them. Align to ship-section-loading /
plan-ceo-section-loading.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(test): make external-host freshness checks deterministic
The parameterized host smoke + --host all freshness tests assumed an external
`gen:skill-docs --host all` had run first (it never does in `bun test`), so which
host reported STALE varied by sibling-test timing — flaky. Regenerate the
gitignored external host dirs in a beforeAll so the --dry-run check is
deterministic. It still catches non-deterministic generation (the real bug class
for regenerated outputs); the tracked-claude freshness test runs earlier and is
unaffected.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(parity): headroom for AUQ cross-cutting addition on carved document-release
Merging main brought the carve of document-release (smaller skeleton); the AUQ
prose-fallback adds ~2KB to every skill's always-loaded preamble, landing
document-release at ~5.9% over the pre-carve v1.53.0.0 baseline. Add a per-carve
maxSizeRatio override (CARVE_GUARDS single source of truth) and bump only this
skill to 1.08. All other skills keep the strict 1.05 ceiling.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* fix(auq): harden error-fallback hook + harness per adversarial review
Codex pre-landing review found three real issues:
- The PostToolUse fallback hook shared source 'plan-tune-cathedral' with the
question-log hook (same event+matcher); gstack-settings-hook replaces the entry,
so it would have clobbered plan-tune capture. Give it its own 'auq-error-fallback'
source (separate entry, both run); ALREADY_INSTALLED now requires both sources.
- isErrorResponse triggered on any string containing 'internal error'/'is_error',
so a real answer or a {"is_error": false} payload could fire the fallback after a
successful question. Narrow it to the missing-result sentinel + boolean is_error.
- The SDK runner mutated process.env.GSTACK_HEADLESS process-wide (leaked headless
into later tests). Removed; GSTACK_HEADLESS=1 now lives in the eval package.json
scripts, scoped to the invocation and inherited by the SDK child.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v1.57.2.0)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
|
|
e722c5bf89
|
v1.57.0.0 feat: carve-guard system + carve cso/document-release/design-consultation (#1907)
* test: canonical CARVE_GUARDS registry; derive parity + size-budget from it Single source of truth for the carved-skill set + per-skill invariants (EQ1). parity-harness.ts sectioned entries and skill-size-budget.ts SECTIONS_EXTRACTED now derive from it instead of hand-maintained lists. Closes a pre-existing drift: plan-devex-review was in SECTIONS_EXTRACTED but had no sectioned parity invariant; now generated. carve-guards.ts is a pure leaf data module (import type only) to avoid an import cycle. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: shared carve-guard check fns with injectable root discoverCarvedSkills/checkOrdering/checkCompleteness take a root param so the negative tests can point the real guards at a fixture dir. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: E2 data-driven carve static ordering guard (gate) Per-PR backstop for every carved skill, one test() per skill, driven by CARVE_GUARDS staticInvariants. Generalizes + retires the ceo-specific ordering test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: E1 carve-guard completeness meta-guard (gate) Asserts filesystem carved set == CARVE_GUARDS set both directions, so a future carve without a registry entry fails CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: ET1 guard-of-guards negative tests (gate) Temp fixture broken 3 ways proves E1/E2 actually throw, via the injectable root. Kills the silent-pass-guard failure class. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: T2 data-driven behavioral section-loading guard (periodic) One file iterating CARVE_GUARDS, one test() per skill with GSTACK_CARVE_SKILL cost-scoping (D-CODEX A). external carves (ship, plan-ceo) keep bespoke tests; testNames aligned to their touchfile keys. Registered in touchfiles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: defer E3 real-session carve canary to TODOS Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: carve document-release into skeleton + on-demand section Steps 2-9 (per-file audit, auto-updates, risky-change asks, CHANGELOG voice polish, cross-doc consistency, TODOS cleanup, VERSION bump, commit + PR body) move to sections/release-body.md, read on demand after the Step 1.5 coverage map. Skeleton 59,256 -> 45,797 B (-23%); union preserved. Adds the CARVE_GUARDS entry (auto-extends parity + size-budget via EQ1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: carve design-consultation into skeleton + on-demand section Phases 3-6 (complete proposal, drill-downs, design preview, writing DESIGN.md) move to sections/proposal-and-preview.md, read on demand after product context + research. Skeleton 80,719 -> 59,229 B (-27%); union preserved. Adds the CARVE_GUARDS entry. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: carve cso into skeleton + on-demand section (security-safe) Scope-dependent audit Phases 2-11 move to sections/audit-phases.md. Mode dispatch (## Arguments, ## Mode Resolution), always-run Phases 0/1, and the Phase 12 false-positive-filtering exceptions stay ALWAYS-LOADED in the skeleton. Skeleton 79,383 -> 65,117 B (-18%); union preserved. Adds a cso CARVE_GUARDS entry with an earliest-use invariant (mustPrecedeStop): mode dispatch must appear before any STOP-Read, so a directive that decides which sections to read can't be stranded behind the STOP that reads them (codex outside-voice #6). carve-guard-checks gains the mustPrecedeStop check. parity moves cso monolith -> generated carved entry. cso-preserved.test.ts strengthened: phrases checked against the union, plus an always-loaded contract on the skeleton (dispatch + FP-filtering, codex #5). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: make redaction/taxonomy tests union-aware for cso + document-release carves The cso carve moved Secrets Archaeology (prefixes, lib/redact-patterns.ts pointer, git-history scan) into sections/audit-phases.md, and the document-release carve moved the Step 9 PR-body redaction scan into sections/release-body.md. Three content-presence tests asserted that content in the skeleton SKILL.md/.md.tmpl; they now read the skeleton+sections union (same fix as cso-preserved + parity). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v1.57.0.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix: address pre-landing review (codex) on the carve - cso section: add a scope-gate header so '--owasp' (and other scoped modes) run only their selected phases, not every phase bundled in the section ('execute in full' no longer overrides Mode Resolution). - carve-guard-checks: gateAfterStop now compares against the LAST STOP, not the first, so a gate stranded between two STOPs in a multi-STOP skeleton fails. - TODOS: behavioral section-loading hermeticity (verifier matches global-install path, not the fixture) — pre-existing in auq-sdk-capture.ts, deferred. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> |