docs(spikes): Claude hook mutation + Codex session format

Plan-tune cathedral T4 (per D5/D10). Two Phase 1 design spikes that
downstream tasks (T3, T5, T6, T8, T9) depend on.

claude-code-hook-mutation.md
- Confirms PreToolUse allow + updatedInput is supported and is the right
  mechanism for substituting an auto-decided answer.
- Pins stdin/stdout JSON schemas with field-by-field reference.
- Documents matcher regex syntax for "(AskUserQuestion|mcp__.*__AskUserQuestion)"
  so Conductor's MCP-routed AUQ is covered.
- Captures parallel-hook merge order caveat and our settings.json snippet.

codex-session-format.md
- Maps the on-disk ~/.codex/sessions/<date>/rollout-*.jsonl schema by
  event type (response_item 76%, event_msg 19%, turn_context, session_meta).
- Critical finding: Codex has NO AskUserQuestion tool. Gstack AUQ-shaped
  Decision Briefs surface as agent_message text; answer is the next
  user_message. Two-tier recovery: marker-first (D18), then pattern
  fallback for hash-only logging.
- Confirms logs_2.sqlite is internal telemetry, not session content.
- Lists open questions to answer during T9 implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan 2026-05-27 07:33:57 -07:00
parent 9e0e185fe2
commit 6dc113838a
No known key found for this signature in database
GPG Key ID: C1F69E85C74EFE1D
2 changed files with 364 additions and 0 deletions

View File

@ -0,0 +1,193 @@
# Spike: Claude Code hook mutation for plan-tune cathedral
**Status:** complete (2026-05-27)
**Surfaces:** D10 (does PreToolUse allow mutating AUQ input?), D19/Codex (matcher must cover MCP variants)
**Downstream consumers:** T3, T5, T6, T8
## Question this spike answers
Can a PreToolUse hook on `AskUserQuestion` actually substitute the user's
answer via `updatedInput`? If yes, what's the exact protocol?
## Answer
**Yes.** `updatedInput` is the supported mechanism. Source:
https://code.claude.com/docs/en/hooks (confirmed 2026-04 reference).
## Hook stdin schema (PreToolUse + PostToolUse)
```json
{
"session_id": "abc123",
"transcript_path": "/path/to/transcript.jsonl",
"cwd": "/current/working/dir",
"permission_mode": "default",
"effort": { "level": "medium" },
"hook_event_name": "PreToolUse",
"tool_name": "AskUserQuestion",
"tool_input": { /* tool-specific */ },
"tool_use_id": "unique-id-12345"
}
```
Optional in subagent context: `agent_id`, `agent_type`.
## PreToolUse hook stdout schema for `allow + updatedInput`
```json
{
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "allow",
"permissionDecisionReason": "auto-decided by plan-tune preference",
"updatedInput": { /* shallow-merged into original tool_input */ },
"additionalContext": "optional context for Claude"
}
}
```
**permissionDecision values:**
- `"allow"` — proceed, optionally with `updatedInput`
- `"deny"` — block (feedback to Claude, NOT a synthetic answer per Codex
correction in D-prefixed decisions)
- `"ask"` — escalate to user
- `"defer"` — let permission flow continue
**`updatedInput` semantics:** shallow merge of fields present in the returned
object onto the original `tool_input`. Only valid with
`permissionDecision: "allow"`. This is what lets us substitute an
auto-decided answer for `never-ask` preferences.
## Matcher schema
The `matcher` field in `~/.claude/settings.json` supports JS-regex syntax
**when it contains regex metacharacters**. A matcher with only letters/
underscores is an exact match.
To cover both native + MCP `AskUserQuestion`:
```json
"matcher": "(AskUserQuestion|mcp__.*__AskUserQuestion)"
```
Conductor disables native `AskUserQuestion` via `--disallowedTools` and
routes through `mcp__conductor__AskUserQuestion` — the MCP suffix is
required for our hook to fire there.
## Multiple-hook concurrency caveat
> All matching hooks run in parallel, and identical handlers are
> deduplicated automatically.
**For our use case:**
- gstack registers exactly one PreToolUse hook and one PostToolUse hook on
AUQ-shaped tool names.
- If a user has THEIR own hook that also returns `updatedInput` on
AskUserQuestion, the merge order is undefined.
- Mitigation: document this constraint in `bin/gstack-settings-hook`
install prompt. User can detect the conflict from the diff preview before
accepting.
**`permissionDecision` precedence (when multiple hooks decide):**
`deny > ask > allow > defer` — most restrictive wins.
## Implementation hookSpecificOutput examples
**Auto-decide (PreToolUse, `never-ask` preference + non-one-way):**
```json
{
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "allow",
"permissionDecisionReason": "plan-tune: never-ask preference on ship-test-failure-triage",
"updatedInput": {
"questions": [{ /* same as input, but with auto-selected answer */ }]
}
}
}
```
**Pass-through (no preference, or one-way safety override):**
```json
{
"hookSpecificOutput": {
"hookEventName": "PreToolUse",
"permissionDecision": "defer"
}
}
```
**PostToolUse capture (always):**
```json
{
"hookSpecificOutput": {
"hookEventName": "PostToolUse"
}
}
```
(PostToolUse hooks can also set `additionalContext` to append to the tool
result; we don't need this for v1 capture.)
## Settings.json snippet for T8 hook installer
```json
{
"hooks": {
"PreToolUse": [
{
"matcher": "(AskUserQuestion|mcp__.*__AskUserQuestion)",
"hooks": [
{
"type": "command",
"command": "$CLAUDE_PROJECT_DIR/.claude/skills/gstack/hosts/claude/hooks/question-preference-hook",
"timeout": 5
}
]
}
],
"PostToolUse": [
{
"matcher": "(AskUserQuestion|mcp__.*__AskUserQuestion)",
"hooks": [
{
"type": "command",
"command": "$CLAUDE_PROJECT_DIR/.claude/skills/gstack/hosts/claude/hooks/question-log-hook",
"timeout": 5
}
]
}
]
}
}
```
Hook commands take `bun` invocation under the hood; absolute paths (or
`$CLAUDE_PROJECT_DIR` substitution) are required by Claude Code's hook
runner. The hooks themselves are TypeScript files that the bash wrapper
shells into bun.
## Open questions deferred to implementation
1. **Recommended-option parsing scope.** D2 says parse `(recommended)`
label first. The label is on the option's `label` field per
AskUserQuestion Format. Implementation will need to walk `tool_input.
questions[*].options[*]` looking for the label suffix. Worked
examples: ship/SKILL.md.tmpl emits options like `"A) Fix now"
(recommended)`.
2. **Auto-decided event tagging.** When hook returns `updatedInput`, the
PostToolUse hook will see the resolved input and log a normal event.
Need an extra field on the PostToolUse payload (e.g.,
`was_auto_decided: true`) that the hook can set via session state
tracking — write a marker file in `~/.gstack/sessions/<id>/.auto-decided-<tool_use_id>`
from PreToolUse, read it from PostToolUse, delete on read.
3. **Timeout behavior.** Default hook timeout is 60s but the docs are
thin on what happens at timeout. Set explicit `timeout: 5` so the
user never waits >5s on a hook misfire. Falls back to pass-through.
## References
- https://code.claude.com/docs/en/hooks (canonical, latest as of 2026-04)
- WebSearch results 2026-05-27
- Existing `bin/gstack-settings-hook` (SessionStart-only impl, to be
superseded by T3 schema-aware rewrite)

View File

@ -0,0 +1,171 @@
# Spike: Codex session storage format for plan-tune cathedral
**Status:** complete (2026-05-27)
**Surfaces:** D5 (Codex import parses structured files, not regex)
**Downstream consumers:** T9 (gstack-codex-session-import)
## Question this spike answers
What's the actual on-disk format of Codex sessions, and how do we recover
AskUserQuestion-shaped events from it for `gstack-codex-session-import`?
## Storage layout
```
~/.codex/
├── auth.json # Codex auth (do not touch)
├── config.toml # User config
├── goals_1.sqlite # ~24KB, internal goals DB (not relevant)
├── logs_2.sqlite # ~16MB, structured logs (target=*, see schema)
├── history.jsonl # ~9KB, command history
└── sessions/
└── 2026/05/27/
└── rollout-<iso8601>-<uuid>.jsonl # per-session transcript
```
Session files: one JSONL per `codex exec` or interactive session. Cwd path
embedded in the `session_meta` event. CLI version recorded.
## Session JSONL event types (measured on Garry's machine, 2026-05-27)
| type | count | meaning |
|----------------|------:|---------|
| `response_item`| 382 | model's response stream (~76%) |
| `event_msg` | 97 | high-level session events (~19%) |
| `turn_context` | 6 | per-turn context snapshot |
| `session_meta` | 6 | session header (one per session) |
### response_item subtypes
| subtype | count | meaning |
|--------------------------|------:|---------|
| `function_call` | 148 | model invoked a tool |
| `function_call_output` | 148 | tool result returned to model |
| `reasoning` | 44 | reasoning summary |
| `message` | 40 | text message (input_text or output_text) |
| `web_search_call` | 2 | web search tool call |
### event_msg subtypes
| subtype | count | meaning |
|-------------------|------:|---------|
| `token_count` | 55 | per-step token accounting |
| `agent_message` | 22 | agent's prose output |
| `user_message` | 6 | user's prose input |
| `task_started` | 6 | task start (one per top-level task) |
| `task_complete` | 6 | task complete |
| `web_search_end` | 2 | web search completion |
## Critical finding: Codex has no `AskUserQuestion` tool
Codex doesn't surface AskUserQuestion as a tool call in `response_item`
stream. Gstack skills running on Codex emit AskUserQuestion-shaped
Decision Briefs as plain prose inside `agent_message` events (the
`AskUserQuestion Format` from preamble). The user's answer comes back in
the next `user_message`.
This means importing AUQ events from Codex sessions is structurally
different from importing them from Claude Code (where they ARE
tool calls):
- **Claude Code:** hook captures structured `tool_input`/`tool_output`
for `AskUserQuestion`. Question + options + answer all separated.
- **Codex:** parser must extract from `agent_message.text` body, detect
the D-numbered Decision Brief pattern, then match against the
subsequent `user_message` for the answer.
## Recovery strategy for `gstack-codex-session-import`
**Two-tier extraction:**
1. **Marker-first (D18 mechanism).** Search `agent_message` text for the
`<gstack-qid:foo-bar>` marker. If present, we have an exact question_id
and can reliably recover. (Will work once T14 adds markers to the top
10 registry questions and Codex starts emitting them via the
host-aware preamble path.)
2. **Pattern fallback.** When no marker, parse for:
- `D<N> — <title>` line (D-number from AskUserQuestion Format)
- `Recommendation: ...` line
- Option block `A) ...`, `B) ...`, etc.
- Next `user_message` event for the chosen option label
Use this only to populate hash-based question_id (the same
`hook-<sha1(skill+text+sorted_options)[:10]>` shape Layer 1 uses on
Claude). Tagged `source: "codex-pattern-fallback"`, never used as
preference key (per D18 hash drift guidance).
## Schema we'll write to question-log.jsonl from Codex import
Per existing `bin/gstack-question-log` schema, augmented with:
- `source: "codex-import-marker"` (when qid marker found)
- `source: "codex-import-pattern"` (when fallback regex used)
- `codex_session_id` (UUID from session_meta)
- `codex_cwd` (working dir from session_meta — disambiguates project)
- `codex_ts` (timestamp from event)
## Sqlite logs_2.sqlite schema
```sql
CREATE TABLE logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts INTEGER NOT NULL,
ts_nanos INTEGER NOT NULL,
level TEXT NOT NULL,
target TEXT NOT NULL,
feedback_log_body TEXT,
module_path TEXT,
file TEXT,
line INTEGER,
thread_id TEXT,
process_uuid TEXT,
estimated_bytes INTEGER NOT NULL DEFAULT 0
);
```
`logs_2.sqlite` is internal telemetry, not session content. **Don't use
for AUQ extraction.** Sessions JSONL is authoritative.
## Project-slug derivation
From `session_meta.payload.cwd` — derive via the existing
`bin/gstack-slug` logic on the cwd path. Conductor worktrees have their
own slug naming convention encoded in cwd; the bin already handles this.
## Versioning safety
`session_meta.payload.cli_version` records the Codex CLI version (e.g.
`0.130.0`). When the importer encounters an unknown version, log a
warning to stderr but continue — schema additions are typically
backwards-compatible in JSONL.
If `type` or `payload.type` values change in a future version, we'll see
them as `unknown` in the importer's audit log. Add a guarded
`KNOWN_VERSIONS = ["0.130.x", "0.131.x", ...]` constant in the importer
and bump explicitly when re-testing.
## Open questions for implementation
1. **Where does Codex store the "user's answer" exactly?** Need to test
with a real `codex exec` run that triggers a Decision Brief and inspect
the next event. Likely `event_msg` of subtype `user_message` or a
`response_item` of subtype `message` with `role: "user"`. Confirm
during T9 implementation.
2. **Free-text extraction for "Other".** The Decision Brief prose
doesn't structurally separate "Other" responses from named options.
Pattern fallback will need to detect "Other: <text>" wording in the
answer. T10 (dream cycle distill) only fires on this when source is
`codex-import-marker` so we can trust the data.
3. **Conductor cwd handling.** Conductor worktrees share project state
but have distinct cwds. The import should bucket events by the
project slug, not the cwd directly, so events from sibling worktrees
accumulate into the same project view.
## References
- Live inspection of `~/.codex/sessions/2026/05/*/`
- `sqlite3 ~/.codex/logs_2.sqlite ".schema"` (2026-05-27)
- Codex CLI 0.130.0 (current at spike time)
- See also: D5 cross-model tension decision in plan file.