docs(spec): stakeholder interview subagents design

Approved design for a four-subagent post-simulation interview system (Longitudinal, Diversity, Delphi, Scenario) over MiroFish-simulated German fisheries stakeholders, with cross-method synthesiser. Includes architecture, instrument design, data flow, API surface, error handling, validation, testing, and methodological caveats. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 10:53:28 +02:00 · 2026-05-23 10:53:28 +02:00 · bf058080ac
parent fa0f6519b1
commit bf058080ac
1 changed files with 280 additions and 0 deletions
--- a/docs/superpowers/specs/2026-05-23-stakeholder-interview-subagents-design.md
+++ b/docs/superpowers/specs/2026-05-23-stakeholder-interview-subagents-design.md
@ -0,0 +1,280 @@
+# Stakeholder Interview Subagents — Design Spec
+
+- **Date:** 2026-05-23
+- **Project:** MiroFish (multi-agent simulation engine for German fisheries discourse)
+- **Author:** Christian Möllmann (with Claude Code)
+- **Status:** Approved design — pending implementation plan
+
+## 1. Purpose
+
+After the OASIS Twitter + Reddit simulation produces a population of in-character stakeholder agents (fishers, NGOs, policy actors, scientists, consumers, etc.) grounded in a German fisheries discourse knowledge graph, we want to interrogate each agent individually with a structured questionnaire about the future of German fisheries.
+
+Four methodologies run as independent subagents over the same agent population:
+
+1. **Longitudinal** — pre/post Likert to measure opinion drift induced by simulated peer interaction
+2. **Diversity** — Q-sort + multi-dim Likert to map the value space and derive a stakeholder typology
+3. **Delphi** — three-round consensus probing to identify where stakeholder views converge vs. stay polarised
+4. **Scenario** — rating of 4 pre-defined 2040 scenarios on desirability, plausibility, group-impact, fairness
+
+A synthesiser combines the four outputs into a single cross-method report.
+
+## 2. Non-goals (v1)
+
+- Real-time WebSocket streaming of interview progress (polling suffices)
+- Adaptive instruments / IRT calibration
+- Web UI for editing instruments (YAML + restart is fine)
+- Cross-simulation comparison endpoints (CSV exports support this externally)
+- Multi-language support beyond DE / EN
+
+## 3. Architectural approach
+
+**Chosen approach: Deterministic instrument runners.** Each subagent is a fixed protocol, not a ReACT loop. Rationale: fisheries futures methodology favours instrument fidelity (every stakeholder sees the same scale) over agent autonomy; results must be directly tabularisable for downstream analysis in pandas/R.
+
+Rejected:
+- *ReACT-style subagents* — non-deterministic, ~3–10× cost, can't guarantee every agent answered every item
+- *Single InterviewService with mode enum* — couples four distinct methodologies (especially multi-round Delphi and two-phase Longitudinal) into one growing class
+
+## 4. System architecture
+
+```
+                    InterviewOrchestrator
+                          │
+   ┌──────────────┬───────┴───────┬──────────────┐
+   ▼              ▼               ▼              ▼
+Longitudinal   Diversity        Delphi       Scenario
+Subagent       Subagent         Subagent     Subagent
+   │              │               │              │
+   └──────────────┴──────┬────────┴──────────────┘
+                         ▼
+              StakeholderInterviewer (base)
+                         │
+       ┌─────────────────┼─────────────────┐
+       ▼                 ▼                 ▼
+   LLMClient        ZepEntityReader   ProfileLoader
+   (in-character)   (memory digest)   (reddit/twitter)
+                         │
+                         ▼
+       uploads/.../interviews/    +    Zep episodes
+```
+
+### 4.1 New files
+
+| Path | Purpose |
+|---|---|
+| `backend/app/services/interviews/base.py` | `StakeholderInterviewer` — persona+memory loading, in-character prompting, retry/validation |
+| `backend/app/services/interviews/longitudinal.py` | Pre/post Likert |
+| `backend/app/services/interviews/diversity.py` | Q-sort + multi-dim value-space mapping |
+| `backend/app/services/interviews/delphi.py` | Three-round consensus |
+| `backend/app/services/interviews/scenario.py` | Scenario rating |
+| `backend/app/services/interview_orchestrator.py` | Fan-out, parallel execution, two-phase lifecycle |
+| `backend/app/services/interview_synthesizer.py` | Cross-method narrative report |
+| `backend/app/api/interview.py` | New Flask blueprint `/api/interview/*` |
+| `backend/app/models/interview.py` | Pydantic schemas for instruments + responses |
+| `backend/scripts/instruments/*.yaml` | Editable instrument definitions (one YAML per subagent) |
+| `frontend/src/components/Step4bInterviews.vue` | Four tabs + synthesis tab |
+| `backend/tests/interviews/` | Unit tests per subagent + base + orchestrator + synthesiser |
+| `tests/integration/test_interview_pipeline.py` | End-to-end with stub LLM + disposable Zep graph |
+
+### 4.2 Lifecycle integration
+
+Two hooks added to `backend/app/services/simulation_manager.py`:
+
+- `on_ready()` — automatically triggers Longitudinal T0 (pre-simulation baseline)
+- `on_completed()` — queues a `task_id` running Longitudinal T1 + Diversity + Delphi + Scenario in parallel, then Synthesiser
+
+The two-phase split is **non-negotiable**: Longitudinal needs T0 captured before OASIS exposes agents to peer-generated content, otherwise drift is unmeasurable.
+
+## 5. Instrument design
+
+All instruments live in `backend/scripts/instruments/*.yaml` so content is editable without redeploying. Items default to German, translatable via existing locale system.
+
+### 5.1 Longitudinal — opinion drift
+
+- 12–15 item 5-point Likert ("lehne stark ab" → "stimme stark zu")
+- Administered at T0 (post-persona, pre-OASIS) and T1 (post-OASIS)
+- Item families (3–4 each): stock status & recovery; governance & CFP; market & MSC; climate & adaptation
+- Per-agent output: response value + LLM self-reported confidence per item + one open comment
+- Aggregate: Δ-matrix (N × M items), per-item Wilcoxon signed-rank, per-agent total drift magnitude
+
+### 5.2 Diversity — typology mapping
+
+- One-shot, post-simulation only
+- **Part A (Q-sort lite):** 24 statements sorted onto forced quasi-normal distribution from −3 to +3
+- **Part B:** 6 multi-dim Likert axes (preservation↔extraction, local↔EU, science-led↔tradition-led, individual↔collective, short-term↔long-term, market↔regulation)
+- Per-agent output: vector ∈ ℝ^30
+- Aggregate: PCA + k-means → 3–5 stakeholder clusters with archetype descriptions + cluster-membership probabilities
+
+### 5.3 Delphi — consensus probing
+
+- Three rounds, fully automated
+- **R1 (open):** 4 open questions; LLM extracts thematic codes from responses
+- **R2 (rate):** Agent sees anonymised list of all unique themes; rates each on importance (1–5) + plausibility (1–5)
+- **R3 (revise):** Agent sees group median + IQR per theme; can revise own ratings; free-text justification
+- Aggregate: per-theme convergence (Δ-IQR R2→R3), persistent disagreements (IQR > 2), ranked consensus statements
+
+### 5.4 Scenario — futures evaluation
+
+Four 2040 scenarios (YAML-editable):
+
+- **S1 "Erholung"** — cod and herring recover, MSC ubiquitous, small-scale fleet stabilises
+- **S2 "Kollaps"** — both stocks collapse, fleet halved, aquaculture dominant
+- **S3 "Festung Europa"** — protectionist EU policy, MPAs cover 30%, recreational fishing curtailed
+- **S4 "Privatisierung"** — ITQs, consolidation, large operators only
+
+Each agent rates each scenario on 4 dimensions (1–7 Likert): desirability, plausibility, impact-on-my-group, fairness. Plus one open question per scenario: "If you woke up in this 2040, what would you do?"
+
+Aggregate: 4 × 4 per-agent matrix + open-text corpus → polarity charts (desirability × plausibility by stakeholder type), narrative themes.
+
+### 5.5 Cross-cutting
+
+**In-character prompting.** Every LLM call uses a system prompt of the form:
+
+> You are [persona_text]. You are answering a survey about the future of German fisheries. Answer strictly in character based on your background, values, and what you experienced during the simulated social media discourse summarised below: [Zep memory digest]. Return JSON only.
+
+Memory digest comes from `ZepEntityReader.get_entity_with_context()`.
+
+**Structured output enforced.** Every response goes through `LLMClient.chat_json()` with a per-instrument JSON schema. One auto-retry on schema violation; agent flagged in audit log on second failure.
+
+**Cost guardrails.** Longitudinal × 2 phases + Delphi × 3 rounds is heaviest. For N=50 agents and ~100 LLM calls per agent across all 4 subagents, budget ~5k calls / 5–10M tokens per simulation. Persona system prompts stay constant within a subagent run → cacheable.
+
+## 6. Data flow and storage
+
+### 6.1 Storage layout
+
+```
+uploads/simulations/{sim_id}/interviews/
+├── instruments_used.json          # frozen snapshot of YAML at run-time
+├── T0/
+│   └── longitudinal/
+│       ├── responses.jsonl
+│       ├── audit.jsonl            # raw LLM I/O, retries, validation failures
+│       └── aggregate.json
+├── T1/
+│   ├── longitudinal/{same structure}
+│   ├── diversity/
+│   │   ├── responses.jsonl
+│   │   ├── typology.json
+│   │   └── pca.json
+│   ├── delphi/
+│   │   ├── round1_themes.jsonl
+│   │   ├── round2_ratings.jsonl
+│   │   ├── round3_revisions.jsonl
+│   │   └── convergence.json
+│   └── scenario/
+│       ├── responses.jsonl
+│       └── polarity_matrix.json
+└── synthesis/
+    ├── report.md
+    └── exports/
+        ├── all_responses.csv      # tidy long format
+        └── codebook.json
+```
+
+JSONL for raw responses (append-safe, streams cleanly); JSON for aggregates; CSV for analysis hand-off. `instruments_used.json` snapshot is critical for reproducibility when YAML is later edited.
+
+### 6.2 Zep integration
+
+Two write patterns, both reusing `ZepGraphMemoryUpdater.add_activity()`:
+
+- **Per-agent episode** — after each subagent finishes for an agent, write one episode: `"Agent {name} (interview/{subagent}/{phase}): {short summary of stance}"`. The existing ReportAgent can retrieve interview content via its current `panorama_search` / `insight_forge` tools without changes.
+- **Aggregate episodes** — after each subagent's aggregate step, write one summary episode per cluster / theme / scenario.
+
+No new Zep schemas. No new entity types. Interviews are just more episodes — append-only, safe.
+
+### 6.3 API surface
+
+New blueprint `/api/interview`:
+
+| Method | Path | Purpose |
+|---|---|---|
+| `POST` | `/api/interview/{sim_id}/pre` | Trigger T0 longitudinal (auto on READY, manual for re-runs) |
+| `POST` | `/api/interview/{sim_id}/post` | Trigger all 4 post-sim subagents; returns `task_id` |
+| `GET`  | `/api/interview/{sim_id}/status?task_id=...` | Per-subagent progress |
+| `GET`  | `/api/interview/{sim_id}/results/{subagent}` | Aggregate JSON for one subagent |
+| `GET`  | `/api/interview/{sim_id}/results/synthesis` | Full synthesis report |
+| `GET`  | `/api/interview/{sim_id}/export.csv` | Tidy long-format CSV across all 4 subagents |
+| `POST` | `/api/interview/{sim_id}/rerun` | Re-run one subagent (e.g. after editing YAML) |
+
+All responses follow the existing `{success, data, error}` envelope. Polling reuses `models/task.py`.
+
+### 6.4 Parallelism
+
+- Within a subagent: `ThreadPoolExecutor(max_workers=8)` for per-agent LLM calls
+- Across the 4 post-sim subagents: parallel, except Delphi (sequential rounds internally)
+- Synthesiser waits for all four
+- Token budget guard: `Config.INTERVIEW_MAX_TOKENS_PER_RUN`; if projected cost exceeds, API returns 400 with dry-run estimate and `confirm=true` override
+
+### 6.5 Frontend
+
+New `Step4bInterviews.vue` between current Step4 (report) and Step5 (interaction). Four tabs (one per subagent) + a synthesis tab. Each tab shows progress bar during run, then results: Likert heatmap (longitudinal Δ), PCA scatter (diversity), convergence chart (Delphi), polarity quadrants (scenario). Download button per tab pulls the CSV export.
+
+## 7. Error handling
+
+**Per-agent failures are isolated.** If agent 17 times out or fails JSON validation twice, agent 17 is marked `failed` in `audit.jsonl`; the rest of the run continues. Aggregates report `n_responded` / `n_total` honestly.
+
+| Failure | Handling |
+|---|---|
+| LLM timeout / 5xx | Exponential-backoff retry (3 attempts) via existing `LLMClient`; then mark agent failed |
+| JSON schema violation | One auto-retry with explicit corrective instruction; then mark failed |
+| Likert out-of-range / missing items | Re-ask only the bad items; if still bad, item-level missing |
+| Zep memory fetch fails | Run without memory digest; flag in audit (`memory_available: false`); down-weight in drift analysis |
+| Whole-subagent crash | Other 3 continue; synthesiser runs on what completed and flags the gap |
+| Token budget exceeded | Pause, write partial results, return 503 with `resume_token` |
+
+**Idempotency.** Every subagent run is keyed by `(sim_id, subagent, phase, run_id)`. Re-runs write a new `run_id` directory; never overwrite. A `latest.json` pointer tracks the canonical run.
+
+## 8. Validation
+
+Three layers:
+
+1. **Schema validation** — pydantic models for every response; JSONL files validated on write
+2. **Instrument validation** — `validate_instrument(yaml)` pre-flight: required fields, scale coherence, no duplicate item_ids, DE+EN both present if i18n enabled
+3. **Plausibility checks** on aggregates (flag, don't kill):
+   - Longitudinal: >80% zero drift on every item OR >80% flip — likely a prompting bug or acquiescence bias
+   - Diversity: first two PCA components explain <30% of variance — instrument not discriminating
+   - Delphi: R3 ratings identical to R2 for >90% of agents — no engagement with anonymised feedback
+   - Scenario: all agents rate all scenarios identically on `desirability` — instrument failure
+
+Flags surface in the synthesis report under "instrument health" so the user can decide whether data is publishable.
+
+## 9. Testing
+
+**Unit tests** (`backend/tests/interviews/`):
+
+- `test_instruments.py` — every YAML parses and validates
+- `test_base_interviewer.py` — persona+memory loading, in-character prompt construction, schema-retry logic (mock `LLMClient`)
+- One file per subagent — happy path + each failure mode in §7
+- `test_orchestrator.py` — fan-out, partial failures, two-phase ordering (T0 before T1)
+- `test_synthesizer.py` — missing-subagent handling, stable output shape
+
+**Integration test** (`tests/integration/test_interview_pipeline.py`):
+
+End-to-end with N=5 agents against a recorded LLM cassette. Verifies T0 at READY, T1 + 3 others at COMPLETED, CSV export well-formed, Zep episodes written.
+
+**Stub LLM mode** (`Config.LLM_STUB_MODE=true`) returns deterministic canned responses keyed by `(subagent, item_id, persona_hash)`. Full pipeline exercisable in CI for free.
+
+**Zep**: disposable graph in integration tests (consistent with project conventions); unit tests stub.
+
+## 10. Methodological caveats (auto-emitted in synthesis)
+
+The synthesiser **always** emits a "Limitations" section, programmatically generated from run metadata:
+
+- **Simulated, not real stakeholders.** Responses reflect how the seed-document discourse + LLM jointly encode each stakeholder type, not what actual fishers / NGO staff would say. The instrument measures the *model of the stakeholder*, not the stakeholder.
+- **Memory digest is lossy.** Each agent's "experience" of OASIS is summarised to bounded length; agents do not have full episodic recall.
+- **LLM acquiescence and centrality bias.** Likert with LLM respondents skews toward 3–4 of 5; per-item distribution shape statistics are reported.
+- **N is what it is.** `n_total` and `n_responded` printed verbatim; no rounding, no smoothing.
+- **Instrument provenance.** Hash of `instruments_used.json` printed so future-you can rebuild the exact instrument.
+
+This section is load-bearing for any publication: it makes the system intellectually defensible rather than a black box.
+
+## 11. Defaulted decisions (revisit later if needed)
+
+- **N agents:** assumed 50, driven from existing simulation config; if you typically run more/fewer, cost guardrail threshold needs adjusting
+- **Default instrument language:** German with English fallback in YAML
+- **Delphi rounds = 3:** classic Delphi can run more; 3 is the methodological floor and the cost ceiling here
+
+## 12. Open questions for implementation phase
+
+- Whether to write a separate `instruments_changelog.md` per run, or embed change tracking in `instruments_used.json` metadata
+- Whether the synthesiser should write into Zep as a single mega-episode or stay file-only (current design: file-only, plus the per-agent + per-aggregate episodes from each subagent)
+- Whether `Step4bInterviews.vue` should sit strictly after Step4 (current design) or render in parallel — interviews depend on the simulation having reached `completed` (Step3 output) and on the `graph_id` (created in Step1); they do not depend on Step4's ReportAgent run, so a parallel layout is technically possible