MicroFish/docs/superpowers/specs/2026-05-23-stakeholder-inte...

# Stakeholder Interview Subagents — Design Spec

- **Date:** 2026-05-23
- **Project:** MiroFish (multi-agent simulation engine for German fisheries discourse)
- **Author:** Christian Möllmann (with Claude Code)
- **Status:** Approved design — pending implementation plan

## 1. Purpose

After the OASIS Twitter + Reddit simulation produces a population of in-character stakeholder agents (fishers, NGOs, policy actors, scientists, consumers, etc.) grounded in a German fisheries discourse knowledge graph, we want to interrogate each agent individually with a structured questionnaire about the future of German fisheries.

Four methodologies run as independent subagents over the same agent population:

1. **Longitudinal** — pre/post Likert to measure opinion drift induced by simulated peer interaction
2. **Diversity** — Q-sort + multi-dim Likert to map the value space and derive a stakeholder typology
3. **Delphi** — three-round consensus probing to identify where stakeholder views converge vs. stay polarised
4. **Scenario** — rating of 4 pre-defined 2040 scenarios on desirability, plausibility, group-impact, fairness

A synthesiser combines the four outputs into a single cross-method report.

## 2. Non-goals (v1)

- Real-time WebSocket streaming of interview progress (polling suffices)
- Adaptive instruments / IRT calibration
- Web UI for editing instruments (YAML + restart is fine)
- Cross-simulation comparison endpoints (CSV exports support this externally)
- Multi-language support beyond DE / EN

## 3. Architectural approach

**Chosen approach: Deterministic instrument runners.** Each subagent is a fixed protocol, not a ReACT loop. Rationale: fisheries futures methodology favours instrument fidelity (every stakeholder sees the same scale) over agent autonomy; results must be directly tabularisable for downstream analysis in pandas/R.

Rejected:
- *ReACT-style subagents* — non-deterministic, ~3–10× cost, can't guarantee every agent answered every item
- *Single InterviewService with mode enum* — couples four distinct methodologies (especially multi-round Delphi and two-phase Longitudinal) into one growing class

## 4. System architecture

```
                    InterviewOrchestrator
                          │
   ┌──────────────┬───────┴───────┬──────────────┐
   ▼              ▼               ▼              ▼
Longitudinal   Diversity        Delphi       Scenario
Subagent       Subagent         Subagent     Subagent
   │              │               │              │
   └──────────────┴──────┬────────┴──────────────┘
                         ▼
              StakeholderInterviewer (base)
                         │
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
   LLMClient        ZepEntityReader   ProfileLoader
   (in-character)   (memory digest)   (reddit/twitter)
                         │
                         ▼
       uploads/.../interviews/    +    Zep episodes
```

### 4.1 New files

| Path | Purpose |
|---|---|
| `backend/app/services/interviews/base.py` | `StakeholderInterviewer` — persona+memory loading, in-character prompting, retry/validation |
| `backend/app/services/interviews/longitudinal.py` | Pre/post Likert |
| `backend/app/services/interviews/diversity.py` | Q-sort + multi-dim value-space mapping |
| `backend/app/services/interviews/delphi.py` | Three-round consensus |
| `backend/app/services/interviews/scenario.py` | Scenario rating |
| `backend/app/services/interview_orchestrator.py` | Fan-out, parallel execution, two-phase lifecycle |
| `backend/app/services/interview_synthesizer.py` | Cross-method narrative report |
| `backend/app/api/interview.py` | New Flask blueprint `/api/interview/*` |
| `backend/app/models/interview.py` | Pydantic schemas for instruments + responses |
| `backend/scripts/instruments/*.yaml` | Editable instrument definitions (one YAML per subagent) |
| `frontend/src/components/Step4bInterviews.vue` | Four tabs + synthesis tab |
| `backend/tests/interviews/` | Unit tests per subagent + base + orchestrator + synthesiser |
| `tests/integration/test_interview_pipeline.py` | End-to-end with stub LLM + disposable Zep graph |

### 4.2 Lifecycle integration

Two hooks added to `backend/app/services/simulation_manager.py`:

- `on_ready()` — automatically triggers Longitudinal T0 (pre-simulation baseline)
- `on_completed()` — queues a `task_id` running Longitudinal T1 + Diversity + Delphi + Scenario in parallel, then Synthesiser

The two-phase split is **non-negotiable**: Longitudinal needs T0 captured before OASIS exposes agents to peer-generated content, otherwise drift is unmeasurable.

## 5. Instrument design

All instruments live in `backend/scripts/instruments/*.yaml` so content is editable without redeploying. Items default to German, translatable via existing locale system.

### 5.1 Longitudinal — opinion drift

- 12–15 item 5-point Likert ("lehne stark ab" → "stimme stark zu")
- Administered at T0 (post-persona, pre-OASIS) and T1 (post-OASIS)
- Item families (3–4 each): stock status & recovery; governance & CFP; market & MSC; climate & adaptation
- Per-agent output: response value + LLM self-reported confidence per item + one open comment
- Aggregate: Δ-matrix (N × M items), per-item Wilcoxon signed-rank, per-agent total drift magnitude

### 5.2 Diversity — typology mapping

- One-shot, post-simulation only
- **Part A (Q-sort lite):** 24 statements sorted onto forced quasi-normal distribution from −3 to +3
- **Part B:** 6 multi-dim Likert axes (preservation↔extraction, local↔EU, science-led↔tradition-led, individual↔collective, short-term↔long-term, market↔regulation)
- Per-agent output: vector ∈ ℝ^30
- Aggregate: PCA + k-means → 3–5 stakeholder clusters with archetype descriptions + cluster-membership probabilities

### 5.3 Delphi — consensus probing

- Three rounds, fully automated
- **R1 (open):** 4 open questions; LLM extracts thematic codes from responses
- **R2 (rate):** Agent sees anonymised list of all unique themes; rates each on importance (1–5) + plausibility (1–5)
- **R3 (revise):** Agent sees group median + IQR per theme; can revise own ratings; free-text justification
- Aggregate: per-theme convergence (Δ-IQR R2→R3), persistent disagreements (IQR > 2), ranked consensus statements

### 5.4 Scenario — futures evaluation

Four 2040 scenarios (YAML-editable):

- **S1 "Erholung"** — cod and herring recover, MSC ubiquitous, small-scale fleet stabilises
- **S2 "Kollaps"** — both stocks collapse, fleet halved, aquaculture dominant
- **S3 "Festung Europa"** — protectionist EU policy, MPAs cover 30%, recreational fishing curtailed
- **S4 "Privatisierung"** — ITQs, consolidation, large operators only

Each agent rates each scenario on 4 dimensions (1–7 Likert): desirability, plausibility, impact-on-my-group, fairness. Plus one open question per scenario: "If you woke up in this 2040, what would you do?"

Aggregate: 4 × 4 per-agent matrix + open-text corpus → polarity charts (desirability × plausibility by stakeholder type), narrative themes.

### 5.5 Cross-cutting

**In-character prompting.** Every LLM call uses a system prompt of the form:

> You are [persona_text]. You are answering a survey about the future of German fisheries. Answer strictly in character based on your background, values, and what you experienced during the simulated social media discourse summarised below: [Zep memory digest]. Return JSON only.

Memory digest comes from `ZepEntityReader.get_entity_with_context()`.

**Structured output enforced.** Every response goes through `LLMClient.chat_json()` with a per-instrument JSON schema. One auto-retry on schema violation; agent flagged in audit log on second failure.

**Cost guardrails.** Longitudinal × 2 phases + Delphi × 3 rounds is heaviest. For N=50 agents and ~100 LLM calls per agent across all 4 subagents, budget ~5k calls / 5–10M tokens per simulation. Persona system prompts stay constant within a subagent run → cacheable.

## 6. Data flow and storage

### 6.1 Storage layout

```
uploads/simulations/{sim_id}/interviews/
├── instruments_used.json          # frozen snapshot of YAML at run-time
├── T0/
│   └── longitudinal/
│       ├── responses.jsonl
│       ├── audit.jsonl            # raw LLM I/O, retries, validation failures
│       └── aggregate.json
├── T1/
│   ├── longitudinal/{same structure}
│   ├── diversity/
│   │   ├── responses.jsonl
│   │   ├── typology.json
│   │   └── pca.json
│   ├── delphi/
│   │   ├── round1_themes.jsonl
│   │   ├── round2_ratings.jsonl
│   │   ├── round3_revisions.jsonl
│   │   └── convergence.json
│   └── scenario/
│       ├── responses.jsonl
│       └── polarity_matrix.json
└── synthesis/
    ├── report.md
    └── exports/
        ├── all_responses.csv      # tidy long format
        └── codebook.json
```

JSONL for raw responses (append-safe, streams cleanly); JSON for aggregates; CSV for analysis hand-off. `instruments_used.json` snapshot is critical for reproducibility when YAML is later edited.

### 6.2 Zep integration

Two write patterns, both reusing `ZepGraphMemoryUpdater.add_activity()`:

- **Per-agent episode** — after each subagent finishes for an agent, write one episode: `"Agent {name} (interview/{subagent}/{phase}): {short summary of stance}"`. The existing ReportAgent can retrieve interview content via its current `panorama_search` / `insight_forge` tools without changes.
- **Aggregate episodes** — after each subagent's aggregate step, write one summary episode per cluster / theme / scenario.

No new Zep schemas. No new entity types. Interviews are just more episodes — append-only, safe.

### 6.3 API surface

New blueprint `/api/interview`:

| Method | Path | Purpose |
|---|---|---|
| `POST` | `/api/interview/{sim_id}/pre` | Trigger T0 longitudinal (auto on READY, manual for re-runs) |
| `POST` | `/api/interview/{sim_id}/post` | Trigger all 4 post-sim subagents; returns `task_id` |
| `GET`  | `/api/interview/{sim_id}/status?task_id=...` | Per-subagent progress |
| `GET`  | `/api/interview/{sim_id}/results/{subagent}` | Aggregate JSON for one subagent |
| `GET`  | `/api/interview/{sim_id}/results/synthesis` | Full synthesis report |
| `GET`  | `/api/interview/{sim_id}/export.csv` | Tidy long-format CSV across all 4 subagents |
| `POST` | `/api/interview/{sim_id}/rerun` | Re-run one subagent (e.g. after editing YAML) |

All responses follow the existing `{success, data, error}` envelope. Polling reuses `models/task.py`.

### 6.4 Parallelism

- Within a subagent: `ThreadPoolExecutor(max_workers=8)` for per-agent LLM calls
- Across the 4 post-sim subagents: parallel, except Delphi (sequential rounds internally)
- Synthesiser waits for all four
- Token budget guard: `Config.INTERVIEW_MAX_TOKENS_PER_RUN`; if projected cost exceeds, API returns 400 with dry-run estimate and `confirm=true` override

### 6.5 Frontend

New `Step4bInterviews.vue` between current Step4 (report) and Step5 (interaction). Four tabs (one per subagent) + a synthesis tab. Each tab shows progress bar during run, then results: Likert heatmap (longitudinal Δ), PCA scatter (diversity), convergence chart (Delphi), polarity quadrants (scenario). Download button per tab pulls the CSV export.

## 7. Error handling

**Per-agent failures are isolated.** If agent 17 times out or fails JSON validation twice, agent 17 is marked `failed` in `audit.jsonl`; the rest of the run continues. Aggregates report `n_responded` / `n_total` honestly.

| Failure | Handling |
|---|---|
| LLM timeout / 5xx | Exponential-backoff retry (3 attempts) via existing `LLMClient`; then mark agent failed |
| JSON schema violation | One auto-retry with explicit corrective instruction; then mark failed |
| Likert out-of-range / missing items | Re-ask only the bad items; if still bad, item-level missing |
| Zep memory fetch fails | Run without memory digest; flag in audit (`memory_available: false`); down-weight in drift analysis |
| Whole-subagent crash | Other 3 continue; synthesiser runs on what completed and flags the gap |
| Token budget exceeded | Pause, write partial results, return 503 with `resume_token` |

**Idempotency.** Every subagent run is keyed by `(sim_id, subagent, phase, run_id)`. Re-runs write a new `run_id` directory; never overwrite. A `latest.json` pointer tracks the canonical run.

## 8. Validation

Three layers:

1. **Schema validation** — pydantic models for every response; JSONL files validated on write
2. **Instrument validation** — `validate_instrument(yaml)` pre-flight: required fields, scale coherence, no duplicate item_ids, DE+EN both present if i18n enabled
3. **Plausibility checks** on aggregates (flag, don't kill):
   - Longitudinal: >80% zero drift on every item OR >80% flip — likely a prompting bug or acquiescence bias
   - Diversity: first two PCA components explain <30% of variance — instrument not discriminating
   - Delphi: R3 ratings identical to R2 for >90% of agents — no engagement with anonymised feedback
   - Scenario: all agents rate all scenarios identically on `desirability` — instrument failure

Flags surface in the synthesis report under "instrument health" so the user can decide whether data is publishable.

## 9. Testing

**Unit tests** (`backend/tests/interviews/`):

- `test_instruments.py` — every YAML parses and validates
- `test_base_interviewer.py` — persona+memory loading, in-character prompt construction, schema-retry logic (mock `LLMClient`)
- One file per subagent — happy path + each failure mode in §7
- `test_orchestrator.py` — fan-out, partial failures, two-phase ordering (T0 before T1)
- `test_synthesizer.py` — missing-subagent handling, stable output shape

**Integration test** (`tests/integration/test_interview_pipeline.py`):

End-to-end with N=5 agents against a recorded LLM cassette. Verifies T0 at READY, T1 + 3 others at COMPLETED, CSV export well-formed, Zep episodes written.

**Stub LLM mode** (`Config.LLM_STUB_MODE=true`) returns deterministic canned responses keyed by `(subagent, item_id, persona_hash)`. Full pipeline exercisable in CI for free.

**Zep**: disposable graph in integration tests (consistent with project conventions); unit tests stub.

## 10. Methodological caveats (auto-emitted in synthesis)

The synthesiser **always** emits a "Limitations" section, programmatically generated from run metadata:

- **Simulated, not real stakeholders.** Responses reflect how the seed-document discourse + LLM jointly encode each stakeholder type, not what actual fishers / NGO staff would say. The instrument measures the *model of the stakeholder*, not the stakeholder.
- **Memory digest is lossy.** Each agent's "experience" of OASIS is summarised to bounded length; agents do not have full episodic recall.
- **LLM acquiescence and centrality bias.** Likert with LLM respondents skews toward 3–4 of 5; per-item distribution shape statistics are reported.
- **N is what it is.** `n_total` and `n_responded` printed verbatim; no rounding, no smoothing.
- **Instrument provenance.** Hash of `instruments_used.json` printed so future-you can rebuild the exact instrument.

This section is load-bearing for any publication: it makes the system intellectually defensible rather than a black box.

## 11. Defaulted decisions (revisit later if needed)

- **N agents:** assumed 50, driven from existing simulation config; if you typically run more/fewer, cost guardrail threshold needs adjusting
- **Default instrument language:** German with English fallback in YAML
- **Delphi rounds = 3:** classic Delphi can run more; 3 is the methodological floor and the cost ceiling here

## 12. Open questions for implementation phase

- Whether to write a separate `instruments_changelog.md` per run, or embed change tracking in `instruments_used.json` metadata
- Whether the synthesiser should write into Zep as a single mega-episode or stay file-only (current design: file-only, plus the per-agent + per-aggregate episodes from each subagent)
- Whether `Step4bInterviews.vue` should sit strictly after Step4 (current design) or render in parallel — interviews depend on the simulation having reached `completed` (Step3 output) and on the `graph_id` (created in Step1); they do not depend on Step4's ReportAgent run, so a parallel layout is technically possible