16 KiB
Stakeholder Interview Subagents — Design Spec
- Date: 2026-05-23
- Project: MiroFish (multi-agent simulation engine for German fisheries discourse)
- Author: Christian Möllmann (with Claude Code)
- Status: Approved design — pending implementation plan
1. Purpose
After the OASIS Twitter + Reddit simulation produces a population of in-character stakeholder agents (fishers, NGOs, policy actors, scientists, consumers, etc.) grounded in a German fisheries discourse knowledge graph, we want to interrogate each agent individually with a structured questionnaire about the future of German fisheries.
Four methodologies run as independent subagents over the same agent population:
- Longitudinal — pre/post Likert to measure opinion drift induced by simulated peer interaction
- Diversity — Q-sort + multi-dim Likert to map the value space and derive a stakeholder typology
- Delphi — three-round consensus probing to identify where stakeholder views converge vs. stay polarised
- Scenario — rating of 4 pre-defined 2040 scenarios on desirability, plausibility, group-impact, fairness
A synthesiser combines the four outputs into a single cross-method report.
2. Non-goals (v1)
- Real-time WebSocket streaming of interview progress (polling suffices)
- Adaptive instruments / IRT calibration
- Web UI for editing instruments (YAML + restart is fine)
- Cross-simulation comparison endpoints (CSV exports support this externally)
- Multi-language support beyond DE / EN
3. Architectural approach
Chosen approach: Deterministic instrument runners. Each subagent is a fixed protocol, not a ReACT loop. Rationale: fisheries futures methodology favours instrument fidelity (every stakeholder sees the same scale) over agent autonomy; results must be directly tabularisable for downstream analysis in pandas/R.
Rejected:
- ReACT-style subagents — non-deterministic, ~3–10× cost, can't guarantee every agent answered every item
- Single InterviewService with mode enum — couples four distinct methodologies (especially multi-round Delphi and two-phase Longitudinal) into one growing class
4. System architecture
InterviewOrchestrator
│
┌──────────────┬───────┴───────┬──────────────┐
▼ ▼ ▼ ▼
Longitudinal Diversity Delphi Scenario
Subagent Subagent Subagent Subagent
│ │ │ │
└──────────────┴──────┬────────┴──────────────┘
▼
StakeholderInterviewer (base)
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
LLMClient ZepEntityReader ProfileLoader
(in-character) (memory digest) (reddit/twitter)
│
▼
uploads/.../interviews/ + Zep episodes
4.1 New files
| Path | Purpose |
|---|---|
backend/app/services/interviews/base.py |
StakeholderInterviewer — persona+memory loading, in-character prompting, retry/validation |
backend/app/services/interviews/longitudinal.py |
Pre/post Likert |
backend/app/services/interviews/diversity.py |
Q-sort + multi-dim value-space mapping |
backend/app/services/interviews/delphi.py |
Three-round consensus |
backend/app/services/interviews/scenario.py |
Scenario rating |
backend/app/services/interview_orchestrator.py |
Fan-out, parallel execution, two-phase lifecycle |
backend/app/services/interview_synthesizer.py |
Cross-method narrative report |
backend/app/api/interview.py |
New Flask blueprint /api/interview/* |
backend/app/models/interview.py |
Pydantic schemas for instruments + responses |
backend/scripts/instruments/*.yaml |
Editable instrument definitions (one YAML per subagent) |
frontend/src/components/Step4bInterviews.vue |
Four tabs + synthesis tab |
backend/tests/interviews/ |
Unit tests per subagent + base + orchestrator + synthesiser |
tests/integration/test_interview_pipeline.py |
End-to-end with stub LLM + disposable Zep graph |
4.2 Lifecycle integration
Two hooks added to backend/app/services/simulation_manager.py:
on_ready()— automatically triggers Longitudinal T0 (pre-simulation baseline)on_completed()— queues atask_idrunning Longitudinal T1 + Diversity + Delphi + Scenario in parallel, then Synthesiser
The two-phase split is non-negotiable: Longitudinal needs T0 captured before OASIS exposes agents to peer-generated content, otherwise drift is unmeasurable.
5. Instrument design
All instruments live in backend/scripts/instruments/*.yaml so content is editable without redeploying. Items default to German, translatable via existing locale system.
5.1 Longitudinal — opinion drift
- 12–15 item 5-point Likert ("lehne stark ab" → "stimme stark zu")
- Administered at T0 (post-persona, pre-OASIS) and T1 (post-OASIS)
- Item families (3–4 each): stock status & recovery; governance & CFP; market & MSC; climate & adaptation
- Per-agent output: response value + LLM self-reported confidence per item + one open comment
- Aggregate: Δ-matrix (N × M items), per-item Wilcoxon signed-rank, per-agent total drift magnitude
5.2 Diversity — typology mapping
- One-shot, post-simulation only
- Part A (Q-sort lite): 24 statements sorted onto forced quasi-normal distribution from −3 to +3
- Part B: 6 multi-dim Likert axes (preservation↔extraction, local↔EU, science-led↔tradition-led, individual↔collective, short-term↔long-term, market↔regulation)
- Per-agent output: vector ∈ ℝ^30
- Aggregate: PCA + k-means → 3–5 stakeholder clusters with archetype descriptions + cluster-membership probabilities
5.3 Delphi — consensus probing
- Three rounds, fully automated
- R1 (open): 4 open questions; LLM extracts thematic codes from responses
- R2 (rate): Agent sees anonymised list of all unique themes; rates each on importance (1–5) + plausibility (1–5)
- R3 (revise): Agent sees group median + IQR per theme; can revise own ratings; free-text justification
- Aggregate: per-theme convergence (Δ-IQR R2→R3), persistent disagreements (IQR > 2), ranked consensus statements
5.4 Scenario — futures evaluation
Four 2040 scenarios (YAML-editable):
- S1 "Erholung" — cod and herring recover, MSC ubiquitous, small-scale fleet stabilises
- S2 "Kollaps" — both stocks collapse, fleet halved, aquaculture dominant
- S3 "Festung Europa" — protectionist EU policy, MPAs cover 30%, recreational fishing curtailed
- S4 "Privatisierung" — ITQs, consolidation, large operators only
Each agent rates each scenario on 4 dimensions (1–7 Likert): desirability, plausibility, impact-on-my-group, fairness. Plus one open question per scenario: "If you woke up in this 2040, what would you do?"
Aggregate: 4 × 4 per-agent matrix + open-text corpus → polarity charts (desirability × plausibility by stakeholder type), narrative themes.
5.5 Cross-cutting
In-character prompting. Every LLM call uses a system prompt of the form:
You are [persona_text]. You are answering a survey about the future of German fisheries. Answer strictly in character based on your background, values, and what you experienced during the simulated social media discourse summarised below: [Zep memory digest]. Return JSON only.
Memory digest comes from ZepEntityReader.get_entity_with_context().
Structured output enforced. Every response goes through LLMClient.chat_json() with a per-instrument JSON schema. One auto-retry on schema violation; agent flagged in audit log on second failure.
Cost guardrails. Longitudinal × 2 phases + Delphi × 3 rounds is heaviest. For N=50 agents and ~100 LLM calls per agent across all 4 subagents, budget ~5k calls / 5–10M tokens per simulation. Persona system prompts stay constant within a subagent run → cacheable.
6. Data flow and storage
6.1 Storage layout
uploads/simulations/{sim_id}/interviews/
├── instruments_used.json # frozen snapshot of YAML at run-time
├── T0/
│ └── longitudinal/
│ ├── responses.jsonl
│ ├── audit.jsonl # raw LLM I/O, retries, validation failures
│ └── aggregate.json
├── T1/
│ ├── longitudinal/{same structure}
│ ├── diversity/
│ │ ├── responses.jsonl
│ │ ├── typology.json
│ │ └── pca.json
│ ├── delphi/
│ │ ├── round1_themes.jsonl
│ │ ├── round2_ratings.jsonl
│ │ ├── round3_revisions.jsonl
│ │ └── convergence.json
│ └── scenario/
│ ├── responses.jsonl
│ └── polarity_matrix.json
└── synthesis/
├── report.md
└── exports/
├── all_responses.csv # tidy long format
└── codebook.json
JSONL for raw responses (append-safe, streams cleanly); JSON for aggregates; CSV for analysis hand-off. instruments_used.json snapshot is critical for reproducibility when YAML is later edited.
6.2 Zep integration
Two write patterns, both reusing ZepGraphMemoryUpdater.add_activity():
- Per-agent episode — after each subagent finishes for an agent, write one episode:
"Agent {name} (interview/{subagent}/{phase}): {short summary of stance}". The existing ReportAgent can retrieve interview content via its currentpanorama_search/insight_forgetools without changes. - Aggregate episodes — after each subagent's aggregate step, write one summary episode per cluster / theme / scenario.
No new Zep schemas. No new entity types. Interviews are just more episodes — append-only, safe.
6.3 API surface
New blueprint /api/interview:
| Method | Path | Purpose |
|---|---|---|
POST |
/api/interview/{sim_id}/pre |
Trigger T0 longitudinal (auto on READY, manual for re-runs) |
POST |
/api/interview/{sim_id}/post |
Trigger all 4 post-sim subagents; returns task_id |
GET |
/api/interview/{sim_id}/status?task_id=... |
Per-subagent progress |
GET |
/api/interview/{sim_id}/results/{subagent} |
Aggregate JSON for one subagent |
GET |
/api/interview/{sim_id}/results/synthesis |
Full synthesis report |
GET |
/api/interview/{sim_id}/export.csv |
Tidy long-format CSV across all 4 subagents |
POST |
/api/interview/{sim_id}/rerun |
Re-run one subagent (e.g. after editing YAML) |
All responses follow the existing {success, data, error} envelope. Polling reuses models/task.py.
6.4 Parallelism
- Within a subagent:
ThreadPoolExecutor(max_workers=8)for per-agent LLM calls - Across the 4 post-sim subagents: parallel, except Delphi (sequential rounds internally)
- Synthesiser waits for all four
- Token budget guard:
Config.INTERVIEW_MAX_TOKENS_PER_RUN; if projected cost exceeds, API returns 400 with dry-run estimate andconfirm=trueoverride
6.5 Frontend
New Step4bInterviews.vue between current Step4 (report) and Step5 (interaction). Four tabs (one per subagent) + a synthesis tab. Each tab shows progress bar during run, then results: Likert heatmap (longitudinal Δ), PCA scatter (diversity), convergence chart (Delphi), polarity quadrants (scenario). Download button per tab pulls the CSV export.
7. Error handling
Per-agent failures are isolated. If agent 17 times out or fails JSON validation twice, agent 17 is marked failed in audit.jsonl; the rest of the run continues. Aggregates report n_responded / n_total honestly.
| Failure | Handling |
|---|---|
| LLM timeout / 5xx | Exponential-backoff retry (3 attempts) via existing LLMClient; then mark agent failed |
| JSON schema violation | One auto-retry with explicit corrective instruction; then mark failed |
| Likert out-of-range / missing items | Re-ask only the bad items; if still bad, item-level missing |
| Zep memory fetch fails | Run without memory digest; flag in audit (memory_available: false); down-weight in drift analysis |
| Whole-subagent crash | Other 3 continue; synthesiser runs on what completed and flags the gap |
| Token budget exceeded | Pause, write partial results, return 503 with resume_token |
Idempotency. Every subagent run is keyed by (sim_id, subagent, phase, run_id). Re-runs write a new run_id directory; never overwrite. A latest.json pointer tracks the canonical run.
8. Validation
Three layers:
- Schema validation — pydantic models for every response; JSONL files validated on write
- Instrument validation —
validate_instrument(yaml)pre-flight: required fields, scale coherence, no duplicate item_ids, DE+EN both present if i18n enabled - Plausibility checks on aggregates (flag, don't kill):
- Longitudinal: >80% zero drift on every item OR >80% flip — likely a prompting bug or acquiescence bias
- Diversity: first two PCA components explain <30% of variance — instrument not discriminating
- Delphi: R3 ratings identical to R2 for >90% of agents — no engagement with anonymised feedback
- Scenario: all agents rate all scenarios identically on
desirability— instrument failure
Flags surface in the synthesis report under "instrument health" so the user can decide whether data is publishable.
9. Testing
Unit tests (backend/tests/interviews/):
test_instruments.py— every YAML parses and validatestest_base_interviewer.py— persona+memory loading, in-character prompt construction, schema-retry logic (mockLLMClient)- One file per subagent — happy path + each failure mode in §7
test_orchestrator.py— fan-out, partial failures, two-phase ordering (T0 before T1)test_synthesizer.py— missing-subagent handling, stable output shape
Integration test (tests/integration/test_interview_pipeline.py):
End-to-end with N=5 agents against a recorded LLM cassette. Verifies T0 at READY, T1 + 3 others at COMPLETED, CSV export well-formed, Zep episodes written.
Stub LLM mode (Config.LLM_STUB_MODE=true) returns deterministic canned responses keyed by (subagent, item_id, persona_hash). Full pipeline exercisable in CI for free.
Zep: disposable graph in integration tests (consistent with project conventions); unit tests stub.
10. Methodological caveats (auto-emitted in synthesis)
The synthesiser always emits a "Limitations" section, programmatically generated from run metadata:
- Simulated, not real stakeholders. Responses reflect how the seed-document discourse + LLM jointly encode each stakeholder type, not what actual fishers / NGO staff would say. The instrument measures the model of the stakeholder, not the stakeholder.
- Memory digest is lossy. Each agent's "experience" of OASIS is summarised to bounded length; agents do not have full episodic recall.
- LLM acquiescence and centrality bias. Likert with LLM respondents skews toward 3–4 of 5; per-item distribution shape statistics are reported.
- N is what it is.
n_totalandn_respondedprinted verbatim; no rounding, no smoothing. - Instrument provenance. Hash of
instruments_used.jsonprinted so future-you can rebuild the exact instrument.
This section is load-bearing for any publication: it makes the system intellectually defensible rather than a black box.
11. Defaulted decisions (revisit later if needed)
- N agents: assumed 50, driven from existing simulation config; if you typically run more/fewer, cost guardrail threshold needs adjusting
- Default instrument language: German with English fallback in YAML
- Delphi rounds = 3: classic Delphi can run more; 3 is the methodological floor and the cost ceiling here
12. Open questions for implementation phase
- Whether to write a separate
instruments_changelog.mdper run, or embed change tracking ininstruments_used.jsonmetadata - Whether the synthesiser should write into Zep as a single mega-episode or stay file-only (current design: file-only, plus the per-agent + per-aggregate episodes from each subagent)
- Whether
Step4bInterviews.vueshould sit strictly after Step4 (current design) or render in parallel — interviews depend on the simulation having reachedcompleted(Step3 output) and on thegraph_id(created in Step1); they do not depend on Step4's ReportAgent run, so a parallel layout is technically possible