16 KiB

Raw Blame History

Stakeholder Interview Subagents — Design Spec

Date: 2026-05-23
Project: MiroFish (multi-agent simulation engine for German fisheries discourse)
Author: Christian Möllmann (with Claude Code)
Status: Approved design — pending implementation plan

1. Purpose

After the OASIS Twitter + Reddit simulation produces a population of in-character stakeholder agents (fishers, NGOs, policy actors, scientists, consumers, etc.) grounded in a German fisheries discourse knowledge graph, we want to interrogate each agent individually with a structured questionnaire about the future of German fisheries.

Four methodologies run as independent subagents over the same agent population:

Longitudinal — pre/post Likert to measure opinion drift induced by simulated peer interaction
Diversity — Q-sort + multi-dim Likert to map the value space and derive a stakeholder typology
Delphi — three-round consensus probing to identify where stakeholder views converge vs. stay polarised
Scenario — rating of 4 pre-defined 2040 scenarios on desirability, plausibility, group-impact, fairness

A synthesiser combines the four outputs into a single cross-method report.

2. Non-goals (v1)

Real-time WebSocket streaming of interview progress (polling suffices)
Adaptive instruments / IRT calibration
Web UI for editing instruments (YAML + restart is fine)
Cross-simulation comparison endpoints (CSV exports support this externally)
Multi-language support beyond DE / EN

3. Architectural approach

Chosen approach: Deterministic instrument runners. Each subagent is a fixed protocol, not a ReACT loop. Rationale: fisheries futures methodology favours instrument fidelity (every stakeholder sees the same scale) over agent autonomy; results must be directly tabularisable for downstream analysis in pandas/R.

Rejected:

ReACT-style subagents — non-deterministic, ~3–10× cost, can't guarantee every agent answered every item
Single InterviewService with mode enum — couples four distinct methodologies (especially multi-round Delphi and two-phase Longitudinal) into one growing class

4. System architecture

                    InterviewOrchestrator
                          │
   ┌──────────────┬───────┴───────┬──────────────┐
   ▼              ▼               ▼              ▼
Longitudinal   Diversity        Delphi       Scenario
Subagent       Subagent         Subagent     Subagent
   │              │               │              │
   └──────────────┴──────┬────────┴──────────────┘
                         ▼
              StakeholderInterviewer (base)
                         │
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
   LLMClient        ZepEntityReader   ProfileLoader
   (in-character)   (memory digest)   (reddit/twitter)
                         │
                         ▼
       uploads/.../interviews/    +    Zep episodes

4.1 New files

Path	Purpose
`backend/app/services/interviews/base.py`	`StakeholderInterviewer` — persona+memory loading, in-character prompting, retry/validation
`backend/app/services/interviews/longitudinal.py`	Pre/post Likert
`backend/app/services/interviews/diversity.py`	Q-sort + multi-dim value-space mapping
`backend/app/services/interviews/delphi.py`	Three-round consensus
`backend/app/services/interviews/scenario.py`	Scenario rating
`backend/app/services/interview_orchestrator.py`	Fan-out, parallel execution, two-phase lifecycle
`backend/app/services/interview_synthesizer.py`	Cross-method narrative report
`backend/app/api/interview.py`	New Flask blueprint `/api/interview/*`
`backend/app/models/interview.py`	Pydantic schemas for instruments + responses
`backend/scripts/instruments/*.yaml`	Editable instrument definitions (one YAML per subagent)
`frontend/src/components/Step4bInterviews.vue`	Four tabs + synthesis tab
`backend/tests/interviews/`	Unit tests per subagent + base + orchestrator + synthesiser
`tests/integration/test_interview_pipeline.py`	End-to-end with stub LLM + disposable Zep graph

4.2 Lifecycle integration

Two hooks added to backend/app/services/simulation_manager.py:

on_ready() — automatically triggers Longitudinal T0 (pre-simulation baseline)
on_completed() — queues a task_id running Longitudinal T1 + Diversity + Delphi + Scenario in parallel, then Synthesiser

The two-phase split is non-negotiable: Longitudinal needs T0 captured before OASIS exposes agents to peer-generated content, otherwise drift is unmeasurable.

5. Instrument design

All instruments live in backend/scripts/instruments/*.yaml so content is editable without redeploying. Items default to German, translatable via existing locale system.

5.1 Longitudinal — opinion drift

12–15 item 5-point Likert ("lehne stark ab" → "stimme stark zu")
Administered at T0 (post-persona, pre-OASIS) and T1 (post-OASIS)
Item families (3–4 each): stock status & recovery; governance & CFP; market & MSC; climate & adaptation
Per-agent output: response value + LLM self-reported confidence per item + one open comment
Aggregate: Δ-matrix (N × M items), per-item Wilcoxon signed-rank, per-agent total drift magnitude

5.2 Diversity — typology mapping

One-shot, post-simulation only
Part A (Q-sort lite): 24 statements sorted onto forced quasi-normal distribution from −3 to +3
Part B: 6 multi-dim Likert axes (preservation↔extraction, local↔EU, science-led↔tradition-led, individual↔collective, short-term↔long-term, market↔regulation)
Per-agent output: vector ∈ ℝ^30
Aggregate: PCA + k-means → 3–5 stakeholder clusters with archetype descriptions + cluster-membership probabilities

5.3 Delphi — consensus probing

Three rounds, fully automated
R1 (open): 4 open questions; LLM extracts thematic codes from responses
R2 (rate): Agent sees anonymised list of all unique themes; rates each on importance (1–5) + plausibility (1–5)
R3 (revise): Agent sees group median + IQR per theme; can revise own ratings; free-text justification
Aggregate: per-theme convergence (Δ-IQR R2→R3), persistent disagreements (IQR > 2), ranked consensus statements

5.4 Scenario — futures evaluation

Four 2040 scenarios (YAML-editable):

S1 "Erholung" — cod and herring recover, MSC ubiquitous, small-scale fleet stabilises
S2 "Kollaps" — both stocks collapse, fleet halved, aquaculture dominant
S3 "Festung Europa" — protectionist EU policy, MPAs cover 30%, recreational fishing curtailed
S4 "Privatisierung" — ITQs, consolidation, large operators only

Each agent rates each scenario on 4 dimensions (1–7 Likert): desirability, plausibility, impact-on-my-group, fairness. Plus one open question per scenario: "If you woke up in this 2040, what would you do?"

Aggregate: 4 × 4 per-agent matrix + open-text corpus → polarity charts (desirability × plausibility by stakeholder type), narrative themes.

5.5 Cross-cutting

In-character prompting. Every LLM call uses a system prompt of the form:

You are [persona_text]. You are answering a survey about the future of German fisheries. Answer strictly in character based on your background, values, and what you experienced during the simulated social media discourse summarised below: [Zep memory digest]. Return JSON only.

Memory digest comes from ZepEntityReader.get_entity_with_context().

Structured output enforced. Every response goes through LLMClient.chat_json() with a per-instrument JSON schema. One auto-retry on schema violation; agent flagged in audit log on second failure.

Cost guardrails. Longitudinal × 2 phases + Delphi × 3 rounds is heaviest. For N=50 agents and ~100 LLM calls per agent across all 4 subagents, budget ~5k calls / 5–10M tokens per simulation. Persona system prompts stay constant within a subagent run → cacheable.

6. Data flow and storage

6.1 Storage layout

uploads/simulations/{sim_id}/interviews/
├── instruments_used.json          # frozen snapshot of YAML at run-time
├── T0/
│   └── longitudinal/
│       ├── responses.jsonl
│       ├── audit.jsonl            # raw LLM I/O, retries, validation failures
│       └── aggregate.json
├── T1/
│   ├── longitudinal/{same structure}
│   ├── diversity/
│   │   ├── responses.jsonl
│   │   ├── typology.json
│   │   └── pca.json
│   ├── delphi/
│   │   ├── round1_themes.jsonl
│   │   ├── round2_ratings.jsonl
│   │   ├── round3_revisions.jsonl
│   │   └── convergence.json
│   └── scenario/
│       ├── responses.jsonl
│       └── polarity_matrix.json
└── synthesis/
    ├── report.md
    └── exports/
        ├── all_responses.csv      # tidy long format
        └── codebook.json

JSONL for raw responses (append-safe, streams cleanly); JSON for aggregates; CSV for analysis hand-off. instruments_used.json snapshot is critical for reproducibility when YAML is later edited.

6.2 Zep integration

Two write patterns, both reusing ZepGraphMemoryUpdater.add_activity():

Per-agent episode — after each subagent finishes for an agent, write one episode: "Agent {name} (interview/{subagent}/{phase}): {short summary of stance}". The existing ReportAgent can retrieve interview content via its current panorama_search / insight_forge tools without changes.
Aggregate episodes — after each subagent's aggregate step, write one summary episode per cluster / theme / scenario.

No new Zep schemas. No new entity types. Interviews are just more episodes — append-only, safe.

6.3 API surface

New blueprint /api/interview:

Method	Path	Purpose
`POST`	`/api/interview/{sim_id}/pre`	Trigger T0 longitudinal (auto on READY, manual for re-runs)
`POST`	`/api/interview/{sim_id}/post`	Trigger all 4 post-sim subagents; returns `task_id`
`GET`	`/api/interview/{sim_id}/status?task_id=...`	Per-subagent progress
`GET`	`/api/interview/{sim_id}/results/{subagent}`	Aggregate JSON for one subagent
`GET`	`/api/interview/{sim_id}/results/synthesis`	Full synthesis report
`GET`	`/api/interview/{sim_id}/export.csv`	Tidy long-format CSV across all 4 subagents
`POST`	`/api/interview/{sim_id}/rerun`	Re-run one subagent (e.g. after editing YAML)

All responses follow the existing {success, data, error} envelope. Polling reuses models/task.py.

6.4 Parallelism

Within a subagent: ThreadPoolExecutor(max_workers=8) for per-agent LLM calls
Across the 4 post-sim subagents: parallel, except Delphi (sequential rounds internally)
Synthesiser waits for all four
Token budget guard: Config.INTERVIEW_MAX_TOKENS_PER_RUN; if projected cost exceeds, API returns 400 with dry-run estimate and confirm=true override

6.5 Frontend

New Step4bInterviews.vue between current Step4 (report) and Step5 (interaction). Four tabs (one per subagent) + a synthesis tab. Each tab shows progress bar during run, then results: Likert heatmap (longitudinal Δ), PCA scatter (diversity), convergence chart (Delphi), polarity quadrants (scenario). Download button per tab pulls the CSV export.

7. Error handling

Per-agent failures are isolated. If agent 17 times out or fails JSON validation twice, agent 17 is marked failed in audit.jsonl; the rest of the run continues. Aggregates report n_responded / n_total honestly.

Failure	Handling
LLM timeout / 5xx	Exponential-backoff retry (3 attempts) via existing `LLMClient`; then mark agent failed
JSON schema violation	One auto-retry with explicit corrective instruction; then mark failed
Likert out-of-range / missing items	Re-ask only the bad items; if still bad, item-level missing
Zep memory fetch fails	Run without memory digest; flag in audit (`memory_available: false`); down-weight in drift analysis
Whole-subagent crash	Other 3 continue; synthesiser runs on what completed and flags the gap
Token budget exceeded	Pause, write partial results, return 503 with `resume_token`

Idempotency. Every subagent run is keyed by (sim_id, subagent, phase, run_id). Re-runs write a new run_id directory; never overwrite. A latest.json pointer tracks the canonical run.

8. Validation

Three layers:

Schema validation — pydantic models for every response; JSONL files validated on write
Instrument validation — validate_instrument(yaml) pre-flight: required fields, scale coherence, no duplicate item_ids, DE+EN both present if i18n enabled
Plausibility checks on aggregates (flag, don't kill):
- Longitudinal: >80% zero drift on every item OR >80% flip — likely a prompting bug or acquiescence bias
- Diversity: first two PCA components explain <30% of variance — instrument not discriminating
- Delphi: R3 ratings identical to R2 for >90% of agents — no engagement with anonymised feedback
- Scenario: all agents rate all scenarios identically on desirability — instrument failure

Flags surface in the synthesis report under "instrument health" so the user can decide whether data is publishable.

9. Testing

Unit tests (backend/tests/interviews/):

test_instruments.py — every YAML parses and validates
test_base_interviewer.py — persona+memory loading, in-character prompt construction, schema-retry logic (mock LLMClient)
One file per subagent — happy path + each failure mode in §7
test_orchestrator.py — fan-out, partial failures, two-phase ordering (T0 before T1)
test_synthesizer.py — missing-subagent handling, stable output shape

Integration test (tests/integration/test_interview_pipeline.py):

End-to-end with N=5 agents against a recorded LLM cassette. Verifies T0 at READY, T1 + 3 others at COMPLETED, CSV export well-formed, Zep episodes written.

Stub LLM mode (Config.LLM_STUB_MODE=true) returns deterministic canned responses keyed by (subagent, item_id, persona_hash). Full pipeline exercisable in CI for free.

Zep: disposable graph in integration tests (consistent with project conventions); unit tests stub.

10. Methodological caveats (auto-emitted in synthesis)

The synthesiser always emits a "Limitations" section, programmatically generated from run metadata:

Simulated, not real stakeholders. Responses reflect how the seed-document discourse + LLM jointly encode each stakeholder type, not what actual fishers / NGO staff would say. The instrument measures the model of the stakeholder, not the stakeholder.
Memory digest is lossy. Each agent's "experience" of OASIS is summarised to bounded length; agents do not have full episodic recall.
LLM acquiescence and centrality bias. Likert with LLM respondents skews toward 3–4 of 5; per-item distribution shape statistics are reported.
N is what it is. n_total and n_responded printed verbatim; no rounding, no smoothing.
Instrument provenance. Hash of instruments_used.json printed so future-you can rebuild the exact instrument.

This section is load-bearing for any publication: it makes the system intellectually defensible rather than a black box.

11. Defaulted decisions (revisit later if needed)

N agents: assumed 50, driven from existing simulation config; if you typically run more/fewer, cost guardrail threshold needs adjusting
Default instrument language: German with English fallback in YAML
Delphi rounds = 3: classic Delphi can run more; 3 is the methodological floor and the cost ceiling here

12. Open questions for implementation phase

Whether to write a separate instruments_changelog.md per run, or embed change tracking in instruments_used.json metadata
Whether the synthesiser should write into Zep as a single mega-episode or stay file-only (current design: file-only, plus the per-agent + per-aggregate episodes from each subagent)
Whether Step4bInterviews.vue should sit strictly after Step4 (current design) or render in parallel — interviews depend on the simulation having reached completed (Step3 output) and on the graph_id (created in Step1); they do not depend on Step4's ReportAgent run, so a parallel layout is technically possible

16 KiB Raw Blame History Unescape Escape