MicroFish/docs/superpowers/specs/2026-05-23-stakeholder-inte...

16 KiB
Raw Blame History

Stakeholder Interview Subagents — Design Spec

  • Date: 2026-05-23
  • Project: MiroFish (multi-agent simulation engine for German fisheries discourse)
  • Author: Christian Möllmann (with Claude Code)
  • Status: Approved design — pending implementation plan

1. Purpose

After the OASIS Twitter + Reddit simulation produces a population of in-character stakeholder agents (fishers, NGOs, policy actors, scientists, consumers, etc.) grounded in a German fisheries discourse knowledge graph, we want to interrogate each agent individually with a structured questionnaire about the future of German fisheries.

Four methodologies run as independent subagents over the same agent population:

  1. Longitudinal — pre/post Likert to measure opinion drift induced by simulated peer interaction
  2. Diversity — Q-sort + multi-dim Likert to map the value space and derive a stakeholder typology
  3. Delphi — three-round consensus probing to identify where stakeholder views converge vs. stay polarised
  4. Scenario — rating of 4 pre-defined 2040 scenarios on desirability, plausibility, group-impact, fairness

A synthesiser combines the four outputs into a single cross-method report.

2. Non-goals (v1)

  • Real-time WebSocket streaming of interview progress (polling suffices)
  • Adaptive instruments / IRT calibration
  • Web UI for editing instruments (YAML + restart is fine)
  • Cross-simulation comparison endpoints (CSV exports support this externally)
  • Multi-language support beyond DE / EN

3. Architectural approach

Chosen approach: Deterministic instrument runners. Each subagent is a fixed protocol, not a ReACT loop. Rationale: fisheries futures methodology favours instrument fidelity (every stakeholder sees the same scale) over agent autonomy; results must be directly tabularisable for downstream analysis in pandas/R.

Rejected:

  • ReACT-style subagents — non-deterministic, ~310× cost, can't guarantee every agent answered every item
  • Single InterviewService with mode enum — couples four distinct methodologies (especially multi-round Delphi and two-phase Longitudinal) into one growing class

4. System architecture

                    InterviewOrchestrator
                          │
   ┌──────────────┬───────┴───────┬──────────────┐
   ▼              ▼               ▼              ▼
Longitudinal   Diversity        Delphi       Scenario
Subagent       Subagent         Subagent     Subagent
   │              │               │              │
   └──────────────┴──────┬────────┴──────────────┘
                         ▼
              StakeholderInterviewer (base)
                         │
       ┌─────────────────┼─────────────────┐
       ▼                 ▼                 ▼
   LLMClient        ZepEntityReader   ProfileLoader
   (in-character)   (memory digest)   (reddit/twitter)
                         │
                         ▼
       uploads/.../interviews/    +    Zep episodes

4.1 New files

Path Purpose
backend/app/services/interviews/base.py StakeholderInterviewer — persona+memory loading, in-character prompting, retry/validation
backend/app/services/interviews/longitudinal.py Pre/post Likert
backend/app/services/interviews/diversity.py Q-sort + multi-dim value-space mapping
backend/app/services/interviews/delphi.py Three-round consensus
backend/app/services/interviews/scenario.py Scenario rating
backend/app/services/interview_orchestrator.py Fan-out, parallel execution, two-phase lifecycle
backend/app/services/interview_synthesizer.py Cross-method narrative report
backend/app/api/interview.py New Flask blueprint /api/interview/*
backend/app/models/interview.py Pydantic schemas for instruments + responses
backend/scripts/instruments/*.yaml Editable instrument definitions (one YAML per subagent)
frontend/src/components/Step4bInterviews.vue Four tabs + synthesis tab
backend/tests/interviews/ Unit tests per subagent + base + orchestrator + synthesiser
tests/integration/test_interview_pipeline.py End-to-end with stub LLM + disposable Zep graph

4.2 Lifecycle integration

Two hooks added to backend/app/services/simulation_manager.py:

  • on_ready() — automatically triggers Longitudinal T0 (pre-simulation baseline)
  • on_completed() — queues a task_id running Longitudinal T1 + Diversity + Delphi + Scenario in parallel, then Synthesiser

The two-phase split is non-negotiable: Longitudinal needs T0 captured before OASIS exposes agents to peer-generated content, otherwise drift is unmeasurable.

5. Instrument design

All instruments live in backend/scripts/instruments/*.yaml so content is editable without redeploying. Items default to German, translatable via existing locale system.

5.1 Longitudinal — opinion drift

  • 1215 item 5-point Likert ("lehne stark ab" → "stimme stark zu")
  • Administered at T0 (post-persona, pre-OASIS) and T1 (post-OASIS)
  • Item families (34 each): stock status & recovery; governance & CFP; market & MSC; climate & adaptation
  • Per-agent output: response value + LLM self-reported confidence per item + one open comment
  • Aggregate: Δ-matrix (N × M items), per-item Wilcoxon signed-rank, per-agent total drift magnitude

5.2 Diversity — typology mapping

  • One-shot, post-simulation only
  • Part A (Q-sort lite): 24 statements sorted onto forced quasi-normal distribution from 3 to +3
  • Part B: 6 multi-dim Likert axes (preservation↔extraction, local↔EU, science-led↔tradition-led, individual↔collective, short-term↔long-term, market↔regulation)
  • Per-agent output: vector ∈ ^30
  • Aggregate: PCA + k-means → 35 stakeholder clusters with archetype descriptions + cluster-membership probabilities

5.3 Delphi — consensus probing

  • Three rounds, fully automated
  • R1 (open): 4 open questions; LLM extracts thematic codes from responses
  • R2 (rate): Agent sees anonymised list of all unique themes; rates each on importance (15) + plausibility (15)
  • R3 (revise): Agent sees group median + IQR per theme; can revise own ratings; free-text justification
  • Aggregate: per-theme convergence (Δ-IQR R2→R3), persistent disagreements (IQR > 2), ranked consensus statements

5.4 Scenario — futures evaluation

Four 2040 scenarios (YAML-editable):

  • S1 "Erholung" — cod and herring recover, MSC ubiquitous, small-scale fleet stabilises
  • S2 "Kollaps" — both stocks collapse, fleet halved, aquaculture dominant
  • S3 "Festung Europa" — protectionist EU policy, MPAs cover 30%, recreational fishing curtailed
  • S4 "Privatisierung" — ITQs, consolidation, large operators only

Each agent rates each scenario on 4 dimensions (17 Likert): desirability, plausibility, impact-on-my-group, fairness. Plus one open question per scenario: "If you woke up in this 2040, what would you do?"

Aggregate: 4 × 4 per-agent matrix + open-text corpus → polarity charts (desirability × plausibility by stakeholder type), narrative themes.

5.5 Cross-cutting

In-character prompting. Every LLM call uses a system prompt of the form:

You are [persona_text]. You are answering a survey about the future of German fisheries. Answer strictly in character based on your background, values, and what you experienced during the simulated social media discourse summarised below: [Zep memory digest]. Return JSON only.

Memory digest comes from ZepEntityReader.get_entity_with_context().

Structured output enforced. Every response goes through LLMClient.chat_json() with a per-instrument JSON schema. One auto-retry on schema violation; agent flagged in audit log on second failure.

Cost guardrails. Longitudinal × 2 phases + Delphi × 3 rounds is heaviest. For N=50 agents and ~100 LLM calls per agent across all 4 subagents, budget ~5k calls / 510M tokens per simulation. Persona system prompts stay constant within a subagent run → cacheable.

6. Data flow and storage

6.1 Storage layout

uploads/simulations/{sim_id}/interviews/
├── instruments_used.json          # frozen snapshot of YAML at run-time
├── T0/
│   └── longitudinal/
│       ├── responses.jsonl
│       ├── audit.jsonl            # raw LLM I/O, retries, validation failures
│       └── aggregate.json
├── T1/
│   ├── longitudinal/{same structure}
│   ├── diversity/
│   │   ├── responses.jsonl
│   │   ├── typology.json
│   │   └── pca.json
│   ├── delphi/
│   │   ├── round1_themes.jsonl
│   │   ├── round2_ratings.jsonl
│   │   ├── round3_revisions.jsonl
│   │   └── convergence.json
│   └── scenario/
│       ├── responses.jsonl
│       └── polarity_matrix.json
└── synthesis/
    ├── report.md
    └── exports/
        ├── all_responses.csv      # tidy long format
        └── codebook.json

JSONL for raw responses (append-safe, streams cleanly); JSON for aggregates; CSV for analysis hand-off. instruments_used.json snapshot is critical for reproducibility when YAML is later edited.

6.2 Zep integration

Two write patterns, both reusing ZepGraphMemoryUpdater.add_activity():

  • Per-agent episode — after each subagent finishes for an agent, write one episode: "Agent {name} (interview/{subagent}/{phase}): {short summary of stance}". The existing ReportAgent can retrieve interview content via its current panorama_search / insight_forge tools without changes.
  • Aggregate episodes — after each subagent's aggregate step, write one summary episode per cluster / theme / scenario.

No new Zep schemas. No new entity types. Interviews are just more episodes — append-only, safe.

6.3 API surface

New blueprint /api/interview:

Method Path Purpose
POST /api/interview/{sim_id}/pre Trigger T0 longitudinal (auto on READY, manual for re-runs)
POST /api/interview/{sim_id}/post Trigger all 4 post-sim subagents; returns task_id
GET /api/interview/{sim_id}/status?task_id=... Per-subagent progress
GET /api/interview/{sim_id}/results/{subagent} Aggregate JSON for one subagent
GET /api/interview/{sim_id}/results/synthesis Full synthesis report
GET /api/interview/{sim_id}/export.csv Tidy long-format CSV across all 4 subagents
POST /api/interview/{sim_id}/rerun Re-run one subagent (e.g. after editing YAML)

All responses follow the existing {success, data, error} envelope. Polling reuses models/task.py.

6.4 Parallelism

  • Within a subagent: ThreadPoolExecutor(max_workers=8) for per-agent LLM calls
  • Across the 4 post-sim subagents: parallel, except Delphi (sequential rounds internally)
  • Synthesiser waits for all four
  • Token budget guard: Config.INTERVIEW_MAX_TOKENS_PER_RUN; if projected cost exceeds, API returns 400 with dry-run estimate and confirm=true override

6.5 Frontend

New Step4bInterviews.vue between current Step4 (report) and Step5 (interaction). Four tabs (one per subagent) + a synthesis tab. Each tab shows progress bar during run, then results: Likert heatmap (longitudinal Δ), PCA scatter (diversity), convergence chart (Delphi), polarity quadrants (scenario). Download button per tab pulls the CSV export.

7. Error handling

Per-agent failures are isolated. If agent 17 times out or fails JSON validation twice, agent 17 is marked failed in audit.jsonl; the rest of the run continues. Aggregates report n_responded / n_total honestly.

Failure Handling
LLM timeout / 5xx Exponential-backoff retry (3 attempts) via existing LLMClient; then mark agent failed
JSON schema violation One auto-retry with explicit corrective instruction; then mark failed
Likert out-of-range / missing items Re-ask only the bad items; if still bad, item-level missing
Zep memory fetch fails Run without memory digest; flag in audit (memory_available: false); down-weight in drift analysis
Whole-subagent crash Other 3 continue; synthesiser runs on what completed and flags the gap
Token budget exceeded Pause, write partial results, return 503 with resume_token

Idempotency. Every subagent run is keyed by (sim_id, subagent, phase, run_id). Re-runs write a new run_id directory; never overwrite. A latest.json pointer tracks the canonical run.

8. Validation

Three layers:

  1. Schema validation — pydantic models for every response; JSONL files validated on write
  2. Instrument validationvalidate_instrument(yaml) pre-flight: required fields, scale coherence, no duplicate item_ids, DE+EN both present if i18n enabled
  3. Plausibility checks on aggregates (flag, don't kill):
    • Longitudinal: >80% zero drift on every item OR >80% flip — likely a prompting bug or acquiescence bias
    • Diversity: first two PCA components explain <30% of variance — instrument not discriminating
    • Delphi: R3 ratings identical to R2 for >90% of agents — no engagement with anonymised feedback
    • Scenario: all agents rate all scenarios identically on desirability — instrument failure

Flags surface in the synthesis report under "instrument health" so the user can decide whether data is publishable.

9. Testing

Unit tests (backend/tests/interviews/):

  • test_instruments.py — every YAML parses and validates
  • test_base_interviewer.py — persona+memory loading, in-character prompt construction, schema-retry logic (mock LLMClient)
  • One file per subagent — happy path + each failure mode in §7
  • test_orchestrator.py — fan-out, partial failures, two-phase ordering (T0 before T1)
  • test_synthesizer.py — missing-subagent handling, stable output shape

Integration test (tests/integration/test_interview_pipeline.py):

End-to-end with N=5 agents against a recorded LLM cassette. Verifies T0 at READY, T1 + 3 others at COMPLETED, CSV export well-formed, Zep episodes written.

Stub LLM mode (Config.LLM_STUB_MODE=true) returns deterministic canned responses keyed by (subagent, item_id, persona_hash). Full pipeline exercisable in CI for free.

Zep: disposable graph in integration tests (consistent with project conventions); unit tests stub.

10. Methodological caveats (auto-emitted in synthesis)

The synthesiser always emits a "Limitations" section, programmatically generated from run metadata:

  • Simulated, not real stakeholders. Responses reflect how the seed-document discourse + LLM jointly encode each stakeholder type, not what actual fishers / NGO staff would say. The instrument measures the model of the stakeholder, not the stakeholder.
  • Memory digest is lossy. Each agent's "experience" of OASIS is summarised to bounded length; agents do not have full episodic recall.
  • LLM acquiescence and centrality bias. Likert with LLM respondents skews toward 34 of 5; per-item distribution shape statistics are reported.
  • N is what it is. n_total and n_responded printed verbatim; no rounding, no smoothing.
  • Instrument provenance. Hash of instruments_used.json printed so future-you can rebuild the exact instrument.

This section is load-bearing for any publication: it makes the system intellectually defensible rather than a black box.

11. Defaulted decisions (revisit later if needed)

  • N agents: assumed 50, driven from existing simulation config; if you typically run more/fewer, cost guardrail threshold needs adjusting
  • Default instrument language: German with English fallback in YAML
  • Delphi rounds = 3: classic Delphi can run more; 3 is the methodological floor and the cost ceiling here

12. Open questions for implementation phase

  • Whether to write a separate instruments_changelog.md per run, or embed change tracking in instruments_used.json metadata
  • Whether the synthesiser should write into Zep as a single mega-episode or stay file-only (current design: file-only, plus the per-agent + per-aggregate episodes from each subagent)
  • Whether Step4bInterviews.vue should sit strictly after Step4 (current design) or render in parallel — interviews depend on the simulation having reached completed (Step3 output) and on the graph_id (created in Step1); they do not depend on Step4's ReportAgent run, so a parallel layout is technically possible