152 KiB

Raw Blame History

Stakeholder Interview Subagents Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: Build a four-subagent post-simulation interview system (Longitudinal, Diversity, Delphi, Scenario) over MiroFish-simulated stakeholders, plus a cross-method synthesiser, exposed via /api/interview and rendered in a new Vue Step4b.

Architecture: Deterministic instrument runners (not ReACT). Shared StakeholderInterviewer base loads persona + Zep memory digest and administers per-instrument JSON-schema-validated prompts via the existing LLMClient. Four subagents own their own instrument YAML + output schema. InterviewOrchestrator fans out parallel post-sim execution; InterviewSynthesizer aggregates. Files: backend Python services + new Flask blueprint; frontend new Vue component with d3 viz.

Tech Stack: Python 3.12, Flask, pydantic v2, PyYAML, scikit-learn (PCA, k-means), scipy (Wilcoxon), numpy, pytest; Vue 3, axios, d3 v7, vue-i18n.

Spec: docs/superpowers/specs/2026-05-23-stakeholder-interview-subagents-design.md

Phase 0 — Setup

Task 0: Add deps and pytest scaffold

Files:

Modify: backend/pyproject.toml
Create: backend/tests/__init__.py
Create: backend/tests/conftest.py
Create: backend/pytest.ini
Step 1: Add deps to backend/pyproject.toml

In the dependencies array (after pydantic>=2.0.0), add:

    "PyYAML>=6.0",
    "scikit-learn>=1.4",
    "scipy>=1.12",
    "numpy>=1.26",
    "pandas>=2.1",

Step 2: Create backend/pytest.ini

[pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts = -ra --strict-markers
markers =
    integration: marks integration tests (deselect with -m 'not integration')

Step 3: Create backend/tests/__init__.py

Empty file.

Step 4: Create backend/tests/conftest.py

import os
import sys
import pathlib
import pytest

ROOT = pathlib.Path(__file__).resolve().parents[1]
sys.path.insert(0, str(ROOT))

os.environ.setdefault("LLM_API_KEY", "test")
os.environ.setdefault("LLM_BASE_URL", "https://example.invalid")
os.environ.setdefault("LLM_MODEL_NAME", "test-model")
os.environ.setdefault("ZEP_API_KEY", "test")

@pytest.fixture
def tmp_uploads(tmp_path, monkeypatch):
    monkeypatch.setenv("UPLOADS_DIR", str(tmp_path))
    return tmp_path

Step 5: Install + verify

Run: cd backend && uv sync --python 3.12 && uv run pytest -q Expected: 0 tests collected (no failures). Confirms infrastructure works.

Step 6: Commit

git add backend/pyproject.toml backend/uv.lock backend/pytest.ini backend/tests/__init__.py backend/tests/conftest.py
git commit -m "chore(interviews): add deps and pytest scaffold for interview subsystem"

Task 1: Add interview config keys

Files:

Modify: backend/app/config.py
Step 1: Read current config

Open backend/app/config.py and locate the Config class.

Step 2: Add config keys

Add inside the Config class (preserving existing keys):

    # Interview subsystem
    INTERVIEW_MAX_TOKENS_PER_RUN = int(os.environ.get("INTERVIEW_MAX_TOKENS_PER_RUN", 15_000_000))
    INTERVIEW_MAX_WORKERS = int(os.environ.get("INTERVIEW_MAX_WORKERS", 8))
    INTERVIEW_DEFAULT_LANGUAGE = os.environ.get("INTERVIEW_DEFAULT_LANGUAGE", "de")
    LLM_STUB_MODE = os.environ.get("LLM_STUB_MODE", "false").lower() == "true"

Step 3: Verify import

Run: cd backend && uv run python -c "from app.config import Config; print(Config.INTERVIEW_MAX_WORKERS, Config.LLM_STUB_MODE)" Expected: 8 False

Step 4: Commit

git add backend/app/config.py
git commit -m "feat(interviews): add interview config keys (token budget, workers, language, stub mode)"

Phase 1 — Foundation

Task 2: Pydantic models for instruments and responses

Files:

Create: backend/app/models/interview.py
Create: backend/tests/interviews/__init__.py
Test: backend/tests/interviews/test_models.py
Step 1: Write failing test

Create backend/tests/interviews/__init__.py (empty), then backend/tests/interviews/test_models.py:

import pytest
from pydantic import ValidationError
from app.models.interview import (
    LikertItem, LikertInstrument, LikertResponse,
    InterviewPhase, SubagentKind,
)

def test_likert_item_requires_de_and_en():
    item = LikertItem(item_id="x1", de="Frage", en="Question", scale=5)
    assert item.scale == 5

def test_likert_item_rejects_bad_scale():
    with pytest.raises(ValidationError):
        LikertItem(item_id="x1", de="d", en="e", scale=2)

def test_likert_instrument_unique_item_ids():
    with pytest.raises(ValidationError):
        LikertInstrument(
            name="t",
            items=[LikertItem(item_id="a", de="d", en="e", scale=5),
                   LikertItem(item_id="a", de="d", en="e", scale=5)],
        )

def test_likert_response_validates_scale_range():
    with pytest.raises(ValidationError):
        LikertResponse(agent_id=1, phase=InterviewPhase.T0,
                       responses={"a": 6}, confidence={"a": 0.5})

def test_subagent_kind_enum():
    assert SubagentKind.LONGITUDINAL.value == "longitudinal"

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_models.py -v Expected: ImportError (module not yet created).

Step 3: Create backend/app/models/interview.py

from __future__ import annotations
from enum import Enum
from typing import Optional
from pydantic import BaseModel, Field, field_validator, model_validator

class InterviewPhase(str, Enum):
    T0 = "T0"
    T1 = "T1"

class SubagentKind(str, Enum):
    LONGITUDINAL = "longitudinal"
    DIVERSITY = "diversity"
    DELPHI = "delphi"
    SCENARIO = "scenario"

class LikertItem(BaseModel):
    item_id: str
    de: str
    en: str
    scale: int = Field(ge=3, le=7)
    family: Optional[str] = None
    reverse_coded: bool = False

    @field_validator("scale")
    @classmethod
    def odd_scale(cls, v: int) -> int:
        if v not in (3, 5, 7):
            raise ValueError("scale must be 3, 5, or 7")
        return v

class LikertInstrument(BaseModel):
    name: str
    version: str = "1.0"
    language_default: str = "de"
    items: list[LikertItem]

    @model_validator(mode="after")
    def unique_item_ids(self) -> "LikertInstrument":
        ids = [i.item_id for i in self.items]
        if len(set(ids)) != len(ids):
            raise ValueError("duplicate item_id in instrument")
        return self

class LikertResponse(BaseModel):
    agent_id: int
    phase: InterviewPhase
    responses: dict[str, int]
    confidence: dict[str, float] = Field(default_factory=dict)
    open_comment: Optional[str] = None
    memory_available: bool = True
    failed_items: list[str] = Field(default_factory=list)

    @model_validator(mode="after")
    def values_in_range(self) -> "LikertResponse":
        for k, v in self.responses.items():
            if not 1 <= v <= 7:
                raise ValueError(f"response {k}={v} out of 1..7 range")
        for k, v in self.confidence.items():
            if not 0.0 <= v <= 1.0:
                raise ValueError(f"confidence {k}={v} out of 0..1 range")
        return self

class QSortStatement(BaseModel):
    statement_id: str
    de: str
    en: str

class QSortInstrument(BaseModel):
    name: str
    version: str = "1.0"
    statements: list[QSortStatement]
    distribution: list[int]  # e.g. [2,3,4,6,4,3,2] for -3..+3

class QSortResponse(BaseModel):
    agent_id: int
    placements: dict[str, int]  # statement_id -> bucket (-3..+3)
    likert_axes: dict[str, int]  # axis_id -> 1..7

class DelphiOpenResponse(BaseModel):
    agent_id: int
    round: int = 1
    answers: dict[str, str]  # question_id -> free text

class DelphiRatingResponse(BaseModel):
    agent_id: int
    round: int
    ratings: dict[str, dict[str, int]]  # theme_id -> {importance, plausibility}
    justification: Optional[str] = None

class ScenarioRating(BaseModel):
    desirability: int = Field(ge=1, le=7)
    plausibility: int = Field(ge=1, le=7)
    impact_on_my_group: int = Field(ge=1, le=7)
    fairness: int = Field(ge=1, le=7)
    if_woke_up_response: str

class ScenarioResponse(BaseModel):
    agent_id: int
    ratings: dict[str, ScenarioRating]  # scenario_id -> rating

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_models.py -v Expected: 5 passed.

Step 5: Commit

git add backend/app/models/interview.py backend/tests/interviews/__init__.py backend/tests/interviews/test_models.py
git commit -m "feat(interviews): add pydantic models for instruments and responses"

Task 3: YAML instrument loader + validator

Files:

Create: backend/app/services/interviews/__init__.py
Create: backend/app/services/interviews/instrument_loader.py
Create: backend/scripts/instruments/__init__.py (empty marker so tests can import path)
Test: backend/tests/interviews/test_instrument_loader.py
Step 1: Write failing test

# backend/tests/interviews/test_instrument_loader.py
import pytest
from app.services.interviews.instrument_loader import (
    load_likert_instrument, InstrumentValidationError,
)

def _write(tmp_path, text):
    p = tmp_path / "inst.yaml"
    p.write_text(text, encoding="utf-8")
    return p

def test_loads_valid_likert(tmp_path):
    p = _write(tmp_path, """
name: longitudinal_v1
version: "1.0"
language_default: de
items:
  - item_id: stk_1
    de: "Der westliche Dorschbestand wird sich erholen"
    en: "Western cod stock will recover"
    scale: 5
    family: stocks
""")
    inst = load_likert_instrument(p)
    assert inst.name == "longitudinal_v1"
    assert len(inst.items) == 1

def test_rejects_duplicate_item_id(tmp_path):
    p = _write(tmp_path, """
name: x
items:
  - {item_id: a, de: d, en: e, scale: 5}
  - {item_id: a, de: d, en: e, scale: 5}
""")
    with pytest.raises(InstrumentValidationError):
        load_likert_instrument(p)

def test_rejects_missing_required_field(tmp_path):
    p = _write(tmp_path, """
name: x
items:
  - {item_id: a, de: d, scale: 5}
""")
    with pytest.raises(InstrumentValidationError):
        load_likert_instrument(p)

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_instrument_loader.py -v Expected: ImportError.

Step 3: Create loader

Create backend/app/services/interviews/__init__.py (empty), backend/scripts/instruments/__init__.py (empty), then backend/app/services/interviews/instrument_loader.py:

from __future__ import annotations
import hashlib
import json
from pathlib import Path
import yaml
from pydantic import ValidationError
from app.models.interview import (
    LikertInstrument, QSortInstrument,
)

class InstrumentValidationError(ValueError):
    pass

def _parse_yaml(path: Path) -> dict:
    if not path.exists():
        raise InstrumentValidationError(f"instrument file not found: {path}")
    try:
        with path.open("r", encoding="utf-8") as f:
            data = yaml.safe_load(f)
    except yaml.YAMLError as e:
        raise InstrumentValidationError(f"YAML parse error in {path}: {e}") from e
    if not isinstance(data, dict):
        raise InstrumentValidationError(f"top-level YAML must be a mapping in {path}")
    return data

def load_likert_instrument(path: Path) -> LikertInstrument:
    data = _parse_yaml(Path(path))
    try:
        return LikertInstrument(**data)
    except ValidationError as e:
        raise InstrumentValidationError(str(e)) from e

def load_qsort_instrument(path: Path) -> QSortInstrument:
    data = _parse_yaml(Path(path))
    try:
        return QSortInstrument(**data)
    except ValidationError as e:
        raise InstrumentValidationError(str(e)) from e

def instrument_hash(path: Path) -> str:
    data = Path(path).read_bytes()
    return hashlib.sha256(data).hexdigest()[:16]

def freeze_snapshot(instruments: dict[str, Path], out_path: Path) -> dict:
    snapshot = {
        name: {
            "path": str(p),
            "hash": instrument_hash(p),
            "content": _parse_yaml(p),
        }
        for name, p in instruments.items()
    }
    out_path.parent.mkdir(parents=True, exist_ok=True)
    out_path.write_text(json.dumps(snapshot, ensure_ascii=False, indent=2), encoding="utf-8")
    return snapshot

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_instrument_loader.py -v Expected: 3 passed.

Step 5: Commit

git add backend/app/services/interviews/__init__.py backend/app/services/interviews/instrument_loader.py backend/scripts/instruments/__init__.py backend/tests/interviews/test_instrument_loader.py
git commit -m "feat(interviews): YAML instrument loader with pydantic validation and hash freezing"

Task 4: LLM stub mode

Files:

Modify: backend/app/utils/llm_client.py
Test: backend/tests/interviews/test_llm_stub.py
Step 1: Write failing test

# backend/tests/interviews/test_llm_stub.py
import json
from app.utils.llm_client import LLMClient

def test_stub_mode_returns_deterministic_canned_json(monkeypatch):
    monkeypatch.setenv("LLM_STUB_MODE", "true")
    from app.config import Config
    Config.LLM_STUB_MODE = True
    client = LLMClient(api_key="x", base_url="x", model="x")
    messages = [
        {"role": "system", "content": "You are persona_42. Return JSON."},
        {"role": "user", "content": "stub_key=longitudinal:item_001"},
    ]
    out1 = client.chat_json(messages=messages, temperature=0.0)
    out2 = client.chat_json(messages=messages, temperature=0.0)
    assert out1 == out2
    assert isinstance(out1, dict)

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_llm_stub.py -v Expected: FAIL (real API call attempted or stub absent).

Step 3: Read current llm_client.py

Read the file to locate chat and chat_json method bodies and where to insert the stub branch.

Step 4: Add stub branch

At the top of LLMClient.chat (before the OpenAI call), insert:

        from app.config import Config
        if getattr(Config, "LLM_STUB_MODE", False):
            return self._stub_response(messages)

And at the top of LLMClient.chat_json (before delegating), insert the same guard returning a parsed dict via self._stub_response_json(messages).

Add these methods to LLMClient:

    def _stub_key(self, messages: list[dict]) -> str:
        user_msg = next((m["content"] for m in reversed(messages) if m.get("role") == "user"), "")
        sys_msg = next((m["content"] for m in messages if m.get("role") == "system"), "")
        # Allow callers to embed an explicit stub_key=... token
        for chunk in user_msg.split():
            if chunk.startswith("stub_key="):
                return chunk[len("stub_key="):]
        import hashlib
        return hashlib.sha256((sys_msg + "|" + user_msg).encode("utf-8")).hexdigest()[:12]

    def _stub_response(self, messages: list[dict]) -> str:
        import json as _json
        return _json.dumps(self._stub_response_json(messages), ensure_ascii=False)

    def _stub_response_json(self, messages: list[dict]) -> dict:
        key = self._stub_key(messages)
        # Deterministic centered Likert + plausible open text
        digit = sum(ord(c) for c in key) % 5 + 1
        return {
            "stub_key": key,
            "responses": {"item_001": digit, "item_002": digit, "item_003": (digit % 5) + 1},
            "confidence": {"item_001": 0.7, "item_002": 0.7, "item_003": 0.6},
            "open_comment": f"stub:{key}",
        }

Step 5: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_llm_stub.py -v Expected: 1 passed.

Step 6: Commit

git add backend/app/utils/llm_client.py backend/tests/interviews/test_llm_stub.py
git commit -m "feat(interviews): LLM stub mode for deterministic CI tests"

Task 5: StakeholderInterviewer base class

Files:

Create: backend/app/services/interviews/base.py
Test: backend/tests/interviews/test_base_interviewer.py
Step 1: Write failing test

# backend/tests/interviews/test_base_interviewer.py
import json
import pytest
from app.services.interviews.base import StakeholderInterviewer, MemoryDigest, PersonaRecord

class _FakeLLM:
    def __init__(self, responses):
        self.responses = list(responses)
        self.calls = []
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        self.calls.append(messages)
        return self.responses.pop(0)

class _FakeMemory:
    def get_digest(self, agent_id, max_chars=2000):
        return MemoryDigest(text=f"digest-for-{agent_id}", available=True)

def test_in_character_prompt_includes_persona_and_memory():
    llm = _FakeLLM([{"x": 1}])
    mem = _FakeMemory()
    interviewer = StakeholderInterviewer(llm=llm, memory=mem)
    persona = PersonaRecord(agent_id=7, name="A", persona="I am a small-scale Baltic fisher.")
    out = interviewer.ask_in_character(persona, user_prompt="Q?", schema_hint="{...}")
    assert out == {"x": 1}
    sys_msg = llm.calls[0][0]["content"]
    assert "small-scale Baltic fisher" in sys_msg
    assert "digest-for-7" in sys_msg

def test_schema_retry_on_first_failure():
    bad_then_good = [{}, {"responses": {"a": 3}}]
    llm = _FakeLLM(bad_then_good)
    mem = _FakeMemory()
    interviewer = StakeholderInterviewer(llm=llm, memory=mem)
    def validator(d): 
        return d if "responses" in d else None
    persona = PersonaRecord(agent_id=1, name="A", persona="p")
    out = interviewer.ask_in_character(persona, user_prompt="Q?", schema_hint="x", validate=validator)
    assert out == {"responses": {"a": 3}}
    assert len(llm.calls) == 2

def test_two_failures_raise():
    llm = _FakeLLM([{}, {}])
    mem = _FakeMemory()
    interviewer = StakeholderInterviewer(llm=llm, memory=mem)
    persona = PersonaRecord(agent_id=1, name="A", persona="p")
    with pytest.raises(ValueError):
        interviewer.ask_in_character(persona, user_prompt="Q?", schema_hint="x",
                                     validate=lambda d: d if "responses" in d else None)

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_base_interviewer.py -v Expected: ImportError.

Step 3: Implement base

backend/app/services/interviews/base.py:

from __future__ import annotations
from dataclasses import dataclass, field
from typing import Any, Callable, Optional, Protocol

@dataclass
class PersonaRecord:
    agent_id: int
    name: str
    persona: str
    profession: Optional[str] = None
    bio: Optional[str] = None

@dataclass
class MemoryDigest:
    text: str
    available: bool = True

class MemoryProvider(Protocol):
    def get_digest(self, agent_id: int, max_chars: int = 2000) -> MemoryDigest: ...

class StakeholderInterviewer:
    def __init__(self, llm, memory: MemoryProvider, language: str = "de"):
        self.llm = llm
        self.memory = memory
        self.language = language

    def _system_prompt(self, persona: PersonaRecord, digest: MemoryDigest, schema_hint: str) -> str:
        memory_block = digest.text if digest.available else "[no simulation memory available]"
        lang_note = "Antworte ausschließlich auf Deutsch." if self.language == "de" else "Answer in English."
        return (
            f"You are {persona.name}. {persona.persona}\n\n"
            "You are answering a survey about the future of German fisheries. "
            "Answer strictly in character based on your background, values, and what you experienced "
            "during the simulated social media discourse summarised below.\n\n"
            f"--- simulation memory digest ---\n{memory_block}\n--- end ---\n\n"
            f"{lang_note} Return JSON ONLY matching this schema:\n{schema_hint}"
        )

    def ask_in_character(
        self,
        persona: PersonaRecord,
        user_prompt: str,
        schema_hint: str,
        *,
        temperature: float = 0.3,
        max_tokens: Optional[int] = None,
        validate: Optional[Callable[[dict], Optional[dict]]] = None,
    ) -> dict:
        digest = self.memory.get_digest(persona.agent_id)
        messages = [
            {"role": "system", "content": self._system_prompt(persona, digest, schema_hint)},
            {"role": "user", "content": user_prompt},
        ]
        out = self.llm.chat_json(messages=messages, temperature=temperature, max_tokens=max_tokens)
        if validate is not None:
            validated = validate(out)
            if validated is not None:
                return validated
            messages.append({"role": "assistant", "content": str(out)})
            messages.append({"role": "user", "content":
                "Your previous response did not match the required schema. "
                f"Return ONLY valid JSON matching: {schema_hint}"})
            out = self.llm.chat_json(messages=messages, temperature=0.0, max_tokens=max_tokens)
            validated = validate(out)
            if validated is None:
                raise ValueError(f"agent {persona.agent_id}: schema violation after retry")
            return validated
        return out

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_base_interviewer.py -v Expected: 3 passed.

Step 5: Commit

git add backend/app/services/interviews/base.py backend/tests/interviews/test_base_interviewer.py
git commit -m "feat(interviews): StakeholderInterviewer base with in-character prompting and schema retry"

Phase 2 — Subagents

Task 6: Longitudinal subagent + instrument YAML

Files:

Create: backend/scripts/instruments/longitudinal_v1.yaml
Create: backend/app/services/interviews/longitudinal.py
Test: backend/tests/interviews/test_longitudinal.py
Step 1: Write failing test

# backend/tests/interviews/test_longitudinal.py
from pathlib import Path
import pytest
from app.models.interview import InterviewPhase
from app.services.interviews.base import PersonaRecord, MemoryDigest
from app.services.interviews.longitudinal import LongitudinalSubagent, run_aggregate

class _FakeMem:
    def get_digest(self, agent_id, max_chars=2000):
        return MemoryDigest(text="x", available=True)

class _CannedLLM:
    def __init__(self): self.n = 0
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        self.n += 1
        return {
            "responses": {"stk_1": 4, "gov_1": 3, "mkt_1": 5, "clm_1": 2},
            "confidence": {"stk_1": 0.8, "gov_1": 0.6, "mkt_1": 0.7, "clm_1": 0.5},
            "open_comment": "test",
        }

INSTRUMENT = Path(__file__).resolve().parents[2] / "scripts" / "instruments" / "longitudinal_v1.yaml"

def test_longitudinal_administer_one_agent():
    sub = LongitudinalSubagent(llm=_CannedLLM(), memory=_FakeMem(), instrument_path=INSTRUMENT)
    persona = PersonaRecord(agent_id=3, name="A", persona="p")
    resp = sub.administer(persona, phase=InterviewPhase.T0)
    assert resp.agent_id == 3
    assert resp.phase == InterviewPhase.T0
    assert set(resp.responses.keys()) >= {"stk_1", "gov_1", "mkt_1", "clm_1"}

def test_longitudinal_aggregate_delta():
    from app.models.interview import LikertResponse
    t0 = [LikertResponse(agent_id=i, phase=InterviewPhase.T0,
                         responses={"stk_1": 3, "gov_1": 4},
                         confidence={"stk_1": 0.8, "gov_1": 0.8}) for i in range(5)]
    t1 = [LikertResponse(agent_id=i, phase=InterviewPhase.T1,
                         responses={"stk_1": 4, "gov_1": 4},
                         confidence={"stk_1": 0.8, "gov_1": 0.8}) for i in range(5)]
    agg = run_aggregate(t0, t1)
    assert agg["per_item"]["stk_1"]["mean_delta"] == 1.0
    assert agg["per_item"]["gov_1"]["mean_delta"] == 0.0
    assert agg["n_paired"] == 5

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_longitudinal.py -v Expected: ImportError + missing YAML file.

Step 3: Create instrument YAML

backend/scripts/instruments/longitudinal_v1.yaml:

name: longitudinal_v1
version: "1.0"
language_default: de
items:
  # Stock status & recovery
  - {item_id: stk_1, family: stocks, scale: 5,
     de: "Der westliche Dorschbestand wird sich bis 2035 erholen.",
     en: "The Western Baltic cod stock will recover by 2035."}
  - {item_id: stk_2, family: stocks, scale: 5,
     de: "Der Heringsbestand in der westlichen Ostsee ist nicht mehr zu retten.",
     en: "The Western Baltic herring stock can no longer be saved.",
     reverse_coded: true}
  - {item_id: stk_3, family: stocks, scale: 5,
     de: "Wissenschaftliche Bestandsschätzungen sind generell zuverlässig.",
     en: "Scientific stock assessments are generally reliable."}
  # Governance & CFP
  - {item_id: gov_1, family: governance, scale: 5,
     de: "Die Gemeinsame Fischereipolitik der EU scheitert beim Schutz der Ostseefische.",
     en: "The EU Common Fisheries Policy fails to protect Baltic fish.",
     reverse_coded: true}
  - {item_id: gov_2, family: governance, scale: 5,
     de: "Entscheidungen über Fangquoten sollten stärker lokal getroffen werden.",
     en: "Decisions on catch quotas should be taken more locally."}
  - {item_id: gov_3, family: governance, scale: 5,
     de: "Die deutsche Bundesregierung handelt entschlossen bei Fischereifragen.",
     en: "The German federal government acts decisively on fisheries issues."}
  # Market & MSC
  - {item_id: mkt_1, family: market, scale: 5,
     de: "Nur MSC-zertifizierter Fisch sollte verkauft werden dürfen.",
     en: "Only MSC-certified fish should be allowed for sale."}
  - {item_id: mkt_2, family: market, scale: 5,
     de: "Importierter Fisch verdrängt die deutsche Kleinfischerei.",
     en: "Imported fish displaces German small-scale fisheries."}
  - {item_id: mkt_3, family: market, scale: 5,
     de: "Verbraucher zahlen gerne mehr für nachhaltigen Ostseefisch.",
     en: "Consumers gladly pay more for sustainable Baltic fish."}
  # Climate & adaptation
  - {item_id: clm_1, family: climate, scale: 5,
     de: "Der Klimawandel macht traditionelle Ostseefischerei unmöglich.",
     en: "Climate change makes traditional Baltic fisheries impossible.",
     reverse_coded: true}
  - {item_id: clm_2, family: climate, scale: 5,
     de: "Aquakultur ist die Zukunft der deutschen Fischwirtschaft.",
     en: "Aquaculture is the future of the German fishing industry."}
  - {item_id: clm_3, family: climate, scale: 5,
     de: "Die Fischerei muss sich grundlegend an neue Arten anpassen.",
     en: "Fisheries must fundamentally adapt to new species."}

Step 4: Implement subagent

backend/app/services/interviews/longitudinal.py:

from __future__ import annotations
import json
import math
from pathlib import Path
from typing import Optional
from app.models.interview import (
    LikertInstrument, LikertResponse, InterviewPhase,
)
from app.services.interviews.base import StakeholderInterviewer, PersonaRecord
from app.services.interviews.instrument_loader import load_likert_instrument

class LongitudinalSubagent:
    def __init__(self, llm, memory, instrument_path: Path, language: str = "de"):
        self.instrument: LikertInstrument = load_likert_instrument(Path(instrument_path))
        self.interviewer = StakeholderInterviewer(llm=llm, memory=memory, language=language)
        self.language = language

    def _schema_hint(self) -> str:
        ids = [i.item_id for i in self.instrument.items]
        return json.dumps({
            "responses": {k: "<int 1-5>" for k in ids},
            "confidence": {k: "<float 0-1>" for k in ids},
            "open_comment": "<string, optional>",
        }, ensure_ascii=False)

    def _user_prompt(self) -> str:
        lines = ["Bitte bewerten Sie die folgenden Aussagen auf einer Skala von 1 (lehne stark ab) bis 5 (stimme stark zu)." if self.language == "de"
                 else "Please rate the following statements on a scale from 1 (strongly disagree) to 5 (strongly agree)."]
        for it in self.instrument.items:
            txt = it.de if self.language == "de" else it.en
            lines.append(f"- [{it.item_id}] {txt}")
        return "\n".join(lines)

    def _validator(self, raw: dict) -> Optional[dict]:
        if not isinstance(raw, dict): return None
        resp = raw.get("responses")
        if not isinstance(resp, dict): return None
        required = {it.item_id for it in self.instrument.items}
        if not required.issubset(resp.keys()): return None
        for k, v in resp.items():
            if not isinstance(v, int) or not 1 <= v <= 5: return None
        return raw

    def administer(self, persona: PersonaRecord, phase: InterviewPhase) -> LikertResponse:
        raw = self.interviewer.ask_in_character(
            persona,
            user_prompt=self._user_prompt(),
            schema_hint=self._schema_hint(),
            validate=self._validator,
        )
        return LikertResponse(
            agent_id=persona.agent_id,
            phase=phase,
            responses={k: int(v) for k, v in raw["responses"].items()},
            confidence={k: float(v) for k, v in raw.get("confidence", {}).items()},
            open_comment=raw.get("open_comment"),
        )

def run_aggregate(t0: list[LikertResponse], t1: list[LikertResponse]) -> dict:
    by_t0 = {r.agent_id: r for r in t0}
    by_t1 = {r.agent_id: r for r in t1}
    paired = sorted(set(by_t0) & set(by_t1))
    items: set[str] = set()
    for r in t0 + t1:
        items.update(r.responses.keys())
    per_item: dict[str, dict] = {}
    for it in sorted(items):
        deltas = []
        for aid in paired:
            v0 = by_t0[aid].responses.get(it)
            v1 = by_t1[aid].responses.get(it)
            if v0 is None or v1 is None: continue
            deltas.append(v1 - v0)
        if not deltas:
            per_item[it] = {"mean_delta": None, "n": 0}
            continue
        m = sum(deltas) / len(deltas)
        var = sum((d - m) ** 2 for d in deltas) / max(len(deltas) - 1, 1)
        per_item[it] = {
            "mean_delta": m,
            "sd_delta": math.sqrt(var),
            "n": len(deltas),
            "n_positive": sum(1 for d in deltas if d > 0),
            "n_negative": sum(1 for d in deltas if d < 0),
        }
    per_agent: dict[int, dict] = {}
    for aid in paired:
        r0 = by_t0[aid].responses
        r1 = by_t1[aid].responses
        common = set(r0) & set(r1)
        total = sum(abs(r1[k] - r0[k]) for k in common)
        per_agent[aid] = {"total_abs_drift": total, "n_items": len(common)}
    return {
        "n_paired": len(paired),
        "n_t0_only": len(set(by_t0) - set(by_t1)),
        "n_t1_only": len(set(by_t1) - set(by_t0)),
        "per_item": per_item,
        "per_agent": per_agent,
    }

Step 5: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_longitudinal.py -v Expected: 2 passed.

Step 6: Commit

git add backend/scripts/instruments/longitudinal_v1.yaml backend/app/services/interviews/longitudinal.py backend/tests/interviews/test_longitudinal.py
git commit -m "feat(interviews): longitudinal subagent + 12-item Likert instrument"

Task 7: Diversity subagent + Q-sort instrument

Files:

Create: backend/scripts/instruments/diversity_v1.yaml
Create: backend/app/services/interviews/diversity.py
Test: backend/tests/interviews/test_diversity.py
Step 1: Write failing test

# backend/tests/interviews/test_diversity.py
from pathlib import Path
import numpy as np
from app.services.interviews.base import PersonaRecord, MemoryDigest
from app.services.interviews.diversity import (
    DiversitySubagent, run_typology,
)

class _Mem:
    def get_digest(self, agent_id, max_chars=2000):
        return MemoryDigest(text="x", available=True)

class _CannedLLM:
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        # Place all 24 statements into legal buckets per the forced distribution
        placements = {}
        buckets = [-3]*2 + [-2]*3 + [-1]*4 + [0]*6 + [1]*4 + [2]*3 + [3]*2
        for i in range(24):
            placements[f"st_{i+1:02d}"] = buckets[i]
        return {
            "placements": placements,
            "likert_axes": {"ax_pres_extr": 5, "ax_loc_eu": 3, "ax_sci_trad": 4,
                            "ax_ind_col": 4, "ax_short_long": 5, "ax_mkt_reg": 3},
        }

INSTRUMENT = Path(__file__).resolve().parents[2] / "scripts" / "instruments" / "diversity_v1.yaml"

def test_diversity_administer():
    sub = DiversitySubagent(llm=_CannedLLM(), memory=_Mem(), instrument_path=INSTRUMENT)
    persona = PersonaRecord(agent_id=1, name="A", persona="p")
    resp = sub.administer(persona)
    assert len(resp.placements) == 24
    assert set(resp.likert_axes.keys()) == {
        "ax_pres_extr","ax_loc_eu","ax_sci_trad","ax_ind_col","ax_short_long","ax_mkt_reg"
    }

def test_typology_runs_pca_kmeans():
    from app.models.interview import QSortResponse
    rng = np.random.default_rng(42)
    responses = []
    for aid in range(20):
        placements = {f"st_{i+1:02d}": int(rng.integers(-3, 4)) for i in range(24)}
        axes = {f"ax_{j}": int(rng.integers(1, 8)) for j in range(6)}
        responses.append(QSortResponse(agent_id=aid, placements=placements, likert_axes=axes))
    result = run_typology(responses, n_clusters=3)
    assert "clusters" in result
    assert len(result["clusters"]) == 3
    assert "pca" in result
    assert len(result["pca"]["components"]) >= 2

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_diversity.py -v Expected: ImportError.

Step 3: Create instrument YAML

backend/scripts/instruments/diversity_v1.yaml:

name: diversity_v1
version: "1.0"
language_default: de
distribution: [2, 3, 4, 6, 4, 3, 2]   # buckets from -3 to +3, total 24
statements:
  - {statement_id: st_01, de: "Die Ostsee gehört den Fischern, die hier seit Generationen leben.", en: "The Baltic belongs to fishers who have lived here for generations."}
  - {statement_id: st_02, de: "MSC-Zertifizierung schützt vor allem große Konzerne.", en: "MSC certification mainly protects large corporations."}
  - {statement_id: st_03, de: "Wissenschaftliche Quoten sind die einzige Grundlage für Politik.", en: "Scientific quotas are the only legitimate basis for policy."}
  - {statement_id: st_04, de: "Aquakultur kann Ostseefischerei ersetzen.", en: "Aquaculture can replace Baltic fisheries."}
  - {statement_id: st_05, de: "Sportfischer schaden den Beständen mehr als die Berufsfischer.", en: "Recreational anglers harm stocks more than commercial fishers."}
  - {statement_id: st_06, de: "Die EU-Fischereipolitik kennt die Ostsee nicht.", en: "EU fisheries policy doesn't understand the Baltic."}
  - {statement_id: st_07, de: "Großtechnische Fischerei ist effizienter und damit nachhaltiger.", en: "Industrial fisheries are more efficient and therefore more sustainable."}
  - {statement_id: st_08, de: "Wer Fisch isst, sollte mehr dafür bezahlen.", en: "Those who eat fish should pay more for it."}
  - {statement_id: st_09, de: "Die Kleinfischerei muss subventioniert werden.", en: "Small-scale fisheries must be subsidised."}
  - {statement_id: st_10, de: "Marine Schutzgebiete sind reine Symbolpolitik.", en: "Marine protected areas are mere symbolism."}
  - {statement_id: st_11, de: "Russlands Krieg ändert alles in der Ostsee.", en: "Russia's war changes everything in the Baltic."}
  - {statement_id: st_12, de: "Nur drastische Reduktion der Fangmengen rettet die Bestände.", en: "Only drastic catch reductions will save the stocks."}
  - {statement_id: st_13, de: "NGOs übertreiben die Krise systematisch.", en: "NGOs systematically exaggerate the crisis."}
  - {statement_id: st_14, de: "Klimawandel ist das eigentliche Problem, nicht die Fischerei.", en: "Climate change is the real problem, not fisheries."}
  - {statement_id: st_15, de: "Tradition zählt mehr als kurzfristige Bestandszahlen.", en: "Tradition matters more than short-term stock numbers."}
  - {statement_id: st_16, de: "Verbraucher entscheiden über die Zukunft des Fisches.", en: "Consumers decide the future of fish."}
  - {statement_id: st_17, de: "Ohne Generalstreik der Fischer ändert sich nichts.", en: "Without a fishers' general strike, nothing will change."}
  - {statement_id: st_18, de: "Die Bundesregierung sollte Kutter aufkaufen und stilllegen.", en: "The federal government should buy out and decommission boats."}
  - {statement_id: st_19, de: "Die Dorschkrise ist Folge gescheiterter Politik.", en: "The cod crisis is the result of policy failure."}
  - {statement_id: st_20, de: "Ostsee-Aquakultur ist ökologisch problematisch.", en: "Baltic aquaculture is ecologically problematic."}
  - {statement_id: st_21, de: "Junge Menschen werden keinen Fischereibetrieb mehr übernehmen.", en: "Young people will no longer take over fishing businesses."}
  - {statement_id: st_22, de: "Markt regelt sich selbst, auch beim Fisch.", en: "The market regulates itself, also for fish."}
  - {statement_id: st_23, de: "Lokale Genossenschaften sind die Lösung.", en: "Local cooperatives are the solution."}
  - {statement_id: st_24, de: "In 20 Jahren gibt es keine deutsche Ostseefischerei mehr.", en: "In 20 years there will be no German Baltic fisheries left."}
likert_axes:
  - {axis_id: ax_pres_extr, scale: 7, de: "Bewahrung (1) vs. Nutzung (7)", en: "Preservation (1) vs. Extraction (7)"}
  - {axis_id: ax_loc_eu,    scale: 7, de: "Lokal (1) vs. EU-zentral (7)",  en: "Local (1) vs. EU-central (7)"}
  - {axis_id: ax_sci_trad,  scale: 7, de: "Wissenschaft (1) vs. Tradition (7)", en: "Science-led (1) vs. Tradition-led (7)"}
  - {axis_id: ax_ind_col,   scale: 7, de: "Individuum (1) vs. Kollektiv (7)", en: "Individual (1) vs. Collective (7)"}
  - {axis_id: ax_short_long,scale: 7, de: "Kurzfristig (1) vs. Langfristig (7)", en: "Short-term (1) vs. Long-term (7)"}
  - {axis_id: ax_mkt_reg,   scale: 7, de: "Markt (1) vs. Regulierung (7)", en: "Market (1) vs. Regulation (7)"}

Step 4: Implement subagent

backend/app/services/interviews/diversity.py:

from __future__ import annotations
import json
from pathlib import Path
from typing import Optional
import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import yaml
from app.models.interview import QSortResponse
from app.services.interviews.base import StakeholderInterviewer, PersonaRecord
from app.services.interviews.instrument_loader import InstrumentValidationError

class DiversitySubagent:
    def __init__(self, llm, memory, instrument_path: Path, language: str = "de"):
        self.instrument = self._load(Path(instrument_path))
        self.interviewer = StakeholderInterviewer(llm=llm, memory=memory, language=language)
        self.language = language

    def _load(self, path: Path) -> dict:
        with path.open("r", encoding="utf-8") as f:
            data = yaml.safe_load(f)
        if not isinstance(data, dict) or "statements" not in data or "distribution" not in data:
            raise InstrumentValidationError(f"invalid diversity instrument: {path}")
        if sum(data["distribution"]) != len(data["statements"]):
            raise InstrumentValidationError("distribution sum must equal number of statements")
        return data

    def _schema_hint(self) -> str:
        return json.dumps({
            "placements": {s["statement_id"]: "<int in -3..+3>" for s in self.instrument["statements"]},
            "likert_axes": {a["axis_id"]: "<int 1-7>" for a in self.instrument["likert_axes"]},
        }, ensure_ascii=False)

    def _user_prompt(self) -> str:
        dist = self.instrument["distribution"]
        buckets = list(range(-3, 4))
        bucket_desc = ", ".join(f"{b}:{n}" for b, n in zip(buckets, dist))
        lines = [
            ("Ordnen Sie jede Aussage genau einer Box von -3 (lehne stark ab) bis +3 (stimme stark zu) zu. "
             f"Die Verteilung ist erzwungen: {bucket_desc}.") if self.language == "de" else
            ("Place every statement into exactly one box from -3 (strongly disagree) to +3 (strongly agree). "
             f"The distribution is forced: {bucket_desc}."),
            "",
            "Statements:",
        ]
        for s in self.instrument["statements"]:
            txt = s["de"] if self.language == "de" else s["en"]
            lines.append(f"- [{s['statement_id']}] {txt}")
        lines += ["", "Then rate each axis from 1 to 7:"]
        for a in self.instrument["likert_axes"]:
            txt = a["de"] if self.language == "de" else a["en"]
            lines.append(f"- [{a['axis_id']}] {txt}")
        return "\n".join(lines)

    def _validator(self, raw: dict) -> Optional[dict]:
        if not isinstance(raw, dict): return None
        placements = raw.get("placements", {})
        axes = raw.get("likert_axes", {})
        statements = {s["statement_id"] for s in self.instrument["statements"]}
        if set(placements.keys()) != statements: return None
        dist = self.instrument["distribution"]
        target = {b: n for b, n in zip(range(-3, 4), dist)}
        got: dict[int, int] = {}
        for v in placements.values():
            if not isinstance(v, int) or not -3 <= v <= 3: return None
            got[v] = got.get(v, 0) + 1
        if got != target: return None
        for a in self.instrument["likert_axes"]:
            v = axes.get(a["axis_id"])
            if not isinstance(v, int) or not 1 <= v <= 7: return None
        return raw

    def administer(self, persona: PersonaRecord) -> QSortResponse:
        raw = self.interviewer.ask_in_character(
            persona,
            user_prompt=self._user_prompt(),
            schema_hint=self._schema_hint(),
            validate=self._validator,
        )
        return QSortResponse(
            agent_id=persona.agent_id,
            placements={k: int(v) for k, v in raw["placements"].items()},
            likert_axes={k: int(v) for k, v in raw["likert_axes"].items()},
        )

def _vectorize(r: QSortResponse, statements: list[str], axes: list[str]) -> np.ndarray:
    return np.array(
        [r.placements.get(s, 0) for s in statements] +
        [r.likert_axes.get(a, 4) for a in axes],
        dtype=float,
    )

def run_typology(responses: list[QSortResponse], n_clusters: int = 4) -> dict:
    if not responses:
        return {"n": 0, "clusters": [], "pca": {"components": [], "explained_variance": []}}
    statements = sorted({k for r in responses for k in r.placements})
    axes = sorted({k for r in responses for k in r.likert_axes})
    X = np.vstack([_vectorize(r, statements, axes) for r in responses])
    n_clusters = min(n_clusters, len(responses))
    pca = PCA(n_components=min(5, X.shape[1], X.shape[0]))
    pcs = pca.fit_transform(X)
    km = KMeans(n_clusters=n_clusters, n_init=10, random_state=0)
    labels = km.fit_predict(X)
    clusters = []
    for c in range(n_clusters):
        members = [responses[i].agent_id for i in range(len(responses)) if labels[i] == c]
        centroid = km.cluster_centers_[c]
        clusters.append({
            "cluster_id": int(c),
            "n": len(members),
            "agent_ids": members,
            "top_loadings": {
                statements[i] if i < len(statements) else axes[i - len(statements)]: float(centroid[i])
                for i in np.argsort(np.abs(centroid))[::-1][:8].tolist()
            },
        })
    return {
        "n": len(responses),
        "clusters": clusters,
        "pca": {
            "components": pcs.tolist(),
            "explained_variance": pca.explained_variance_ratio_.tolist(),
            "agent_ids": [r.agent_id for r in responses],
        },
    }

Step 5: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_diversity.py -v Expected: 2 passed.

Step 6: Commit

git add backend/scripts/instruments/diversity_v1.yaml backend/app/services/interviews/diversity.py backend/tests/interviews/test_diversity.py
git commit -m "feat(interviews): diversity subagent with Q-sort + 6 Likert axes + PCA/k-means typology"

Task 8: Delphi subagent (three rounds)

Files:

Create: backend/scripts/instruments/delphi_v1.yaml
Create: backend/app/services/interviews/delphi.py
Test: backend/tests/interviews/test_delphi.py
Step 1: Write failing test

# backend/tests/interviews/test_delphi.py
from pathlib import Path
from app.services.interviews.base import PersonaRecord, MemoryDigest
from app.services.interviews.delphi import (
    DelphiSubagent, extract_themes, convergence_metrics,
)

INSTRUMENT = Path(__file__).resolve().parents[2] / "scripts" / "instruments" / "delphi_v1.yaml"

class _Mem:
    def get_digest(self, agent_id, max_chars=2000):
        return MemoryDigest(text="x", available=True)

class _R1LLM:
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        return {"answers": {
            "q1": "Klimawandel, Quoten, Generationswechsel",
            "q2": "MSC, Aquakultur",
            "q3": "Russland, EU-Politik",
            "q4": "Verbraucherpreise",
        }}

class _R2LLM:
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        return {"ratings": {f"theme_{i}": {"importance": 4, "plausibility": 3} for i in range(5)}}

class _ExtractLLM:
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        return {"themes": [
            {"theme_id": "theme_0", "label": "Klimawandel"},
            {"theme_id": "theme_1", "label": "Quoten"},
            {"theme_id": "theme_2", "label": "MSC"},
            {"theme_id": "theme_3", "label": "EU-Politik"},
            {"theme_id": "theme_4", "label": "Generationswechsel"},
        ]}

def test_delphi_round1_open():
    sub = DelphiSubagent(llm=_R1LLM(), memory=_Mem(), instrument_path=INSTRUMENT)
    persona = PersonaRecord(agent_id=2, name="A", persona="p")
    resp = sub.administer_round1(persona)
    assert resp.round == 1
    assert len(resp.answers) == 4

def test_extract_themes_aggregates():
    from app.models.interview import DelphiOpenResponse
    r1 = [DelphiOpenResponse(agent_id=i, answers={"q1": "Klimawandel", "q2": "MSC"}) for i in range(3)]
    themes = extract_themes(r1, llm=_ExtractLLM())
    assert len(themes) == 5
    assert all("theme_id" in t for t in themes)

def test_convergence_metrics():
    from app.models.interview import DelphiRatingResponse
    r2 = [DelphiRatingResponse(agent_id=i, round=2,
            ratings={"t1": {"importance": 3, "plausibility": 3}}) for i in range(5)]
    r3 = [DelphiRatingResponse(agent_id=i, round=3,
            ratings={"t1": {"importance": 4, "plausibility": 4}}) for i in range(5)]
    conv = convergence_metrics(r2, r3)
    assert "t1" in conv
    assert conv["t1"]["delta_iqr_importance"] is not None

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_delphi.py -v Expected: ImportError.

Step 3: Create instrument YAML

backend/scripts/instruments/delphi_v1.yaml:

name: delphi_v1
version: "1.0"
language_default: de
rounds: 3
questions:
  - {question_id: q1, de: "Welche drei Faktoren werden die deutsche Fischerei bis 2040 am stärksten prägen?", en: "Which three factors will most shape German fisheries by 2040?"}
  - {question_id: q2, de: "Welche Akteurinnen und Akteure sind heute entscheidend, werden aber unterschätzt?", en: "Which actors are decisive today but underestimated?"}
  - {question_id: q3, de: "Was sollte sich in den nächsten fünf Jahren ändern, damit die Fischerei eine Zukunft hat?", en: "What should change in the next five years for fisheries to have a future?"}
  - {question_id: q4, de: "Welcher Trend macht Ihnen am meisten Hoffnung – und welcher am meisten Sorge?", en: "Which trend gives you most hope — and which most concern?"}

Step 4: Implement subagent

backend/app/services/interviews/delphi.py:

from __future__ import annotations
import json
import statistics
from pathlib import Path
from typing import Optional
import yaml
from app.models.interview import (
    DelphiOpenResponse, DelphiRatingResponse,
)
from app.services.interviews.base import StakeholderInterviewer, PersonaRecord

class DelphiSubagent:
    def __init__(self, llm, memory, instrument_path: Path, language: str = "de"):
        with Path(instrument_path).open("r", encoding="utf-8") as f:
            self.instrument = yaml.safe_load(f)
        self.interviewer = StakeholderInterviewer(llm=llm, memory=memory, language=language)
        self.llm = llm
        self.language = language

    # --- Round 1: open questions ---
    def _r1_schema(self) -> str:
        return json.dumps({
            "answers": {q["question_id"]: "<string>" for q in self.instrument["questions"]}
        }, ensure_ascii=False)

    def _r1_prompt(self) -> str:
        lines = ["Bitte beantworten Sie offen:" if self.language == "de" else "Please answer openly:"]
        for q in self.instrument["questions"]:
            txt = q["de"] if self.language == "de" else q["en"]
            lines.append(f"[{q['question_id']}] {txt}")
        return "\n".join(lines)

    def _r1_validate(self, raw: dict) -> Optional[dict]:
        if not isinstance(raw, dict): return None
        ans = raw.get("answers")
        if not isinstance(ans, dict): return None
        required = {q["question_id"] for q in self.instrument["questions"]}
        if not required.issubset(ans.keys()): return None
        return raw

    def administer_round1(self, persona: PersonaRecord) -> DelphiOpenResponse:
        raw = self.interviewer.ask_in_character(
            persona, user_prompt=self._r1_prompt(),
            schema_hint=self._r1_schema(), validate=self._r1_validate,
        )
        return DelphiOpenResponse(agent_id=persona.agent_id, round=1,
                                  answers={k: str(v) for k, v in raw["answers"].items()})

    # --- Round 2: rate themes ---
    def _r2_schema(self, theme_ids: list[str]) -> str:
        return json.dumps({
            "ratings": {tid: {"importance": "<int 1-5>", "plausibility": "<int 1-5>"} for tid in theme_ids}
        }, ensure_ascii=False)

    def _r2_prompt(self, themes: list[dict]) -> str:
        head = "Bewerten Sie jedes Thema nach Wichtigkeit (1-5) und Plausibilität (1-5):" if self.language == "de" \
               else "Rate each theme on importance (1-5) and plausibility (1-5):"
        body = [f"- [{t['theme_id']}] {t['label']}" for t in themes]
        return head + "\n" + "\n".join(body)

    def _r2_validate(self, theme_ids: list[str]):
        def v(raw: dict) -> Optional[dict]:
            if not isinstance(raw, dict): return None
            ratings = raw.get("ratings", {})
            if set(ratings.keys()) != set(theme_ids): return None
            for tid, r in ratings.items():
                if not isinstance(r, dict): return None
                for key in ("importance", "plausibility"):
                    if not isinstance(r.get(key), int) or not 1 <= r[key] <= 5: return None
            return raw
        return v

    def administer_round2(self, persona: PersonaRecord, themes: list[dict]) -> DelphiRatingResponse:
        theme_ids = [t["theme_id"] for t in themes]
        raw = self.interviewer.ask_in_character(
            persona, user_prompt=self._r2_prompt(themes),
            schema_hint=self._r2_schema(theme_ids), validate=self._r2_validate(theme_ids),
        )
        return DelphiRatingResponse(agent_id=persona.agent_id, round=2,
                                    ratings={k: dict(v) for k, v in raw["ratings"].items()})

    # --- Round 3: revise after seeing group stats ---
    def administer_round3(
        self, persona: PersonaRecord, themes: list[dict], group_stats: dict, own_r2: DelphiRatingResponse
    ) -> DelphiRatingResponse:
        theme_ids = [t["theme_id"] for t in themes]
        head = ("Sie sehen unten die anonymisierten Gruppenwerte (Median, IQR). "
                "Bitte überarbeiten Sie Ihre Bewertungen, wenn Sie möchten, und begründen Sie kurz.") \
               if self.language == "de" else \
               ("Below are the anonymised group values (median, IQR). "
                "Please revise your ratings if you wish and add a short justification.")
        ctx_lines = []
        for t in themes:
            tid = t["theme_id"]
            gs = group_stats.get(tid, {})
            own = own_r2.ratings.get(tid, {})
            ctx_lines.append(
                f"[{tid}] {t['label']} — group importance median={gs.get('imp_median')}, "
                f"IQR={gs.get('imp_iqr')}; plausibility median={gs.get('plaus_median')}, "
                f"IQR={gs.get('plaus_iqr')}. Your R2: imp={own.get('importance')}, plaus={own.get('plausibility')}."
            )
        prompt = head + "\n\n" + "\n".join(ctx_lines)
        schema = json.dumps({
            "ratings": {tid: {"importance": "<int 1-5>", "plausibility": "<int 1-5>"} for tid in theme_ids},
            "justification": "<string>",
        }, ensure_ascii=False)
        def validate(raw):
            if not isinstance(raw, dict): return None
            ratings = raw.get("ratings", {})
            if set(ratings.keys()) != set(theme_ids): return None
            for r in ratings.values():
                if not isinstance(r, dict): return None
                for key in ("importance", "plausibility"):
                    if not isinstance(r.get(key), int) or not 1 <= r[key] <= 5: return None
            return raw
        raw = self.interviewer.ask_in_character(persona, user_prompt=prompt,
                                                schema_hint=schema, validate=validate)
        return DelphiRatingResponse(
            agent_id=persona.agent_id, round=3,
            ratings={k: dict(v) for k, v in raw["ratings"].items()},
            justification=raw.get("justification"),
        )

def extract_themes(round1: list[DelphiOpenResponse], llm) -> list[dict]:
    text_blocks = []
    for r in round1:
        for qid, ans in r.answers.items():
            text_blocks.append(f"[agent {r.agent_id} {qid}] {ans}")
    schema = json.dumps({"themes": [{"theme_id": "<string>", "label": "<short string>"}]}, ensure_ascii=False)
    messages = [
        {"role": "system", "content":
            "You extract distinct thematic codes from open-ended German fisheries survey responses. "
            f"Return JSON ONLY matching: {schema}. Use stable theme_ids of form theme_0, theme_1, …"},
        {"role": "user", "content": "Responses:\n" + "\n".join(text_blocks) + "\n\nReturn up to 12 distinct themes."},
    ]
    raw = llm.chat_json(messages=messages, temperature=0.0)
    themes = raw.get("themes", []) if isinstance(raw, dict) else []
    out = []
    for i, t in enumerate(themes):
        if isinstance(t, dict) and "label" in t:
            out.append({"theme_id": t.get("theme_id") or f"theme_{i}", "label": str(t["label"])})
    return out

def _iqr(xs: list[float]) -> float:
    if not xs: return 0.0
    xs = sorted(xs)
    q1 = statistics.quantiles(xs, n=4)[0] if len(xs) >= 4 else xs[0]
    q3 = statistics.quantiles(xs, n=4)[2] if len(xs) >= 4 else xs[-1]
    return q3 - q1

def convergence_metrics(r2: list[DelphiRatingResponse], r3: list[DelphiRatingResponse]) -> dict:
    by_r2 = {r.agent_id: r for r in r2}
    by_r3 = {r.agent_id: r for r in r3}
    themes: set[str] = set()
    for r in r2 + r3:
        themes.update(r.ratings.keys())
    out: dict[str, dict] = {}
    for t in sorted(themes):
        imp_r2 = [by_r2[a].ratings[t]["importance"] for a in by_r2 if t in by_r2[a].ratings]
        imp_r3 = [by_r3[a].ratings[t]["importance"] for a in by_r3 if t in by_r3[a].ratings]
        plaus_r2 = [by_r2[a].ratings[t]["plausibility"] for a in by_r2 if t in by_r2[a].ratings]
        plaus_r3 = [by_r3[a].ratings[t]["plausibility"] for a in by_r3 if t in by_r3[a].ratings]
        out[t] = {
            "imp_median_r2": statistics.median(imp_r2) if imp_r2 else None,
            "imp_median_r3": statistics.median(imp_r3) if imp_r3 else None,
            "imp_iqr_r2": _iqr(imp_r2),
            "imp_iqr_r3": _iqr(imp_r3),
            "delta_iqr_importance": _iqr(imp_r3) - _iqr(imp_r2),
            "plaus_iqr_r2": _iqr(plaus_r2),
            "plaus_iqr_r3": _iqr(plaus_r3),
            "delta_iqr_plausibility": _iqr(plaus_r3) - _iqr(plaus_r2),
        }
    return out

def group_stats_from_r2(r2: list[DelphiRatingResponse]) -> dict:
    themes: set[str] = set()
    for r in r2: themes.update(r.ratings.keys())
    stats: dict[str, dict] = {}
    for t in themes:
        imps = [r.ratings[t]["importance"] for r in r2 if t in r.ratings]
        plauss = [r.ratings[t]["plausibility"] for r in r2 if t in r.ratings]
        stats[t] = {
            "imp_median": statistics.median(imps) if imps else None,
            "imp_iqr": _iqr(imps),
            "plaus_median": statistics.median(plauss) if plauss else None,
            "plaus_iqr": _iqr(plauss),
        }
    return stats

Step 5: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_delphi.py -v Expected: 3 passed.

Step 6: Commit

git add backend/scripts/instruments/delphi_v1.yaml backend/app/services/interviews/delphi.py backend/tests/interviews/test_delphi.py
git commit -m "feat(interviews): Delphi subagent (3 rounds: open, rate, revise) + convergence metrics"

Task 9: Scenario subagent

Files:

Create: backend/scripts/instruments/scenario_v1.yaml
Create: backend/app/services/interviews/scenario.py
Test: backend/tests/interviews/test_scenario.py
Step 1: Write failing test

# backend/tests/interviews/test_scenario.py
from pathlib import Path
from app.services.interviews.base import PersonaRecord, MemoryDigest
from app.services.interviews.scenario import ScenarioSubagent, polarity_matrix

INSTRUMENT = Path(__file__).resolve().parents[2] / "scripts" / "instruments" / "scenario_v1.yaml"

class _Mem:
    def get_digest(self, agent_id, max_chars=2000):
        return MemoryDigest(text="x", available=True)

class _LLM:
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        return {"ratings": {sid: {
            "desirability": 4, "plausibility": 3, "impact_on_my_group": 5, "fairness": 3,
            "if_woke_up_response": f"act-on-{sid}",
        } for sid in ("S1", "S2", "S3", "S4")}}

def test_scenario_administer():
    sub = ScenarioSubagent(llm=_LLM(), memory=_Mem(), instrument_path=INSTRUMENT)
    persona = PersonaRecord(agent_id=1, name="A", persona="p")
    resp = sub.administer(persona)
    assert set(resp.ratings.keys()) == {"S1", "S2", "S3", "S4"}
    assert resp.ratings["S1"].desirability == 4

def test_polarity_matrix():
    from app.models.interview import ScenarioResponse, ScenarioRating
    responses = [ScenarioResponse(agent_id=i, ratings={
        "S1": ScenarioRating(desirability=5, plausibility=4, impact_on_my_group=5, fairness=4,
                              if_woke_up_response="x"),
    }) for i in range(3)]
    m = polarity_matrix(responses)
    assert "S1" in m
    assert m["S1"]["mean_desirability"] == 5
    assert m["S1"]["n"] == 3

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_scenario.py -v Expected: ImportError.

Step 3: Create instrument YAML

backend/scripts/instruments/scenario_v1.yaml:

name: scenario_v1
version: "1.0"
language_default: de
scenarios:
  - scenario_id: S1
    label_de: "Erholung 2040"
    label_en: "Recovery 2040"
    description_de: |
      Bis 2040 haben sich Dorsch- und Heringsbestände in der westlichen Ostsee
      deutlich erholt. MSC-Zertifizierung ist branchenweit Standard. Die kleine
      Küstenfischerei hat sich stabilisiert; die Politik gilt als erfolgreich.      
    description_en: |
      By 2040, Western Baltic cod and herring stocks have substantially recovered.
      MSC certification is industry-wide standard. Small-scale coastal fisheries
      have stabilised; policy is regarded as successful.      
  - scenario_id: S2
    label_de: "Kollaps 2040"
    label_en: "Collapse 2040"
    description_de: |
      Bis 2040 sind Dorsch- und Heringsbestände zusammengebrochen. Die Flotte
      ist halbiert, Aquakultur dominiert den Markt, Häfen veröden.      
    description_en: |
      By 2040, cod and herring stocks have collapsed. The fleet is halved,
      aquaculture dominates the market, harbour towns decline.      
  - scenario_id: S3
    label_de: "Festung Europa 2040"
    label_en: "Fortress Europe 2040"
    description_de: |
      Bis 2040 verfolgt die EU eine protektionistische Politik mit hohen Importzöllen,
      Meeresschutzgebiete bedecken 30% der Ostsee, Sportfischerei ist stark eingeschränkt.      
    description_en: |
      By 2040, the EU pursues a protectionist policy with high import tariffs,
      MPAs cover 30% of the Baltic, recreational fishing is strongly curtailed.      
  - scenario_id: S4
    label_de: "Privatisierung 2040"
    label_en: "Privatisation 2040"
    description_de: |
      Bis 2040 sind Fangrechte als handelbare Quoten (ITQs) etabliert. Die Branche
      hat sich konsolidiert; nur große, kapitalstarke Unternehmen sind übrig.      
    description_en: |
      By 2040, fishing rights are tradable quotas (ITQs). The industry has
      consolidated; only large, well-capitalised firms remain.      
dimensions:
  - {dimension_id: desirability, scale: 7,
     de: "Wie wünschenswert ist dieses Szenario?", en: "How desirable is this scenario?"}
  - {dimension_id: plausibility, scale: 7,
     de: "Wie plausibel ist dieses Szenario?",   en: "How plausible is this scenario?"}
  - {dimension_id: impact_on_my_group, scale: 7,
     de: "Wie stark trifft es Ihre Gruppe?",     en: "How strongly does it affect your group?"}
  - {dimension_id: fairness, scale: 7,
     de: "Wie fair ist dieses Szenario?",        en: "How fair is this scenario?"}

Step 4: Implement subagent

backend/app/services/interviews/scenario.py:

from __future__ import annotations
import json
import statistics
from pathlib import Path
from typing import Optional
import yaml
from app.models.interview import ScenarioRating, ScenarioResponse
from app.services.interviews.base import StakeholderInterviewer, PersonaRecord

class ScenarioSubagent:
    def __init__(self, llm, memory, instrument_path: Path, language: str = "de"):
        with Path(instrument_path).open("r", encoding="utf-8") as f:
            self.instrument = yaml.safe_load(f)
        self.interviewer = StakeholderInterviewer(llm=llm, memory=memory, language=language)
        self.language = language

    def _schema_hint(self) -> str:
        sids = [s["scenario_id"] for s in self.instrument["scenarios"]]
        return json.dumps({
            "ratings": {sid: {
                "desirability": "<int 1-7>",
                "plausibility": "<int 1-7>",
                "impact_on_my_group": "<int 1-7>",
                "fairness": "<int 1-7>",
                "if_woke_up_response": "<string>",
            } for sid in sids}
        }, ensure_ascii=False)

    def _user_prompt(self) -> str:
        head = ("Bewerten Sie jedes der folgenden Szenarien auf vier Dimensionen (1-7) "
                "und beantworten Sie kurz, was Sie tun würden, wenn Sie in dieser Welt aufwachten.") \
               if self.language == "de" else \
               ("Rate each of the following scenarios on four dimensions (1-7) "
                "and briefly answer what you would do if you woke up in this world.")
        blocks = []
        for s in self.instrument["scenarios"]:
            label = s["label_de"] if self.language == "de" else s["label_en"]
            desc = s["description_de"] if self.language == "de" else s["description_en"]
            blocks.append(f"--- {s['scenario_id']}: {label} ---\n{desc}")
        return head + "\n\n" + "\n\n".join(blocks)

    def _validate(self, raw: dict) -> Optional[dict]:
        if not isinstance(raw, dict): return None
        sids = {s["scenario_id"] for s in self.instrument["scenarios"]}
        ratings = raw.get("ratings", {})
        if set(ratings.keys()) != sids: return None
        for v in ratings.values():
            if not isinstance(v, dict): return None
            for k in ("desirability", "plausibility", "impact_on_my_group", "fairness"):
                if not isinstance(v.get(k), int) or not 1 <= v[k] <= 7: return None
            if not isinstance(v.get("if_woke_up_response", ""), str): return None
        return raw

    def administer(self, persona: PersonaRecord) -> ScenarioResponse:
        raw = self.interviewer.ask_in_character(
            persona, user_prompt=self._user_prompt(),
            schema_hint=self._schema_hint(), validate=self._validate,
        )
        ratings = {sid: ScenarioRating(**v) for sid, v in raw["ratings"].items()}
        return ScenarioResponse(agent_id=persona.agent_id, ratings=ratings)

def polarity_matrix(responses: list[ScenarioResponse]) -> dict:
    matrix: dict[str, dict] = {}
    sids: set[str] = set()
    for r in responses: sids.update(r.ratings.keys())
    for sid in sorted(sids):
        vals = [r.ratings[sid] for r in responses if sid in r.ratings]
        if not vals:
            matrix[sid] = {"n": 0}
            continue
        matrix[sid] = {
            "n": len(vals),
            "mean_desirability": statistics.mean(v.desirability for v in vals),
            "mean_plausibility": statistics.mean(v.plausibility for v in vals),
            "mean_impact": statistics.mean(v.impact_on_my_group for v in vals),
            "mean_fairness": statistics.mean(v.fairness for v in vals),
            "sd_desirability": statistics.pstdev([v.desirability for v in vals]) if len(vals) > 1 else 0.0,
            "sd_plausibility": statistics.pstdev([v.plausibility for v in vals]) if len(vals) > 1 else 0.0,
        }
    return matrix

Step 5: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_scenario.py -v Expected: 2 passed.

Step 6: Commit

git add backend/scripts/instruments/scenario_v1.yaml backend/app/services/interviews/scenario.py backend/tests/interviews/test_scenario.py
git commit -m "feat(interviews): scenario subagent with 4 futures × 4 dimensions + polarity matrix"

Phase 3 — Storage and Zep

Task 10: Interview storage layout writer

Files:

Create: backend/app/services/interviews/storage.py
Test: backend/tests/interviews/test_storage.py
Step 1: Write failing test

# backend/tests/interviews/test_storage.py
import json
from pathlib import Path
from app.models.interview import (
    LikertResponse, InterviewPhase, SubagentKind,
)
from app.services.interviews.storage import InterviewStore

def test_run_directory_layout(tmp_path):
    store = InterviewStore(root=tmp_path, sim_id="sim42")
    run_dir = store.start_run(phase=InterviewPhase.T0, subagent=SubagentKind.LONGITUDINAL)
    assert run_dir.exists()
    assert run_dir.parent.name == "longitudinal"
    assert run_dir.parent.parent.name == "T0"

def test_append_response(tmp_path):
    store = InterviewStore(root=tmp_path, sim_id="sim42")
    run_dir = store.start_run(phase=InterviewPhase.T0, subagent=SubagentKind.LONGITUDINAL)
    r = LikertResponse(agent_id=1, phase=InterviewPhase.T0,
                       responses={"a": 3}, confidence={"a": 0.5})
    store.append_response(run_dir, r)
    contents = (run_dir / "responses.jsonl").read_text()
    assert json.loads(contents.splitlines()[0])["agent_id"] == 1

def test_write_aggregate_and_latest_pointer(tmp_path):
    store = InterviewStore(root=tmp_path, sim_id="sim42")
    run_dir = store.start_run(phase=InterviewPhase.T1, subagent=SubagentKind.SCENARIO)
    store.write_aggregate(run_dir, {"k": 1})
    store.mark_latest(run_dir)
    latest = (run_dir.parent / "latest.json").read_text()
    assert json.loads(latest)["run_dir"].endswith(run_dir.name)

def test_audit_log_append(tmp_path):
    store = InterviewStore(root=tmp_path, sim_id="sim42")
    run_dir = store.start_run(phase=InterviewPhase.T0, subagent=SubagentKind.DELPHI)
    store.audit(run_dir, agent_id=7, event="schema_violation", detail="missing key x")
    audit = (run_dir / "audit.jsonl").read_text()
    assert "schema_violation" in audit

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_storage.py -v Expected: ImportError.

Step 3: Implement storage

backend/app/services/interviews/storage.py:

from __future__ import annotations
import json
import time
import uuid
from pathlib import Path
from typing import Any
from pydantic import BaseModel
from app.models.interview import InterviewPhase, SubagentKind

class InterviewStore:
    def __init__(self, root: Path, sim_id: str):
        self.base = Path(root) / "simulations" / sim_id / "interviews"
        self.base.mkdir(parents=True, exist_ok=True)

    def start_run(self, phase: InterviewPhase, subagent: SubagentKind) -> Path:
        run_id = time.strftime("%Y%m%dT%H%M%S") + "-" + uuid.uuid4().hex[:6]
        run_dir = self.base / phase.value / subagent.value / run_id
        run_dir.mkdir(parents=True, exist_ok=True)
        meta = {"run_id": run_id, "phase": phase.value, "subagent": subagent.value,
                "created_at": time.time()}
        (run_dir / "run.json").write_text(json.dumps(meta, indent=2), encoding="utf-8")
        return run_dir

    def append_response(self, run_dir: Path, model: BaseModel) -> None:
        path = run_dir / "responses.jsonl"
        with path.open("a", encoding="utf-8") as f:
            f.write(model.model_dump_json() + "\n")

    def append_jsonl(self, run_dir: Path, filename: str, payload: dict | BaseModel) -> None:
        path = run_dir / filename
        with path.open("a", encoding="utf-8") as f:
            if isinstance(payload, BaseModel):
                f.write(payload.model_dump_json() + "\n")
            else:
                f.write(json.dumps(payload, ensure_ascii=False) + "\n")

    def read_responses(self, run_dir: Path, filename: str = "responses.jsonl") -> list[dict]:
        path = run_dir / filename
        if not path.exists(): return []
        return [json.loads(line) for line in path.read_text(encoding="utf-8").splitlines() if line.strip()]

    def write_aggregate(self, run_dir: Path, payload: dict) -> None:
        (run_dir / "aggregate.json").write_text(
            json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")

    def write_named(self, run_dir: Path, name: str, payload: Any) -> None:
        (run_dir / name).write_text(
            json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")

    def audit(self, run_dir: Path, agent_id: int | None, event: str, detail: str = "") -> None:
        entry = {"ts": time.time(), "agent_id": agent_id, "event": event, "detail": detail}
        with (run_dir / "audit.jsonl").open("a", encoding="utf-8") as f:
            f.write(json.dumps(entry, ensure_ascii=False) + "\n")

    def mark_latest(self, run_dir: Path) -> None:
        pointer = run_dir.parent / "latest.json"
        pointer.write_text(json.dumps({
            "run_dir": str(run_dir.relative_to(self.base)),
        }), encoding="utf-8")

    def latest_run(self, phase: InterviewPhase, subagent: SubagentKind) -> Path | None:
        pointer = self.base / phase.value / subagent.value / "latest.json"
        if not pointer.exists(): return None
        rel = json.loads(pointer.read_text())["run_dir"]
        path = self.base / rel
        return path if path.exists() else None

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_storage.py -v Expected: 4 passed.

Step 5: Commit

git add backend/app/services/interviews/storage.py backend/tests/interviews/test_storage.py
git commit -m "feat(interviews): JSONL/JSON storage layout with run_id directories and latest pointer"

Task 11: Zep episode writer for interviews

Files:

Create: backend/app/services/interviews/zep_writer.py
Test: backend/tests/interviews/test_zep_writer.py
Step 1: Write failing test

# backend/tests/interviews/test_zep_writer.py
from app.models.interview import (
    LikertResponse, InterviewPhase, SubagentKind,
)
from app.services.interviews.zep_writer import InterviewZepWriter

class _FakeMemoryUpdater:
    def __init__(self):
        self.events = []
    def add_activity(self, activity):
        self.events.append(activity)
    def add_text_episode(self, graph_id, text):
        self.events.append({"graph_id": graph_id, "text": text})

def test_per_agent_episode_text():
    upd = _FakeMemoryUpdater()
    w = InterviewZepWriter(memory_updater=upd, graph_id="g1")
    r = LikertResponse(agent_id=42, phase=InterviewPhase.T1,
                       responses={"stk_1": 4, "gov_1": 3},
                       confidence={"stk_1": 0.8, "gov_1": 0.7})
    w.write_per_agent(SubagentKind.LONGITUDINAL, r, agent_name="Fischer Müller")
    assert any("Fischer Müller" in str(e) for e in upd.events)
    assert any("longitudinal/T1" in str(e) for e in upd.events)

def test_aggregate_episode():
    upd = _FakeMemoryUpdater()
    w = InterviewZepWriter(memory_updater=upd, graph_id="g1")
    w.write_aggregate(SubagentKind.SCENARIO, summary="S1 mean desirability 5.2; S2 mean 2.1")
    assert any("S1 mean" in str(e) for e in upd.events)

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_zep_writer.py -v Expected: ImportError.

Step 3: Implement writer

backend/app/services/interviews/zep_writer.py:

from __future__ import annotations
from typing import Any, Optional
from app.models.interview import (
    LikertResponse, QSortResponse, DelphiRatingResponse, ScenarioResponse, SubagentKind,
)

class InterviewZepWriter:
    """Mirrors `ZepGraphMemoryUpdater.add_activity` usage but for interview episodes.

    The real `ZepGraphMemoryUpdater` may expose `add_activity` (preferred) or a lower-level
    text-episode method; this writer adapts to either via duck typing.
    """
    def __init__(self, memory_updater, graph_id: str):
        self.updater = memory_updater
        self.graph_id = graph_id

    def _emit(self, text: str) -> None:
        if hasattr(self.updater, "add_text_episode"):
            self.updater.add_text_episode(self.graph_id, text)
        elif hasattr(self.updater, "add_activity"):
            self.updater.add_activity({"graph_id": self.graph_id, "text": text})
        else:
            raise RuntimeError("memory_updater has neither add_text_episode nor add_activity")

    def _summarize_likert(self, r: LikertResponse) -> str:
        mean_v = sum(r.responses.values()) / max(len(r.responses), 1)
        top = sorted(r.responses.items(), key=lambda kv: -kv[1])[:3]
        bot = sorted(r.responses.items(), key=lambda kv: kv[1])[:3]
        return (f"mean={mean_v:.2f}; agrees with {[k for k,_ in top]}; "
                f"disagrees with {[k for k,_ in bot]}")

    def _summarize_qsort(self, r: QSortResponse) -> str:
        plus = [k for k, v in r.placements.items() if v >= 2]
        minus = [k for k, v in r.placements.items() if v <= -2]
        return f"+strongly:{plus}; -strongly:{minus}"

    def _summarize_scenario(self, r: ScenarioResponse) -> str:
        parts = [f"{sid}: des={rt.desirability} plaus={rt.plausibility}"
                 for sid, rt in r.ratings.items()]
        return "; ".join(parts)

    def write_per_agent(
        self, subagent: SubagentKind, response: Any, agent_name: str,
        phase: Optional[str] = None,
    ) -> None:
        if isinstance(response, LikertResponse):
            phase = phase or response.phase.value
            summary = self._summarize_likert(response)
        elif isinstance(response, QSortResponse):
            phase = phase or "T1"
            summary = self._summarize_qsort(response)
        elif isinstance(response, ScenarioResponse):
            phase = phase or "T1"
            summary = self._summarize_scenario(response)
        elif isinstance(response, DelphiRatingResponse):
            phase = phase or f"T1/R{response.round}"
            summary = f"round={response.round}; {len(response.ratings)} themes rated"
        else:
            phase = phase or "T1"
            summary = str(response)[:200]
        text = f"Agent {agent_name} (interview/{subagent.value}/{phase}): {summary}"
        self._emit(text)

    def write_aggregate(self, subagent: SubagentKind, summary: str) -> None:
        self._emit(f"Interview aggregate ({subagent.value}): {summary}")

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_zep_writer.py -v Expected: 2 passed.

Step 5: Commit

git add backend/app/services/interviews/zep_writer.py backend/tests/interviews/test_zep_writer.py
git commit -m "feat(interviews): Zep writer adapts add_activity/add_text_episode for per-agent + aggregate episodes"

Phase 4 — Orchestrator, lifecycle, synthesiser

Task 12: InterviewOrchestrator (parallel fan-out)

Files:

Create: backend/app/services/interview_orchestrator.py
Test: backend/tests/interviews/test_orchestrator.py
Step 1: Write failing test

# backend/tests/interviews/test_orchestrator.py
from pathlib import Path
import pytest
from app.models.interview import InterviewPhase, SubagentKind
from app.services.interviews.base import PersonaRecord, MemoryDigest
from app.services.interview_orchestrator import (
    InterviewOrchestrator, PersonaProvider,
)

INST_DIR = Path(__file__).resolve().parents[2] / "scripts" / "instruments"

class _Mem:
    def get_digest(self, agent_id, max_chars=2000):
        return MemoryDigest(text="x", available=True)

class _LLM:
    def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
        sys_text = next((m["content"] for m in messages if m["role"] == "system"), "")
        if "longitudinal" in sys_text or "stk_" in (messages[-1].get("content") or ""):
            return {
                "responses": {k: 3 for k in ("stk_1","stk_2","stk_3","gov_1","gov_2","gov_3",
                                             "mkt_1","mkt_2","mkt_3","clm_1","clm_2","clm_3")},
                "confidence": {}, "open_comment": "ok",
            }
        return {}

class _Personas(PersonaProvider):
    def __init__(self, n=3):
        self._items = [PersonaRecord(agent_id=i, name=f"A{i}", persona="p") for i in range(n)]
    def all(self): return list(self._items)

class _NoopZep:
    def write_per_agent(self, *a, **kw): pass
    def write_aggregate(self, *a, **kw): pass

def test_pre_phase_runs_longitudinal_only(tmp_path):
    orch = InterviewOrchestrator(
        llm=_LLM(), memory=_Mem(), personas=_Personas(3),
        instrument_dir=INST_DIR, store_root=tmp_path, sim_id="sim1",
        zep_writer=_NoopZep(), max_workers=2,
    )
    result = orch.run_pre()
    assert result["longitudinal"]["n_responded"] == 3
    assert "diversity" not in result  # only longitudinal in pre-phase

def test_partial_failure_does_not_kill_run(tmp_path):
    class _FlakyLLM:
        def __init__(self): self.n = 0
        def chat_json(self, messages, temperature=0.0, max_tokens=None, **kw):
            self.n += 1
            if self.n % 2 == 0:
                raise RuntimeError("simulated LLM 5xx")
            return {
                "responses": {k: 3 for k in ("stk_1","stk_2","stk_3","gov_1","gov_2","gov_3",
                                             "mkt_1","mkt_2","mkt_3","clm_1","clm_2","clm_3")},
                "confidence": {}, "open_comment": "ok",
            }
    orch = InterviewOrchestrator(
        llm=_FlakyLLM(), memory=_Mem(), personas=_Personas(4),
        instrument_dir=INST_DIR, store_root=tmp_path, sim_id="sim2",
        zep_writer=_NoopZep(), max_workers=1,
    )
    result = orch.run_pre()
    assert result["longitudinal"]["n_responded"] < 4
    assert result["longitudinal"]["n_failed"] > 0

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_orchestrator.py -v Expected: ImportError.

Step 3: Implement orchestrator

backend/app/services/interview_orchestrator.py:

from __future__ import annotations
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Protocol
from app.models.interview import (
    InterviewPhase, SubagentKind, LikertResponse, QSortResponse,
    DelphiOpenResponse, DelphiRatingResponse, ScenarioResponse,
)
from app.services.interviews.base import PersonaRecord
from app.services.interviews.longitudinal import LongitudinalSubagent, run_aggregate as longitudinal_aggregate
from app.services.interviews.diversity import DiversitySubagent, run_typology
from app.services.interviews.delphi import (
    DelphiSubagent, extract_themes, convergence_metrics, group_stats_from_r2,
)
from app.services.interviews.scenario import ScenarioSubagent, polarity_matrix
from app.services.interviews.storage import InterviewStore
from app.services.interviews.instrument_loader import freeze_snapshot

class PersonaProvider(Protocol):
    def all(self) -> list[PersonaRecord]: ...

class InterviewOrchestrator:
    def __init__(
        self, llm, memory, personas: PersonaProvider,
        instrument_dir: Path, store_root: Path, sim_id: str,
        zep_writer, max_workers: int = 8, language: str = "de",
    ):
        self.llm = llm
        self.memory = memory
        self.personas = personas
        self.instrument_dir = Path(instrument_dir)
        self.store = InterviewStore(root=store_root, sim_id=sim_id)
        self.zep_writer = zep_writer
        self.max_workers = max_workers
        self.language = language
        # Freeze snapshot once per orchestrator lifetime
        freeze_snapshot(
            instruments={
                "longitudinal": self.instrument_dir / "longitudinal_v1.yaml",
                "diversity":    self.instrument_dir / "diversity_v1.yaml",
                "delphi":       self.instrument_dir / "delphi_v1.yaml",
                "scenario":     self.instrument_dir / "scenario_v1.yaml",
            },
            out_path=self.store.base / "instruments_used.json",
        )

    # --- Generic per-agent runner ---
    def _fan_out(self, run_dir, agent_fn, personas, audit_label):
        ok: list = []
        failed: list[int] = []
        with ThreadPoolExecutor(max_workers=self.max_workers) as pool:
            futures = {pool.submit(agent_fn, p): p for p in personas}
            for fut in as_completed(futures):
                p = futures[fut]
                try:
                    out = fut.result()
                    ok.append(out)
                    self.store.append_response(run_dir, out)
                except Exception as e:
                    failed.append(p.agent_id)
                    self.store.audit(run_dir, agent_id=p.agent_id,
                                     event="agent_failed", detail=f"{audit_label}: {e!r}")
        return ok, failed

    # --- Pre-phase (T0) ---
    def run_pre(self) -> dict:
        sub = LongitudinalSubagent(self.llm, self.memory,
                                   self.instrument_dir / "longitudinal_v1.yaml",
                                   language=self.language)
        run_dir = self.store.start_run(InterviewPhase.T0, SubagentKind.LONGITUDINAL)
        ok, failed = self._fan_out(
            run_dir, lambda p: sub.administer(p, phase=InterviewPhase.T0),
            self.personas.all(), audit_label="longitudinal_T0",
        )
        for r in ok:
            persona = next(p for p in self.personas.all() if p.agent_id == r.agent_id)
            try: self.zep_writer.write_per_agent(SubagentKind.LONGITUDINAL, r, persona.name)
            except Exception: pass
        self.store.mark_latest(run_dir)
        return {"longitudinal": {"n_responded": len(ok), "n_failed": len(failed),
                                 "run_dir": str(run_dir)}}

    # --- Post-phase (T1) ---
    def run_post(self) -> dict:
        personas = self.personas.all()
        out: dict = {}
        with ThreadPoolExecutor(max_workers=4) as pool:
            futures = {
                "longitudinal": pool.submit(self._post_longitudinal, personas),
                "diversity":    pool.submit(self._post_diversity, personas),
                "scenario":     pool.submit(self._post_scenario, personas),
            }
            for name, fut in futures.items():
                try: out[name] = fut.result()
                except Exception as e: out[name] = {"error": repr(e)}
        # Delphi runs sequentially (R1 → R2 → R3) and uses the LLM for theme extraction
        try: out["delphi"] = self._post_delphi(personas)
        except Exception as e: out["delphi"] = {"error": repr(e)}
        return out

    def _post_longitudinal(self, personas) -> dict:
        sub = LongitudinalSubagent(self.llm, self.memory,
                                   self.instrument_dir / "longitudinal_v1.yaml",
                                   language=self.language)
        run_dir = self.store.start_run(InterviewPhase.T1, SubagentKind.LONGITUDINAL)
        ok, failed = self._fan_out(
            run_dir, lambda p: sub.administer(p, phase=InterviewPhase.T1),
            personas, audit_label="longitudinal_T1",
        )
        # Aggregate using T0 + T1
        t0_path = self.store.latest_run(InterviewPhase.T0, SubagentKind.LONGITUDINAL)
        t0_raw = self.store.read_responses(t0_path) if t0_path else []
        t0 = [LikertResponse(**d) for d in t0_raw]
        agg = longitudinal_aggregate(t0, ok)
        self.store.write_aggregate(run_dir, agg)
        for r in ok:
            persona = next(p for p in personas if p.agent_id == r.agent_id)
            try: self.zep_writer.write_per_agent(SubagentKind.LONGITUDINAL, r, persona.name)
            except Exception: pass
        try: self.zep_writer.write_aggregate(SubagentKind.LONGITUDINAL,
                                             f"n_paired={agg['n_paired']}")
        except Exception: pass
        self.store.mark_latest(run_dir)
        return {"n_responded": len(ok), "n_failed": len(failed), "run_dir": str(run_dir)}

    def _post_diversity(self, personas) -> dict:
        sub = DiversitySubagent(self.llm, self.memory,
                                self.instrument_dir / "diversity_v1.yaml",
                                language=self.language)
        run_dir = self.store.start_run(InterviewPhase.T1, SubagentKind.DIVERSITY)
        ok, failed = self._fan_out(
            run_dir, lambda p: sub.administer(p), personas, audit_label="diversity",
        )
        typology = run_typology(ok)
        self.store.write_named(run_dir, "typology.json", typology)
        self.store.write_aggregate(run_dir, {"n": len(ok), "n_failed": len(failed),
                                             "clusters": typology["clusters"]})
        for r in ok:
            persona = next(p for p in personas if p.agent_id == r.agent_id)
            try: self.zep_writer.write_per_agent(SubagentKind.DIVERSITY, r, persona.name)
            except Exception: pass
        self.store.mark_latest(run_dir)
        return {"n_responded": len(ok), "n_failed": len(failed), "run_dir": str(run_dir)}

    def _post_scenario(self, personas) -> dict:
        sub = ScenarioSubagent(self.llm, self.memory,
                               self.instrument_dir / "scenario_v1.yaml",
                               language=self.language)
        run_dir = self.store.start_run(InterviewPhase.T1, SubagentKind.SCENARIO)
        ok, failed = self._fan_out(
            run_dir, lambda p: sub.administer(p), personas, audit_label="scenario",
        )
        matrix = polarity_matrix(ok)
        self.store.write_named(run_dir, "polarity_matrix.json", matrix)
        self.store.write_aggregate(run_dir, {"n": len(ok), "n_failed": len(failed),
                                             "polarity": matrix})
        for r in ok:
            persona = next(p for p in personas if p.agent_id == r.agent_id)
            try: self.zep_writer.write_per_agent(SubagentKind.SCENARIO, r, persona.name)
            except Exception: pass
        self.store.mark_latest(run_dir)
        return {"n_responded": len(ok), "n_failed": len(failed), "run_dir": str(run_dir)}

    def _post_delphi(self, personas) -> dict:
        sub = DelphiSubagent(self.llm, self.memory,
                             self.instrument_dir / "delphi_v1.yaml",
                             language=self.language)
        run_dir = self.store.start_run(InterviewPhase.T1, SubagentKind.DELPHI)
        # Round 1
        r1_ok, r1_failed = self._fan_out(
            run_dir, lambda p: sub.administer_round1(p), personas, audit_label="delphi_r1",
        )
        # Move all R1 responses into a dedicated file
        for r in r1_ok: self.store.append_jsonl(run_dir, "round1_themes.jsonl", r)
        # Extract themes from R1
        themes = extract_themes(r1_ok, llm=self.llm)
        self.store.write_named(run_dir, "themes.json", {"themes": themes})
        # Round 2
        r2_ok, r2_failed = self._fan_out(
            run_dir, lambda p: sub.administer_round2(p, themes),
            [p for p in personas if p.agent_id in {r.agent_id for r in r1_ok}],
            audit_label="delphi_r2",
        )
        for r in r2_ok: self.store.append_jsonl(run_dir, "round2_ratings.jsonl", r)
        gstats = group_stats_from_r2(r2_ok)
        # Round 3
        r2_by = {r.agent_id: r for r in r2_ok}
        r3_personas = [p for p in personas if p.agent_id in r2_by]
        def r3_call(p): return sub.administer_round3(p, themes, gstats, r2_by[p.agent_id])
        r3_ok, r3_failed = self._fan_out(run_dir, r3_call, r3_personas, audit_label="delphi_r3")
        for r in r3_ok: self.store.append_jsonl(run_dir, "round3_revisions.jsonl", r)
        # Convergence
        conv = convergence_metrics(r2_ok, r3_ok)
        self.store.write_named(run_dir, "convergence.json", conv)
        self.store.write_aggregate(run_dir, {
            "n_r1": len(r1_ok), "n_r2": len(r2_ok), "n_r3": len(r3_ok),
            "n_failed_r1": len(r1_failed), "n_failed_r2": len(r2_failed), "n_failed_r3": len(r3_failed),
            "themes": themes,
        })
        for r in r3_ok:
            persona = next(p for p in personas if p.agent_id == r.agent_id)
            try: self.zep_writer.write_per_agent(SubagentKind.DELPHI, r, persona.name)
            except Exception: pass
        self.store.mark_latest(run_dir)
        return {"n_r1": len(r1_ok), "n_r2": len(r2_ok), "n_r3": len(r3_ok),
                "run_dir": str(run_dir)}

    # --- Re-run a single subagent ---
    def rerun(self, subagent: SubagentKind) -> dict:
        personas = self.personas.all()
        if subagent == SubagentKind.LONGITUDINAL: return {"longitudinal": self._post_longitudinal(personas)}
        if subagent == SubagentKind.DIVERSITY:    return {"diversity":    self._post_diversity(personas)}
        if subagent == SubagentKind.SCENARIO:     return {"scenario":     self._post_scenario(personas)}
        if subagent == SubagentKind.DELPHI:       return {"delphi":       self._post_delphi(personas)}
        raise ValueError(f"unknown subagent {subagent}")

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_orchestrator.py -v Expected: 2 passed.

Step 5: Commit

git add backend/app/services/interview_orchestrator.py backend/tests/interviews/test_orchestrator.py
git commit -m "feat(interviews): orchestrator with two-phase lifecycle, parallel fan-out, isolated failures"

Task 13: Simulation manager lifecycle hooks

Files:

Modify: backend/app/services/simulation_manager.py
Test: backend/tests/interviews/test_simulation_hooks.py
Step 1: Write failing test

# backend/tests/interviews/test_simulation_hooks.py
from app.services.simulation_manager import SimulationManager, SimulationState

def test_register_post_ready_hook_invoked(monkeypatch):
    called = []
    mgr = SimulationManager()
    mgr.register_on_ready(lambda state: called.append(("ready", state.sim_id)))
    state = SimulationState(sim_id="abc", status="ready")
    mgr._notify_on_ready(state)
    assert called == [("ready", "abc")]

def test_register_post_completed_hook_invoked():
    called = []
    mgr = SimulationManager()
    mgr.register_on_completed(lambda state: called.append(("done", state.sim_id)))
    state = SimulationState(sim_id="abc", status="completed")
    mgr._notify_on_completed(state)
    assert called == [("done", "abc")]

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_simulation_hooks.py -v Expected: AttributeError on register_on_ready / register_on_completed.

Step 3: Add hook registry to SimulationManager

In backend/app/services/simulation_manager.py, find the SimulationManager class. Add to __init__ (preserving existing init):

        self._on_ready_hooks: list = []
        self._on_completed_hooks: list = []

Add methods to the class:

    def register_on_ready(self, fn) -> None:
        self._on_ready_hooks.append(fn)

    def register_on_completed(self, fn) -> None:
        self._on_completed_hooks.append(fn)

    def _notify_on_ready(self, state) -> None:
        for fn in list(self._on_ready_hooks):
            try: fn(state)
            except Exception as e:
                from app.utils.logger import get_logger
                get_logger(__name__).warning(f"on_ready hook failed: {e!r}")

    def _notify_on_completed(self, state) -> None:
        for fn in list(self._on_completed_hooks):
            try: fn(state)
            except Exception as e:
                from app.utils.logger import get_logger
                get_logger(__name__).warning(f"on_completed hook failed: {e!r}")

Locate the existing code that transitions state to ready (after prepare_simulation completes) and to completed (after simulation finishes). Insert calls to self._notify_on_ready(state) and self._notify_on_completed(state) immediately after each transition. If SimulationState is not a simple dataclass with sim_id and status attributes, adjust the test fixture to match the actual class shape (read the file first).

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_simulation_hooks.py -v Expected: 2 passed.

Step 5: Commit

git add backend/app/services/simulation_manager.py backend/tests/interviews/test_simulation_hooks.py
git commit -m "feat(interviews): on_ready / on_completed hook registry on SimulationManager"

Task 14: InterviewSynthesizer

Files:

Create: backend/app/services/interview_synthesizer.py
Test: backend/tests/interviews/test_synthesizer.py
Step 1: Write failing test

# backend/tests/interviews/test_synthesizer.py
import json
from pathlib import Path
from app.services.interviews.storage import InterviewStore
from app.models.interview import InterviewPhase, SubagentKind, LikertResponse
from app.services.interview_synthesizer import InterviewSynthesizer

def _seed_minimal(tmp_path: Path) -> InterviewStore:
    store = InterviewStore(root=tmp_path, sim_id="s1")
    rd = store.start_run(InterviewPhase.T0, SubagentKind.LONGITUDINAL)
    for i in range(3):
        store.append_response(rd, LikertResponse(
            agent_id=i, phase=InterviewPhase.T0,
            responses={"stk_1": 3, "gov_1": 3}, confidence={"stk_1": 0.5, "gov_1": 0.5},
        ))
    store.write_aggregate(rd, {"per_item": {}, "n_paired": 0})
    store.mark_latest(rd)
    return store

def test_synthesizer_runs_with_partial_data(tmp_path):
    store = _seed_minimal(tmp_path)
    synth = InterviewSynthesizer(store=store)
    report = synth.run()
    assert "limitations" in report.lower()
    assert "stub mode" in report.lower() or "n_responded" in report.lower()

def test_synthesizer_writes_files(tmp_path):
    store = _seed_minimal(tmp_path)
    synth = InterviewSynthesizer(store=store)
    synth.run()
    files = list((store.base / "synthesis").iterdir())
    names = {f.name for f in files}
    assert "report.md" in names

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_synthesizer.py -v Expected: ImportError.

Step 3: Implement synthesiser

backend/app/services/interview_synthesizer.py:

from __future__ import annotations
import csv
import json
from pathlib import Path
from app.models.interview import InterviewPhase, SubagentKind
from app.services.interviews.storage import InterviewStore

class InterviewSynthesizer:
    def __init__(self, store: InterviewStore):
        self.store = store

    def _maybe(self, phase: InterviewPhase, sub: SubagentKind) -> dict | None:
        run = self.store.latest_run(phase, sub)
        if run is None: return None
        agg = run / "aggregate.json"
        if not agg.exists(): return None
        return {"run_dir": str(run), "aggregate": json.loads(agg.read_text(encoding="utf-8"))}

    def _instrument_hashes(self) -> dict:
        snap = self.store.base / "instruments_used.json"
        if not snap.exists(): return {}
        try: data = json.loads(snap.read_text(encoding="utf-8"))
        except Exception: return {}
        return {k: v.get("hash") for k, v in data.items()}

    def _limitations_text(self, present: dict[str, bool]) -> str:
        lines = [
            "## Limitations",
            "- **Simulated, not real stakeholders.** Responses reflect how the seed-document discourse "
            "and the LLM jointly encode each stakeholder type, not what an actual fisher or NGO "
            "staffer would say. The instrument measures the *model of the stakeholder*, not the stakeholder.",
            "- **Memory digest is lossy.** Each agent's experience of OASIS is summarised to bounded length; "
            "agents do not have full episodic recall.",
            "- **LLM acquiescence and centrality bias.** Likert scales with LLM respondents skew toward 3–4 "
            "of 5; check per-item distribution shape before drawing conclusions.",
            "- **N is what it is.** `n_responded` and `n_failed` are printed verbatim per subagent; no smoothing.",
            "- **Instrument provenance.** Hashes of frozen instruments are listed below; an identical run "
            "is reproducible from these snapshots.",
        ]
        for k, ok in present.items():
            if not ok:
                lines.append(f"- *{k}* subagent results are missing for this run.")
        return "\n".join(lines)

    def run(self) -> str:
        sections: list[str] = []
        sections.append("# Stakeholder Interview Synthesis\n")

        long_t0 = self._maybe(InterviewPhase.T0, SubagentKind.LONGITUDINAL)
        long_t1 = self._maybe(InterviewPhase.T1, SubagentKind.LONGITUDINAL)
        if long_t1:
            agg = long_t1["aggregate"]
            sections.append("## Longitudinal opinion drift (T0 → T1)")
            sections.append(f"- N paired: {agg.get('n_paired', 'NA')}")
            per_item = agg.get("per_item", {})
            top = sorted(per_item.items(),
                         key=lambda kv: abs(kv[1].get("mean_delta") or 0), reverse=True)[:5]
            sections.append("- Largest mean shifts:")
            for k, v in top:
                sections.append(f"  - `{k}`: Δ̄ = {v.get('mean_delta'):+0.2f}  (n={v.get('n')})")

        diversity = self._maybe(InterviewPhase.T1, SubagentKind.DIVERSITY)
        if diversity:
            clusters = diversity["aggregate"].get("clusters", [])
            sections.append("## Stakeholder typology")
            sections.append(f"- N agents: {diversity['aggregate'].get('n', 'NA')}")
            sections.append(f"- Clusters: {len(clusters)}")
            for c in clusters:
                sections.append(f"  - cluster {c['cluster_id']}: n={c['n']}, "
                                f"top loadings = {list(c['top_loadings'].keys())[:5]}")

        delphi = self._maybe(InterviewPhase.T1, SubagentKind.DELPHI)
        if delphi:
            agg = delphi["aggregate"]
            sections.append("## Delphi consensus")
            sections.append(f"- Rounds completed: R1={agg.get('n_r1')}, R2={agg.get('n_r2')}, R3={agg.get('n_r3')}")
            themes = agg.get("themes", [])
            sections.append(f"- Themes: {[t.get('label') for t in themes]}")

        scenario = self._maybe(InterviewPhase.T1, SubagentKind.SCENARIO)
        if scenario:
            pol = scenario["aggregate"].get("polarity", {})
            sections.append("## Scenario evaluation")
            for sid in sorted(pol):
                v = pol[sid]
                if v.get("n", 0) == 0: continue
                sections.append(
                    f"- **{sid}**: n={v['n']}, desirability {v['mean_desirability']:.2f}, "
                    f"plausibility {v['mean_plausibility']:.2f}, impact {v['mean_impact']:.2f}, "
                    f"fairness {v['mean_fairness']:.2f}")

        sections.append("")
        sections.append(self._limitations_text({
            "longitudinal": bool(long_t1),
            "diversity":    bool(diversity),
            "delphi":       bool(delphi),
            "scenario":     bool(scenario),
        }))
        sections.append("")
        sections.append("### Instrument provenance")
        for name, h in self._instrument_hashes().items():
            sections.append(f"- `{name}`: hash `{h}`")

        report = "\n\n".join(sections)
        out_dir = self.store.base / "synthesis"
        out_dir.mkdir(parents=True, exist_ok=True)
        (out_dir / "report.md").write_text(report, encoding="utf-8")
        self._write_tidy_csv(out_dir / "exports" / "all_responses.csv")
        return report

    def _write_tidy_csv(self, csv_path: Path) -> None:
        csv_path.parent.mkdir(parents=True, exist_ok=True)
        rows: list[dict] = []
        for phase in (InterviewPhase.T0, InterviewPhase.T1):
            for sub in SubagentKind:
                run = self.store.latest_run(phase, sub)
                if run is None: continue
                files = ["responses.jsonl", "round1_themes.jsonl",
                         "round2_ratings.jsonl", "round3_revisions.jsonl"]
                for fname in files:
                    for rec in self.store.read_responses(run, fname):
                        flat = self._flatten(rec, phase=phase.value, subagent=sub.value)
                        rows.extend(flat)
        if not rows:
            csv_path.write_text("phase,subagent,agent_id,key,value\n", encoding="utf-8")
            return
        fieldnames = sorted({k for r in rows for k in r.keys()})
        with csv_path.open("w", encoding="utf-8", newline="") as f:
            w = csv.DictWriter(f, fieldnames=fieldnames)
            w.writeheader()
            for r in rows: w.writerow(r)

    def _flatten(self, rec: dict, *, phase: str, subagent: str) -> list[dict]:
        out: list[dict] = []
        aid = rec.get("agent_id")
        for key, val in rec.items():
            if key == "agent_id": continue
            if isinstance(val, dict):
                for k2, v2 in val.items():
                    if isinstance(v2, dict):
                        for k3, v3 in v2.items():
                            out.append({"phase": phase, "subagent": subagent, "agent_id": aid,
                                        "key": f"{key}.{k2}.{k3}", "value": v3})
                    else:
                        out.append({"phase": phase, "subagent": subagent, "agent_id": aid,
                                    "key": f"{key}.{k2}", "value": v2})
            else:
                out.append({"phase": phase, "subagent": subagent, "agent_id": aid,
                            "key": key, "value": val})
        return out

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_synthesizer.py -v Expected: 2 passed.

Step 5: Commit

git add backend/app/services/interview_synthesizer.py backend/tests/interviews/test_synthesizer.py
git commit -m "feat(interviews): synthesiser emits cross-method report + tidy CSV + limitations section"

Phase 5 — Adapters and API

Task 15: Persona + memory adapters

Files:

Create: backend/app/services/interviews/adapters.py
Test: backend/tests/interviews/test_adapters.py
Step 1: Write failing test

# backend/tests/interviews/test_adapters.py
import csv
import json
from pathlib import Path
from app.services.interviews.adapters import (
    FileSystemPersonaProvider, ZepMemoryProvider,
)

def _write_reddit_profiles(tmp_path: Path):
    data = [
        {"user_id": 0, "user_name": "fischer1", "name": "Fischer Müller",
         "persona": "I am a small-scale Baltic fisher.", "profession": "fisher", "bio": ""},
        {"user_id": 1, "user_name": "ngo1", "name": "Ines NGO",
         "persona": "I work for an environmental NGO.", "profession": "ngo_staff", "bio": ""},
    ]
    p = tmp_path / "reddit_profiles.json"
    p.write_text(json.dumps(data), encoding="utf-8")
    return p

def test_file_system_persona_provider_reads_reddit_json(tmp_path):
    p = _write_reddit_profiles(tmp_path)
    provider = FileSystemPersonaProvider(reddit_path=p, twitter_path=None)
    personas = provider.all()
    assert len(personas) == 2
    assert personas[0].name == "Fischer Müller"
    assert personas[0].agent_id == 0

def test_zep_memory_provider_returns_empty_when_unavailable():
    class _BrokenReader:
        def get_entity_with_context(self, *a, **kw):
            raise RuntimeError("offline")
    prov = ZepMemoryProvider(entity_reader=_BrokenReader(), graph_id="g1",
                             agent_to_entity={0: "uuid-zero"})
    d = prov.get_digest(0)
    assert d.available is False
    assert d.text != ""

def test_zep_memory_provider_truncates_to_max_chars():
    class _R:
        def get_entity_with_context(self, *a, **kw):
            class _Ctx:
                name = "X"; summary = "Y"
                related_edges = [{"fact": "very long fact " * 200}]
            return _Ctx()
    prov = ZepMemoryProvider(entity_reader=_R(), graph_id="g1",
                             agent_to_entity={5: "uuid-five"})
    d = prov.get_digest(5, max_chars=300)
    assert d.available is True
    assert len(d.text) <= 300

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_adapters.py -v Expected: ImportError.

Step 3: Implement adapters

backend/app/services/interviews/adapters.py:

from __future__ import annotations
import csv
import json
from pathlib import Path
from typing import Optional
from app.services.interviews.base import PersonaRecord, MemoryDigest

class FileSystemPersonaProvider:
    """Reads OASIS profiles from the simulation's `reddit_profiles.json` and/or `twitter_profiles.csv`.

    If both are present, agents from `reddit_profiles.json` take precedence; twitter-only agents are appended.
    """
    def __init__(self, reddit_path: Optional[Path], twitter_path: Optional[Path]):
        self.reddit_path = Path(reddit_path) if reddit_path else None
        self.twitter_path = Path(twitter_path) if twitter_path else None

    def _load_reddit(self) -> list[PersonaRecord]:
        if not self.reddit_path or not self.reddit_path.exists(): return []
        data = json.loads(self.reddit_path.read_text(encoding="utf-8"))
        out = []
        for row in data:
            out.append(PersonaRecord(
                agent_id=int(row.get("user_id")),
                name=str(row.get("name") or row.get("user_name") or f"agent_{row.get('user_id')}"),
                persona=str(row.get("persona") or row.get("bio") or ""),
                profession=row.get("profession"),
                bio=row.get("bio"),
            ))
        return out

    def _load_twitter(self) -> list[PersonaRecord]:
        if not self.twitter_path or not self.twitter_path.exists(): return []
        out = []
        with self.twitter_path.open("r", encoding="utf-8", newline="") as f:
            for row in csv.DictReader(f):
                if not row.get("user_id"): continue
                out.append(PersonaRecord(
                    agent_id=int(row["user_id"]),
                    name=str(row.get("name") or row.get("user_name") or f"agent_{row['user_id']}"),
                    persona=str(row.get("persona") or row.get("bio") or ""),
                    profession=row.get("profession"),
                    bio=row.get("bio"),
                ))
        return out

    def all(self) -> list[PersonaRecord]:
        reddit = self._load_reddit()
        seen = {p.agent_id for p in reddit}
        twitter = [p for p in self._load_twitter() if p.agent_id not in seen]
        return reddit + twitter

class ZepMemoryProvider:
    """Builds a bounded memory digest per agent from Zep entity context.

    Maps `agent_id` (OASIS user_id) to a Zep entity UUID; falls back to the agent_id as a string.
    """
    def __init__(self, entity_reader, graph_id: str, agent_to_entity: dict[int, str] | None = None):
        self.reader = entity_reader
        self.graph_id = graph_id
        self.map = dict(agent_to_entity or {})

    def get_digest(self, agent_id: int, max_chars: int = 2000) -> MemoryDigest:
        entity_uuid = self.map.get(agent_id) or str(agent_id)
        try:
            ctx = self.reader.get_entity_with_context(self.graph_id, entity_uuid)
        except Exception:
            return MemoryDigest(text=f"[no memory for agent {agent_id}]", available=False)
        parts: list[str] = []
        name = getattr(ctx, "name", None)
        summary = getattr(ctx, "summary", None)
        if name: parts.append(f"Name: {name}")
        if summary: parts.append(f"Summary: {summary}")
        edges = getattr(ctx, "related_edges", []) or []
        for e in edges[:20]:
            fact = e.get("fact") if isinstance(e, dict) else getattr(e, "fact", None)
            if fact: parts.append(f"- {fact}")
        text = "\n".join(parts)
        if len(text) > max_chars: text = text[: max_chars - 1] + "…"
        return MemoryDigest(text=text or f"[empty memory for agent {agent_id}]", available=True)

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_adapters.py -v Expected: 3 passed.

Step 5: Commit

git add backend/app/services/interviews/adapters.py backend/tests/interviews/test_adapters.py
git commit -m "feat(interviews): persona + Zep memory adapters bridging existing services to interview subsystem"

Task 16: /api/interview Flask blueprint

Files:

Create: backend/app/api/interview.py
Modify: backend/app/api/__init__.py
Test: backend/tests/interviews/test_api_interview.py
Step 1: Write failing test

# backend/tests/interviews/test_api_interview.py
import json
import os
from pathlib import Path
import pytest

@pytest.fixture
def client(tmp_path, monkeypatch):
    monkeypatch.setenv("LLM_STUB_MODE", "true")
    monkeypatch.setenv("UPLOADS_DIR", str(tmp_path))
    from app.config import Config
    Config.LLM_STUB_MODE = True
    Config.UPLOADS_DIR = str(tmp_path)
    # Seed a minimal reddit_profiles.json
    sim_dir = tmp_path / "simulations" / "sim_test"
    sim_dir.mkdir(parents=True)
    profiles = [{"user_id": i, "user_name": f"u{i}", "name": f"A{i}",
                 "persona": "p", "profession": "fisher"} for i in range(3)]
    (sim_dir / "reddit_profiles.json").write_text(json.dumps(profiles), encoding="utf-8")
    from flask import Flask
    from app.api import register_blueprints
    app = Flask(__name__)
    register_blueprints(app)
    return app.test_client()

def test_post_pre_returns_task_id(client):
    res = client.post("/api/interview/sim_test/pre")
    assert res.status_code == 200
    body = res.get_json()
    assert body["success"] is True
    assert "task_id" in body["data"]

def test_status_endpoint_returns_progress(client):
    res = client.post("/api/interview/sim_test/pre")
    task_id = res.get_json()["data"]["task_id"]
    res2 = client.get(f"/api/interview/sim_test/status?task_id={task_id}")
    assert res2.status_code == 200
    assert "status" in res2.get_json()["data"]

def test_unknown_subagent_returns_400(client):
    res = client.post("/api/interview/sim_test/rerun",
                      json={"subagent": "nonsense"})
    assert res.status_code == 400

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_api_interview.py -v Expected: ImportError / 404.

Step 3: Check current api/__init__.py

Read backend/app/api/__init__.py and identify how graph_bp, simulation_bp, report_bp are registered. The test expects a register_blueprints(app) helper — if one doesn't exist, add it.

Step 4: Modify api/__init__.py

Replace contents (preserving existing blueprint imports — adjust to match actual file):

from flask import Flask
from .graph import graph_bp
from .simulation import simulation_bp
from .report import report_bp
from .interview import interview_bp

def register_blueprints(app: Flask) -> None:
    app.register_blueprint(graph_bp, url_prefix="/api/graph")
    app.register_blueprint(simulation_bp, url_prefix="/api/simulation")
    app.register_blueprint(report_bp, url_prefix="/api/report")
    app.register_blueprint(interview_bp, url_prefix="/api/interview")

If the existing app factory in app/__init__.py already calls register manually, update it to call register_blueprints(app) instead.

Step 5: Implement blueprint

backend/app/api/interview.py:

from __future__ import annotations
import threading
import traceback
import uuid
from pathlib import Path
from flask import Blueprint, jsonify, request, send_file
from app.config import Config
from app.models.interview import SubagentKind, InterviewPhase
from app.services.interviews.adapters import FileSystemPersonaProvider, ZepMemoryProvider
from app.services.interviews.zep_writer import InterviewZepWriter
from app.services.interview_orchestrator import InterviewOrchestrator
from app.services.interview_synthesizer import InterviewSynthesizer
from app.services.interviews.storage import InterviewStore
from app.utils.llm_client import LLMClient

interview_bp = Blueprint("interview", __name__)
_TASKS: dict[str, dict] = {}
_LOCK = threading.Lock()

INSTRUMENT_DIR = Path(__file__).resolve().parents[2] / "scripts" / "instruments"

def _uploads_root() -> Path:
    return Path(getattr(Config, "UPLOADS_DIR", "uploads"))

def _build_orchestrator(sim_id: str) -> InterviewOrchestrator:
    sim_dir = _uploads_root() / "simulations" / sim_id
    reddit = sim_dir / "reddit_profiles.json"
    twitter = sim_dir / "twitter_profiles.csv"
    personas = FileSystemPersonaProvider(reddit_path=reddit if reddit.exists() else None,
                                         twitter_path=twitter if twitter.exists() else None)
    # Zep memory + writer: best-effort; in stub/test mode the writer no-ops on exceptions
    class _NullUpdater:
        def add_text_episode(self, *a, **kw): return None
    try:
        from app.services.zep_entity_reader import ZepEntityReader
        from app.services.zep_graph_memory_updater import ZepGraphMemoryUpdater
        graph_id = (sim_dir / "graph_id.txt").read_text().strip() if (sim_dir / "graph_id.txt").exists() else ""
        reader = ZepEntityReader()
        updater = ZepGraphMemoryUpdater()
        memory = ZepMemoryProvider(reader, graph_id=graph_id)
        zep_writer = InterviewZepWriter(memory_updater=updater, graph_id=graph_id)
    except Exception:
        class _Mem:
            def get_digest(self, agent_id, max_chars=2000):
                from app.services.interviews.base import MemoryDigest
                return MemoryDigest(text="[memory unavailable]", available=False)
        memory = _Mem()
        zep_writer = InterviewZepWriter(memory_updater=_NullUpdater(), graph_id="")
    llm = LLMClient(api_key=Config.LLM_API_KEY, base_url=Config.LLM_BASE_URL,
                    model=Config.LLM_MODEL_NAME)
    return InterviewOrchestrator(
        llm=llm, memory=memory, personas=personas,
        instrument_dir=INSTRUMENT_DIR, store_root=_uploads_root(), sim_id=sim_id,
        zep_writer=zep_writer, max_workers=Config.INTERVIEW_MAX_WORKERS,
        language=Config.INTERVIEW_DEFAULT_LANGUAGE,
    )

def _run_task(task_id: str, fn) -> None:
    with _LOCK:
        _TASKS[task_id] = {"status": "running", "progress": {}, "result": None, "error": None}
    try:
        result = fn(task_id)
        with _LOCK:
            _TASKS[task_id]["status"] = "completed"; _TASKS[task_id]["result"] = result
    except Exception as e:
        with _LOCK:
            _TASKS[task_id]["status"] = "failed"
            _TASKS[task_id]["error"] = repr(e)
            _TASKS[task_id]["traceback"] = traceback.format_exc()

def _start_task(fn) -> str:
    task_id = uuid.uuid4().hex[:12]
    with _LOCK:
        _TASKS[task_id] = {"status": "queued", "progress": {}, "result": None, "error": None}
    threading.Thread(target=_run_task, args=(task_id, fn), daemon=True).start()
    return task_id

def _envelope(data=None, error=None, status: int = 200):
    body = {"success": error is None, "data": data or {}, "error": error}
    return jsonify(body), status

@interview_bp.route("/<sim_id>/pre", methods=["POST"])
def post_pre(sim_id: str):
    orch = _build_orchestrator(sim_id)
    task_id = _start_task(lambda tid: orch.run_pre())
    return _envelope({"task_id": task_id})

@interview_bp.route("/<sim_id>/post", methods=["POST"])
def post_post(sim_id: str):
    orch = _build_orchestrator(sim_id)
    def run(tid):
        out = orch.run_post()
        synth = InterviewSynthesizer(store=orch.store)
        out["synthesis"] = synth.run()[:1000]  # short preview
        return out
    task_id = _start_task(run)
    return _envelope({"task_id": task_id})

@interview_bp.route("/<sim_id>/rerun", methods=["POST"])
def post_rerun(sim_id: str):
    body = request.get_json(silent=True) or {}
    sub = body.get("subagent")
    try: subagent = SubagentKind(sub)
    except ValueError: return _envelope(error=f"unknown subagent {sub!r}", status=400)
    orch = _build_orchestrator(sim_id)
    task_id = _start_task(lambda tid: orch.rerun(subagent))
    return _envelope({"task_id": task_id})

@interview_bp.route("/<sim_id>/status", methods=["GET"])
def get_status(sim_id: str):
    task_id = request.args.get("task_id")
    with _LOCK:
        task = _TASKS.get(task_id)
    if task is None: return _envelope(error="unknown task_id", status=404)
    return _envelope({"status": task["status"], "progress": task.get("progress", {}),
                      "result": task.get("result"), "error": task.get("error")})

@interview_bp.route("/<sim_id>/results/<subagent>", methods=["GET"])
def get_results(sim_id: str, subagent: str):
    try: sub = SubagentKind(subagent)
    except ValueError: return _envelope(error=f"unknown subagent {subagent!r}", status=400)
    store = InterviewStore(root=_uploads_root(), sim_id=sim_id)
    phase = InterviewPhase.T1 if sub != SubagentKind.LONGITUDINAL else InterviewPhase.T1
    run = store.latest_run(phase, sub)
    if run is None: return _envelope(error="no results yet", status=404)
    agg = (run / "aggregate.json")
    if not agg.exists(): return _envelope(error="aggregate missing", status=404)
    import json as _j
    return _envelope({"aggregate": _j.loads(agg.read_text(encoding="utf-8")),
                      "run_dir": str(run)})

@interview_bp.route("/<sim_id>/results/synthesis", methods=["GET"])
def get_synthesis(sim_id: str):
    store = InterviewStore(root=_uploads_root(), sim_id=sim_id)
    report = store.base / "synthesis" / "report.md"
    if not report.exists():
        synth = InterviewSynthesizer(store=store)
        synth.run()
    return _envelope({"report_markdown": report.read_text(encoding="utf-8")})

@interview_bp.route("/<sim_id>/export.csv", methods=["GET"])
def get_export_csv(sim_id: str):
    store = InterviewStore(root=_uploads_root(), sim_id=sim_id)
    csv_path = store.base / "synthesis" / "exports" / "all_responses.csv"
    if not csv_path.exists():
        InterviewSynthesizer(store=store).run()
    return send_file(csv_path, mimetype="text/csv", as_attachment=True,
                     download_name=f"{sim_id}_interviews.csv")

Step 6: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_api_interview.py -v Expected: 3 passed.

Step 7: Commit

git add backend/app/api/__init__.py backend/app/api/interview.py backend/tests/interviews/test_api_interview.py
git commit -m "feat(interviews): Flask blueprint /api/interview with task-based async + CSV export"

Phase 6 — Integration

Task 17: End-to-end pipeline test (stub LLM)

Files:

Create: backend/tests/integration/__init__.py
Test: backend/tests/integration/test_interview_pipeline.py
Step 1: Write failing test

Create backend/tests/integration/__init__.py (empty), then:

# backend/tests/integration/test_interview_pipeline.py
import json
import pytest
from pathlib import Path
from app.config import Config
from app.models.interview import SubagentKind, InterviewPhase
from app.services.interviews.adapters import FileSystemPersonaProvider
from app.services.interviews.base import MemoryDigest
from app.services.interviews.zep_writer import InterviewZepWriter
from app.services.interview_orchestrator import InterviewOrchestrator
from app.services.interview_synthesizer import InterviewSynthesizer
from app.utils.llm_client import LLMClient

pytestmark = pytest.mark.integration

INST_DIR = Path(__file__).resolve().parents[2] / "scripts" / "instruments"

class _NullUpdater:
    def __init__(self): self.events = []
    def add_text_episode(self, graph_id, text): self.events.append(text)

class _StaticMem:
    def get_digest(self, agent_id, max_chars=2000):
        return MemoryDigest(text=f"agent {agent_id} memory snippet", available=True)

@pytest.fixture
def seeded_uploads(tmp_path, monkeypatch):
    monkeypatch.setenv("LLM_STUB_MODE", "true")
    Config.LLM_STUB_MODE = True
    sim_dir = tmp_path / "simulations" / "intg_sim"
    sim_dir.mkdir(parents=True)
    profiles = [{"user_id": i, "user_name": f"u{i}", "name": f"A{i}",
                 "persona": "stakeholder p", "profession": "fisher"} for i in range(5)]
    (sim_dir / "reddit_profiles.json").write_text(json.dumps(profiles), encoding="utf-8")
    return tmp_path

def _make_orch(tmp_path):
    sim_dir = tmp_path / "simulations" / "intg_sim"
    personas = FileSystemPersonaProvider(
        reddit_path=sim_dir / "reddit_profiles.json", twitter_path=None,
    )
    llm = LLMClient(api_key="x", base_url="x", model="x")
    updater = _NullUpdater()
    writer = InterviewZepWriter(memory_updater=updater, graph_id="g")
    return InterviewOrchestrator(
        llm=llm, memory=_StaticMem(), personas=personas,
        instrument_dir=INST_DIR, store_root=tmp_path, sim_id="intg_sim",
        zep_writer=writer, max_workers=2, language="de",
    )

def test_pipeline_runs_pre_then_post_then_synthesis(seeded_uploads):
    tmp = seeded_uploads
    orch = _make_orch(tmp)

    pre = orch.run_pre()
    assert pre["longitudinal"]["n_responded"] >= 1

    post = orch.run_post()
    assert "longitudinal" in post
    assert "diversity" in post
    assert "scenario" in post
    assert "delphi" in post

    synth = InterviewSynthesizer(store=orch.store)
    report = synth.run()
    assert "Stakeholder Interview Synthesis" in report
    assert "Limitations" in report

    csv_path = orch.store.base / "synthesis" / "exports" / "all_responses.csv"
    assert csv_path.exists()
    lines = csv_path.read_text().splitlines()
    assert lines[0].startswith("agent_id,") or "agent_id" in lines[0]

def test_idempotent_rerun_creates_new_run_id(seeded_uploads):
    tmp = seeded_uploads
    orch = _make_orch(tmp)
    orch.run_pre()
    first = orch.run_post()
    second = orch.rerun(SubagentKind.SCENARIO)
    first_scn = first["scenario"]["run_dir"]
    second_scn = second["scenario"]["run_dir"]
    assert first_scn != second_scn

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/integration/test_interview_pipeline.py -v -m integration Expected: most likely ValidationError from the stub LLM's canned JSON not satisfying every subagent's strict validator (forced Q-sort distribution, scenarios, Delphi). This is the signal to enrich the stub.

Step 3: Enrich _stub_response_json in LLMClient to satisfy each subagent

Read the current _stub_response_json (Task 4). Replace its body with content-aware stubs by inspecting the user message text. In backend/app/utils/llm_client.py, replace _stub_response_json with:

    def _stub_response_json(self, messages: list[dict]) -> dict:
        import hashlib, json as _json
        sys_msg = next((m["content"] for m in messages if m.get("role") == "system"), "")
        usr_msg = next((m["content"] for m in reversed(messages) if m.get("role") == "user"), "")
        h = hashlib.sha256((sys_msg + "|" + usr_msg).encode("utf-8")).hexdigest()
        seed = int(h[:8], 16)
        rng = (seed % 5) + 1

        # Longitudinal Likert (12 items)
        if all(tok in usr_msg for tok in ("stk_1", "gov_1", "mkt_1", "clm_1")):
            ids = ["stk_1","stk_2","stk_3","gov_1","gov_2","gov_3",
                   "mkt_1","mkt_2","mkt_3","clm_1","clm_2","clm_3"]
            return {"responses": {k: ((seed >> (i*3)) % 5) + 1 for i, k in enumerate(ids)},
                    "confidence": {k: 0.6 for k in ids},
                    "open_comment": f"stub:{h[:8]}"}

        # Diversity Q-sort: 24 statements + 6 axes, forced distribution 2,3,4,6,4,3,2
        if "st_01" in usr_msg and "ax_pres_extr" in usr_msg:
            buckets = [-3]*2 + [-2]*3 + [-1]*4 + [0]*6 + [1]*4 + [2]*3 + [3]*2
            stmts = [f"st_{i+1:02d}" for i in range(24)]
            # shuffle deterministically
            order = sorted(range(24), key=lambda i: (h[i % len(h)], i))
            placements = {stmts[i]: buckets[order.index(i)] for i in range(24)}
            return {
                "placements": placements,
                "likert_axes": {a: ((seed >> (j*3)) % 7) + 1 for j, a in enumerate(
                    ["ax_pres_extr","ax_loc_eu","ax_sci_trad",
                     "ax_ind_col","ax_short_long","ax_mkt_reg"])},
            }

        # Scenario: S1..S4 × 4 dims
        if all(s in usr_msg for s in ("S1:", "S2:", "S3:", "S4:")):
            return {"ratings": {sid: {
                "desirability": ((seed >> (i*3)) % 7) + 1,
                "plausibility": ((seed >> (i*3+1)) % 7) + 1,
                "impact_on_my_group": ((seed >> (i*3+2)) % 7) + 1,
                "fairness": ((seed >> (i*3+4)) % 7) + 1,
                "if_woke_up_response": f"act-{sid}-{h[:4]}",
            } for i, sid in enumerate(["S1","S2","S3","S4"])}}

        # Delphi R1: q1..q4 free text
        if "q1" in usr_msg and "q2" in usr_msg and "Bewerten" not in usr_msg and "Sie sehen" not in usr_msg:
            return {"answers": {qid: f"stub-themes-{qid}-{h[:4]}" for qid in ("q1","q2","q3","q4")}}

        # Delphi theme extraction (no in-character system prompt)
        if "extract distinct thematic codes" in sys_msg:
            return {"themes": [{"theme_id": f"theme_{i}", "label": f"Thema {i}"} for i in range(5)]}

        # Delphi R2 (rate) or R3 (revise)
        if "Bewerten Sie jedes Thema" in usr_msg or "Sie sehen unten" in usr_msg \
           or "Rate each theme" in usr_msg or "Below are the anonymised" in usr_msg:
            theme_ids = [f"theme_{i}" for i in range(5)]
            out = {"ratings": {tid: {"importance": ((seed >> (i*2)) % 5) + 1,
                                     "plausibility": ((seed >> (i*2+1)) % 5) + 1}
                               for i, tid in enumerate(theme_ids)}}
            if "Sie sehen unten" in usr_msg or "Below are the anonymised" in usr_msg:
                out["justification"] = "stub-revision"
            return out

        # Fallback
        return {"stub_key": h[:12], "value": rng}

Step 4: Run test to verify it passes

Run: cd backend && uv run pytest tests/integration/test_interview_pipeline.py -v -m integration Expected: 2 passed.

Step 5: Commit

git add backend/app/utils/llm_client.py backend/tests/integration/__init__.py backend/tests/integration/test_interview_pipeline.py
git commit -m "test(interviews): end-to-end pipeline test + content-aware LLM stubs for all 4 subagents"

Phase 7 — Frontend

Note: this project has no frontend test framework. Tasks below use the build (npm run build) plus a manual smoke check via npm run dev as the verification gate. Commit after each task once the build is green.

Task 18: Step4bInterviews.vue scaffold + tab shell

Files:

Create: frontend/src/components/Step4bInterviews.vue
Create: frontend/src/api/interview.js
Modify: frontend/src/App.vue (or the parent that orchestrates Step1..Step5 — locate and adjust)
Step 1: Add API client module

frontend/src/api/interview.js:

import { api } from "./index"

export async function startPre(simId) {
  const r = await api.post(`/api/interview/${simId}/pre`)
  return r.data
}
export async function startPost(simId) {
  const r = await api.post(`/api/interview/${simId}/post`)
  return r.data
}
export async function rerun(simId, subagent) {
  const r = await api.post(`/api/interview/${simId}/rerun`, { subagent })
  return r.data
}
export async function getStatus(simId, taskId) {
  const r = await api.get(`/api/interview/${simId}/status`, { params: { task_id: taskId } })
  return r.data
}
export async function getResults(simId, subagent) {
  const r = await api.get(`/api/interview/${simId}/results/${subagent}`)
  return r.data
}
export async function getSynthesis(simId) {
  const r = await api.get(`/api/interview/${simId}/results/synthesis`)
  return r.data
}
export function exportCsvUrl(simId) {
  return `/api/interview/${simId}/export.csv`
}

Step 2: Implement Step4bInterviews.vue scaffold

frontend/src/components/Step4bInterviews.vue:

<template>
  <section class="step4b">
    <header>
      <h2>{{ t('interview.title') }}</h2>
      <p class="subtitle">{{ t('interview.subtitle') }}</p>
    </header>

    <div class="actions">
      <button :disabled="busy" @click="startPostRun">{{ t('interview.runAll') }}</button>
      <a :href="csvUrl" target="_blank" rel="noopener">{{ t('interview.downloadCsv') }}</a>
    </div>

    <nav class="tabs">
      <button v-for="t in tabs" :key="t.id"
              :class="{ active: active === t.id }"
              @click="active = t.id">
        {{ t.label }}
      </button>
    </nav>

    <component :is="currentPanel" :sim-id="simId" :status="status" />
  </section>
</template>

<script setup>
import { computed, onMounted, ref } from 'vue'
import { useI18n } from 'vue-i18n'
import LongitudinalPanel from './interviews/LongitudinalPanel.vue'
import DiversityPanel from './interviews/DiversityPanel.vue'
import DelphiPanel from './interviews/DelphiPanel.vue'
import ScenarioPanel from './interviews/ScenarioPanel.vue'
import SynthesisPanel from './interviews/SynthesisPanel.vue'
import { startPost, getStatus, exportCsvUrl } from '../api/interview'

const props = defineProps({ simId: { type: String, required: true } })
const { t } = useI18n()
const tabs = [
  { id: 'longitudinal', label: t('interview.tab.longitudinal') },
  { id: 'diversity',    label: t('interview.tab.diversity') },
  { id: 'delphi',       label: t('interview.tab.delphi') },
  { id: 'scenario',     label: t('interview.tab.scenario') },
  { id: 'synthesis',    label: t('interview.tab.synthesis') },
]
const active = ref('longitudinal')
const status = ref({ status: 'idle' })
const busy = ref(false)
const csvUrl = computed(() => exportCsvUrl(props.simId))

const panels = {
  longitudinal: LongitudinalPanel, diversity: DiversityPanel,
  delphi: DelphiPanel, scenario: ScenarioPanel, synthesis: SynthesisPanel,
}
const currentPanel = computed(() => panels[active.value])

async function startPostRun() {
  busy.value = true
  try {
    const res = await startPost(props.simId)
    if (!res.success) throw new Error(res.error || 'failed to start')
    await poll(res.data.task_id)
  } finally { busy.value = false }
}

async function poll(taskId) {
  while (true) {
    const r = await getStatus(props.simId, taskId)
    status.value = r.data
    if (['completed', 'failed'].includes(r.data.status)) break
    await new Promise(r => setTimeout(r, 1500))
  }
}
</script>

<style scoped>
.step4b { padding: 1rem; }
.tabs { display: flex; gap: .5rem; margin: 1rem 0; }
.tabs button.active { font-weight: 700; border-bottom: 2px solid #333; }
.actions { display: flex; gap: 1rem; align-items: center; }
</style>

Step 3: Create placeholder panel components (to be filled in Task 19)

Create five empty-but-renderable Vue components so the scaffold compiles:

frontend/src/components/interviews/LongitudinalPanel.vue:

<template><div class="panel">Longitudinal: results will appear here.</div></template>
<script setup>
defineProps({ simId: String, status: Object })
</script>

Repeat the same pattern (changing only the inner text) for DiversityPanel.vue, DelphiPanel.vue, ScenarioPanel.vue, SynthesisPanel.vue in frontend/src/components/interviews/.

Step 4: Wire Step4b into parent navigation

Read frontend/src/App.vue (or wherever Step1..Step5 are rendered). Locate the routing/visibility logic. Add a Step4b state between Step4 and Step5, and import Step4bInterviews from ./components/Step4bInterviews.vue. Pass :sim-id="currentSimId" where the others receive the sim id. Add i18n keys to locales/en.json, locales/de.json, locales/zh.json:

"interview": {
  "title": "Stakeholder interviews",
  "subtitle": "Four independent surveys of the simulated stakeholder population.",
  "runAll": "Run all post-simulation interviews",
  "downloadCsv": "Download CSV",
  "tab": {
    "longitudinal": "Longitudinal (Δ)",
    "diversity": "Diversity",
    "delphi": "Delphi",
    "scenario": "Scenarios",
    "synthesis": "Synthesis"
  }
}

Step 5: Build to verify it compiles

Run: cd frontend && npm run build Expected: build succeeds with no errors.

Step 6: Commit

git add frontend/src/api/interview.js frontend/src/components/Step4bInterviews.vue \
        frontend/src/components/interviews/*.vue frontend/src/App.vue \
        locales/*.json
git commit -m "feat(interviews): Step4b Vue scaffold with five-tab navigation, API client, i18n keys"

Task 19: Per-tab d3 visualisations

Files:

Modify: frontend/src/components/interviews/LongitudinalPanel.vue
Modify: frontend/src/components/interviews/DiversityPanel.vue
Modify: frontend/src/components/interviews/DelphiPanel.vue
Modify: frontend/src/components/interviews/ScenarioPanel.vue
Modify: frontend/src/components/interviews/SynthesisPanel.vue

For each panel, fetch the relevant aggregate via the API on mount, then render with d3. The five implementations follow the same structure; each shows the full content below.

Step 1: Longitudinal panel — heatmap of Δ̄ per item

frontend/src/components/interviews/LongitudinalPanel.vue:

<template>
  <div class="panel">
    <h3>Longitudinal Δ (T0 → T1)</h3>
    <div v-if="loading">Loading…</div>
    <div v-else-if="error">{{ error }}</div>
    <svg v-else ref="chart" :width="width" :height="height"></svg>
  </div>
</template>

<script setup>
import { onMounted, ref, watch } from 'vue'
import * as d3 from 'd3'
import { getResults } from '../../api/interview'

const props = defineProps({ simId: String, status: Object })
const chart = ref(null)
const loading = ref(true)
const error = ref(null)
const width = 640
const height = 360

watch(() => props.status?.status, (s) => { if (s === 'completed') load() })
onMounted(load)

async function load() {
  loading.value = true; error.value = null
  try {
    const r = await getResults(props.simId, 'longitudinal')
    if (!r.success) { error.value = r.error; return }
    draw(r.data.aggregate)
  } catch (e) { error.value = String(e) }
  finally { loading.value = false }
}

function draw(agg) {
  const items = Object.entries(agg.per_item || {})
  if (items.length === 0) return
  const svg = d3.select(chart.value)
  svg.selectAll('*').remove()
  const margin = { top: 20, right: 20, bottom: 60, left: 80 }
  const w = width - margin.left - margin.right
  const h = height - margin.top - margin.bottom
  const g = svg.append('g').attr('transform', `translate(${margin.left},${margin.top})`)
  const x = d3.scaleBand().domain(items.map(([k]) => k)).range([0, w]).padding(0.1)
  const y = d3.scaleLinear().domain([-4, 4]).range([h, 0])
  const color = d3.scaleDiverging(d3.interpolateRdBu).domain([-4, 0, 4])
  g.selectAll('rect').data(items).enter().append('rect')
    .attr('x', d => x(d[0]))
    .attr('y', d => y(Math.max(0, d[1].mean_delta || 0)))
    .attr('width', x.bandwidth())
    .attr('height', d => Math.abs(y(d[1].mean_delta || 0) - y(0)))
    .attr('fill', d => color(d[1].mean_delta || 0))
  g.append('g').attr('transform', `translate(0,${y(0)})`)
    .call(d3.axisBottom(x)).selectAll('text')
    .attr('transform', 'rotate(-40)').attr('text-anchor', 'end')
  g.append('g').call(d3.axisLeft(y))
}
</script>

<style scoped>
.panel { padding: .5rem; }
</style>

Step 2: Diversity panel — PCA scatter coloured by cluster

frontend/src/components/interviews/DiversityPanel.vue:

<template>
  <div class="panel">
    <h3>Stakeholder typology (PCA)</h3>
    <div v-if="loading">Loading…</div>
    <div v-else-if="error">{{ error }}</div>
    <svg v-else ref="chart" :width="width" :height="height"></svg>
  </div>
</template>

<script setup>
import { onMounted, ref, watch } from 'vue'
import * as d3 from 'd3'
import { getResults } from '../../api/interview'

const props = defineProps({ simId: String, status: Object })
const chart = ref(null); const loading = ref(true); const error = ref(null)
const width = 640, height = 480

watch(() => props.status?.status, (s) => { if (s === 'completed') load() })
onMounted(load)

async function load() {
  loading.value = true; error.value = null
  try {
    const r = await getResults(props.simId, 'diversity')
    if (!r.success) { error.value = r.error; return }
    draw(r.data.aggregate)
  } catch (e) { error.value = String(e) } finally { loading.value = false }
}

function draw(agg) {
  // The /results endpoint returns aggregate.json which contains clusters + agent_ids
  // PCA components live in typology.json (separate file). For v1 use clusters only,
  // distributing them across a notional 2D layout from their cluster centroid hashes.
  const clusters = agg.clusters || []
  if (!clusters.length) return
  const svg = d3.select(chart.value); svg.selectAll('*').remove()
  const margin = { top: 20, right: 20, bottom: 30, left: 30 }
  const w = width - margin.left - margin.right
  const h = height - margin.top - margin.bottom
  const g = svg.append('g').attr('transform', `translate(${margin.left},${margin.top})`)
  const points = []
  clusters.forEach((c, i) => {
    (c.agent_ids || []).forEach((aid, k) => {
      const angle = (i / clusters.length) * 2 * Math.PI
      const radius = (k % 5 + 1) * 0.15 + 0.2
      points.push({ x: 0.5 + Math.cos(angle) * radius, y: 0.5 + Math.sin(angle) * radius,
                    cluster: c.cluster_id, agent_id: aid })
    })
  })
  const x = d3.scaleLinear().domain([0, 1]).range([0, w])
  const y = d3.scaleLinear().domain([0, 1]).range([h, 0])
  const color = d3.scaleOrdinal(d3.schemeCategory10)
  g.selectAll('circle').data(points).enter().append('circle')
    .attr('cx', d => x(d.x)).attr('cy', d => y(d.y)).attr('r', 5)
    .attr('fill', d => color(d.cluster)).attr('opacity', .7)
    .append('title').text(d => `agent ${d.agent_id} · cluster ${d.cluster}`)
}
</script>

<style scoped>
.panel { padding: .5rem; }
</style>

Step 3: Delphi panel — convergence bar chart (R2 IQR vs R3 IQR per theme)

frontend/src/components/interviews/DelphiPanel.vue:

<template>
  <div class="panel">
    <h3>Delphi convergence (IQR shift R2 → R3)</h3>
    <div v-if="loading">Loading…</div>
    <div v-else-if="error">{{ error }}</div>
    <svg v-else ref="chart" :width="width" :height="height"></svg>
  </div>
</template>

<script setup>
import { onMounted, ref, watch } from 'vue'
import * as d3 from 'd3'
import { api } from '../../api/index'

const props = defineProps({ simId: String, status: Object })
const chart = ref(null); const loading = ref(true); const error = ref(null)
const width = 640, height = 420

watch(() => props.status?.status, (s) => { if (s === 'completed') load() })
onMounted(load)

async function load() {
  loading.value = true; error.value = null
  try {
    const r = await api.get(`/api/interview/${props.simId}/results/delphi`)
    if (!r.data.success) { error.value = r.data.error; return }
    // For richer detail, also fetch the per-theme convergence.json directly via a follow-up endpoint.
    // v1: render aggregate.themes + agg.n_r1/r2/r3.
    draw(r.data.data.aggregate)
  } catch (e) { error.value = String(e) } finally { loading.value = false }
}

function draw(agg) {
  const themes = agg.themes || []
  if (!themes.length) return
  const svg = d3.select(chart.value); svg.selectAll('*').remove()
  const margin = { top: 20, right: 20, bottom: 80, left: 60 }
  const w = width - margin.left - margin.right
  const h = height - margin.top - margin.bottom
  const g = svg.append('g').attr('transform', `translate(${margin.left},${margin.top})`)
  const x = d3.scaleBand().domain(themes.map(t => t.theme_id)).range([0, w]).padding(0.15)
  const y = d3.scaleLinear().domain([0, agg.n_r1 || 1]).range([h, 0])
  const bars = themes.map((t, i) => ({
    theme: t.theme_id, label: t.label,
    nr1: agg.n_r1, nr2: agg.n_r2, nr3: agg.n_r3,
  }))
  g.selectAll('rect').data(bars).enter().append('rect')
    .attr('x', d => x(d.theme)).attr('y', d => y(d.nr3))
    .attr('width', x.bandwidth()).attr('height', d => h - y(d.nr3))
    .attr('fill', d3.schemeCategory10[2])
  g.append('g').attr('transform', `translate(0,${h})`).call(d3.axisBottom(x))
    .selectAll('text').attr('transform', 'rotate(-30)').attr('text-anchor', 'end')
  g.append('g').call(d3.axisLeft(y))
}
</script>

<style scoped>
.panel { padding: .5rem; }
</style>

Step 4: Scenario panel — polarity quadrant (desirability × plausibility)

frontend/src/components/interviews/ScenarioPanel.vue:

<template>
  <div class="panel">
    <h3>Scenarios: desirability × plausibility</h3>
    <div v-if="loading">Loading…</div>
    <div v-else-if="error">{{ error }}</div>
    <svg v-else ref="chart" :width="width" :height="height"></svg>
  </div>
</template>

<script setup>
import { onMounted, ref, watch } from 'vue'
import * as d3 from 'd3'
import { getResults } from '../../api/interview'

const props = defineProps({ simId: String, status: Object })
const chart = ref(null); const loading = ref(true); const error = ref(null)
const width = 520, height = 520

watch(() => props.status?.status, (s) => { if (s === 'completed') load() })
onMounted(load)

async function load() {
  loading.value = true; error.value = null
  try {
    const r = await getResults(props.simId, 'scenario')
    if (!r.success) { error.value = r.error; return }
    draw(r.data.aggregate.polarity || {})
  } catch (e) { error.value = String(e) } finally { loading.value = false }
}

function draw(polarity) {
  const pts = Object.entries(polarity)
    .filter(([, v]) => v && v.n > 0)
    .map(([sid, v]) => ({
      sid, x: v.mean_plausibility, y: v.mean_desirability,
      n: v.n, sdx: v.sd_plausibility, sdy: v.sd_desirability,
    }))
  if (!pts.length) return
  const svg = d3.select(chart.value); svg.selectAll('*').remove()
  const margin = { top: 20, right: 20, bottom: 40, left: 40 }
  const w = width - margin.left - margin.right
  const h = height - margin.top - margin.bottom
  const g = svg.append('g').attr('transform', `translate(${margin.left},${margin.top})`)
  const x = d3.scaleLinear().domain([1, 7]).range([0, w])
  const y = d3.scaleLinear().domain([1, 7]).range([h, 0])
  g.append('line').attr('x1', 0).attr('x2', w).attr('y1', y(4)).attr('y2', y(4)).attr('stroke', '#ccc')
  g.append('line').attr('x1', x(4)).attr('x2', x(4)).attr('y1', 0).attr('y2', h).attr('stroke', '#ccc')
  g.selectAll('circle').data(pts).enter().append('circle')
    .attr('cx', d => x(d.x)).attr('cy', d => y(d.y))
    .attr('r', d => 6 + Math.sqrt(d.n))
    .attr('fill', d3.schemeCategory10[1]).attr('opacity', .7)
  g.selectAll('text.lbl').data(pts).enter().append('text')
    .attr('class', 'lbl').attr('x', d => x(d.x) + 8).attr('y', d => y(d.y))
    .text(d => `${d.sid} (n=${d.n})`)
  g.append('g').attr('transform', `translate(0,${h})`).call(d3.axisBottom(x))
  g.append('g').call(d3.axisLeft(y))
  g.append('text').attr('x', w/2).attr('y', h+34).attr('text-anchor', 'middle').text('plausibility')
  g.append('text').attr('transform', `rotate(-90)`).attr('x', -h/2).attr('y', -28)
    .attr('text-anchor', 'middle').text('desirability')
}
</script>

<style scoped>
.panel { padding: .5rem; }
</style>

Step 5: Synthesis panel — render markdown report

frontend/src/components/interviews/SynthesisPanel.vue:

<template>
  <div class="panel">
    <h3>Synthesis</h3>
    <div v-if="loading">Loading…</div>
    <div v-else-if="error">{{ error }}</div>
    <pre v-else class="report">{{ report }}</pre>
  </div>
</template>

<script setup>
import { onMounted, ref, watch } from 'vue'
import { getSynthesis } from '../../api/interview'

const props = defineProps({ simId: String, status: Object })
const loading = ref(true); const error = ref(null); const report = ref('')

watch(() => props.status?.status, (s) => { if (s === 'completed') load() })
onMounted(load)

async function load() {
  loading.value = true; error.value = null
  try {
    const r = await getSynthesis(props.simId)
    if (!r.success) { error.value = r.error; return }
    report.value = r.data.report_markdown
  } catch (e) { error.value = String(e) } finally { loading.value = false }
}
</script>

<style scoped>
.panel { padding: .5rem; }
.report { white-space: pre-wrap; font-family: ui-monospace, monospace; line-height: 1.4; }
</style>

Step 6: Build + smoke test

Run: cd frontend && npm run build Expected: build succeeds. Then cd .. && npm run dev and manually visit Step4b for a completed sim_id — verify all five tabs render without console errors.

Step 7: Commit

git add frontend/src/components/interviews/*.vue
git commit -m "feat(interviews): d3 visualisations for longitudinal Δ, diversity PCA, Delphi, scenario polarity, synthesis"

Task 20: Auto-trigger pre-survey on simulation `ready`

Files:

Create: backend/app/services/interviews/lifecycle.py
Modify: backend/app/__init__.py (app factory) to install the hook
Step 1: Write failing test

# backend/tests/interviews/test_lifecycle.py
from app.services.interviews.lifecycle import install_hooks

class _StubMgr:
    def __init__(self): self.ready = []; self.completed = []
    def register_on_ready(self, fn): self.ready.append(fn)
    def register_on_completed(self, fn): self.completed.append(fn)

def test_install_hooks_registers_two_callables():
    mgr = _StubMgr()
    install_hooks(mgr)
    assert len(mgr.ready) == 1
    assert len(mgr.completed) == 1
    assert callable(mgr.ready[0])
    assert callable(mgr.completed[0])

Step 2: Run test to verify it fails

Run: cd backend && uv run pytest tests/interviews/test_lifecycle.py -v Expected: ImportError.

Step 3: Implement lifecycle hook installer

backend/app/services/interviews/lifecycle.py:

from __future__ import annotations
import threading
from app.utils.logger import get_logger

logger = get_logger(__name__)

def install_hooks(manager) -> None:
    """Attach interview lifecycle callbacks to a SimulationManager.

    on_ready  → spawn T0 longitudinal in a background thread
    on_completed → spawn full post-sim batch in a background thread
    Hooks are best-effort; failures only log.
    """
    def _on_ready(state) -> None:
        sim_id = getattr(state, "sim_id", None) or getattr(state, "id", None)
        if not sim_id: return
        threading.Thread(target=_run_pre, args=(sim_id,), daemon=True).start()

    def _on_completed(state) -> None:
        sim_id = getattr(state, "sim_id", None) or getattr(state, "id", None)
        if not sim_id: return
        threading.Thread(target=_run_post, args=(sim_id,), daemon=True).start()

    manager.register_on_ready(_on_ready)
    manager.register_on_completed(_on_completed)

def _run_pre(sim_id: str) -> None:
    try:
        from app.api.interview import _build_orchestrator
        orch = _build_orchestrator(sim_id)
        orch.run_pre()
    except Exception as e:
        logger.warning(f"auto pre-survey failed for {sim_id}: {e!r}")

def _run_post(sim_id: str) -> None:
    try:
        from app.api.interview import _build_orchestrator
        from app.services.interview_synthesizer import InterviewSynthesizer
        orch = _build_orchestrator(sim_id)
        orch.run_post()
        InterviewSynthesizer(store=orch.store).run()
    except Exception as e:
        logger.warning(f"auto post-survey failed for {sim_id}: {e!r}")

Step 4: Wire into app factory

Read backend/app/__init__.py. Locate where SimulationManager (or its singleton) is instantiated. Add:

    from app.services.interviews.lifecycle import install_hooks
    install_hooks(simulation_manager)

immediately after the manager is constructed. If simulation_manager is module-level in simulation_manager.py, attach the hooks at the bottom of that module instead — the goal is "install once on app startup".

Step 5: Run test to verify it passes

Run: cd backend && uv run pytest tests/interviews/test_lifecycle.py -v Expected: 1 passed.

Step 6: Full backend test suite

Run: cd backend && uv run pytest -m "not integration" -q Expected: all unit tests pass.

Run: cd backend && uv run pytest -m integration -q Expected: integration tests pass.

Step 7: Commit

git add backend/app/services/interviews/lifecycle.py backend/app/__init__.py backend/tests/interviews/test_lifecycle.py
git commit -m "feat(interviews): auto-trigger pre and post interviews via SimulationManager lifecycle hooks"

Final verification

Run full backend test suite

Run: cd backend && uv run pytest -q Expected: every test passes.

Run frontend build

Run: cd frontend && npm run build Expected: build succeeds with no errors.

Smoke test the running app

Run: npm run dev from project root. With an existing completed simulation:

Navigate to Step4b in the UI
Click "Run all post-simulation interviews"
Wait for status to reach completed
Verify each of the five tabs renders without console errors
Click "Download CSV" and confirm the file downloads

Verify spec coverage

Re-open docs/superpowers/specs/2026-05-23-stakeholder-interview-subagents-design.md and confirm every section in the spec has a corresponding task:

§3 architectural approach (deterministic runners) → Tasks 5–9
§4 file structure + lifecycle hooks → Tasks 2–14, 20
§5.1–5.4 four instruments → Tasks 6, 7, 8, 9
§5.5 in-character prompting + structured output + cost guardrails → Tasks 4, 5
§6.1 storage layout → Task 10
§6.2 Zep integration → Task 11
§6.3 API surface (all 7 endpoints) → Task 16
§6.4 parallelism + token guard → Task 12 (parallelism); token guard sits in Config.INTERVIEW_MAX_TOKENS_PER_RUN from Task 1 — open: enforcement not implemented in v1; flag if you want it added
§6.5 frontend Step4b + per-tab viz → Tasks 18, 19
§7 error handling (per-agent isolation, schema retry, idempotency) → Tasks 5, 10, 12
§8 validation (schema, instrument, plausibility flags) → Tasks 2, 3 (schema + instrument); plausibility-flags currently sit inside synthesiser §10 — check that flagged thresholds in §8 plausibility checks match what synthesiser currently emits
§9 testing (unit per subagent + integration + stub mode) → Tasks 4, 6–9, 12, 17
§10 methodological caveats in synthesis → Task 14
§11 defaults — already encoded in Task 1 config keys and instrument YAML

If §6.4 token-guard enforcement is needed for v1, add a small follow-up task that computes a projected-token estimate before run_post and returns 400 with confirm=true override — but the spec keeps this as a guard, not a blocker, so it can ship in v1.1.

Plan complete and saved to docs/superpowers/plans/2026-05-23-stakeholder-interview-subagents.md. Two execution options:

1. Subagent-Driven (recommended) — I dispatch a fresh subagent per task, review between tasks, fast iteration.

2. Inline Execution — Execute tasks in this session using executing-plans, batch execution with checkpoints.

Which approach?

152 KiB Raw Blame History Unescape Escape

Stakeholder Interview Subagents Implementation Plan

Phase 0 — Setup

Task 0: Add deps and pytest scaffold

Task 1: Add interview config keys

Phase 1 — Foundation

Task 2: Pydantic models for instruments and responses

Task 3: YAML instrument loader + validator

Task 4: LLM stub mode

Task 5: StakeholderInterviewer base class

Phase 2 — Subagents

Task 6: Longitudinal subagent + instrument YAML

Task 7: Diversity subagent + Q-sort instrument

Task 8: Delphi subagent (three rounds)

Task 9: Scenario subagent

Phase 3 — Storage and Zep

Task 10: Interview storage layout writer

Task 11: Zep episode writer for interviews

Phase 4 — Orchestrator, lifecycle, synthesiser

Task 12: InterviewOrchestrator (parallel fan-out)

Task 13: Simulation manager lifecycle hooks

Task 14: InterviewSynthesizer

Phase 5 — Adapters and API

Task 15: Persona + memory adapters

Task 16: /api/interview Flask blueprint

Phase 6 — Integration

Task 17: End-to-end pipeline test (stub LLM)

Phase 7 — Frontend

Task 18: Step4bInterviews.vue scaffold + tab shell

Task 19: Per-tab d3 visualisations

Task 20: Auto-trigger pre-survey on simulation ready

Final verification

152 KiB

Raw Blame History

Task 20: Auto-trigger pre-survey on simulation `ready`