MicroFish/.kiro/specs/graph-build-empty-fix/design.md

# Design: graph-build-empty-fix

## Overview

**Purpose**: Restore non-empty knowledge-graph builds under the post-migration Graphiti + Neo4j stack and migrate the embedding pipeline to a local-by-default model so the documented happy path produces a working pipeline end-to-end.

**Users**: MiroFish maintainers and operators running a fresh checkout, plus existing operators who already pinned `EMBEDDING_*` to a remote provider.

**Impact**: Flips three default values in `backend/app/config.py` (`EMBEDDING_MODEL`, `EMBEDDING_BASE_URL`, `EMBEDDING_API_KEY`) so the embedder targets a local Ollama instance with `mxbai-embed-large`, adds a non-zero-node-count gate to the graph-build worker's completion path, and updates `README.md` / `CLAUDE.md` / `docker-compose.yml` comments / `.env.example` so the documentation matches the new defaults. No new env var, no new dependency, no new provider branch in `_build_llm_and_embedder` — Ollama is reached through the existing `"openai"` provider against its OpenAI-compatible `/v1` endpoint.

### Goals
- Default `.env`-free configuration produces a non-empty `(:Entity {group_id})` set in Neo4j for the uploaded seed material.
- Any silent "succeeded but empty" graph-build outcome is converted into a `Task.status = FAILED` with an actionable error.
- Existing OpenAI- / Gemini-compatible operators are unaffected on the happy path.
- Documentation (README, CLAUDE.md, docker-compose.yml, `.env.example`) reflects the new default unambiguously.

### Non-Goals
- Startup-time embedder health probe that refuses to boot on dim/model mismatch.
- Tunable `EMBEDDING_DIM` (768/1536 support) — explicit follow-up.
- New provider branch in `_build_llm_and_embedder` (e.g., a dedicated `"ollama"` enum).
- Bundling Ollama or any model binary in `docker-compose.yml`.
- Auto-rebuilding or invalidating project graphs created before this change.
- LLM-side default change — only embedding defaults move.

## Boundary Commitments

### This Spec Owns
- The three `EMBEDDING_*` default values in `backend/app/config.py`.
- The `.env.example` block ordering / commenting that presents Ollama as active and OpenAI/Gemini as fallbacks.
- A non-zero-node-count gate in `GraphBuilderService._build_graph_worker` that converts an empty-graph completion into a `fail_task(...)`.
- Wording of the embedder section in `README.md`, `CLAUDE.md`, and `docker-compose.yml` comments.
- One new locale key (`progress.emptyGraphFailure`) in `locales/en.json` and `locales/zh.json` for the gate's failure message.

### Out of Boundary
- Any change to `_build_llm_and_embedder`'s provider factory beyond what these defaults exercise.
- Changes to `_recover_stuck_projects` — its `count > 0` gate already matches the contract.
- The single-episode `_GraphNamespace.add(...)` path (already raises naturally).
- The graphiti-core dependency version.
- Pre-existing project-graph migration / backfill.

### Allowed Dependencies
- `backend/app/services/graphiti_adapter.py` (read-only — the OpenAI branch is already Ollama-compatible).
- `backend/app/models/task.py` `TaskManager.fail_task` / `complete_task`.
- `backend/app/utils/locale.t` for the new failure message.
- Existing loud-failure contract from spec `graphiti-ollama-embedder`.

### Revalidation Triggers
- A graphiti-core upgrade that changes `EMBEDDING_DIM` away from 1024.
- A change to the recovery contract in `_recover_stuck_projects`.
- Introduction of a new embedder provider branch (would invalidate the "Ollama-via-openai-branch" assumption).
- Any new long-running task type built on the same pattern would need a parallel non-zero-count gate.

## Architecture

### Existing Architecture Analysis

- **Embedder construction**: `_build_llm_and_embedder` (`graphiti_adapter.py:92-139`) branches on `GRAPHITI_LLM_PROVIDER` ∈ {`openai`, `gemini`}. The `"openai"` branch composes `OpenAIEmbedder(OpenAIEmbedderConfig(api_key, base_url, embedding_model))` where each field falls back from `EMBEDDING_*` to `LLM_*`. Ollama's `/v1/embeddings` is OpenAI-shape-compatible, so the existing branch suffices.
- **Graph-build worker**: `_build_graph_worker` (`graph_builder.py:140-230`) ingests chunks via `add_text_batches` → `_GraphNamespace.add_batch`, waits on a no-op `episode.get` poll, fetches `_get_graph_info(graph_id)`, then calls `complete_task` with the resulting node/edge counts. Failure-path is a broad `except Exception` → traceback → `fail_task(task_id, error_msg)`.
- **Loud-failure contract** (from spec #18): `_GraphNamespace.add_batch` logs the underlying `add_episode` exception at `ERROR` and `raise`s — no placeholder UUID return path.
- **Startup recovery**: `_recover_stuck_projects` (`__init__.py:88-109`) promotes `GRAPH_BUILDING` → `GRAPH_COMPLETED` only when `count(:Entity {group_id}) > 0`.

These patterns are preserved; this design extends `_build_graph_worker` with a symmetric `count > 0` check before `complete_task` is called.

### Architecture Pattern & Boundary Map

```mermaid
graph TB
    EnvFile[dotenv]
    Config[Config]
    Adapter[GraphitiAdapter _build_llm_and_embedder]
    Embedder[OpenAIEmbedder]
    Ollama[Local Ollama mxbai_embed_large]
    Worker[_build_graph_worker]
    Neo4j[Neo4j Vector Index 1024 dim]
    TaskMgr[TaskManager]
    Recovery[_recover_stuck_projects]

    EnvFile --> Config
    Config --> Adapter
    Adapter --> Embedder
    Embedder --> Ollama
    Worker --> Adapter
    Adapter --> Neo4j
    Worker --> Neo4j
    Worker --> TaskMgr
    Recovery --> Neo4j
    Recovery --> TaskMgr
```

**Architecture Integration**:
- **Selected pattern**: Defaults-flip + completion-gate (Option C from the gap analysis). Preserves the existing layered flow; adds one synchronous read inside the worker.
- **Domain / feature boundaries**: `Config` owns env-driven defaults; `GraphitiAdapter` owns provider construction; `GraphBuilderService` owns the worker lifecycle and the new `count > 0` gate; `_recover_stuck_projects` owns the symmetric startup-side gate. No cross-cutting changes.
- **Existing patterns preserved**: Single-Graphiti-singleton; persistent event loop; loud `add_batch`; broad worker `except Exception`; `group_id`-scoped reads.
- **New components rationale**: None. Only one new locale key and a ~5-line gate inside the existing worker.
- **Steering compliance**: Stays inside the adapter (`database.md`); reaches `fail_task` on the unhappy path (`error-handling.md`); configuration centralised in `config.py` (`structure.md`); per-project `group_id` filter preserved.

### Technology Stack & Alignment

| Layer | Choice / Version | Role in Feature | Notes |
|-------|------------------|-----------------|-------|
| Backend / Services | Python ≥3.11, Flask 3.0 | Hosts the unchanged graph-build worker and the new non-zero-count gate | Existing stack; no change. |
| Data / Storage | Neo4j 5.x Community + `graphiti-core` ≥ 0.3 | Owns the 1024-dim vector index that the embedder must match | `EMBEDDING_DIM = 1024` is a graphiti-core invariant; not exposed. |
| External | Ollama (operator-managed) + `mxbai-embed-large` (1024-dim) | New default embedding provider, reached over OpenAI-shaped `/v1/embeddings` | Reached via `http://localhost:11434/v1` in host mode; `http://host.docker.internal:11434/v1` in Docker. |
| Frontend / CLI | Vue 3 + `vue-i18n` | Renders the new "graph build produced 0 entities" failure message | One new locale key in `locales/en.json` and `locales/zh.json`. |

## File Structure Plan

### Modified Files

- `backend/app/config.py` — Change the three `EMBEDDING_*` defaults (lines 42, 52, 53). No new fields.
- `backend/app/services/graph_builder.py` — After `_get_graph_info(graph_id)` in `_build_graph_worker`, check `graph_info.node_count > 0`; if zero, call `task_manager.fail_task(task_id, …)` with a `t('progress.emptyGraphFailure')` message and `return` instead of `complete_task`. Log at `ERROR` level.
- `locales/en.json` — Add `progress.emptyGraphFailure` (English).
- `locales/zh.json` — Add `progress.emptyGraphFailure` (Chinese, mirroring the existing `progress.*` style).
- `README.md` — In the env-block code fence (around lines 163-173), move the Ollama lines out of comments and demote the OpenAI/Gemini line to a commented fallback example. Adjust the surrounding prose so the Ollama prerequisite (`ollama pull mxbai-embed-large`) is part of the default setup checklist alongside Neo4j.
- `CLAUDE.md` — In the "Required Environment Variables" section (around lines 72-80), state that the active default `EMBEDDING_MODEL` is `mxbai-embed-large` via Ollama; demote OpenAI/Gemini to "Other supported configurations".
- `docker-compose.yml` — Tighten the L31-33 comment so it points operators at the `.env.example` Ollama block as the active default rather than as an optional override.

### Hook-Protected File (operator-coordinated)

- `.env.example` — The block layout must end up: uncommented `EMBEDDING_BASE_URL=http://host.docker.internal:11434/v1`, `EMBEDDING_API_KEY=ollama`, `EMBEDDING_MODEL=mxbai-embed-large`; OpenAI and Gemini examples remain present but as commented blocks below. The implementation phase produces the exact diff and either coordinates the edit with the developer or records it in `HANDOFF.md`.

> Directory structure is unchanged; no new files are introduced.

## System Flows

### Graph-build completion gate

```mermaid
sequenceDiagram
    participant API as graph_bp
    participant Worker as _build_graph_worker
    participant Adapter as GraphitiAdapter
    participant Neo as Neo4j
    participant Task as TaskManager

    API->>Worker: start (text, ontology, group_id)
    Worker->>Adapter: add_batch(chunks)
    Adapter->>Neo: add_episode per chunk (entities, edges)
    Adapter-->>Worker: episode_uuids OR raises
    alt add_batch raised
        Worker->>Task: fail_task(err)
    else add_batch returned
        Worker->>Adapter: _get_graph_info(group_id)
        Adapter-->>Worker: GraphInfo(node_count, edge_count)
        alt node_count == 0
            Worker->>Task: fail_task("graph build produced 0 entities")
        else node_count > 0
            Worker->>Task: complete_task(graph_info)
        end
    end
```

**Key decisions captured by the diagram**:
- The gate runs *after* the existing `_get_graph_info` call so it costs one extra branch, not an extra Neo4j round-trip.
- The gate fires only when `add_batch` returned without raising — it is strictly a defense for "succeeded but empty," not a replacement for the loud-failure contract.
- Edge count is **not** part of the gate: the contract from `_recover_stuck_projects` is "non-zero entities ⇒ COMPLETED", and edges may legitimately lag entities in some graphiti-core flows.

## Requirements Traceability

| Requirement | Summary | Components | Interfaces | Flows |
|-------------|---------|------------|------------|-------|
| 1.1 | Reproduce on `main` defaults before fixing | Implementation log (PR) | — | — |
| 1.2 | Document root cause(s) in PR + design.md | `design.md` Overview, `research.md` Research Log | — | — |
| 1.3 | If dim-mismatch, record dims | `research.md` Research Log → Embedder construction path | — | — |
| 1.4 | If a new silent path is found, remediate via R4 | `graph_builder.py` worker gate | `TaskManager.fail_task` | Graph-build completion gate |
| 1.5 | Post-fix reproduction writes non-zero entities | End-to-end smoke (PR description) | — | — |
| 2.1 | `Config.EMBEDDING_*` defaults point to local Ollama | `backend/app/config.py` | — | — |
| 2.2 | `.env.example` presents Ollama uncommented | `.env.example` | — | — |
| 2.3 | Default config end-to-end produces non-empty graph | All modified files | — | Graph-build completion gate |
| 2.4 | No reachable Ollama ⇒ `Task.FAILED` with named error | `graph_builder.py` worker (existing `except`) | `TaskManager.fail_task` | — |
| 2.5 | Ollama goes through existing `_build_llm_and_embedder` `openai` branch | `graphiti_adapter.py` (read-only) | — | — |
| 3.1 | Keep `EMBEDDING_DIM = 1024`; default model is 1024-dim | `config.py`, `CLAUDE.md` | — | — |
| 3.2 | CLAUDE.md states the 1024 invariant and rules out 768-dim | `CLAUDE.md` | — | — |
| 3.3 | Dim-mismatch override ⇒ loud `Task.FAILED` | `graph_builder.py` worker gate + existing loud `add_batch` | `TaskManager.fail_task` | Graph-build completion gate |
| 3.4 | No new `EMBEDDING_DIM` env var | — | — | — |
| 4.1 | Preserve loud `add_batch` from #18 | `graphiti_adapter.py` (read-only) | — | — |
| 4.2 | Remediate any new silent call site found in R1 | `graph_builder.py` worker gate | `TaskManager.fail_task` | Graph-build completion gate |
| 4.3 | Embedder-construction failure ⇒ worker `Task.FAILED` | `graph_builder.py` worker (existing `except`) | `TaskManager.fail_task` | — |
| 4.4 | Log propagated failure at ERROR before `fail_task` | `graph_builder.py` worker gate | `logger.error` / `logger.exception` | — |
| 4.5 | `GRAPH_COMPLETED` only when `node_count > 0` | `graph_builder.py` worker gate | `TaskManager.complete_task` | Graph-build completion gate |
| 5.1 | Existing OpenAI/Gemini configs unchanged behavior | `graphiti_adapter.py` (read-only), `config.py` | — | — |
| 5.2 | No new env var | — | — | — |
| 5.3 | Pre-existing 1536-dim graphs remain readable when operator keeps their override | `config.py` (override-wins semantics unchanged) | — | — |
| 5.4 | `GRAPHITI_LLM_PROVIDER` default stays `openai` | `config.py` (unchanged) | — | — |
| 6.1 | CLAUDE.md describes Ollama as default | `CLAUDE.md` | — | — |
| 6.2 | README setup names `ollama pull` prerequisite | `README.md` | — | — |
| 6.3 | docker-compose / README documents host.docker.internal:11434 | `docker-compose.yml`, `README.md` | — | — |
| 6.4 | One-line `curl` smoke test in docs | `README.md` (already present, retain) | — | — |
| 7.1 | Profile generation reads the new graph | End-to-end smoke (PR description) | — | — |
| 7.2 | Report-agent tools return non-empty results | End-to-end smoke (PR description) | — | — |
| 7.3 | PR documents the smoke-test path | PR description | — | — |
| 7.4 | If smoke test not run, PR says so explicitly | PR description | — | — |

## Components and Interfaces

| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts |
|-----------|--------------|--------|--------------|--------------------------|-----------|
| `Config` (modified) | Backend / config | Owns the three `EMBEDDING_*` defaults that flip from OpenAI to Ollama | 2.1, 5.4 | dotenv (P0) | State |
| `GraphBuilderService._build_graph_worker` (modified) | Backend / services | Adds the non-zero-node-count gate before `complete_task` | 1.4, 3.3, 4.2, 4.4, 4.5 | `_get_graph_info` (P0), `TaskManager` (P0), `locale.t` (P1) | Batch |
| Locale entries (new key) | Shared / i18n | One key (`progress.emptyGraphFailure`) so the gate's message is translated | 4.4, 4.5 | `vue-i18n` (P1), `utils.locale.t` (P1) | State |
| Docs set (`README.md`, `CLAUDE.md`, `docker-compose.yml`, `.env.example`) | Docs | Updates the documented happy path to local-by-default | 2.2, 6.1, 6.2, 6.3, 6.4 | — | — |

### Backend / Config

#### `Config` (modified)

| Field | Detail |
|-------|--------|
| Intent | Flip the three `EMBEDDING_*` defaults from OpenAI to Ollama. |
| Requirements | 2.1, 5.4 |

**Responsibilities & Constraints**
- Owns the env-driven embedder defaults consumed by `_build_llm_and_embedder`.
- Must not introduce a new env var or remove any existing one.
- Operator-set `EMBEDDING_*` continues to win over the defaults (override semantics unchanged).

**Dependencies**
- Inbound: `_build_llm_and_embedder` (P0) reads the three values.
- Outbound: none.
- External: dotenv (P0) loads the `.env` file before class evaluation.

**Contracts**: State ☑.

##### State Management
- **State model**: Three module-level class attributes on `Config`:
  - `EMBEDDING_MODEL = os.environ.get('EMBEDDING_MODEL', 'mxbai-embed-large')`
  - `EMBEDDING_BASE_URL = os.environ.get('EMBEDDING_BASE_URL', 'http://localhost:11434/v1')`
  - `EMBEDDING_API_KEY = os.environ.get('EMBEDDING_API_KEY', 'ollama')`
- **Persistence & consistency**: Read once at import; no runtime mutation.
- **Concurrency strategy**: N/A (read-only after import).

**Implementation Notes**
- Integration: `_build_llm_and_embedder`'s existing fallback `Config.EMBEDDING_API_KEY or Config.LLM_API_KEY` continues to work; with the new defaults, the fallback is no longer triggered on a clean checkout.
- Validation: None added — embedder errors continue to surface via the worker's existing `except`.
- Risks: An operator who previously relied on "leave `EMBEDDING_*` unset to inherit `LLM_*`" will, after this change, hit Ollama at `http://localhost:11434/v1` instead. README and CLAUDE.md call this out under "Backwards compatibility".

### Backend / Services

#### `GraphBuilderService._build_graph_worker` (modified)

| Field | Detail |
|-------|--------|
| Intent | Convert "graph build succeeded but produced 0 entities" into a `Task.FAILED`. |
| Requirements | 1.4, 3.3, 4.2, 4.4, 4.5 |

**Responsibilities & Constraints**
- Preserves the existing 5-stage progression (create → set ontology → split → batch → wait → fetch info → complete).
- New behavior: after `graph_info = self._get_graph_info(graph_id)`, if `graph_info.node_count == 0`, call `task_manager.fail_task(...)` with a localised error and `return` (skip `complete_task`).
- Must log at `ERROR` level *before* the `fail_task` call so server logs carry the diagnostic ahead of the task envelope.
- Must not weaken the existing `except Exception` branch — the gate is *additional*, not a replacement.

**Dependencies**
- Inbound: `graph_bp` (P0) invokes `build_graph_async` which calls this worker.
- Outbound: `_get_graph_info(graph_id)` (P0), `TaskManager.fail_task` (P0), `TaskManager.complete_task` (P0), `utils.locale.t` (P1).
- External: `logger.error` (P1).

**Contracts**: Batch ☑.

##### Batch / Job Contract
- **Trigger**: `build_graph_async` spawns the worker thread from a `POST /api/graph/build` request.
- **Input / validation**: Unchanged from current contract (`text`, `ontology`, `graph_name`, `chunk_size`, `chunk_overlap`, `batch_size`, `locale`).
- **Output / destination**: `Task` envelope on `TaskManager`. On success: `Task.status = COMPLETED`, `Task.result = {graph_id, graph_info, chunks_processed}`. On gate trip: `Task.status = FAILED`, `Task.error = t('progress.emptyGraphFailure')`.
- **Idempotency & recovery**: Unchanged. `_recover_stuck_projects` continues to gate on `count(:Entity) > 0`; the worker's new gate makes the live-completion path symmetric.

**Implementation Notes**
- Integration: One block inserted between the existing `graph_info = self._get_graph_info(graph_id)` (line ~219) and `self.task_manager.complete_task(...)` (line ~221). Approximately 5 lines.
- Validation: Confirmed empirically that `_get_graph_info` returns `node_count == 0` when Neo4j holds no `(:Entity {group_id})` rows.
- Risks: A worker that ran on a misconfigured embedder would previously surface via the existing `except Exception` (because `add_batch` re-raises). The new gate catches the residual case where graphiti-core *returns successfully but writes nothing* — the exact failure mode the ticket reports.

### Shared / i18n

#### `progress.emptyGraphFailure` (new locale key)

| Field | Detail |
|-------|--------|
| Intent | Localised failure message for the new gate. |
| Requirements | 4.4, 4.5 |

**Contracts**: State ☑.

##### State Management
- **State model**: One additional entry in the `progress` namespace of `locales/en.json` and `locales/zh.json`.
- **Persistence & consistency**: File-based locales loaded by `vue-i18n` (frontend) and `utils.locale` (backend). Keys must exist in both files; the `progress.*` namespace is the established home for graph-build status strings.
- **Concurrency strategy**: N/A.

**Implementation Notes**
- Integration: Backend calls `t('progress.emptyGraphFailure')` from `_build_graph_worker`. Frontend renders the same key in `Step1GraphBuild.vue`'s failure surface (no code change — it already displays `Task.error`).
- Validation: Smoke-test the key resolves in both locales (`set_locale('en')` / `set_locale('zh')`).
- Risks: None — additive change.

## Error Handling

### Error Strategy

This spec contributes one new error case (`progress.emptyGraphFailure`) and re-uses the existing transport (`TaskManager.fail_task` → polling endpoint → frontend renders `Task.error`).

### Error Categories and Responses

- **Embedder unreachable** (e.g., Ollama not running): caught by `_build_graph_worker`'s existing `except Exception` after `add_batch` raises. `Task.FAILED` with the underlying connection error.
- **Dim-mismatch override** (operator points `EMBEDDING_MODEL` at a non-1024-dim model): caught by `add_batch`'s loud-failure contract (Neo4j or graphiti-core raises). `Task.FAILED` with the underlying dim-mismatch error.
- **Empty graph after a clean `add_batch`** (new case): caught by the gate. `Task.FAILED` with `t('progress.emptyGraphFailure')`. Logged at `ERROR` before the `fail_task` call.

### Monitoring

- Existing `logger.exception` / `logger.error` lines in `graphiti_adapter.py` and `graph_builder.py` carry the underlying error.
- New `logger.error('graph build produced 0 entities for group_id=%s', graph_id)` line precedes the gate's `fail_task` call.

## Testing Strategy

- **Unit-level smoke** (manual, scripted): With Ollama down → `npm run dev` → start a graph build → expect `Task.status = FAILED` and `Task.error` containing a connectivity message. With Ollama up and `mxbai-embed-large` pulled → expect `Task.status = COMPLETED` and `graph_info.node_count > 0`.
- **Configuration smoke**: Confirm `Config.EMBEDDING_MODEL`, `Config.EMBEDDING_BASE_URL`, `Config.EMBEDDING_API_KEY` resolve to the new defaults when `.env` is empty.
- **Backwards-compat smoke**: With `.env` setting `EMBEDDING_*` to OpenAI's values, confirm `_build_llm_and_embedder` constructs the OpenAI embedder exactly as before (no observable change).
- **Gate unit-style test**: Patch `_get_graph_info` to return `GraphInfo(graph_id=…, node_count=0, edge_count=0, entity_types=[])` and assert the worker calls `fail_task` with the localised key (no real pytest harness expansion — short repro in PR description is sufficient given the existing minimal test policy).
- **End-to-end** (Req 7): Graph build → env-setup (profile generation) → report-agent query on a representative seed file. PR description documents the run; if the maintainer cannot run it locally, the PR description states that explicitly.

## Migration Strategy

```mermaid
flowchart LR
    Start[merge to main]
    Pull[operator runs ollama pull mxbai_embed_large]
    Restart[restart backend]
    Build[start fresh graph build]
    Verify[verify entities in Neo4j]
    Done[done]

    Start --> Pull
    Pull --> Restart
    Restart --> Build
    Build --> Verify
    Verify --> Done
```

- **Phase 1 (no operator action)**: For operators with explicit `EMBEDDING_*` overrides — no change. Pre-existing project graphs remain readable.
- **Phase 2 (default-using operators)**: One-time `ollama pull mxbai-embed-large` and restart. Pre-existing project graphs created against the previous default (1536-dim text-embedding-3-small with a Dashscope LLM key, which most likely already produced empty graphs per the ticket) are invalidated; operators rebuild them.
- **Rollback trigger**: If an operator cannot run Ollama, they re-add the OpenAI or Gemini `EMBEDDING_*` block to `.env` (the README's commented fallback) and restart. No code rollback required.
- **Validation checkpoint**: After the first graph build under the new defaults, the `node_count > 0` gate proves the migration succeeded.