MicroFish/.kiro/specs/graph-build-empty-fix/design.md

23 KiB

Design: graph-build-empty-fix

Overview

Purpose: Restore non-empty knowledge-graph builds under the post-migration Graphiti + Neo4j stack and migrate the embedding pipeline to a local-by-default model so the documented happy path produces a working pipeline end-to-end.

Users: MiroFish maintainers and operators running a fresh checkout, plus existing operators who already pinned EMBEDDING_* to a remote provider.

Impact: Flips three default values in backend/app/config.py (EMBEDDING_MODEL, EMBEDDING_BASE_URL, EMBEDDING_API_KEY) so the embedder targets a local Ollama instance with mxbai-embed-large, adds a non-zero-node-count gate to the graph-build worker's completion path, and updates README.md / CLAUDE.md / docker-compose.yml comments / .env.example so the documentation matches the new defaults. No new env var, no new dependency, no new provider branch in _build_llm_and_embedder — Ollama is reached through the existing "openai" provider against its OpenAI-compatible /v1 endpoint.

Goals

  • Default .env-free configuration produces a non-empty (:Entity {group_id}) set in Neo4j for the uploaded seed material.
  • Any silent "succeeded but empty" graph-build outcome is converted into a Task.status = FAILED with an actionable error.
  • Existing OpenAI- / Gemini-compatible operators are unaffected on the happy path.
  • Documentation (README, CLAUDE.md, docker-compose.yml, .env.example) reflects the new default unambiguously.

Non-Goals

  • Startup-time embedder health probe that refuses to boot on dim/model mismatch.
  • Tunable EMBEDDING_DIM (768/1536 support) — explicit follow-up.
  • New provider branch in _build_llm_and_embedder (e.g., a dedicated "ollama" enum).
  • Bundling Ollama or any model binary in docker-compose.yml.
  • Auto-rebuilding or invalidating project graphs created before this change.
  • LLM-side default change — only embedding defaults move.

Boundary Commitments

This Spec Owns

  • The three EMBEDDING_* default values in backend/app/config.py.
  • The .env.example block ordering / commenting that presents Ollama as active and OpenAI/Gemini as fallbacks.
  • A non-zero-node-count gate in GraphBuilderService._build_graph_worker that converts an empty-graph completion into a fail_task(...).
  • Wording of the embedder section in README.md, CLAUDE.md, and docker-compose.yml comments.
  • One new locale key (progress.emptyGraphFailure) in locales/en.json and locales/zh.json for the gate's failure message.

Out of Boundary

  • Any change to _build_llm_and_embedder's provider factory beyond what these defaults exercise.
  • Changes to _recover_stuck_projects — its count > 0 gate already matches the contract.
  • The single-episode _GraphNamespace.add(...) path (already raises naturally).
  • The graphiti-core dependency version.
  • Pre-existing project-graph migration / backfill.

Allowed Dependencies

  • backend/app/services/graphiti_adapter.py (read-only — the OpenAI branch is already Ollama-compatible).
  • backend/app/models/task.py TaskManager.fail_task / complete_task.
  • backend/app/utils/locale.t for the new failure message.
  • Existing loud-failure contract from spec graphiti-ollama-embedder.

Revalidation Triggers

  • A graphiti-core upgrade that changes EMBEDDING_DIM away from 1024.
  • A change to the recovery contract in _recover_stuck_projects.
  • Introduction of a new embedder provider branch (would invalidate the "Ollama-via-openai-branch" assumption).
  • Any new long-running task type built on the same pattern would need a parallel non-zero-count gate.

Architecture

Existing Architecture Analysis

  • Embedder construction: _build_llm_and_embedder (graphiti_adapter.py:92-139) branches on GRAPHITI_LLM_PROVIDER ∈ {openai, gemini}. The "openai" branch composes OpenAIEmbedder(OpenAIEmbedderConfig(api_key, base_url, embedding_model)) where each field falls back from EMBEDDING_* to LLM_*. Ollama's /v1/embeddings is OpenAI-shape-compatible, so the existing branch suffices.
  • Graph-build worker: _build_graph_worker (graph_builder.py:140-230) ingests chunks via add_text_batches_GraphNamespace.add_batch, waits on a no-op episode.get poll, fetches _get_graph_info(graph_id), then calls complete_task with the resulting node/edge counts. Failure-path is a broad except Exception → traceback → fail_task(task_id, error_msg).
  • Loud-failure contract (from spec #18): _GraphNamespace.add_batch logs the underlying add_episode exception at ERROR and raises — no placeholder UUID return path.
  • Startup recovery: _recover_stuck_projects (__init__.py:88-109) promotes GRAPH_BUILDINGGRAPH_COMPLETED only when count(:Entity {group_id}) > 0.

These patterns are preserved; this design extends _build_graph_worker with a symmetric count > 0 check before complete_task is called.

Architecture Pattern & Boundary Map

graph TB
    EnvFile[dotenv]
    Config[Config]
    Adapter[GraphitiAdapter _build_llm_and_embedder]
    Embedder[OpenAIEmbedder]
    Ollama[Local Ollama mxbai_embed_large]
    Worker[_build_graph_worker]
    Neo4j[Neo4j Vector Index 1024 dim]
    TaskMgr[TaskManager]
    Recovery[_recover_stuck_projects]

    EnvFile --> Config
    Config --> Adapter
    Adapter --> Embedder
    Embedder --> Ollama
    Worker --> Adapter
    Adapter --> Neo4j
    Worker --> Neo4j
    Worker --> TaskMgr
    Recovery --> Neo4j
    Recovery --> TaskMgr

Architecture Integration:

  • Selected pattern: Defaults-flip + completion-gate (Option C from the gap analysis). Preserves the existing layered flow; adds one synchronous read inside the worker.
  • Domain / feature boundaries: Config owns env-driven defaults; GraphitiAdapter owns provider construction; GraphBuilderService owns the worker lifecycle and the new count > 0 gate; _recover_stuck_projects owns the symmetric startup-side gate. No cross-cutting changes.
  • Existing patterns preserved: Single-Graphiti-singleton; persistent event loop; loud add_batch; broad worker except Exception; group_id-scoped reads.
  • New components rationale: None. Only one new locale key and a ~5-line gate inside the existing worker.
  • Steering compliance: Stays inside the adapter (database.md); reaches fail_task on the unhappy path (error-handling.md); configuration centralised in config.py (structure.md); per-project group_id filter preserved.

Technology Stack & Alignment

Layer Choice / Version Role in Feature Notes
Backend / Services Python ≥3.11, Flask 3.0 Hosts the unchanged graph-build worker and the new non-zero-count gate Existing stack; no change.
Data / Storage Neo4j 5.x Community + graphiti-core ≥ 0.3 Owns the 1024-dim vector index that the embedder must match EMBEDDING_DIM = 1024 is a graphiti-core invariant; not exposed.
External Ollama (operator-managed) + mxbai-embed-large (1024-dim) New default embedding provider, reached over OpenAI-shaped /v1/embeddings Reached via http://localhost:11434/v1 in host mode; http://host.docker.internal:11434/v1 in Docker.
Frontend / CLI Vue 3 + vue-i18n Renders the new "graph build produced 0 entities" failure message One new locale key in locales/en.json and locales/zh.json.

File Structure Plan

Modified Files

  • backend/app/config.py — Change the three EMBEDDING_* defaults (lines 42, 52, 53). No new fields.
  • backend/app/services/graph_builder.py — After _get_graph_info(graph_id) in _build_graph_worker, check graph_info.node_count > 0; if zero, call task_manager.fail_task(task_id, …) with a t('progress.emptyGraphFailure') message and return instead of complete_task. Log at ERROR level.
  • locales/en.json — Add progress.emptyGraphFailure (English).
  • locales/zh.json — Add progress.emptyGraphFailure (Chinese, mirroring the existing progress.* style).
  • README.md — In the env-block code fence (around lines 163-173), move the Ollama lines out of comments and demote the OpenAI/Gemini line to a commented fallback example. Adjust the surrounding prose so the Ollama prerequisite (ollama pull mxbai-embed-large) is part of the default setup checklist alongside Neo4j.
  • CLAUDE.md — In the "Required Environment Variables" section (around lines 72-80), state that the active default EMBEDDING_MODEL is mxbai-embed-large via Ollama; demote OpenAI/Gemini to "Other supported configurations".
  • docker-compose.yml — Tighten the L31-33 comment so it points operators at the .env.example Ollama block as the active default rather than as an optional override.

Hook-Protected File (operator-coordinated)

  • .env.example — The block layout must end up: uncommented EMBEDDING_BASE_URL=http://host.docker.internal:11434/v1, EMBEDDING_API_KEY=ollama, EMBEDDING_MODEL=mxbai-embed-large; OpenAI and Gemini examples remain present but as commented blocks below. The implementation phase produces the exact diff and either coordinates the edit with the developer or records it in HANDOFF.md.

Directory structure is unchanged; no new files are introduced.

System Flows

Graph-build completion gate

sequenceDiagram
    participant API as graph_bp
    participant Worker as _build_graph_worker
    participant Adapter as GraphitiAdapter
    participant Neo as Neo4j
    participant Task as TaskManager

    API->>Worker: start (text, ontology, group_id)
    Worker->>Adapter: add_batch(chunks)
    Adapter->>Neo: add_episode per chunk (entities, edges)
    Adapter-->>Worker: episode_uuids OR raises
    alt add_batch raised
        Worker->>Task: fail_task(err)
    else add_batch returned
        Worker->>Adapter: _get_graph_info(group_id)
        Adapter-->>Worker: GraphInfo(node_count, edge_count)
        alt node_count == 0
            Worker->>Task: fail_task("graph build produced 0 entities")
        else node_count > 0
            Worker->>Task: complete_task(graph_info)
        end
    end

Key decisions captured by the diagram:

  • The gate runs after the existing _get_graph_info call so it costs one extra branch, not an extra Neo4j round-trip.
  • The gate fires only when add_batch returned without raising — it is strictly a defense for "succeeded but empty," not a replacement for the loud-failure contract.
  • Edge count is not part of the gate: the contract from _recover_stuck_projects is "non-zero entities ⇒ COMPLETED", and edges may legitimately lag entities in some graphiti-core flows.

Requirements Traceability

Requirement Summary Components Interfaces Flows
1.1 Reproduce on main defaults before fixing Implementation log (PR)
1.2 Document root cause(s) in PR + design.md design.md Overview, research.md Research Log
1.3 If dim-mismatch, record dims research.md Research Log → Embedder construction path
1.4 If a new silent path is found, remediate via R4 graph_builder.py worker gate TaskManager.fail_task Graph-build completion gate
1.5 Post-fix reproduction writes non-zero entities End-to-end smoke (PR description)
2.1 Config.EMBEDDING_* defaults point to local Ollama backend/app/config.py
2.2 .env.example presents Ollama uncommented .env.example
2.3 Default config end-to-end produces non-empty graph All modified files Graph-build completion gate
2.4 No reachable Ollama ⇒ Task.FAILED with named error graph_builder.py worker (existing except) TaskManager.fail_task
2.5 Ollama goes through existing _build_llm_and_embedder openai branch graphiti_adapter.py (read-only)
3.1 Keep EMBEDDING_DIM = 1024; default model is 1024-dim config.py, CLAUDE.md
3.2 CLAUDE.md states the 1024 invariant and rules out 768-dim CLAUDE.md
3.3 Dim-mismatch override ⇒ loud Task.FAILED graph_builder.py worker gate + existing loud add_batch TaskManager.fail_task Graph-build completion gate
3.4 No new EMBEDDING_DIM env var
4.1 Preserve loud add_batch from #18 graphiti_adapter.py (read-only)
4.2 Remediate any new silent call site found in R1 graph_builder.py worker gate TaskManager.fail_task Graph-build completion gate
4.3 Embedder-construction failure ⇒ worker Task.FAILED graph_builder.py worker (existing except) TaskManager.fail_task
4.4 Log propagated failure at ERROR before fail_task graph_builder.py worker gate logger.error / logger.exception
4.5 GRAPH_COMPLETED only when node_count > 0 graph_builder.py worker gate TaskManager.complete_task Graph-build completion gate
5.1 Existing OpenAI/Gemini configs unchanged behavior graphiti_adapter.py (read-only), config.py
5.2 No new env var
5.3 Pre-existing 1536-dim graphs remain readable when operator keeps their override config.py (override-wins semantics unchanged)
5.4 GRAPHITI_LLM_PROVIDER default stays openai config.py (unchanged)
6.1 CLAUDE.md describes Ollama as default CLAUDE.md
6.2 README setup names ollama pull prerequisite README.md
6.3 docker-compose / README documents host.docker.internal:11434 docker-compose.yml, README.md
6.4 One-line curl smoke test in docs README.md (already present, retain)
7.1 Profile generation reads the new graph End-to-end smoke (PR description)
7.2 Report-agent tools return non-empty results End-to-end smoke (PR description)
7.3 PR documents the smoke-test path PR description
7.4 If smoke test not run, PR says so explicitly PR description

Components and Interfaces

Component Domain/Layer Intent Req Coverage Key Dependencies (P0/P1) Contracts
Config (modified) Backend / config Owns the three EMBEDDING_* defaults that flip from OpenAI to Ollama 2.1, 5.4 dotenv (P0) State
GraphBuilderService._build_graph_worker (modified) Backend / services Adds the non-zero-node-count gate before complete_task 1.4, 3.3, 4.2, 4.4, 4.5 _get_graph_info (P0), TaskManager (P0), locale.t (P1) Batch
Locale entries (new key) Shared / i18n One key (progress.emptyGraphFailure) so the gate's message is translated 4.4, 4.5 vue-i18n (P1), utils.locale.t (P1) State
Docs set (README.md, CLAUDE.md, docker-compose.yml, .env.example) Docs Updates the documented happy path to local-by-default 2.2, 6.1, 6.2, 6.3, 6.4

Backend / Config

Config (modified)

Field Detail
Intent Flip the three EMBEDDING_* defaults from OpenAI to Ollama.
Requirements 2.1, 5.4

Responsibilities & Constraints

  • Owns the env-driven embedder defaults consumed by _build_llm_and_embedder.
  • Must not introduce a new env var or remove any existing one.
  • Operator-set EMBEDDING_* continues to win over the defaults (override semantics unchanged).

Dependencies

  • Inbound: _build_llm_and_embedder (P0) reads the three values.
  • Outbound: none.
  • External: dotenv (P0) loads the .env file before class evaluation.

Contracts: State ☑.

State Management
  • State model: Three module-level class attributes on Config:
    • EMBEDDING_MODEL = os.environ.get('EMBEDDING_MODEL', 'mxbai-embed-large')
    • EMBEDDING_BASE_URL = os.environ.get('EMBEDDING_BASE_URL', 'http://localhost:11434/v1')
    • EMBEDDING_API_KEY = os.environ.get('EMBEDDING_API_KEY', 'ollama')
  • Persistence & consistency: Read once at import; no runtime mutation.
  • Concurrency strategy: N/A (read-only after import).

Implementation Notes

  • Integration: _build_llm_and_embedder's existing fallback Config.EMBEDDING_API_KEY or Config.LLM_API_KEY continues to work; with the new defaults, the fallback is no longer triggered on a clean checkout.
  • Validation: None added — embedder errors continue to surface via the worker's existing except.
  • Risks: An operator who previously relied on "leave EMBEDDING_* unset to inherit LLM_*" will, after this change, hit Ollama at http://localhost:11434/v1 instead. README and CLAUDE.md call this out under "Backwards compatibility".

Backend / Services

GraphBuilderService._build_graph_worker (modified)

Field Detail
Intent Convert "graph build succeeded but produced 0 entities" into a Task.FAILED.
Requirements 1.4, 3.3, 4.2, 4.4, 4.5

Responsibilities & Constraints

  • Preserves the existing 5-stage progression (create → set ontology → split → batch → wait → fetch info → complete).
  • New behavior: after graph_info = self._get_graph_info(graph_id), if graph_info.node_count == 0, call task_manager.fail_task(...) with a localised error and return (skip complete_task).
  • Must log at ERROR level before the fail_task call so server logs carry the diagnostic ahead of the task envelope.
  • Must not weaken the existing except Exception branch — the gate is additional, not a replacement.

Dependencies

  • Inbound: graph_bp (P0) invokes build_graph_async which calls this worker.
  • Outbound: _get_graph_info(graph_id) (P0), TaskManager.fail_task (P0), TaskManager.complete_task (P0), utils.locale.t (P1).
  • External: logger.error (P1).

Contracts: Batch ☑.

Batch / Job Contract
  • Trigger: build_graph_async spawns the worker thread from a POST /api/graph/build request.
  • Input / validation: Unchanged from current contract (text, ontology, graph_name, chunk_size, chunk_overlap, batch_size, locale).
  • Output / destination: Task envelope on TaskManager. On success: Task.status = COMPLETED, Task.result = {graph_id, graph_info, chunks_processed}. On gate trip: Task.status = FAILED, Task.error = t('progress.emptyGraphFailure').
  • Idempotency & recovery: Unchanged. _recover_stuck_projects continues to gate on count(:Entity) > 0; the worker's new gate makes the live-completion path symmetric.

Implementation Notes

  • Integration: One block inserted between the existing graph_info = self._get_graph_info(graph_id) (line ~219) and self.task_manager.complete_task(...) (line ~221). Approximately 5 lines.
  • Validation: Confirmed empirically that _get_graph_info returns node_count == 0 when Neo4j holds no (:Entity {group_id}) rows.
  • Risks: A worker that ran on a misconfigured embedder would previously surface via the existing except Exception (because add_batch re-raises). The new gate catches the residual case where graphiti-core returns successfully but writes nothing — the exact failure mode the ticket reports.

Shared / i18n

progress.emptyGraphFailure (new locale key)

Field Detail
Intent Localised failure message for the new gate.
Requirements 4.4, 4.5

Contracts: State ☑.

State Management
  • State model: One additional entry in the progress namespace of locales/en.json and locales/zh.json.
  • Persistence & consistency: File-based locales loaded by vue-i18n (frontend) and utils.locale (backend). Keys must exist in both files; the progress.* namespace is the established home for graph-build status strings.
  • Concurrency strategy: N/A.

Implementation Notes

  • Integration: Backend calls t('progress.emptyGraphFailure') from _build_graph_worker. Frontend renders the same key in Step1GraphBuild.vue's failure surface (no code change — it already displays Task.error).
  • Validation: Smoke-test the key resolves in both locales (set_locale('en') / set_locale('zh')).
  • Risks: None — additive change.

Error Handling

Error Strategy

This spec contributes one new error case (progress.emptyGraphFailure) and re-uses the existing transport (TaskManager.fail_task → polling endpoint → frontend renders Task.error).

Error Categories and Responses

  • Embedder unreachable (e.g., Ollama not running): caught by _build_graph_worker's existing except Exception after add_batch raises. Task.FAILED with the underlying connection error.
  • Dim-mismatch override (operator points EMBEDDING_MODEL at a non-1024-dim model): caught by add_batch's loud-failure contract (Neo4j or graphiti-core raises). Task.FAILED with the underlying dim-mismatch error.
  • Empty graph after a clean add_batch (new case): caught by the gate. Task.FAILED with t('progress.emptyGraphFailure'). Logged at ERROR before the fail_task call.

Monitoring

  • Existing logger.exception / logger.error lines in graphiti_adapter.py and graph_builder.py carry the underlying error.
  • New logger.error('graph build produced 0 entities for group_id=%s', graph_id) line precedes the gate's fail_task call.

Testing Strategy

  • Unit-level smoke (manual, scripted): With Ollama down → npm run dev → start a graph build → expect Task.status = FAILED and Task.error containing a connectivity message. With Ollama up and mxbai-embed-large pulled → expect Task.status = COMPLETED and graph_info.node_count > 0.
  • Configuration smoke: Confirm Config.EMBEDDING_MODEL, Config.EMBEDDING_BASE_URL, Config.EMBEDDING_API_KEY resolve to the new defaults when .env is empty.
  • Backwards-compat smoke: With .env setting EMBEDDING_* to OpenAI's values, confirm _build_llm_and_embedder constructs the OpenAI embedder exactly as before (no observable change).
  • Gate unit-style test: Patch _get_graph_info to return GraphInfo(graph_id=…, node_count=0, edge_count=0, entity_types=[]) and assert the worker calls fail_task with the localised key (no real pytest harness expansion — short repro in PR description is sufficient given the existing minimal test policy).
  • End-to-end (Req 7): Graph build → env-setup (profile generation) → report-agent query on a representative seed file. PR description documents the run; if the maintainer cannot run it locally, the PR description states that explicitly.

Migration Strategy

flowchart LR
    Start[merge to main]
    Pull[operator runs ollama pull mxbai_embed_large]
    Restart[restart backend]
    Build[start fresh graph build]
    Verify[verify entities in Neo4j]
    Done[done]

    Start --> Pull
    Pull --> Restart
    Restart --> Build
    Build --> Verify
    Verify --> Done
  • Phase 1 (no operator action): For operators with explicit EMBEDDING_* overrides — no change. Pre-existing project graphs remain readable.
  • Phase 2 (default-using operators): One-time ollama pull mxbai-embed-large and restart. Pre-existing project graphs created against the previous default (1536-dim text-embedding-3-small with a Dashscope LLM key, which most likely already produced empty graphs per the ticket) are invalidated; operators rebuild them.
  • Rollback trigger: If an operator cannot run Ollama, they re-add the OpenAI or Gemini EMBEDDING_* block to .env (the README's commented fallback) and restart. No code rollback required.
  • Validation checkpoint: After the first graph build under the new defaults, the node_count > 0 gate proves the migration succeeded.