MicroFish/.kiro/specs/graph-build-empty-fix/gap-analysis.md

9.9 KiB
Raw Blame History

Gap Analysis: graph-build-empty-fix

Scope Snapshot

The fix is small in code surface (config defaults, embedder construction, docs) but research-heavy on root cause (why does Graphiti add_episode appear to succeed yet leave Neo4j empty?). The Ollama documentation, OpenAI-compatible embedder support, and loud add_batch failure already exist from spec graphiti-ollama-embedder (issue #18). What's still missing: flipping the active default to Ollama and confirming the dimension-mismatch hypothesis that drives the empty-graph symptom.

Current State

Relevant Assets

  • backend/app/services/graphiti_adapter.py:92139_build_llm_and_embedder constructs OpenAI or Gemini providers. The OpenAI branch reads Config.EMBEDDING_BASE_URL or Config.LLM_BASE_URL and Config.EMBEDDING_API_KEY or Config.LLM_API_KEY. This branch is already Ollama-compatible (Ollama exposes an OpenAI-shaped /v1/embeddings).
  • backend/app/services/graphiti_adapter.py:466486_GraphNamespace.add_batch re-raises on episode-ingestion failures (spec #18). No placeholder UUIDs. Logger is ERROR with traceback.
  • backend/app/services/graph_builder.py:227230_build_graph_worker catches Exception, captures traceback, calls TaskManager().fail_task(task_id, error_msg).
  • backend/app/__init__.py:88109_recover_stuck_projects gates promotion to GRAPH_COMPLETED on count(n:Entity {group_id}) > 0. Matches Req 4 AC5 already.
  • backend/app/config.py:42, 5253 — current defaults:
    • EMBEDDING_MODEL = 'text-embedding-3-small' (OpenAI, 1536-dim)
    • EMBEDDING_BASE_URL = None → falls back to LLM_BASE_URL
    • EMBEDDING_API_KEY = None → falls back to LLM_API_KEY
  • README.md:163183 — Ollama section present but commented out; OpenAI defaults are still the active path.
  • CLAUDE.md:7280 — already names mxbai-embed-large (1024-dim) and explicitly rules out 768-dim nomic-embed-text. Documentation framing already treats Ollama as a supported provider.
  • docker-compose.yml:3133 — already notes the host.docker.internal:11434 reach-through for Ollama.

Conventions in Play

  • All Neo4j/Graphiti access goes through services/graphiti_adapter.py (per .kiro/steering/database.md).
  • Configuration is centralized in backend/app/config.py — env-driven, single file.
  • Background-task error handling: worker try/exceptfail_task(task_id, str(e)) (per .kiro/steering/error-handling.md).
  • Graph is multi-tenant by group_id; every read/write must be scoped.

Integration Surfaces Out of This Repo

  • graphiti-core package — owns EMBEDDING_DIM = 1024, Neo4j vector index DDL, and add_episode LLM-extraction → embedding → write pipeline. Not vendored here; behavior must be inferred from runtime + their public API.
  • Local Ollama daemon (operator-managed) — out of scope to bundle, in scope to assume runs at host:11434.

Requirement-to-Asset Map

Requirement Asset / Touchpoint Gap
R1 Root cause _build_llm_and_embedder, add_batch, add_episode runtime behavior, Neo4j vector-index dim Unknown — need a reproduction run on default main config to capture the exact failure surface (dimension mismatch vs. embedder 404 vs. silent LLM-extraction-returns-empty).
R2 Local-default config.py:42, 5253, .env.example (protected — operator will reload), README.md:163183 Missing — defaults still point to OpenAI; need to flip to Ollama (mxbai-embed-large @ http://localhost:11434/v1) and demote OpenAI/Gemini to commented fallbacks.
R3 Dimension consistency graphiti-core's EMBEDDING_DIM = 1024 (external constant); CLAUDE.md:7280 documentation Constraint — keep dim at 1024, don't expose a tunable. The Ollama default mxbai-embed-large is 1024-dim, so the defaults align.
R4 Loud failure on every silent path add_batch (already loud), _build_llm_and_embedder (no pre-flight), graph_builder._wait_for_episodes (polls a no-op episode.get) Constraint + possibly Missing — if R1 turns up an additional silent-failure call site (likely candidates: graphiti add_episode swallowing extraction failures, or a dim-mismatch returning soft errors), add a remediation there.
R5 Backwards compatibility _build_llm_and_embedder OpenAI/Gemini branches Constraint — no logic change required; only defaults change. Operator's explicit EMBEDDING_* settings continue to win.
R6 Documentation README.md:163183, CLAUDE.md:7280, docker-compose.yml:3133, .env.example (protected) Missing — flip the README from "Ollama as commented option" to "Ollama as active default, OpenAI/Gemini commented fallback"; CLAUDE.md needs a small wording tweak; .env.example requires operator-coordinated edit (file is hook-protected).
R7 End-to-end smoke Graph build → env setup (profile_generator) → report agent tools — not directly modified, just exercised Constraint — requires a representative seed file. PR description documents whether the smoke test ran.

Implementation Approach Options

Option A — Defaults-Only Flip (extend existing)

Change backend/app/config.py defaults and .env.example + README/CLAUDE.md/docker-compose comments. No code-path changes to _build_llm_and_embedder (the "openai" branch already serves Ollama). Optionally add a one-line ERROR log in graphiti_adapter._build_llm_and_embedder when EMBEDDING_BASE_URL is unset, warning that the LLM base URL is being reused (which is a known dim/model mismatch trap with Dashscope/Qwen).

Trade-offs

  • Tiny, reversible, matches conventions.
  • Fully relies on existing loud-failure plumbing.
  • If R1 turns up a silent path inside add_episode itself (graphiti-core), defaults-only does not fix it.

Option B — Defaults Flip + Pre-flight Embedder Probe (extend existing + small new helper)

Same as A, plus a one-shot embedder ping during _get_graphiti() initialization: synchronously call the configured embedder on a known string and assert the response length matches Graphiti's EMBEDDING_DIM. On mismatch or connectivity failure, raise so the first Project creation surfaces the error rather than the first graph-build worker.

Trade-offs

  • Surfaces dimension/connectivity bugs before a long graph build runs.
  • Avoids per-batch "is this even reachable?" guessing.
  • Requirements explicitly call out "no startup-time embedder health probe that refuses to boot" (Boundary Context, out of scope). This option contradicts that boundary.

Option A, plus targeted hardening based on what R1 reveals. Likely candidates if root cause is a graphiti-core silent path:

  • Wrap the first add_episode call with an explicit dimension-check on the produced embedding (compare against EMBEDDING_DIM) and raise a clear ValueError("embedding dim mismatch") from add_batch before the Neo4j write, so the worker fails the task with an actionable message.
  • Tighten _get_graph_info such that complete_task is gated on a non-zero node count (Req 4 AC5), so a "succeeded but empty" graph never reaches GRAPH_COMPLETED.

Trade-offs

  • Targets the actual failure mode identified by R1 instead of speculating.
  • Stays within the boundary (no startup probe, no new env var, no new provider).
  • Matches existing conventions: a small if not nodes: raise inside the worker, propagated to fail_task.
  • Slightly larger PR than A; the dim-check helper is new code (1020 lines).

Effort & Risk

  • Effort: S (13 days) — code change is small; majority of the work is the root-cause repro + smoke-testing the end-to-end pipeline with a local Ollama instance.
  • Risk: Medium — relies on a Graphiti-core/Neo4j interaction that we don't fully control. If the root cause is upstream and only fixable via a graphiti-core version bump, scope creeps. Mitigation: if upstream fix is required, capture it in the PR description and ship the defaults-flip + first-batch dim check now; the loud failure ensures operators see the real error rather than an empty graph.

Research Items for Design Phase

  1. Confirm the exact silent-failure call site. Run a fresh build on main's default .env and trace where the entity-extraction-or-write disappears: graphiti-core LLM extraction stage, the embedder call, or the Neo4j vector-index write. Log/instrument as needed.
  2. Verify the embedder-output dimension at runtime. With mxbai-embed-large via Ollama, confirm len(embedding) == 1024. With text-embedding-3-small, confirm 1536, and observe what Neo4j (or graphiti-core's vector-index check) does with the mismatch.
  3. Decide whether _get_graph_info gating belongs in this PR. If R1 root cause is fully addressed by the defaults flip, the node_count > 0 gate in complete_task is belt-and-braces. If R1 reveals a residual silent path, the gate becomes essential.

Recommendations for Design Phase

  • Preferred approach: Option C — flip defaults, instrument the first batch enough to capture R1 evidence, gate complete_task on a non-zero node count.
  • Key decisions to lock in design:
    • Concrete default values for EMBEDDING_MODEL, EMBEDDING_BASE_URL, EMBEDDING_API_KEY in backend/app/config.py.
    • Whether _get_graph_info(graph_id) returning node_count == 0 should raise inside the worker (Req 4 AC5) or only when add_batch succeeded — the latter is the right semantics.
    • Wording for the README / CLAUDE.md flip: keep both providers documented; only change the active line.
  • Carry forward to implementation:
    • .env.example is hook-protected; either coordinate with the developer to update it manually, or document the required diff in HANDOFF.md.
    • The end-to-end smoke test (graph build → profile generation → report query) needs a representative seed file; if unavailable, mark that explicitly in the PR.