MicroFish/.kiro/specs/graph-build-empty-fix/gap-analysis.md

100 lines
9.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Gap Analysis: graph-build-empty-fix
## Scope Snapshot
The fix is small in code surface (config defaults, embedder construction, docs) but research-heavy on root cause (why does Graphiti `add_episode` appear to succeed yet leave Neo4j empty?). The Ollama documentation, OpenAI-compatible embedder support, and loud `add_batch` failure already exist from spec `graphiti-ollama-embedder` (issue #18). What's still missing: flipping the active default to Ollama and confirming the dimension-mismatch hypothesis that drives the empty-graph symptom.
## Current State
### Relevant Assets
- `backend/app/services/graphiti_adapter.py:92139``_build_llm_and_embedder` constructs OpenAI or Gemini providers. The OpenAI branch reads `Config.EMBEDDING_BASE_URL or Config.LLM_BASE_URL` and `Config.EMBEDDING_API_KEY or Config.LLM_API_KEY`. This branch is already Ollama-compatible (Ollama exposes an OpenAI-shaped `/v1/embeddings`).
- `backend/app/services/graphiti_adapter.py:466486``_GraphNamespace.add_batch` re-raises on episode-ingestion failures (spec #18). No placeholder UUIDs. Logger is `ERROR` with traceback.
- `backend/app/services/graph_builder.py:227230``_build_graph_worker` catches `Exception`, captures traceback, calls `TaskManager().fail_task(task_id, error_msg)`.
- `backend/app/__init__.py:88109``_recover_stuck_projects` gates promotion to `GRAPH_COMPLETED` on `count(n:Entity {group_id}) > 0`. Matches Req 4 AC5 already.
- `backend/app/config.py:42, 5253` — current defaults:
- `EMBEDDING_MODEL = 'text-embedding-3-small'` (OpenAI, 1536-dim)
- `EMBEDDING_BASE_URL = None` → falls back to `LLM_BASE_URL`
- `EMBEDDING_API_KEY = None` → falls back to `LLM_API_KEY`
- `README.md:163183` — Ollama section present but commented out; OpenAI defaults are still the active path.
- `CLAUDE.md:7280` — already names `mxbai-embed-large` (1024-dim) and explicitly rules out 768-dim `nomic-embed-text`. Documentation framing already treats Ollama as a supported provider.
- `docker-compose.yml:3133` — already notes the `host.docker.internal:11434` reach-through for Ollama.
### Conventions in Play
- All Neo4j/Graphiti access goes through `services/graphiti_adapter.py` (per `.kiro/steering/database.md`).
- Configuration is centralized in `backend/app/config.py` — env-driven, single file.
- Background-task error handling: worker `try/except``fail_task(task_id, str(e))` (per `.kiro/steering/error-handling.md`).
- Graph is multi-tenant by `group_id`; every read/write must be scoped.
### Integration Surfaces Out of This Repo
- `graphiti-core` package — owns `EMBEDDING_DIM = 1024`, Neo4j vector index DDL, and `add_episode` LLM-extraction → embedding → write pipeline. Not vendored here; behavior must be inferred from runtime + their public API.
- Local Ollama daemon (operator-managed) — out of scope to bundle, in scope to assume runs at `host:11434`.
## Requirement-to-Asset Map
| Requirement | Asset / Touchpoint | Gap |
| --- | --- | --- |
| **R1 Root cause** | `_build_llm_and_embedder`, `add_batch`, `add_episode` runtime behavior, Neo4j vector-index dim | **Unknown** — need a reproduction run on default `main` config to capture the exact failure surface (dimension mismatch vs. embedder 404 vs. silent LLM-extraction-returns-empty). |
| **R2 Local-default** | `config.py:42, 5253`, `.env.example` (protected — operator will reload), `README.md:163183` | **Missing** — defaults still point to OpenAI; need to flip to Ollama (`mxbai-embed-large` @ `http://localhost:11434/v1`) and demote OpenAI/Gemini to commented fallbacks. |
| **R3 Dimension consistency** | `graphiti-core`'s `EMBEDDING_DIM = 1024` (external constant); `CLAUDE.md:7280` documentation | **Constraint** — keep dim at 1024, don't expose a tunable. The Ollama default `mxbai-embed-large` is 1024-dim, so the defaults align. |
| **R4 Loud failure on every silent path** | `add_batch` (already loud), `_build_llm_and_embedder` (no pre-flight), `graph_builder._wait_for_episodes` (polls a no-op `episode.get`) | **Constraint** + possibly **Missing** — if R1 turns up an additional silent-failure call site (likely candidates: graphiti `add_episode` swallowing extraction failures, or a dim-mismatch returning soft errors), add a remediation there. |
| **R5 Backwards compatibility** | `_build_llm_and_embedder` OpenAI/Gemini branches | **Constraint** — no logic change required; only defaults change. Operator's explicit `EMBEDDING_*` settings continue to win. |
| **R6 Documentation** | `README.md:163183`, `CLAUDE.md:7280`, `docker-compose.yml:3133`, `.env.example` (protected) | **Missing** — flip the README from "Ollama as commented option" to "Ollama as active default, OpenAI/Gemini commented fallback"; CLAUDE.md needs a small wording tweak; `.env.example` requires operator-coordinated edit (file is hook-protected). |
| **R7 End-to-end smoke** | Graph build → env setup (`profile_generator`) → report agent tools — not directly modified, just exercised | **Constraint** — requires a representative seed file. PR description documents whether the smoke test ran. |
## Implementation Approach Options
### Option A — Defaults-Only Flip (extend existing)
Change `backend/app/config.py` defaults and `.env.example` + README/CLAUDE.md/docker-compose comments. No code-path changes to `_build_llm_and_embedder` (the "openai" branch already serves Ollama). Optionally add a one-line ERROR log in `graphiti_adapter._build_llm_and_embedder` when `EMBEDDING_BASE_URL` is unset, warning that the LLM base URL is being reused (which is a known dim/model mismatch trap with Dashscope/Qwen).
**Trade-offs**
- ✅ Tiny, reversible, matches conventions.
- ✅ Fully relies on existing loud-failure plumbing.
- ❌ If R1 turns up a silent path inside `add_episode` itself (graphiti-core), defaults-only does not fix it.
### Option B — Defaults Flip + Pre-flight Embedder Probe (extend existing + small new helper)
Same as A, plus a one-shot embedder ping during `_get_graphiti()` initialization: synchronously call the configured embedder on a known string and assert the response length matches Graphiti's `EMBEDDING_DIM`. On mismatch or connectivity failure, raise so the first `Project` creation surfaces the error rather than the first graph-build worker.
**Trade-offs**
- ✅ Surfaces dimension/connectivity bugs before a long graph build runs.
- ✅ Avoids per-batch "is this even reachable?" guessing.
- ❌ Requirements explicitly call out "no startup-time embedder health probe that refuses to boot" (Boundary Context, out of scope). This option contradicts that boundary.
### Option C — Defaults Flip + First-Batch Failure Surfacing (hybrid, recommended)
Option A, plus targeted hardening based on what R1 reveals. Likely candidates if root cause is a graphiti-core silent path:
- Wrap the first `add_episode` call with an explicit dimension-check on the produced embedding (compare against `EMBEDDING_DIM`) and raise a clear `ValueError("embedding dim mismatch")` from `add_batch` before the Neo4j write, so the worker fails the task with an actionable message.
- Tighten `_get_graph_info` such that `complete_task` is gated on a non-zero node count (Req 4 AC5), so a "succeeded but empty" graph never reaches `GRAPH_COMPLETED`.
**Trade-offs**
- ✅ Targets the actual failure mode identified by R1 instead of speculating.
- ✅ Stays within the boundary (no startup probe, no new env var, no new provider).
- ✅ Matches existing conventions: a small `if not nodes: raise` inside the worker, propagated to `fail_task`.
- ❌ Slightly larger PR than A; the dim-check helper is new code (1020 lines).
## Effort & Risk
- **Effort:** **S (13 days)** — code change is small; majority of the work is the root-cause repro + smoke-testing the end-to-end pipeline with a local Ollama instance.
- **Risk:** **Medium** — relies on a Graphiti-core/Neo4j interaction that we don't fully control. If the root cause is upstream and only fixable via a graphiti-core version bump, scope creeps. Mitigation: if upstream fix is required, capture it in the PR description and ship the defaults-flip + first-batch dim check now; the loud failure ensures operators see the real error rather than an empty graph.
## Research Items for Design Phase
1. **Confirm the exact silent-failure call site.** Run a fresh build on `main`'s default `.env` and trace where the entity-extraction-or-write disappears: graphiti-core LLM extraction stage, the embedder call, or the Neo4j vector-index write. Log/instrument as needed.
2. **Verify the embedder-output dimension at runtime.** With `mxbai-embed-large` via Ollama, confirm `len(embedding) == 1024`. With `text-embedding-3-small`, confirm 1536, and observe what Neo4j (or graphiti-core's vector-index check) does with the mismatch.
3. **Decide whether `_get_graph_info` gating belongs in this PR.** If R1 root cause is fully addressed by the defaults flip, the `node_count > 0` gate in `complete_task` is belt-and-braces. If R1 reveals a residual silent path, the gate becomes essential.
## Recommendations for Design Phase
- **Preferred approach: Option C** — flip defaults, instrument the first batch enough to capture R1 evidence, gate `complete_task` on a non-zero node count.
- **Key decisions to lock in design:**
- Concrete default values for `EMBEDDING_MODEL`, `EMBEDDING_BASE_URL`, `EMBEDDING_API_KEY` in `backend/app/config.py`.
- Whether `_get_graph_info(graph_id)` returning `node_count == 0` should raise inside the worker (Req 4 AC5) or only when `add_batch` succeeded — the latter is the right semantics.
- Wording for the README / CLAUDE.md flip: keep both providers documented; only change the *active* line.
- **Carry forward to implementation:**
- `.env.example` is hook-protected; either coordinate with the developer to update it manually, or document the required diff in `HANDOFF.md`.
- The end-to-end smoke test (graph build → profile generation → report query) needs a representative seed file; if unavailable, mark that explicitly in the PR.