9.9 KiB
Gap Analysis: graph-build-empty-fix
Scope Snapshot
The fix is small in code surface (config defaults, embedder construction, docs) but research-heavy on root cause (why does Graphiti add_episode appear to succeed yet leave Neo4j empty?). The Ollama documentation, OpenAI-compatible embedder support, and loud add_batch failure already exist from spec graphiti-ollama-embedder (issue #18). What's still missing: flipping the active default to Ollama and confirming the dimension-mismatch hypothesis that drives the empty-graph symptom.
Current State
Relevant Assets
backend/app/services/graphiti_adapter.py:92–139—_build_llm_and_embedderconstructs OpenAI or Gemini providers. The OpenAI branch readsConfig.EMBEDDING_BASE_URL or Config.LLM_BASE_URLandConfig.EMBEDDING_API_KEY or Config.LLM_API_KEY. This branch is already Ollama-compatible (Ollama exposes an OpenAI-shaped/v1/embeddings).backend/app/services/graphiti_adapter.py:466–486—_GraphNamespace.add_batchre-raises on episode-ingestion failures (spec #18). No placeholder UUIDs. Logger isERRORwith traceback.backend/app/services/graph_builder.py:227–230—_build_graph_workercatchesException, captures traceback, callsTaskManager().fail_task(task_id, error_msg).backend/app/__init__.py:88–109—_recover_stuck_projectsgates promotion toGRAPH_COMPLETEDoncount(n:Entity {group_id}) > 0. Matches Req 4 AC5 already.backend/app/config.py:42, 52–53— current defaults:EMBEDDING_MODEL = 'text-embedding-3-small'(OpenAI, 1536-dim)EMBEDDING_BASE_URL = None→ falls back toLLM_BASE_URLEMBEDDING_API_KEY = None→ falls back toLLM_API_KEY
README.md:163–183— Ollama section present but commented out; OpenAI defaults are still the active path.CLAUDE.md:72–80— already namesmxbai-embed-large(1024-dim) and explicitly rules out 768-dimnomic-embed-text. Documentation framing already treats Ollama as a supported provider.docker-compose.yml:31–33— already notes thehost.docker.internal:11434reach-through for Ollama.
Conventions in Play
- All Neo4j/Graphiti access goes through
services/graphiti_adapter.py(per.kiro/steering/database.md). - Configuration is centralized in
backend/app/config.py— env-driven, single file. - Background-task error handling: worker
try/except→fail_task(task_id, str(e))(per.kiro/steering/error-handling.md). - Graph is multi-tenant by
group_id; every read/write must be scoped.
Integration Surfaces Out of This Repo
graphiti-corepackage — ownsEMBEDDING_DIM = 1024, Neo4j vector index DDL, andadd_episodeLLM-extraction → embedding → write pipeline. Not vendored here; behavior must be inferred from runtime + their public API.- Local Ollama daemon (operator-managed) — out of scope to bundle, in scope to assume runs at
host:11434.
Requirement-to-Asset Map
| Requirement | Asset / Touchpoint | Gap |
|---|---|---|
| R1 Root cause | _build_llm_and_embedder, add_batch, add_episode runtime behavior, Neo4j vector-index dim |
Unknown — need a reproduction run on default main config to capture the exact failure surface (dimension mismatch vs. embedder 404 vs. silent LLM-extraction-returns-empty). |
| R2 Local-default | config.py:42, 52–53, .env.example (protected — operator will reload), README.md:163–183 |
Missing — defaults still point to OpenAI; need to flip to Ollama (mxbai-embed-large @ http://localhost:11434/v1) and demote OpenAI/Gemini to commented fallbacks. |
| R3 Dimension consistency | graphiti-core's EMBEDDING_DIM = 1024 (external constant); CLAUDE.md:72–80 documentation |
Constraint — keep dim at 1024, don't expose a tunable. The Ollama default mxbai-embed-large is 1024-dim, so the defaults align. |
| R4 Loud failure on every silent path | add_batch (already loud), _build_llm_and_embedder (no pre-flight), graph_builder._wait_for_episodes (polls a no-op episode.get) |
Constraint + possibly Missing — if R1 turns up an additional silent-failure call site (likely candidates: graphiti add_episode swallowing extraction failures, or a dim-mismatch returning soft errors), add a remediation there. |
| R5 Backwards compatibility | _build_llm_and_embedder OpenAI/Gemini branches |
Constraint — no logic change required; only defaults change. Operator's explicit EMBEDDING_* settings continue to win. |
| R6 Documentation | README.md:163–183, CLAUDE.md:72–80, docker-compose.yml:31–33, .env.example (protected) |
Missing — flip the README from "Ollama as commented option" to "Ollama as active default, OpenAI/Gemini commented fallback"; CLAUDE.md needs a small wording tweak; .env.example requires operator-coordinated edit (file is hook-protected). |
| R7 End-to-end smoke | Graph build → env setup (profile_generator) → report agent tools — not directly modified, just exercised |
Constraint — requires a representative seed file. PR description documents whether the smoke test ran. |
Implementation Approach Options
Option A — Defaults-Only Flip (extend existing)
Change backend/app/config.py defaults and .env.example + README/CLAUDE.md/docker-compose comments. No code-path changes to _build_llm_and_embedder (the "openai" branch already serves Ollama). Optionally add a one-line ERROR log in graphiti_adapter._build_llm_and_embedder when EMBEDDING_BASE_URL is unset, warning that the LLM base URL is being reused (which is a known dim/model mismatch trap with Dashscope/Qwen).
Trade-offs
- ✅ Tiny, reversible, matches conventions.
- ✅ Fully relies on existing loud-failure plumbing.
- ❌ If R1 turns up a silent path inside
add_episodeitself (graphiti-core), defaults-only does not fix it.
Option B — Defaults Flip + Pre-flight Embedder Probe (extend existing + small new helper)
Same as A, plus a one-shot embedder ping during _get_graphiti() initialization: synchronously call the configured embedder on a known string and assert the response length matches Graphiti's EMBEDDING_DIM. On mismatch or connectivity failure, raise so the first Project creation surfaces the error rather than the first graph-build worker.
Trade-offs
- ✅ Surfaces dimension/connectivity bugs before a long graph build runs.
- ✅ Avoids per-batch "is this even reachable?" guessing.
- ❌ Requirements explicitly call out "no startup-time embedder health probe that refuses to boot" (Boundary Context, out of scope). This option contradicts that boundary.
Option C — Defaults Flip + First-Batch Failure Surfacing (hybrid, recommended)
Option A, plus targeted hardening based on what R1 reveals. Likely candidates if root cause is a graphiti-core silent path:
- Wrap the first
add_episodecall with an explicit dimension-check on the produced embedding (compare againstEMBEDDING_DIM) and raise a clearValueError("embedding dim mismatch")fromadd_batchbefore the Neo4j write, so the worker fails the task with an actionable message. - Tighten
_get_graph_infosuch thatcomplete_taskis gated on a non-zero node count (Req 4 AC5), so a "succeeded but empty" graph never reachesGRAPH_COMPLETED.
Trade-offs
- ✅ Targets the actual failure mode identified by R1 instead of speculating.
- ✅ Stays within the boundary (no startup probe, no new env var, no new provider).
- ✅ Matches existing conventions: a small
if not nodes: raiseinside the worker, propagated tofail_task. - ❌ Slightly larger PR than A; the dim-check helper is new code (10–20 lines).
Effort & Risk
- Effort: S (1–3 days) — code change is small; majority of the work is the root-cause repro + smoke-testing the end-to-end pipeline with a local Ollama instance.
- Risk: Medium — relies on a Graphiti-core/Neo4j interaction that we don't fully control. If the root cause is upstream and only fixable via a graphiti-core version bump, scope creeps. Mitigation: if upstream fix is required, capture it in the PR description and ship the defaults-flip + first-batch dim check now; the loud failure ensures operators see the real error rather than an empty graph.
Research Items for Design Phase
- Confirm the exact silent-failure call site. Run a fresh build on
main's default.envand trace where the entity-extraction-or-write disappears: graphiti-core LLM extraction stage, the embedder call, or the Neo4j vector-index write. Log/instrument as needed. - Verify the embedder-output dimension at runtime. With
mxbai-embed-largevia Ollama, confirmlen(embedding) == 1024. Withtext-embedding-3-small, confirm 1536, and observe what Neo4j (or graphiti-core's vector-index check) does with the mismatch. - Decide whether
_get_graph_infogating belongs in this PR. If R1 root cause is fully addressed by the defaults flip, thenode_count > 0gate incomplete_taskis belt-and-braces. If R1 reveals a residual silent path, the gate becomes essential.
Recommendations for Design Phase
- Preferred approach: Option C — flip defaults, instrument the first batch enough to capture R1 evidence, gate
complete_taskon a non-zero node count. - Key decisions to lock in design:
- Concrete default values for
EMBEDDING_MODEL,EMBEDDING_BASE_URL,EMBEDDING_API_KEYinbackend/app/config.py. - Whether
_get_graph_info(graph_id)returningnode_count == 0should raise inside the worker (Req 4 AC5) or only whenadd_batchsucceeded — the latter is the right semantics. - Wording for the README / CLAUDE.md flip: keep both providers documented; only change the active line.
- Concrete default values for
- Carry forward to implementation:
.env.exampleis hook-protected; either coordinate with the developer to update it manually, or document the required diff inHANDOFF.md.- The end-to-end smoke test (graph build → profile generation → report query) needs a representative seed file; if unavailable, mark that explicitly in the PR.