15 KiB
Research & Design Decisions
Summary
- Feature:
graph-build-empty-fix - Discovery Scope: Extension
- Key Findings:
_build_llm_and_embedder(backend/app/services/graphiti_adapter.py:92-139) already supports any OpenAI-compatible/v1/embeddingsendpoint through the existing"openai"branch — Ollama athost:11434/v1works without a new provider branch.- The empty-graph symptom is consistent with a vector-dimension mismatch:
Config.EMBEDDING_MODELdefaults to OpenAI'stext-embedding-3-small(1536-dim), butgraphiti-coreinitialises the Neo4j vector index at 1024 dims. WhenEMBEDDING_BASE_URL/EMBEDDING_API_KEYare unset, the embedder reusesLLM_BASE_URL/LLM_API_KEY, which on the documented Dashscope/Qwen default cannot serve OpenAI's embedding model and produces either a 4xx (since #18, raised to the worker) or a dim-mismatch write that graphiti-core does not validate. - The loud-failure plumbing from spec
graphiti-ollama-embedder(issue #18) is intact:_GraphNamespace.add_batchre-raises withlogger.exception, and_build_graph_workercallsfail_task(...). Belt-and-braces: gatecomplete_taskon a non-zero entity-node count so a "succeeded but empty" graph cannot reachGRAPH_COMPLETEDif any silent path remains. _recover_stuck_projects(backend/app/__init__.py:88-109) already gates recovery promotion oncount(:Entity {group_id}) > 0, so Requirement 4 AC5's contract holds symmetrically on the startup side.
Research Log
Embedder construction path under current defaults
- Context: Determine the runtime configuration of the embedder when an operator runs
mainwith the documented.env(Qwen via Dashscope for LLM, allEMBEDDING_*unset). - Sources Consulted:
backend/app/services/graphiti_adapter.py:92-139,backend/app/config.py:32-54, README.md L150-184. - Findings:
- Resolved values:
embedding_model = "text-embedding-3-small"(1536-dim),base_url = LLM_BASE_URL = https://dashscope.aliyuncs.com/compatible-mode/v1,api_key = LLM_API_KEY(a Dashscope key). - Dashscope's OpenAI-compatible mode does not serve
text-embedding-3-small. The call either 404s on the model name or returns an empty/incorrect response. Since spec #18, this failure path propagates to the worker — but operators reading the README's default config still trip it.
- Resolved values:
- Implications: Flipping the default
EMBEDDING_*to a local Ollama embedder both (a) restores a self-hosted, free-by-default flow and (b) collapses the dim-mismatch class of empty-graph regressions becausemxbai-embed-largeis 1024-dim, matching graphiti-core's vector index.
Graphiti-core vector index dimension
- Context: Confirm graphiti-core's expected embedding dimension and whether it is configurable from MiroFish.
- Sources Consulted: CLAUDE.md L78-80 (states the 1024-dim invariant),
.kiro/specs/graphiti-ollama-embedder/requirements.md(Requirement 3 AC1),_PassthroughRerankeringraphiti_adapter.py:38-51(precedent for working around upstream defaults). - Findings:
graphiti-core≥ 0.3 ships withEMBEDDING_DIM = 1024. It is not surfaced as an env knob in MiroFish today and is explicitly out of scope to change.- Therefore the embedder must produce 1024-dim vectors.
mxbai-embed-largedoes;text-embedding-3-small(1536) andnomic-embed-text(768) do not.
- Implications: The only correct default model is one whose output is 1024-dim. Ollama's
mxbai-embed-largeis the project's already-documented choice (CLAUDE.md, README).
Existing loud-failure contract
- Context: Verify that this spec inherits a working error-propagation contract rather than re-establishing one.
- Sources Consulted:
backend/app/services/graphiti_adapter.py:455-486(add_batch),backend/app/services/graph_builder.py:227-230(workerexcept),.kiro/steering/error-handling.md. - Findings:
add_batchcallslogger.exception(...)andraiseon the first failed episode (lines 478-483). No placeholder UUIDs.- The worker catches
Exception, formats traceback, and callsTaskManager().fail_task(task_id, error_msg).
- Implications: This spec must not weaken the contract. The only remaining silent surface is "the entire batch succeeds but produces no entities" — which the design handles by gating
complete_taskon a non-zero node count returned by_get_graph_info(graph_id).
Startup recovery contract
- Context: Confirm that
_recover_stuck_projectsalready aligns with Requirement 4 AC5. - Sources Consulted:
backend/app/__init__.py:88-109. - Findings: Recovery only promotes to
GRAPH_COMPLETEDwhencount(:Entity {group_id}) > 0. Gates on entities, not edges. - Implications: No change needed in the recovery path. Symmetric gating in
complete_task(this spec) yields a consistent "non-empty entities ⇒ COMPLETED" invariant on both startup recovery and live worker completion.
Architecture Pattern Evaluation
| Option | Description | Strengths | Risks / Limitations | Notes |
|---|---|---|---|---|
| A — Defaults-only flip | Change config.py + .env.example + docs. No code logic change. |
Smallest diff, fully reversible, leverages existing loud-failure plumbing. | Doesn't address the residual silent path of "Graphiti succeeded but produced no entities". | Sufficient if the dim-mismatch is the sole root cause. |
| B — Defaults flip + startup embedder probe | Plus a synchronous one-shot embedding ping during _get_graphiti() init, asserting dim match. |
Surfaces dim/connectivity errors at boot. | Explicitly out of boundary per requirements (no startup probe). | Rejected. |
| C — Defaults flip + non-zero-count gate | Flip defaults; gate complete_task on _get_graph_info(graph_id).node_count > 0; if 0, call fail_task with a clear "graph build produced 0 entities" message. |
Closes the "succeeded but empty" silent path symmetrically with _recover_stuck_projects. Stays within boundary. |
Slightly larger diff (≈10 lines in graph_builder.py). |
Selected. |
Design Decisions
Decision: Local Ollama (mxbai-embed-large) as the embedding default
- Context: Requirement 2 — local embedder is the default; remote providers stay as opt-in fallbacks.
- Alternatives Considered:
- Keep OpenAI default, document Ollama as the recommended path — rejected; doesn't satisfy R2 AC1/AC2.
- Switch default to a remote 1024-dim provider (e.g., Cohere
embed-english-light-v3.0) — rejected; reintroduces a remote dependency in the hot path. - Bundle Ollama in
docker-compose.yml— rejected; explicitly out of boundary, operator-managed.
- Selected Approach:
Config.EMBEDDING_MODEL = 'mxbai-embed-large',Config.EMBEDDING_BASE_URL = 'http://localhost:11434/v1',Config.EMBEDDING_API_KEY = 'ollama'..env.examplepresents the Ollama block uncommented and the OpenAI/Gemini blocks commented out. - Rationale: Matches the already-documented invariant (1024-dim, self-hosted), removes the dim-mismatch root cause, and removes the per-request remote cost.
- Trade-offs: New operators must
ollama pull mxbai-embed-largebefore the first graph build. README and.env.examplealready cover this prerequisite, so the burden is small. Operators in pure-cloud deployments must explicitly opt in to a remote embedder, which is the desired direction. - Follow-up: README setup section must mention the
ollama pullprerequisite alongside Neo4j.
Decision: Gate complete_task on a non-zero entity-node count
- Context: Requirement 4 AC5 —
GRAPH_COMPLETEDmust not be reachable while Neo4j holds zero entities for the project'sgroup_id. - Alternatives Considered:
- Trust
add_batch's loud-failure contract entirely — rejected; if any future Graphiti call returns without raising but writes nothing, the symptom recurs silently. - Add a separate "verify graph" task after build — rejected; over-engineering for a 5-line gate.
- Trust
- Selected Approach: Inside
_build_graph_worker, after_get_graph_info(graph_id), ifnode_count == 0, callTaskManager().fail_task(...)with a localised message naming the failure (and skipcomplete_task). - Rationale: Mirrors
_recover_stuck_projects' "promote only when count > 0" rule; preserves the contract symmetrically on both completion paths. - Trade-offs: Tiny additional code surface. Eliminates the regression vector for any future silent failure inside graphiti-core.
- Follow-up: Add the new failure message to
locales/en.jsonandlocales/zh.jsonkeys consistent with the existingprogress.*namespace.
Decision: No new env var for EMBEDDING_DIM
- Context: Requirement 3 AC4 — keep dim fixed at 1024.
- Selected Approach: Continue to inherit graphiti-core's
EMBEDDING_DIM = 1024. Document the constraint in CLAUDE.md. - Rationale: Avoids surface-area creep; supporting 768/1536 dims is its own follow-up that would require a graphiti-core upgrade or fork.
Decision: README documents the Ollama path as the active default; OpenAI/Gemini as commented fallbacks
- Context: Requirement 6 — the documented happy path must match the new behavior.
- Selected Approach: Swap the
# EMBEDDING_*=comments in README's env block so the Ollama lines are uncommented and the OpenAI/Gemini lines move to a comment-only example. - Rationale: Matches
.env.example's structure; minimises drift between the two files.
Risks & Mitigations
- Risk: The actual root cause is upstream in
graphiti-core, not the dim mismatch — defaults flip alone may not produce non-empty graphs.- Mitigation: R1 mandates a reproduction run on
mainbefore the fix; design includes thecomplete_taskgate so a silent upstream failure is surfaced as afail_taskrather than an "empty graph, COMPLETED" outcome. PR description records the captured failure mode.
- Mitigation: R1 mandates a reproduction run on
- Risk: Operators upgrade in place and discover their old project graphs (1536-dim OpenAI embeddings) are unreachable.
- Mitigation: Requirement 5 AC3 — operators continue to set
EMBEDDING_MODELto their previous value; no auto-rebuild. Document in CLAUDE.md and README's migration note that switching embedder models invalidates existing project graphs (already a baseline rule fromdatabase.md).
- Mitigation: Requirement 5 AC3 — operators continue to set
- Risk:
.env.exampleis hook-protected (the assistant cannot write to it).- Mitigation: Implementation will provide the required diff and a one-line
cat-friendly snippet in the PR description /HANDOFF.md. Operator applies the change manually.
- Mitigation: Implementation will provide the required diff and a one-line
Smoke Run
2026-05-11 — sandbox validation
- Gate firing (Task 5.3 / negative path): validated in-process with the worker driven by a stubbed
_get_graph_infothat returnsnode_count=0. Result captured by the implementation script:Task.status == FAILED,Task.errorstarts with "Graph build produced 0 entities for this project. …", and the ERROR log linegraph build produced 0 entities for group_id=mirofish_test (task=…)is emitted via the newmirofish.graph_builderlogger. Symmetric happy path withnode_count=42was also driven andTask.status == COMPLETEDwithresult.graph_info.node_count == 42. - Config defaults (Task 2.1): validated in-process. With no
.envoverride,Config.EMBEDDING_MODEL = "mxbai-embed-large",Config.EMBEDDING_BASE_URL = "http://localhost:11434/v1",Config.EMBEDDING_API_KEY = "ollama",Config.GRAPHITI_LLM_PROVIDER = "openai". Override semantics confirmed: explicit env vars still win over the new defaults. - End-to-end smoke (Task 5.1): deferred to operator validation — the sandbox lacks Neo4j, Ollama, and LLM credentials. The PR description will state explicitly that the smoke run was not executed in this environment and lists the steps an operator should run before tagging the PR ready:
ollama pull mxbai-embed-large→docker compose up -d neo4j→npm run dev→ upload a representative seed file → confirmTask.result.graph_info.node_count > 0→ run Step 2 (Env Setup) → run Step 4 (Report) and confirm tool calls return non-empty results. - Backwards-compat (Task 5.2): deferred to operator validation under the same constraint. The PR description includes the operator runbook for the OpenAI override scenario (
.envwithEMBEDDING_*pointing athttps://api.openai.com/v1andtext-embedding-3-small) plus the Gemini provider scenario (GRAPHITI_LLM_PROVIDER=gemini,EMBEDDING_MODEL=gemini-embedding-001).
Reproduction Log
2026-05-11 — sandbox run
-
Context: Implementation phase Task 1.1 attempted live reproduction on
main's default.env(LLM via Dashscope, allEMBEDDING_*unset). -
Result: Reproduction could not be executed inside the Claude sandbox — no Neo4j daemon, no Ollama daemon, no LLM API key, no network egress to Dashscope. A live capture of the failing
Taskenvelope and Neo4j node count is therefore deferred to operator validation (Task 5.1). -
Working hypothesis (carried forward): Two compounding silent paths produce the empty-graph symptom on default config:
- With
EMBEDDING_API_KEY/EMBEDDING_BASE_URLunset, the embedder falls back toLLM_API_KEY/LLM_BASE_URL. On the documented default (Dashscope/Qwen for LLM), Dashscope's OpenAI-compatible surface does not servetext-embedding-3-small— calls either 404 or return non-conformant payloads. Post #18 this would propagate as aTask.FAILED, not an "empty graph, COMPLETED". - If the embedder returns a payload (e.g., on an OpenAI key) the resulting 1536-dim vector mismatches Graphiti's 1024-dim vector index. Behaviour at this boundary is graphiti-core-dependent and may have surfaced historically as "wrote metadata, dropped entities".
- With
-
Verdict: diverged-by-sandbox. The fix is robust against either failure mode: flipping the defaults to a 1024-dim local embedder collapses both classes, and the
_get_graph_info(...).node_count == 0gate (Task 3.1) converts any residual silent path into aTask.FAILEDwithprogress.emptyGraphFailure. -
Operator-side verification: Task 5.1 captures the live Smoke Run; Task 5.3 forces the gate's negative path to confirm it surfaces the residual silent case as expected.
-
backend/app/services/graphiti_adapter.py— embedder construction, loud-failure batch -
backend/app/services/graph_builder.py— graph-build worker -
backend/app/__init__.py— startup recovery -
backend/app/config.py— env-driven defaults -
.kiro/specs/graphiti-ollama-embedder/requirements.md— preceding loud-failure work (issue #18) -
.kiro/specs/graphiti-neo4j-finalize/— initial Zep → Graphiti migration -
.kiro/steering/database.md,.kiro/steering/error-handling.md— invariants relied upon -
.ticket/37.md— bug ticket source