11 KiB
11 KiB
Research & Design Decisions — graphiti-ollama-reranker
Summary
- Feature:
graphiti-ollama-reranker - Discovery Scope: Extension (one new service module + factory branch + config + docs).
- Key Findings:
CrossEncoderClient.rank(query, passages) -> list[tuple[str, float]]is the only abstract contract Graphiti requires of the reranker. The existing_PassthroughRerankeralready exercises this contract correctly.- Ollama's OpenAI-compatible
/v1/chat/completionsendpoint does not reliably exposelogprobs/logit_bias, so Graphiti's default OpenAI scoring approach (binary YES/NO over token logits) cannot be ported. The reranker must use prompted numeric scoring with text-output parsing. - The
openaiSDK already shipped inbackend/.venv(v2.35.1) exposesAsyncOpenAI, which is the right client for the asyncrank()method without introducing any new dependency.
Research Log
Graphiti's CrossEncoderClient contract
- Context: Need to confirm the precise shape of the
rankinterface and any other abstract members. - Sources Consulted:
backend/app/services/graphiti_adapter.py:38-51(_PassthroughReranker);.kiro/specs/graphiti-neo4j-finalize/research.mdandgap-analysis.md(which captured the upstream contract on first integration); ticket #39 narrative. - Findings:
_PassthroughRerankersubclassesCrossEncoderClientand only overridesasync def rank(query: str, passages: list[str]) -> list[tuple[str, float]].- Graphiti's internal call site (
graphiti_core/graphiti.py:154) constructs the reranker once and callsrankper search. There is no separate batch interface to satisfy. - Passages are short text snippets (entity-edge facts / node summaries). Typical N per search ≤ 10 (limit defaulted in
_GraphNamespace.search).
- Implications: A drop-in subclass that implements
rankis sufficient. No additional abstract methods to wire.
Ollama OpenAI-compatible scoring surface
- Context: Decide how to obtain a relevance score per passage from a small Ollama-served chat model.
- Sources Consulted: Project-internal
backend/app/utils/llm_client.py(usesopenai.OpenAI+chat.completions.createagainst Dashscope / OpenAI / Ollama uniformly); ticket #39 "Proposed approach" section enumerating Ollama chat-model scoring vs. embedding cosine. - Findings:
- Ollama supports
/v1/chat/completionsfor chat models likeqwen2.5:3b,llama3.2:3b,phi3:3.8b. Pulling a model is required (ollama pull <model>). - JSON-mode (
response_format={"type": "json_object"}) is honored by recent Ollama versions but not universally; project convention is to fall back gracefully (cf.LLMClient.chat_json). - Embedding-cosine reranker is feasible (re-embed query and passages with
mxbai-embed-large) but produces a weaker ordering signal than an LLM that can reason about the question. Picking LLM scoring matches the ticket's preferred path.
- Ollama supports
- Implications:
- Use a chat-completion call per passage with a deterministic temperature (0.0) and a tight system prompt asking for a JSON score in [0.0, 1.0].
- Parse with the same defensive strategy used elsewhere: strip
<think>blocks, strip markdown fences, attemptjson.loads, regex-fallback to first float, deterministic low score on hard failure.
Concurrency strategy
- Context: Decide between per-passage parallel calls vs. one batched call.
- Findings:
- Per-passage with
asyncio.gatheris simpler to align outputs and resilient — a single bad output only loses one passage's score. - Single batched prompt requires the model to emit aligned scores (often by index); LLMs occasionally drop entries or misorder them, demanding additional validation.
- With typical
limit ≤ 10, parallel per-passage calls hit Ollama briefly; on a 3B model this is < 5s for 10 passages.
- Per-passage with
- Implications: Default to per-passage
asyncio.gather. Expose no extra concurrency knob initially (avoid premature configuration surface; YAGNI per project guidelines).
Failure semantics
- Context: Required by R5 — Flask must keep serving on Ollama outage, and graph search should remain functional.
- Sources Consulted:
backend/app/services/graphiti_adapter.py:515-517(_GraphNamespace.searchswallows all exceptions and logs a warning);_get_graphiti()runs once at first call. - Findings:
- Construction of an
openai.AsyncOpenAIclient does not perform any network I/O. ThereforeOllamaReranker.__init__can be safe at startup even when Ollama is down. - If
rank()itself raises, the upstreamGraphiti.searchmay surface the exception. The new reranker should therefore catch its own errors and degrade to passthrough behavior in-method rather than relying on the outertry/exceptin_GraphNamespace.search.
- Construction of an
- Implications:
OllamaReranker.rankshould never raise. On exception or unparseable output it returns the input passages in the original order with passthrough-style synthetic scores and emits a single WARNING log per failure (rate-limited by intent: one log per rank() call).
Architecture Pattern Evaluation
| Option | Description | Strengths | Risks / Limitations | Notes |
|---|---|---|---|---|
A: Add class to graphiti_adapter.py |
Define OllamaReranker next to _PassthroughReranker in the same file. |
Minimal diff; single file to read. | Bloats an already-long adapter; mixes wiring with provider-specific logic. | — |
B: New services/ollama_reranker.py module |
Dedicated module owns prompt + parse + async client; adapter only selects it. | Single-responsibility module; matches ticket suggestion; reusable in isolation. | One extra import in adapter. | Selected. Aligns with project pattern of one concern per services/* file. |
| C: Hybrid provider registry | Map RERANKER_PROVIDER → builder in adapter; class still in B's module. |
Future providers are a one-line registry change. | Over-engineering for two providers (ollama + none). |
Deferred until a third provider is needed. |
Design Decisions
Decision: Provider selected via env var, branch lives in _get_graphiti()
- Context: R3 requires env-driven provider selection; only two values supported by this spec (
ollamaandnone). - Alternatives Considered:
- Function-pointer registry (Option C).
- Inline
if/elsein the factory selecting one of two classes.
- Selected Approach: Inline branch in
_get_graphiti()readsConfig.RERANKER_PROVIDER, picks_build_ollama_reranker()or_PassthroughReranker(), validates unknown values with aValueErrormatching the existing_ALLOWED_GRAPHITI_PROVIDERSconvention. - Rationale: Mirrors the established
GRAPHITI_LLM_PROVIDERvalidation pattern (_ALLOWED_GRAPHITI_PROVIDERS) without adding speculative abstraction. Two values, two branches. - Trade-offs: Adding a third provider later costs one more
elif; acceptable. - Follow-up: Surface the selected provider in the INFO startup log so operators can confirm.
Decision: Per-passage scoring with asyncio.gather, no concurrency knob
- Context: R2.3 requires one score per passage in descending order; R5 requires graceful per-call failure.
- Alternatives Considered:
- Single batched prompt with index-aligned output.
- Per-passage call with bounded
Semaphore.
- Selected Approach: Per-passage
asyncio.gatherwith no explicit limit; rely on defaultlimit ≤ 10in_GraphNamespace.search. - Rationale: Simple, deterministic, isolates per-passage failures. Avoids premature configuration knob.
- Trade-offs: If a future caller asks for
limit=100, Ollama may queue 100 requests; acceptable for now because no caller does this. - Follow-up: If real-world rerank latency becomes a concern, add
RERANKER_MAX_PARALLELthen.
Decision: Default model = qwen2.5:3b
- Context: Need a small, broadly-available Ollama chat model that reliably emits a numeric score in 1–2 tokens.
- Alternatives Considered:
qwen2.5:3b(Apache-2.0, 3B params, strong instruction following).llama3.2:3b(Llama community license, 3B).phi3:3.8b(MIT, 3.8B).
- Selected Approach:
qwen2.5:3b. - Rationale: Matches the Qwen-family alignment of the rest of the project (
qwen-plusis the documented LLM default). Apache-2.0 license is permissive. Small enough for typical dev machines. - Trade-offs: Operators on systems without
qwen2.5:3bmustollama pull qwen2.5:3bor overrideRERANKER_MODEL. - Follow-up: README will document
ollama pull qwen2.5:3balongside the existingollama pull mxbai-embed-largestep.
Decision: Defensive output parsing (json.loads → regex float → deterministic low score)
- Context: R2.6 requires deterministic handling of unparseable model responses.
- Selected Approach:
- Strip
<think>...</think>blocks (project convention fromllm_client.py:64). - Strip markdown fences (project convention from
llm_client.chat_json). json.loadsand readscore(float in[0, 1], clipped on out-of-range).- On JSON failure, regex-extract the first float token; clip to
[0, 1]. - On total failure, assign
0.0 - 0.001 * passage_index(deterministic and below any successfully-parsed score).
- Strip
- Rationale: Reuses patterns already in the codebase. Keeps every passage in the output (R2.6).
- Trade-offs: One failed parse silently downranks a passage; logged at DEBUG (not WARNING) to avoid log spam.
Risks & Mitigations
- Risk: Ollama service is not running on startup → boot must not fail. Mitigation: Construct only
AsyncOpenAI(no network call) during__init__. Defer connectivity to firstrank(). R5.4. - Risk: Model is not pulled →
rank()raises 404 from Ollama. Mitigation: Catch withinrank(), log WARNING naming model + error class, return passthrough-ordered tuples so search still works. R5.1, R5.3. - Risk: Operator misconfigures
RERANKER_PROVIDERto an unknown value → silent fallthrough to wrong reranker. Mitigation:_get_graphiti()raisesValueErrorlisting allowed values, mirroring_ALLOWED_GRAPHITI_PROVIDERS. R3.5. - Risk: Multiple concurrent
rank()calls overwhelm a small local Ollama daemon. Mitigation: Accept default Graphitilimit ≤ 10; documentRERANKER_MAX_PARALLELas a future follow-up if needed.
References
backend/app/services/graphiti_adapter.py:38-51— current passthrough reranker contract.backend/app/services/graphiti_adapter.py:142-162— current_get_graphiti()wiring point.backend/app/utils/llm_client.py— project pattern for OpenAI-SDK chat + JSON parsing + reasoning-block stripping..kiro/specs/graphiti-neo4j-finalize/research.md— historical context for why the passthrough was introduced.- Ticket
#39in.ticket/39.md— feature brief and acceptance criteria.