11 KiB

Raw Blame History

Research & Design Decisions — graphiti-ollama-reranker

Summary

Feature: graphiti-ollama-reranker
Discovery Scope: Extension (one new service module + factory branch + config + docs).
Key Findings:
- CrossEncoderClient.rank(query, passages) -> list[tuple[str, float]] is the only abstract contract Graphiti requires of the reranker. The existing _PassthroughReranker already exercises this contract correctly.
- Ollama's OpenAI-compatible /v1/chat/completions endpoint does not reliably expose logprobs / logit_bias, so Graphiti's default OpenAI scoring approach (binary YES/NO over token logits) cannot be ported. The reranker must use prompted numeric scoring with text-output parsing.
- The openai SDK already shipped in backend/.venv (v2.35.1) exposes AsyncOpenAI, which is the right client for the async rank() method without introducing any new dependency.

Research Log

Graphiti's `CrossEncoderClient` contract

Context: Need to confirm the precise shape of the rank interface and any other abstract members.
Sources Consulted: backend/app/services/graphiti_adapter.py:38-51 (_PassthroughReranker); .kiro/specs/graphiti-neo4j-finalize/research.md and gap-analysis.md (which captured the upstream contract on first integration); ticket #39 narrative.
Findings:
- _PassthroughReranker subclasses CrossEncoderClient and only overrides async def rank(query: str, passages: list[str]) -> list[tuple[str, float]].
- Graphiti's internal call site (graphiti_core/graphiti.py:154) constructs the reranker once and calls rank per search. There is no separate batch interface to satisfy.
- Passages are short text snippets (entity-edge facts / node summaries). Typical N per search ≤ 10 (limit defaulted in _GraphNamespace.search).
Implications: A drop-in subclass that implements rank is sufficient. No additional abstract methods to wire.

Ollama OpenAI-compatible scoring surface

Context: Decide how to obtain a relevance score per passage from a small Ollama-served chat model.
Sources Consulted: Project-internal backend/app/utils/llm_client.py (uses openai.OpenAI + chat.completions.create against Dashscope / OpenAI / Ollama uniformly); ticket #39 "Proposed approach" section enumerating Ollama chat-model scoring vs. embedding cosine.
Findings:
- Ollama supports /v1/chat/completions for chat models like qwen2.5:3b, llama3.2:3b, phi3:3.8b. Pulling a model is required (ollama pull <model>).
- JSON-mode (response_format={"type": "json_object"}) is honored by recent Ollama versions but not universally; project convention is to fall back gracefully (cf. LLMClient.chat_json).
- Embedding-cosine reranker is feasible (re-embed query and passages with mxbai-embed-large) but produces a weaker ordering signal than an LLM that can reason about the question. Picking LLM scoring matches the ticket's preferred path.
Implications:
- Use a chat-completion call per passage with a deterministic temperature (0.0) and a tight system prompt asking for a JSON score in [0.0, 1.0].
- Parse with the same defensive strategy used elsewhere: strip <think> blocks, strip markdown fences, attempt json.loads, regex-fallback to first float, deterministic low score on hard failure.

Concurrency strategy

Context: Decide between per-passage parallel calls vs. one batched call.
Findings:
- Per-passage with asyncio.gather is simpler to align outputs and resilient — a single bad output only loses one passage's score.
- Single batched prompt requires the model to emit aligned scores (often by index); LLMs occasionally drop entries or misorder them, demanding additional validation.
- With typical limit ≤ 10, parallel per-passage calls hit Ollama briefly; on a 3B model this is < 5s for 10 passages.
Implications: Default to per-passage asyncio.gather. Expose no extra concurrency knob initially (avoid premature configuration surface; YAGNI per project guidelines).

Failure semantics

Context: Required by R5 — Flask must keep serving on Ollama outage, and graph search should remain functional.
Sources Consulted: backend/app/services/graphiti_adapter.py:515-517 (_GraphNamespace.search swallows all exceptions and logs a warning); _get_graphiti() runs once at first call.
Findings:
- Construction of an openai.AsyncOpenAI client does not perform any network I/O. Therefore OllamaReranker.__init__ can be safe at startup even when Ollama is down.
- If rank() itself raises, the upstream Graphiti.search may surface the exception. The new reranker should therefore catch its own errors and degrade to passthrough behavior in-method rather than relying on the outer try/except in _GraphNamespace.search.
Implications: OllamaReranker.rank should never raise. On exception or unparseable output it returns the input passages in the original order with passthrough-style synthetic scores and emits a single WARNING log per failure (rate-limited by intent: one log per rank() call).

Architecture Pattern Evaluation

Option	Description	Strengths	Risks / Limitations	Notes
A: Add class to `graphiti_adapter.py`	Define `OllamaReranker` next to `_PassthroughReranker` in the same file.	Minimal diff; single file to read.	Bloats an already-long adapter; mixes wiring with provider-specific logic.	—
B: New `services/ollama_reranker.py` module	Dedicated module owns prompt + parse + async client; adapter only selects it.	Single-responsibility module; matches ticket suggestion; reusable in isolation.	One extra import in adapter.	Selected. Aligns with project pattern of one concern per `services/*` file.
C: Hybrid provider registry	Map `RERANKER_PROVIDER → builder` in adapter; class still in B's module.	Future providers are a one-line registry change.	Over-engineering for two providers (`ollama` + `none`).	Deferred until a third provider is needed.

Design Decisions

Decision: Provider selected via env var, branch lives in `_get_graphiti()`

Context: R3 requires env-driven provider selection; only two values supported by this spec (ollama and none).
Alternatives Considered:
1. Function-pointer registry (Option C).
2. Inline if/else in the factory selecting one of two classes.
Selected Approach: Inline branch in _get_graphiti() reads Config.RERANKER_PROVIDER, picks _build_ollama_reranker() or _PassthroughReranker(), validates unknown values with a ValueError matching the existing _ALLOWED_GRAPHITI_PROVIDERS convention.
Rationale: Mirrors the established GRAPHITI_LLM_PROVIDER validation pattern (_ALLOWED_GRAPHITI_PROVIDERS) without adding speculative abstraction. Two values, two branches.
Trade-offs: Adding a third provider later costs one more elif; acceptable.
Follow-up: Surface the selected provider in the INFO startup log so operators can confirm.

Decision: Per-passage scoring with `asyncio.gather`, no concurrency knob

Context: R2.3 requires one score per passage in descending order; R5 requires graceful per-call failure.
Alternatives Considered:
1. Single batched prompt with index-aligned output.
2. Per-passage call with bounded Semaphore.
Selected Approach: Per-passage asyncio.gather with no explicit limit; rely on default limit ≤ 10 in _GraphNamespace.search.
Rationale: Simple, deterministic, isolates per-passage failures. Avoids premature configuration knob.
Trade-offs: If a future caller asks for limit=100, Ollama may queue 100 requests; acceptable for now because no caller does this.
Follow-up: If real-world rerank latency becomes a concern, add RERANKER_MAX_PARALLEL then.

Decision: Default model = `qwen2.5:3b`

Context: Need a small, broadly-available Ollama chat model that reliably emits a numeric score in 1–2 tokens.
Alternatives Considered:
1. qwen2.5:3b (Apache-2.0, 3B params, strong instruction following).
2. llama3.2:3b (Llama community license, 3B).
3. phi3:3.8b (MIT, 3.8B).
Selected Approach: qwen2.5:3b.
Rationale: Matches the Qwen-family alignment of the rest of the project (qwen-plus is the documented LLM default). Apache-2.0 license is permissive. Small enough for typical dev machines.
Trade-offs: Operators on systems without qwen2.5:3b must ollama pull qwen2.5:3b or override RERANKER_MODEL.
Follow-up: README will document ollama pull qwen2.5:3b alongside the existing ollama pull mxbai-embed-large step.

Decision: Defensive output parsing (`json.loads` → regex float → deterministic low score)

Context: R2.6 requires deterministic handling of unparseable model responses.
Selected Approach:
1. Strip <think>...</think> blocks (project convention from llm_client.py:64).
2. Strip markdown fences (project convention from llm_client.chat_json).
3. json.loads and read score (float in [0, 1], clipped on out-of-range).
4. On JSON failure, regex-extract the first float token; clip to [0, 1].
5. On total failure, assign 0.0 - 0.001 * passage_index (deterministic and below any successfully-parsed score).
Rationale: Reuses patterns already in the codebase. Keeps every passage in the output (R2.6).
Trade-offs: One failed parse silently downranks a passage; logged at DEBUG (not WARNING) to avoid log spam.

Risks & Mitigations

Risk: Ollama service is not running on startup → boot must not fail. Mitigation: Construct only AsyncOpenAI (no network call) during __init__. Defer connectivity to first rank(). R5.4.
Risk: Model is not pulled → rank() raises 404 from Ollama. Mitigation: Catch within rank(), log WARNING naming model + error class, return passthrough-ordered tuples so search still works. R5.1, R5.3.
Risk: Operator misconfigures RERANKER_PROVIDER to an unknown value → silent fallthrough to wrong reranker. Mitigation: _get_graphiti() raises ValueError listing allowed values, mirroring _ALLOWED_GRAPHITI_PROVIDERS. R3.5.
Risk: Multiple concurrent rank() calls overwhelm a small local Ollama daemon. Mitigation: Accept default Graphiti limit ≤ 10; document RERANKER_MAX_PARALLEL as a future follow-up if needed.

References

backend/app/services/graphiti_adapter.py:38-51 — current passthrough reranker contract.
backend/app/services/graphiti_adapter.py:142-162 — current _get_graphiti() wiring point.
backend/app/utils/llm_client.py — project pattern for OpenAI-SDK chat + JSON parsing + reasoning-block stripping.
.kiro/specs/graphiti-neo4j-finalize/research.md — historical context for why the passthrough was introduced.
Ticket #39 in .ticket/39.md — feature brief and acceptance criteria.

11 KiB Raw Blame History Unescape Escape

Research & Design Decisions — graphiti-ollama-reranker

Summary

Research Log

Graphiti's CrossEncoderClient contract

Ollama OpenAI-compatible scoring surface

Concurrency strategy

Failure semantics

Architecture Pattern Evaluation

Design Decisions

Decision: Provider selected via env var, branch lives in _get_graphiti()

Decision: Per-passage scoring with asyncio.gather, no concurrency knob

Decision: Default model = qwen2.5:3b

Decision: Defensive output parsing (json.loads → regex float → deterministic low score)

Risks & Mitigations

References

11 KiB

Raw Blame History

Graphiti's `CrossEncoderClient` contract

Decision: Provider selected via env var, branch lives in `_get_graphiti()`

Decision: Per-passage scoring with `asyncio.gather`, no concurrency knob

Decision: Default model = `qwen2.5:3b`

Decision: Defensive output parsing (`json.loads` → regex float → deterministic low score)