MicroFish/.kiro/specs/graphiti-ollama-reranker/research.md

113 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research & Design Decisions — graphiti-ollama-reranker
## Summary
- **Feature**: `graphiti-ollama-reranker`
- **Discovery Scope**: Extension (one new service module + factory branch + config + docs).
- **Key Findings**:
- `CrossEncoderClient.rank(query, passages) -> list[tuple[str, float]]` is the only abstract contract Graphiti requires of the reranker. The existing `_PassthroughReranker` already exercises this contract correctly.
- Ollama's OpenAI-compatible `/v1/chat/completions` endpoint does not reliably expose `logprobs` / `logit_bias`, so Graphiti's default OpenAI scoring approach (binary YES/NO over token logits) cannot be ported. The reranker must use **prompted numeric scoring** with text-output parsing.
- The `openai` SDK already shipped in `backend/.venv` (v2.35.1) exposes `AsyncOpenAI`, which is the right client for the async `rank()` method without introducing any new dependency.
## Research Log
### Graphiti's `CrossEncoderClient` contract
- **Context**: Need to confirm the precise shape of the `rank` interface and any other abstract members.
- **Sources Consulted**: `backend/app/services/graphiti_adapter.py:38-51` (`_PassthroughReranker`); `.kiro/specs/graphiti-neo4j-finalize/research.md` and `gap-analysis.md` (which captured the upstream contract on first integration); ticket #39 narrative.
- **Findings**:
- `_PassthroughReranker` subclasses `CrossEncoderClient` and only overrides `async def rank(query: str, passages: list[str]) -> list[tuple[str, float]]`.
- Graphiti's internal call site (`graphiti_core/graphiti.py:154`) constructs the reranker once and calls `rank` per search. There is no separate batch interface to satisfy.
- Passages are short text snippets (entity-edge facts / node summaries). Typical N per search ≤ 10 (limit defaulted in `_GraphNamespace.search`).
- **Implications**: A drop-in subclass that implements `rank` is sufficient. No additional abstract methods to wire.
### Ollama OpenAI-compatible scoring surface
- **Context**: Decide how to obtain a relevance score per passage from a small Ollama-served chat model.
- **Sources Consulted**: Project-internal `backend/app/utils/llm_client.py` (uses `openai.OpenAI` + `chat.completions.create` against Dashscope / OpenAI / Ollama uniformly); ticket #39 "Proposed approach" section enumerating Ollama chat-model scoring vs. embedding cosine.
- **Findings**:
- Ollama supports `/v1/chat/completions` for chat models like `qwen2.5:3b`, `llama3.2:3b`, `phi3:3.8b`. Pulling a model is required (`ollama pull <model>`).
- JSON-mode (`response_format={"type": "json_object"}`) is honored by recent Ollama versions but not universally; project convention is to fall back gracefully (cf. `LLMClient.chat_json`).
- Embedding-cosine reranker is feasible (re-embed query and passages with `mxbai-embed-large`) but produces a weaker ordering signal than an LLM that can reason about the question. Picking LLM scoring matches the ticket's preferred path.
- **Implications**:
- Use a chat-completion call per passage with a deterministic temperature (0.0) and a tight system prompt asking for a JSON score in [0.0, 1.0].
- Parse with the same defensive strategy used elsewhere: strip `<think>` blocks, strip markdown fences, attempt `json.loads`, regex-fallback to first float, deterministic low score on hard failure.
### Concurrency strategy
- **Context**: Decide between per-passage parallel calls vs. one batched call.
- **Findings**:
- Per-passage with `asyncio.gather` is simpler to align outputs and resilient — a single bad output only loses one passage's score.
- Single batched prompt requires the model to emit aligned scores (often by index); LLMs occasionally drop entries or misorder them, demanding additional validation.
- With typical `limit ≤ 10`, parallel per-passage calls hit Ollama briefly; on a 3B model this is < 5s for 10 passages.
- **Implications**: Default to per-passage `asyncio.gather`. Expose no extra concurrency knob initially (avoid premature configuration surface; YAGNI per project guidelines).
### Failure semantics
- **Context**: Required by R5 Flask must keep serving on Ollama outage, and graph search should remain functional.
- **Sources Consulted**: `backend/app/services/graphiti_adapter.py:515-517` (`_GraphNamespace.search` swallows all exceptions and logs a warning); `_get_graphiti()` runs once at first call.
- **Findings**:
- Construction of an `openai.AsyncOpenAI` client does not perform any network I/O. Therefore `OllamaReranker.__init__` can be safe at startup even when Ollama is down.
- If `rank()` itself raises, the upstream `Graphiti.search` may surface the exception. The new reranker should therefore catch its own errors and degrade to passthrough behavior in-method rather than relying on the outer `try/except` in `_GraphNamespace.search`.
- **Implications**: `OllamaReranker.rank` should never raise. On exception or unparseable output it returns the input passages in the original order with passthrough-style synthetic scores and emits a single WARNING log per failure (rate-limited by intent: one log per rank() call).
## Architecture Pattern Evaluation
| Option | Description | Strengths | Risks / Limitations | Notes |
|--------|-------------|-----------|---------------------|-------|
| A: Add class to `graphiti_adapter.py` | Define `OllamaReranker` next to `_PassthroughReranker` in the same file. | Minimal diff; single file to read. | Bloats an already-long adapter; mixes wiring with provider-specific logic. | |
| B: New `services/ollama_reranker.py` module | Dedicated module owns prompt + parse + async client; adapter only selects it. | Single-responsibility module; matches ticket suggestion; reusable in isolation. | One extra import in adapter. | **Selected.** Aligns with project pattern of one concern per `services/*` file. |
| C: Hybrid provider registry | Map `RERANKER_PROVIDER → builder` in adapter; class still in B's module. | Future providers are a one-line registry change. | Over-engineering for two providers (`ollama` + `none`). | Deferred until a third provider is needed. |
## Design Decisions
### Decision: Provider selected via env var, branch lives in `_get_graphiti()`
- **Context**: R3 requires env-driven provider selection; only two values supported by this spec (`ollama` and `none`).
- **Alternatives Considered**:
1. Function-pointer registry (Option C).
2. Inline `if/else` in the factory selecting one of two classes.
- **Selected Approach**: Inline branch in `_get_graphiti()` reads `Config.RERANKER_PROVIDER`, picks `_build_ollama_reranker()` or `_PassthroughReranker()`, validates unknown values with a `ValueError` matching the existing `_ALLOWED_GRAPHITI_PROVIDERS` convention.
- **Rationale**: Mirrors the established `GRAPHITI_LLM_PROVIDER` validation pattern (`_ALLOWED_GRAPHITI_PROVIDERS`) without adding speculative abstraction. Two values, two branches.
- **Trade-offs**: Adding a third provider later costs one more `elif`; acceptable.
- **Follow-up**: Surface the selected provider in the INFO startup log so operators can confirm.
### Decision: Per-passage scoring with `asyncio.gather`, no concurrency knob
- **Context**: R2.3 requires one score per passage in descending order; R5 requires graceful per-call failure.
- **Alternatives Considered**:
1. Single batched prompt with index-aligned output.
2. Per-passage call with bounded `Semaphore`.
- **Selected Approach**: Per-passage `asyncio.gather` with no explicit limit; rely on default `limit ≤ 10` in `_GraphNamespace.search`.
- **Rationale**: Simple, deterministic, isolates per-passage failures. Avoids premature configuration knob.
- **Trade-offs**: If a future caller asks for `limit=100`, Ollama may queue 100 requests; acceptable for now because no caller does this.
- **Follow-up**: If real-world rerank latency becomes a concern, add `RERANKER_MAX_PARALLEL` then.
### Decision: Default model = `qwen2.5:3b`
- **Context**: Need a small, broadly-available Ollama chat model that reliably emits a numeric score in 12 tokens.
- **Alternatives Considered**:
1. `qwen2.5:3b` (Apache-2.0, 3B params, strong instruction following).
2. `llama3.2:3b` (Llama community license, 3B).
3. `phi3:3.8b` (MIT, 3.8B).
- **Selected Approach**: `qwen2.5:3b`.
- **Rationale**: Matches the Qwen-family alignment of the rest of the project (`qwen-plus` is the documented LLM default). Apache-2.0 license is permissive. Small enough for typical dev machines.
- **Trade-offs**: Operators on systems without `qwen2.5:3b` must `ollama pull qwen2.5:3b` or override `RERANKER_MODEL`.
- **Follow-up**: README will document `ollama pull qwen2.5:3b` alongside the existing `ollama pull mxbai-embed-large` step.
### Decision: Defensive output parsing (`json.loads` → regex float → deterministic low score)
- **Context**: R2.6 requires deterministic handling of unparseable model responses.
- **Selected Approach**:
1. Strip `<think>...</think>` blocks (project convention from `llm_client.py:64`).
2. Strip markdown fences (project convention from `llm_client.chat_json`).
3. `json.loads` and read `score` (float in `[0, 1]`, clipped on out-of-range).
4. On JSON failure, regex-extract the first float token; clip to `[0, 1]`.
5. On total failure, assign `0.0 - 0.001 * passage_index` (deterministic and below any successfully-parsed score).
- **Rationale**: Reuses patterns already in the codebase. Keeps every passage in the output (R2.6).
- **Trade-offs**: One failed parse silently downranks a passage; logged at DEBUG (not WARNING) to avoid log spam.
## Risks & Mitigations
- **Risk**: Ollama service is not running on startup boot must not fail. **Mitigation**: Construct only `AsyncOpenAI` (no network call) during `__init__`. Defer connectivity to first `rank()`. R5.4.
- **Risk**: Model is not pulled `rank()` raises 404 from Ollama. **Mitigation**: Catch within `rank()`, log WARNING naming model + error class, return passthrough-ordered tuples so search still works. R5.1, R5.3.
- **Risk**: Operator misconfigures `RERANKER_PROVIDER` to an unknown value silent fallthrough to wrong reranker. **Mitigation**: `_get_graphiti()` raises `ValueError` listing allowed values, mirroring `_ALLOWED_GRAPHITI_PROVIDERS`. R3.5.
- **Risk**: Multiple concurrent `rank()` calls overwhelm a small local Ollama daemon. **Mitigation**: Accept default Graphiti `limit ≤ 10`; document `RERANKER_MAX_PARALLEL` as a future follow-up if needed.
## References
- `backend/app/services/graphiti_adapter.py:38-51` current passthrough reranker contract.
- `backend/app/services/graphiti_adapter.py:142-162` current `_get_graphiti()` wiring point.
- `backend/app/utils/llm_client.py` project pattern for OpenAI-SDK chat + JSON parsing + reasoning-block stripping.
- `.kiro/specs/graphiti-neo4j-finalize/research.md` historical context for why the passthrough was introduced.
- Ticket `#39` in `.ticket/39.md` feature brief and acceptance criteria.