MicroFish/.kiro/steering/database.md

124 lines
5.0 KiB
Markdown

# Database / Knowledge Graph Standards
The "database" in MiroFish is **Neo4j accessed via Graphiti**, not a
relational store. There is no SQL, no migrations file, no ORM. Generic
relational guidance does not apply — these are the project-specific
patterns.
## Architecture
- **Engine**: Neo4j 5.x Community over `bolt://`.
- **Graph layer**: `graphiti-core` ≥ 0.3 — handles node/edge writes,
embeddings, hybrid search, reranking.
- **Adapter**: `backend/app/services/graphiti_adapter.py` is the **only**
module that imports `graphiti_core` directly. Every other module talks
to the graph through this adapter.
The adapter exposes a Zep-Cloud-shaped namespace
(`client.graph.add_episode(...)`, `client.graph.search(...)`, etc.) so
legacy `zep_*` services kept their existing call sites after the
migration. New code should use the same surface — do not introduce a
parallel API.
## Core Rule: `group_id` Isolation
**Every read or write to the graph must be scoped by the project's
`group_id`.** The graph is multi-tenant by construction; cross-project
access is not permitted and is grounds for rejecting a change in review.
- A project's `group_id` lives on its `Project` model and never changes
after creation.
- When constructing search filters, episode adds, or node/edge fetches,
always pass `group_id=project.group_id` (or the equivalent
`group_ids=[...]`).
- If you need data spanning projects (e.g. an admin view), aggregate
per-project at the API layer; do not query the graph without a
`group_id` filter.
## Adapter Patterns That Must Stay Intact
These are non-obvious and break subtly when violated:
- **Single Graphiti singleton.** `_get_graphiti()` lazily constructs one
`Graphiti` instance for the whole process. Do not instantiate
`Graphiti` in services or tests.
- **Persistent event loop in a dedicated thread.** All async graph calls
are dispatched through `_run(coro)` onto a single background event
loop (see `graphiti-event-loop` thread). The Neo4j async driver is
bound to whichever loop opened it; crossing loops corrupts the driver
state. Never call `asyncio.run(...)` on a Graphiti coroutine, and
never schedule one on a request thread's loop.
- **Indices and constraints on first init.** `build_indices_and_constraints()`
runs once when the singleton is created. New required indexes go
through Graphiti's mechanisms, not raw Cypher in services.
## What Belongs in the Graph
- **Entities** — Domain objects extracted by the ontology generator
(people, organizations, concepts, events, etc.).
- **Edges** — Relationships between entities, typed per the project's
generated ontology.
- **Episodes** — The raw text/units the entities were derived from;
Graphiti owns chunking and embedding.
What does **not** belong in the graph:
- Project / task metadata (lives in in-memory `ProjectManager` and
`TaskManager`).
- Simulation state (owned by OASIS subprocesses).
- User-uploaded files (filesystem only — paths, not contents, are
passed through the API).
## Schema & Ontology
- Ontology (entity types + edge types) is **generated per project** by
the LLM in step 1, stored on the `Project` model, and used to
constrain extraction during graph build.
- There is no global, hand-maintained schema file. Don't add one — the
ontology is intentionally per-project.
- Reasoning-model outputs from ontology generation are stripped of
`<think>` blocks and code fences before JSON parsing (see
`tech.md`'s "reasoning-model output stripping" decision).
## Embeddings
- `EMBEDDING_MODEL` is configurable per provider:
- OpenAI default: `text-embedding-3-small`
- Gemini: `text-embedding-004` / `gemini-embedding-001`
- Embedding model selection lives in `config.py`. Don't hard-code it in
services.
- Switching embedding model **invalidates existing project graphs**
document this if you add an option that changes the default.
## Query Patterns
- Read via the adapter's search methods (hybrid RRF recipes are wired
in `graphiti_adapter.py`); avoid raw Cypher in feature code.
- If a feature genuinely requires raw Cypher, add it as a method on the
adapter, scoped by `group_id`, with a comment explaining why
Graphiti's API is insufficient.
- Pagination over Graphiti results uses `utils/zep_paging.py` (legacy
name, still applicable).
## Startup Recovery
`_recover_stuck_projects` runs on app boot and promotes any project
left in `GRAPH_BUILDING` to `GRAPH_COMPLETED` if the graph already has
that project's nodes — handling the case where the original task was
killed by a restart. **Any new long-running graph operation must
either:**
1. Be safe to re-run from the start, OR
2. Add an analogous recovery path so a restart mid-task doesn't strand
the project.
## Backups
Graph data is treated as **regenerable from seed material**, not as
durable user data — there is no project-managed backup/restore. If a
deployment requires durability, that's an operator concern (Neo4j
backups), not a feature-code one.
---
_Focus on patterns and decisions. No environment-specific settings._