# Database / Knowledge Graph Standards The "database" in MiroFish is **Neo4j accessed via Graphiti**, not a relational store. There is no SQL, no migrations file, no ORM. Generic relational guidance does not apply — these are the project-specific patterns. ## Architecture - **Engine**: Neo4j 5.x Community over `bolt://`. - **Graph layer**: `graphiti-core` ≥ 0.3 — handles node/edge writes, embeddings, hybrid search, reranking. - **Adapter**: `backend/app/services/graphiti_adapter.py` is the **only** module that imports `graphiti_core` directly. Every other module talks to the graph through this adapter. The adapter exposes a Zep-Cloud-shaped namespace (`client.graph.add_episode(...)`, `client.graph.search(...)`, etc.) so legacy `zep_*` services kept their existing call sites after the migration. New code should use the same surface — do not introduce a parallel API. ## Core Rule: `group_id` Isolation **Every read or write to the graph must be scoped by the project's `group_id`.** The graph is multi-tenant by construction; cross-project access is not permitted and is grounds for rejecting a change in review. - A project's `group_id` lives on its `Project` model and never changes after creation. - When constructing search filters, episode adds, or node/edge fetches, always pass `group_id=project.group_id` (or the equivalent `group_ids=[...]`). - If you need data spanning projects (e.g. an admin view), aggregate per-project at the API layer; do not query the graph without a `group_id` filter. ## Adapter Patterns That Must Stay Intact These are non-obvious and break subtly when violated: - **Single Graphiti singleton.** `_get_graphiti()` lazily constructs one `Graphiti` instance for the whole process. Do not instantiate `Graphiti` in services or tests. - **Persistent event loop in a dedicated thread.** All async graph calls are dispatched through `_run(coro)` onto a single background event loop (see `graphiti-event-loop` thread). The Neo4j async driver is bound to whichever loop opened it; crossing loops corrupts the driver state. Never call `asyncio.run(...)` on a Graphiti coroutine, and never schedule one on a request thread's loop. - **Indices and constraints on first init.** `build_indices_and_constraints()` runs once when the singleton is created. New required indexes go through Graphiti's mechanisms, not raw Cypher in services. ## What Belongs in the Graph - **Entities** — Domain objects extracted by the ontology generator (people, organizations, concepts, events, etc.). - **Edges** — Relationships between entities, typed per the project's generated ontology. - **Episodes** — The raw text/units the entities were derived from; Graphiti owns chunking and embedding. What does **not** belong in the graph: - Project / task metadata (lives in in-memory `ProjectManager` and `TaskManager`). - Simulation state (owned by OASIS subprocesses). - User-uploaded files (filesystem only — paths, not contents, are passed through the API). ## Schema & Ontology - Ontology (entity types + edge types) is **generated per project** by the LLM in step 1, stored on the `Project` model, and used to constrain extraction during graph build. - There is no global, hand-maintained schema file. Don't add one — the ontology is intentionally per-project. - Reasoning-model outputs from ontology generation are stripped of `` blocks and code fences before JSON parsing (see `tech.md`'s "reasoning-model output stripping" decision). ## Embeddings - `EMBEDDING_MODEL` is configurable per provider: - OpenAI default: `text-embedding-3-small` - Gemini: `text-embedding-004` / `gemini-embedding-001` - Embedding model selection lives in `config.py`. Don't hard-code it in services. - Switching embedding model **invalidates existing project graphs** — document this if you add an option that changes the default. ## Query Patterns - Read via the adapter's search methods (hybrid RRF recipes are wired in `graphiti_adapter.py`); avoid raw Cypher in feature code. - If a feature genuinely requires raw Cypher, add it as a method on the adapter, scoped by `group_id`, with a comment explaining why Graphiti's API is insufficient. - Pagination over Graphiti results uses `utils/zep_paging.py` (legacy name, still applicable). ## Startup Recovery `_recover_stuck_projects` runs on app boot and promotes any project left in `GRAPH_BUILDING` to `GRAPH_COMPLETED` if the graph already has that project's nodes — handling the case where the original task was killed by a restart. **Any new long-running graph operation must either:** 1. Be safe to re-run from the start, OR 2. Add an analogous recovery path so a restart mid-task doesn't strand the project. ## Backups Graph data is treated as **regenerable from seed material**, not as durable user data — there is no project-managed backup/restore. If a deployment requires durability, that's an operator concern (Neo4j backups), not a feature-code one. --- _Focus on patterns and decisions. No environment-specific settings._