5.0 KiB

Raw Blame History

Database / Knowledge Graph Standards

The "database" in MiroFish is Neo4j accessed via Graphiti, not a relational store. There is no SQL, no migrations file, no ORM. Generic relational guidance does not apply — these are the project-specific patterns.

Architecture

Engine: Neo4j 5.x Community over bolt://.
Graph layer: graphiti-core ≥ 0.3 — handles node/edge writes, embeddings, hybrid search, reranking.
Adapter: backend/app/services/graphiti_adapter.py is the only module that imports graphiti_core directly. Every other module talks to the graph through this adapter.

The adapter exposes a Zep-Cloud-shaped namespace (client.graph.add_episode(...), client.graph.search(...), etc.) so legacy zep_* services kept their existing call sites after the migration. New code should use the same surface — do not introduce a parallel API.

Core Rule: `group_id` Isolation

Every read or write to the graph must be scoped by the project's group_id. The graph is multi-tenant by construction; cross-project access is not permitted and is grounds for rejecting a change in review.

A project's group_id lives on its Project model and never changes after creation.
When constructing search filters, episode adds, or node/edge fetches, always pass group_id=project.group_id (or the equivalent group_ids=[...]).
If you need data spanning projects (e.g. an admin view), aggregate per-project at the API layer; do not query the graph without a group_id filter.

Adapter Patterns That Must Stay Intact

These are non-obvious and break subtly when violated:

Single Graphiti singleton. _get_graphiti() lazily constructs one Graphiti instance for the whole process. Do not instantiate Graphiti in services or tests.
Persistent event loop in a dedicated thread. All async graph calls are dispatched through _run(coro) onto a single background event loop (see graphiti-event-loop thread). The Neo4j async driver is bound to whichever loop opened it; crossing loops corrupts the driver state. Never call asyncio.run(...) on a Graphiti coroutine, and never schedule one on a request thread's loop.
Indices and constraints on first init. build_indices_and_constraints() runs once when the singleton is created. New required indexes go through Graphiti's mechanisms, not raw Cypher in services.

What Belongs in the Graph

Entities — Domain objects extracted by the ontology generator (people, organizations, concepts, events, etc.).
Edges — Relationships between entities, typed per the project's generated ontology.
Episodes — The raw text/units the entities were derived from; Graphiti owns chunking and embedding.

What does not belong in the graph:

Project / task metadata (lives in in-memory ProjectManager and TaskManager).
Simulation state (owned by OASIS subprocesses).
User-uploaded files (filesystem only — paths, not contents, are passed through the API).

Schema & Ontology

Ontology (entity types + edge types) is generated per project by the LLM in step 1, stored on the Project model, and used to constrain extraction during graph build.
There is no global, hand-maintained schema file. Don't add one — the ontology is intentionally per-project.
Reasoning-model outputs from ontology generation are stripped of <think> blocks and code fences before JSON parsing (see tech.md's "reasoning-model output stripping" decision).

Embeddings

EMBEDDING_MODEL is configurable per provider:
- OpenAI default: text-embedding-3-small
- Gemini: text-embedding-004 / gemini-embedding-001
Embedding model selection lives in config.py. Don't hard-code it in services.
Switching embedding model invalidates existing project graphs — document this if you add an option that changes the default.

Query Patterns

Read via the adapter's search methods (hybrid RRF recipes are wired in graphiti_adapter.py); avoid raw Cypher in feature code.
If a feature genuinely requires raw Cypher, add it as a method on the adapter, scoped by group_id, with a comment explaining why Graphiti's API is insufficient.
Pagination over Graphiti results uses utils/zep_paging.py (legacy name, still applicable).

Startup Recovery

_recover_stuck_projects runs on app boot and promotes any project left in GRAPH_BUILDING to GRAPH_COMPLETED if the graph already has that project's nodes — handling the case where the original task was killed by a restart. Any new long-running graph operation must either:

Be safe to re-run from the start, OR
Add an analogous recovery path so a restart mid-task doesn't strand the project.

Backups

Graph data is treated as regenerable from seed material, not as durable user data — there is no project-managed backup/restore. If a deployment requires durability, that's an operator concern (Neo4j backups), not a feature-code one.

Focus on patterns and decisions. No environment-specific settings.

5.0 KiB Raw Blame History