5.6 KiB
Error Handling Standards
Most errors in MiroFish originate from LLM calls, graph operations, subprocess simulation, or user-uploaded files — not classical 4xx/5xx web flows. These standards target those failure modes specifically.
Philosophy
- Fail fast in services; convert to a stable response envelope at the API layer.
- Long-running tasks must always reach a terminal state
(
COMPLETEDorFAILED) — a stuckPROCESSINGtask is a bug. - LLM responses are untrusted by default: validate, strip, parse, then use.
- Background-thread errors are silent unless explicitly captured —
always wrap the work in
try/except.
Error Surfaces (where they appear, where they're handled)
| Surface | Handle in | Convert to |
|---|---|---|
| HTTP request errors | api/ handler try/except + envelope |
{"success": false, "error": …} |
| Background task | Worker thread try/except → fail_task() |
Task.status = FAILED + error |
| LLM call failures | retry_with_backoff decorator |
Exception bubbles after retries |
| Graph adapter errors | Caller catches & maps | Service-specific error or Task.fail |
| Simulation IPC | simulation_ipc.py catches & logs |
Task fail or simulation cleanup |
| File parsing | utils/file_parser.py |
Raised as ValueError to caller |
A handler should never let an exception reach Flask's default 500 formatter — wrap and return the canonical envelope instead.
LLM-Specific Failure Modes
These are recurring and worth handling explicitly:
1. Reasoning-model output contamination
Some providers (MiniMax, GLM, certain Qwen variants) emit <think>… </think> blocks and/or markdown code fences (```json ... ```)
around JSON output.
Rule: Strip both before json.loads(...). The fix lives in commit
985f89f for context. Any new LLM-output JSON parser must do the same
— do not call json.loads on raw model output.
2. Transient API errors
Network blips, rate limits, intermittent 5xx from the provider.
Rule: Use utils/retry.py:
from app.utils.retry import retry_with_backoff
@retry_with_backoff(max_retries=3, exceptions=(SomeAPIError,))
def call_llm(...): ...
- Sync version:
retry_with_backoff - Async version:
retry_with_backoff_async - For batch processing where partial failure is acceptable, use
RetryableAPIClient.call_batch_with_retry(items, fn, continue_on_failure=True).
Don't write a hand-rolled retry loop — it'll drift from the project's backoff/jitter conventions.
3. Schema mismatch in structured output
LLM returns valid JSON but missing/extra fields.
Rule: Validate with Pydantic v2 models where the call expects structure. Fail loudly (raise) rather than silently coercing — better to retry the LLM call than to feed bad data downstream.
Background Task Errors
Inside a worker thread spawned from an API handler:
def _worker(task_id, project_id, ...):
try:
# work
TaskManager().update_task(task_id, progress=50, message=...)
result = do_real_work(...)
TaskManager().complete_task(task_id, result)
except Exception as e:
logger.exception(f"task {task_id} failed")
TaskManager().fail_task(task_id, str(e))
Rules:
- The outer
exceptmust be broad (Exception) — the goal is "task always terminates," not "narrow down failures here." - Log the full traceback (
logger.exception), then store a concisestr(e)on the task for the frontend to display. - Never re-raise from the worker; the thread has no caller.
- Update related
Projectstate (e.g. revertGRAPH_BUILDING→ previous status) inside the except, beforefail_task.
Graph & Subprocess Errors
- Graphiti / Neo4j errors: caller decides — usually fail the task with a user-friendly message; for non-fatal search failures, log and return empty results.
- OASIS subprocess crashes:
simulation_ipc.pyis the single surface. It owns lifecycle, logging, and signaling task failure. Don't catch subprocess errors elsewhere. - Startup recovery:
_recover_stuck_projectsre-classifies projects leftGRAPH_BUILDINGafter a restart — seedatabase.md.
Logging
- Use
utils/logger.get_logger('mirofish.<module>')— neverprintorlogging.getLoggerdirectly. - Levels:
ERROR— task failure, unrecoverable exceptionWARNING— retry triggered, transient failure, recovered stateINFO— task lifecycle (created, completed), pipeline milestonesDEBUG— payload shapes, intermediate counts, off by default
- User-visible log messages should go through
utils/locale.t(...)so they translate; internal diagnostic logs stay in the file's existing language (English or Chinese — match the surrounding code). - Never log: API keys, full LLM prompts containing user-uploaded
text (truncate or hash), Neo4j credentials, full
.envcontents.
What Not to Do
- Don't catch
Exceptioninside an API handler just to log and continue — fail the request and return the envelope. - Don't retry non-idempotent work (e.g. graph writes that may have partially completed).
- Don't translate exceptions into
success: trueresponses with an embedded error message; usesuccess: false. - Don't surface raw stack traces or LLM internals to the frontend.
Focus on patterns and decisions. No implementation details or exhaustive lists.