5.6 KiB

Raw Blame History

Error Handling Standards

Most errors in MiroFish originate from LLM calls, graph operations, subprocess simulation, or user-uploaded files — not classical 4xx/5xx web flows. These standards target those failure modes specifically.

Philosophy

Fail fast in services; convert to a stable response envelope at the API layer.
Long-running tasks must always reach a terminal state (COMPLETED or FAILED) — a stuck PROCESSING task is a bug.
LLM responses are untrusted by default: validate, strip, parse, then use.
Background-thread errors are silent unless explicitly captured — always wrap the work in try/except.

Error Surfaces (where they appear, where they're handled)

Surface	Handle in	Convert to
HTTP request errors	`api/` handler `try/except` + envelope	`{"success": false, "error": …}`
Background task	Worker thread `try/except` → `fail_task()`	`Task.status = FAILED` + `error`
LLM call failures	`retry_with_backoff` decorator	Exception bubbles after retries
Graph adapter errors	Caller catches & maps	Service-specific error or `Task.fail`
Simulation IPC	`simulation_ipc.py` catches & logs	Task fail or simulation cleanup
File parsing	`utils/file_parser.py`	Raised as `ValueError` to caller

A handler should never let an exception reach Flask's default 500 formatter — wrap and return the canonical envelope instead.

LLM-Specific Failure Modes

These are recurring and worth handling explicitly:

1. Reasoning-model output contamination

Some providers (MiniMax, GLM, certain Qwen variants) emit <think>… </think> blocks and/or markdown code fences (```json ... ```) around JSON output.

Rule: Strip both before json.loads(...). The fix lives in commit 985f89f for context. Any new LLM-output JSON parser must do the same — do not call json.loads on raw model output.

2. Transient API errors

Network blips, rate limits, intermittent 5xx from the provider.

Rule: Use utils/retry.py:

from app.utils.retry import retry_with_backoff

@retry_with_backoff(max_retries=3, exceptions=(SomeAPIError,))
def call_llm(...): ...

Sync version: retry_with_backoff
Async version: retry_with_backoff_async
For batch processing where partial failure is acceptable, use RetryableAPIClient.call_batch_with_retry(items, fn, continue_on_failure=True).

Don't write a hand-rolled retry loop — it'll drift from the project's backoff/jitter conventions.

3. Schema mismatch in structured output

LLM returns valid JSON but missing/extra fields.

Rule: Validate with Pydantic v2 models where the call expects structure. Fail loudly (raise) rather than silently coercing — better to retry the LLM call than to feed bad data downstream.

Background Task Errors

Inside a worker thread spawned from an API handler:

def _worker(task_id, project_id, ...):
    try:
        # work
        TaskManager().update_task(task_id, progress=50, message=...)
        result = do_real_work(...)
        TaskManager().complete_task(task_id, result)
    except Exception as e:
        logger.exception(f"task {task_id} failed")
        TaskManager().fail_task(task_id, str(e))

Rules:

The outer except must be broad (Exception) — the goal is "task always terminates," not "narrow down failures here."
Log the full traceback (logger.exception), then store a concise str(e) on the task for the frontend to display.
Never re-raise from the worker; the thread has no caller.
Update related Project state (e.g. revert GRAPH_BUILDING → previous status) inside the except, before fail_task.

Graph & Subprocess Errors

Graphiti / Neo4j errors: caller decides — usually fail the task with a user-friendly message; for non-fatal search failures, log and return empty results.
OASIS subprocess crashes: simulation_ipc.py is the single surface. It owns lifecycle, logging, and signaling task failure. Don't catch subprocess errors elsewhere.
Startup recovery: _recover_stuck_projects re-classifies projects left GRAPH_BUILDING after a restart — see database.md.

Logging

Use utils/logger.get_logger('mirofish.<module>') — never print or logging.getLogger directly.
Levels:
- ERROR — task failure, unrecoverable exception
- WARNING — retry triggered, transient failure, recovered state
- INFO — task lifecycle (created, completed), pipeline milestones
- DEBUG — payload shapes, intermediate counts, off by default
User-visible log messages should go through utils/locale.t(...) so they translate; internal diagnostic logs stay in the file's existing language (English or Chinese — match the surrounding code).
Never log: API keys, full LLM prompts containing user-uploaded text (truncate or hash), Neo4j credentials, full .env contents.

What Not to Do

Don't catch Exception inside an API handler just to log and continue — fail the request and return the envelope.
Don't retry non-idempotent work (e.g. graph writes that may have partially completed).
Don't translate exceptions into success: true responses with an embedded error message; use success: false.
Don't surface raw stack traces or LLM internals to the frontend.

Focus on patterns and decisions. No implementation details or exhaustive lists.

5.6 KiB Raw Blame History