MicroFish/.kiro/steering/error-handling.md

5.6 KiB

Error Handling Standards

Most errors in MiroFish originate from LLM calls, graph operations, subprocess simulation, or user-uploaded files — not classical 4xx/5xx web flows. These standards target those failure modes specifically.

Philosophy

  • Fail fast in services; convert to a stable response envelope at the API layer.
  • Long-running tasks must always reach a terminal state (COMPLETED or FAILED) — a stuck PROCESSING task is a bug.
  • LLM responses are untrusted by default: validate, strip, parse, then use.
  • Background-thread errors are silent unless explicitly captured — always wrap the work in try/except.

Error Surfaces (where they appear, where they're handled)

Surface Handle in Convert to
HTTP request errors api/ handler try/except + envelope {"success": false, "error": …}
Background task Worker thread try/exceptfail_task() Task.status = FAILED + error
LLM call failures retry_with_backoff decorator Exception bubbles after retries
Graph adapter errors Caller catches & maps Service-specific error or Task.fail
Simulation IPC simulation_ipc.py catches & logs Task fail or simulation cleanup
File parsing utils/file_parser.py Raised as ValueError to caller

A handler should never let an exception reach Flask's default 500 formatter — wrap and return the canonical envelope instead.

LLM-Specific Failure Modes

These are recurring and worth handling explicitly:

1. Reasoning-model output contamination

Some providers (MiniMax, GLM, certain Qwen variants) emit <think>… </think> blocks and/or markdown code fences (```json ... ```) around JSON output.

Rule: Strip both before json.loads(...). The fix lives in commit 985f89f for context. Any new LLM-output JSON parser must do the same — do not call json.loads on raw model output.

2. Transient API errors

Network blips, rate limits, intermittent 5xx from the provider.

Rule: Use utils/retry.py:

from app.utils.retry import retry_with_backoff

@retry_with_backoff(max_retries=3, exceptions=(SomeAPIError,))
def call_llm(...): ...
  • Sync version: retry_with_backoff
  • Async version: retry_with_backoff_async
  • For batch processing where partial failure is acceptable, use RetryableAPIClient.call_batch_with_retry(items, fn, continue_on_failure=True).

Don't write a hand-rolled retry loop — it'll drift from the project's backoff/jitter conventions.

3. Schema mismatch in structured output

LLM returns valid JSON but missing/extra fields.

Rule: Validate with Pydantic v2 models where the call expects structure. Fail loudly (raise) rather than silently coercing — better to retry the LLM call than to feed bad data downstream.

Background Task Errors

Inside a worker thread spawned from an API handler:

def _worker(task_id, project_id, ...):
    try:
        # work
        TaskManager().update_task(task_id, progress=50, message=...)
        result = do_real_work(...)
        TaskManager().complete_task(task_id, result)
    except Exception as e:
        logger.exception(f"task {task_id} failed")
        TaskManager().fail_task(task_id, str(e))

Rules:

  • The outer except must be broad (Exception) — the goal is "task always terminates," not "narrow down failures here."
  • Log the full traceback (logger.exception), then store a concise str(e) on the task for the frontend to display.
  • Never re-raise from the worker; the thread has no caller.
  • Update related Project state (e.g. revert GRAPH_BUILDING → previous status) inside the except, before fail_task.

Graph & Subprocess Errors

  • Graphiti / Neo4j errors: caller decides — usually fail the task with a user-friendly message; for non-fatal search failures, log and return empty results.
  • OASIS subprocess crashes: simulation_ipc.py is the single surface. It owns lifecycle, logging, and signaling task failure. Don't catch subprocess errors elsewhere.
  • Startup recovery: _recover_stuck_projects re-classifies projects left GRAPH_BUILDING after a restart — see database.md.

Logging

  • Use utils/logger.get_logger('mirofish.<module>') — never print or logging.getLogger directly.
  • Levels:
    • ERROR — task failure, unrecoverable exception
    • WARNING — retry triggered, transient failure, recovered state
    • INFO — task lifecycle (created, completed), pipeline milestones
    • DEBUG — payload shapes, intermediate counts, off by default
  • User-visible log messages should go through utils/locale.t(...) so they translate; internal diagnostic logs stay in the file's existing language (English or Chinese — match the surrounding code).
  • Never log: API keys, full LLM prompts containing user-uploaded text (truncate or hash), Neo4j credentials, full .env contents.

What Not to Do

  • Don't catch Exception inside an API handler just to log and continue — fail the request and return the envelope.
  • Don't retry non-idempotent work (e.g. graph writes that may have partially completed).
  • Don't translate exceptions into success: true responses with an embedded error message; use success: false.
  • Don't surface raw stack traces or LLM internals to the frontend.

Focus on patterns and decisions. No implementation details or exhaustive lists.