MicroFish/.kiro/steering/error-handling.md

141 lines
5.6 KiB
Markdown

# Error Handling Standards
Most errors in MiroFish originate from **LLM calls**, **graph
operations**, **subprocess simulation**, or **user-uploaded files**
not classical 4xx/5xx web flows. These standards target those failure
modes specifically.
## Philosophy
- Fail fast in services; convert to a stable response envelope at the
API layer.
- Long-running tasks must always reach a terminal state
(`COMPLETED` or `FAILED`) — a stuck `PROCESSING` task is a bug.
- LLM responses are untrusted by default: validate, strip, parse, then
use.
- Background-thread errors are silent unless explicitly captured —
always wrap the work in `try/except`.
## Error Surfaces (where they appear, where they're handled)
| Surface | Handle in | Convert to |
| -------------------- | ------------------------------------------ | --------------------------------- |
| HTTP request errors | `api/` handler `try/except` + envelope | `{"success": false, "error": …}` |
| Background task | Worker thread `try/except``fail_task()` | `Task.status = FAILED` + `error` |
| LLM call failures | `retry_with_backoff` decorator | Exception bubbles after retries |
| Graph adapter errors | Caller catches & maps | Service-specific error or `Task.fail` |
| Simulation IPC | `simulation_ipc.py` catches & logs | Task fail or simulation cleanup |
| File parsing | `utils/file_parser.py` | Raised as `ValueError` to caller |
A handler should never let an exception reach Flask's default 500
formatter — wrap and return the canonical envelope instead.
## LLM-Specific Failure Modes
These are recurring and worth handling explicitly:
### 1. Reasoning-model output contamination
Some providers (MiniMax, GLM, certain Qwen variants) emit `<think>…
</think>` blocks and/or markdown code fences (```` ```json ... ``` ````)
around JSON output.
**Rule:** Strip both before `json.loads(...)`. The fix lives in commit
`985f89f` for context. Any new LLM-output JSON parser must do the same
— do not call `json.loads` on raw model output.
### 2. Transient API errors
Network blips, rate limits, intermittent 5xx from the provider.
**Rule:** Use `utils/retry.py`:
```python
from app.utils.retry import retry_with_backoff
@retry_with_backoff(max_retries=3, exceptions=(SomeAPIError,))
def call_llm(...): ...
```
- Sync version: `retry_with_backoff`
- Async version: `retry_with_backoff_async`
- For batch processing where partial failure is acceptable, use
`RetryableAPIClient.call_batch_with_retry(items, fn,
continue_on_failure=True)`.
Don't write a hand-rolled retry loop — it'll drift from the project's
backoff/jitter conventions.
### 3. Schema mismatch in structured output
LLM returns valid JSON but missing/extra fields.
**Rule:** Validate with Pydantic v2 models where the call expects
structure. Fail loudly (raise) rather than silently coercing — better
to retry the LLM call than to feed bad data downstream.
## Background Task Errors
Inside a worker thread spawned from an API handler:
```python
def _worker(task_id, project_id, ...):
try:
# work
TaskManager().update_task(task_id, progress=50, message=...)
result = do_real_work(...)
TaskManager().complete_task(task_id, result)
except Exception as e:
logger.exception(f"task {task_id} failed")
TaskManager().fail_task(task_id, str(e))
```
Rules:
- The outer `except` must be broad (`Exception`) — the goal is "task
always terminates," not "narrow down failures here."
- Log the full traceback (`logger.exception`), then store a concise
`str(e)` on the task for the frontend to display.
- Never re-raise from the worker; the thread has no caller.
- Update related `Project` state (e.g. revert `GRAPH_BUILDING`
previous status) **inside** the except, before `fail_task`.
## Graph & Subprocess Errors
- **Graphiti / Neo4j errors:** caller decides — usually fail the task
with a user-friendly message; for non-fatal search failures, log and
return empty results.
- **OASIS subprocess crashes:** `simulation_ipc.py` is the single
surface. It owns lifecycle, logging, and signaling task failure.
Don't catch subprocess errors elsewhere.
- **Startup recovery:** `_recover_stuck_projects` re-classifies
projects left `GRAPH_BUILDING` after a restart — see `database.md`.
## Logging
- Use `utils/logger.get_logger('mirofish.<module>')` — never
`print` or `logging.getLogger` directly.
- Levels:
- `ERROR` — task failure, unrecoverable exception
- `WARNING` — retry triggered, transient failure, recovered state
- `INFO` — task lifecycle (created, completed), pipeline milestones
- `DEBUG` — payload shapes, intermediate counts, off by default
- User-visible log messages should go through `utils/locale.t(...)` so
they translate; internal diagnostic logs stay in the file's existing
language (English or Chinese — match the surrounding code).
- **Never log:** API keys, full LLM prompts containing user-uploaded
text (truncate or hash), Neo4j credentials, full `.env` contents.
## What Not to Do
- Don't catch `Exception` inside an API handler just to log and
continue — fail the request and return the envelope.
- Don't retry non-idempotent work (e.g. graph writes that may have
partially completed).
- Don't translate exceptions into `success: true` responses with an
embedded error message; use `success: false`.
- Don't surface raw stack traces or LLM internals to the frontend.
---
_Focus on patterns and decisions. No implementation details or exhaustive lists._