141 lines
5.6 KiB
Markdown
141 lines
5.6 KiB
Markdown
# Error Handling Standards
|
|
|
|
Most errors in MiroFish originate from **LLM calls**, **graph
|
|
operations**, **subprocess simulation**, or **user-uploaded files** —
|
|
not classical 4xx/5xx web flows. These standards target those failure
|
|
modes specifically.
|
|
|
|
## Philosophy
|
|
|
|
- Fail fast in services; convert to a stable response envelope at the
|
|
API layer.
|
|
- Long-running tasks must always reach a terminal state
|
|
(`COMPLETED` or `FAILED`) — a stuck `PROCESSING` task is a bug.
|
|
- LLM responses are untrusted by default: validate, strip, parse, then
|
|
use.
|
|
- Background-thread errors are silent unless explicitly captured —
|
|
always wrap the work in `try/except`.
|
|
|
|
## Error Surfaces (where they appear, where they're handled)
|
|
|
|
| Surface | Handle in | Convert to |
|
|
| -------------------- | ------------------------------------------ | --------------------------------- |
|
|
| HTTP request errors | `api/` handler `try/except` + envelope | `{"success": false, "error": …}` |
|
|
| Background task | Worker thread `try/except` → `fail_task()` | `Task.status = FAILED` + `error` |
|
|
| LLM call failures | `retry_with_backoff` decorator | Exception bubbles after retries |
|
|
| Graph adapter errors | Caller catches & maps | Service-specific error or `Task.fail` |
|
|
| Simulation IPC | `simulation_ipc.py` catches & logs | Task fail or simulation cleanup |
|
|
| File parsing | `utils/file_parser.py` | Raised as `ValueError` to caller |
|
|
|
|
A handler should never let an exception reach Flask's default 500
|
|
formatter — wrap and return the canonical envelope instead.
|
|
|
|
## LLM-Specific Failure Modes
|
|
|
|
These are recurring and worth handling explicitly:
|
|
|
|
### 1. Reasoning-model output contamination
|
|
|
|
Some providers (MiniMax, GLM, certain Qwen variants) emit `<think>…
|
|
</think>` blocks and/or markdown code fences (```` ```json ... ``` ````)
|
|
around JSON output.
|
|
|
|
**Rule:** Strip both before `json.loads(...)`. The fix lives in commit
|
|
`985f89f` for context. Any new LLM-output JSON parser must do the same
|
|
— do not call `json.loads` on raw model output.
|
|
|
|
### 2. Transient API errors
|
|
|
|
Network blips, rate limits, intermittent 5xx from the provider.
|
|
|
|
**Rule:** Use `utils/retry.py`:
|
|
|
|
```python
|
|
from app.utils.retry import retry_with_backoff
|
|
|
|
@retry_with_backoff(max_retries=3, exceptions=(SomeAPIError,))
|
|
def call_llm(...): ...
|
|
```
|
|
|
|
- Sync version: `retry_with_backoff`
|
|
- Async version: `retry_with_backoff_async`
|
|
- For batch processing where partial failure is acceptable, use
|
|
`RetryableAPIClient.call_batch_with_retry(items, fn,
|
|
continue_on_failure=True)`.
|
|
|
|
Don't write a hand-rolled retry loop — it'll drift from the project's
|
|
backoff/jitter conventions.
|
|
|
|
### 3. Schema mismatch in structured output
|
|
|
|
LLM returns valid JSON but missing/extra fields.
|
|
|
|
**Rule:** Validate with Pydantic v2 models where the call expects
|
|
structure. Fail loudly (raise) rather than silently coercing — better
|
|
to retry the LLM call than to feed bad data downstream.
|
|
|
|
## Background Task Errors
|
|
|
|
Inside a worker thread spawned from an API handler:
|
|
|
|
```python
|
|
def _worker(task_id, project_id, ...):
|
|
try:
|
|
# work
|
|
TaskManager().update_task(task_id, progress=50, message=...)
|
|
result = do_real_work(...)
|
|
TaskManager().complete_task(task_id, result)
|
|
except Exception as e:
|
|
logger.exception(f"task {task_id} failed")
|
|
TaskManager().fail_task(task_id, str(e))
|
|
```
|
|
|
|
Rules:
|
|
|
|
- The outer `except` must be broad (`Exception`) — the goal is "task
|
|
always terminates," not "narrow down failures here."
|
|
- Log the full traceback (`logger.exception`), then store a concise
|
|
`str(e)` on the task for the frontend to display.
|
|
- Never re-raise from the worker; the thread has no caller.
|
|
- Update related `Project` state (e.g. revert `GRAPH_BUILDING` →
|
|
previous status) **inside** the except, before `fail_task`.
|
|
|
|
## Graph & Subprocess Errors
|
|
|
|
- **Graphiti / Neo4j errors:** caller decides — usually fail the task
|
|
with a user-friendly message; for non-fatal search failures, log and
|
|
return empty results.
|
|
- **OASIS subprocess crashes:** `simulation_ipc.py` is the single
|
|
surface. It owns lifecycle, logging, and signaling task failure.
|
|
Don't catch subprocess errors elsewhere.
|
|
- **Startup recovery:** `_recover_stuck_projects` re-classifies
|
|
projects left `GRAPH_BUILDING` after a restart — see `database.md`.
|
|
|
|
## Logging
|
|
|
|
- Use `utils/logger.get_logger('mirofish.<module>')` — never
|
|
`print` or `logging.getLogger` directly.
|
|
- Levels:
|
|
- `ERROR` — task failure, unrecoverable exception
|
|
- `WARNING` — retry triggered, transient failure, recovered state
|
|
- `INFO` — task lifecycle (created, completed), pipeline milestones
|
|
- `DEBUG` — payload shapes, intermediate counts, off by default
|
|
- User-visible log messages should go through `utils/locale.t(...)` so
|
|
they translate; internal diagnostic logs stay in the file's existing
|
|
language (English or Chinese — match the surrounding code).
|
|
- **Never log:** API keys, full LLM prompts containing user-uploaded
|
|
text (truncate or hash), Neo4j credentials, full `.env` contents.
|
|
|
|
## What Not to Do
|
|
|
|
- Don't catch `Exception` inside an API handler just to log and
|
|
continue — fail the request and return the envelope.
|
|
- Don't retry non-idempotent work (e.g. graph writes that may have
|
|
partially completed).
|
|
- Don't translate exceptions into `success: true` responses with an
|
|
embedded error message; use `success: false`.
|
|
- Don't surface raw stack traces or LLM internals to the frontend.
|
|
|
|
---
|
|
_Focus on patterns and decisions. No implementation details or exhaustive lists._
|