MicroFish/docs/LOCAL_LLM.md

6.9 KiB
Raw Blame History

Running MiroFish with a Local LLM

MiroFish talks to any OpenAI-compatible chat endpoint, so you can swap the cloud provider for a local runtime such as LM Studio, Ollama, or llama.cpp server. This document covers the moving parts.

Zep Cloud is still required for the memory graph — only the LLM call path can be replaced. See Limitations below.

TL;DR

# .env
LLM_API_KEY=local-anything       # any non-empty string
LLM_BASE_URL=http://localhost:1234/v1
LLM_MODEL_NAME=<the model id your runtime exposes>
LLM_JSON_MODE=none               # IMPORTANT for LM Studio / llama.cpp
ZEP_API_KEY=<your zep cloud key>

The single critical knob is LLM_JSON_MODE=none. Cloud providers accept response_format={"type":"json_object"}, but most local runtimes reject it with HTTP 400. Setting LLM_JSON_MODE=none makes MiroFish skip that parameter and rely on prompt-driven JSON output, which the existing parser handles robustly.

Provider quick reference

Runtime LLM_BASE_URL LLM_JSON_MODE Notes
OpenAI https://api.openai.com/v1 json_object (default) Strict JSON via response_format
Anthropic (OpenAI-compat) https://api.anthropic.com/v1/ none Trailing slash matters; rejects json_object
Qwen / Dashscope https://dashscope.aliyuncs.com/compatible-mode/v1 json_object Project default
Ollama http://localhost:11434/v1 json_object Ollama mostly accepts it
LM Studio http://localhost:1234/v1 none Returns 400 if response_format is sent
llama.cpp server http://localhost:8080/v1 none Same constraint as LM Studio
vLLM depends on deploy json_object Generally OpenAI-faithful

LM Studio ships an OpenAI-compatible server backed by an MLX runtime that is currently the most reliable option on macOS Tahoe + Apple Silicon (M1M5). See Apple Silicon caveats for why Ollama is not recommended on that combination right now.

# 1. Install (Homebrew cask, or download from https://lmstudio.ai)
brew install --cask lm-studio

# 2. Open LM Studio once to complete the first-run flow.
#    The CLI is bootstrapped at ~/.lmstudio/bin/lms after that.

# 3. Make `lms` available in your shell
export PATH="$HOME/.lmstudio/bin:$PATH"

# 4. Pull a chat-tuned model in MLX format. Examples:
lms get qwen/qwen3-4b-2507 --mlx -y          # ~2.3 GB, fast, instruction-tuned
# lms get qwen/qwen3-coder-30b --mlx -y      # only if you have the RAM

# 5. Start the server with CORS enabled
lms server start --cors

# 6. Load the model with a generous context window
lms load qwen/qwen3-4b-2507 --gpu max --context-length 32768 -y

Then in .env:

LLM_API_KEY=lm-studio-local
LLM_BASE_URL=http://localhost:1234/v1
LLM_MODEL_NAME=qwen/qwen3-4b-2507
LLM_JSON_MODE=none

Smoke-test the endpoint before starting MiroFish:

curl -sS http://localhost:1234/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"qwen/qwen3-4b-2507","messages":[{"role":"user","content":"Reply OK"}],"max_tokens":10}'

If you see a normal completion, you're good. Now run npm run dev from the repo root.

Why --context-length 32768?

Ontology generation feeds all of your uploaded documents into a single prompt. With four medium-sized PDFs (~30 KB extracted text) you'll need ~8k tokens of input, plus headroom for the model's response. The default 4k context window will fail with The number of tokens to keep from the initial prompt is greater than the context length. 32k is a safe choice on a 16 GB Mac; raise it on machines with more RAM if you're loading more documents.

Recipe: Ollama

ollama pull qwen2.5:7b-instruct
ollama serve
LLM_API_KEY=ollama-local
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL_NAME=qwen2.5:7b-instruct
LLM_JSON_MODE=json_object   # Ollama accepts the param

Apple Silicon caveats

If you're on macOS 26 (Tahoe) with an M3/M4/M5 chip, Ollama versions ≤ 0.21 fail to compile their Metal shaders against the updated MetalPerformancePrimitives framework, terminating every model load with static_assert failed [bfloat/half] ... panic: unable to create llama context. Tracked upstream in ollama/ollama#15748 and #15594.

Until that's resolved, prefer LM Studio on those machines. Its MLX runtime side-steps the broken Metal path.

Memory & throughput expectations

A typical MiroFish simulation (200500 agents × 30 rounds × 2 platforms) issues thousands of LLM calls. On a 16 GB MacBook with a local 4B model:

  • A single round can take 515 minutes when the model is the bottleneck.
  • A full simulation can run for hours and may exhaust RAM (backend Python + LM Studio model + frontend Node + Zep cache).
  • Concurrent requests can crash the MLX runtime under memory pressure, surfaced as The model has crashed without additional information.

If you need 200+ agents over many rounds, a cloud LLM (Claude Haiku, GPT-4o-mini, Qwen-plus) is dramatically cheaper in wall-clock time and frees RAM for everything else.

For first-time validation, start with a 20-agent / 3-round smoke test to confirm the pipeline before committing to a long run.

Limitations

  • Zep Cloud is still required. MiroFish hardcodes the zep_cloud SDK in several services (zep_tools.py, graph_builder.py, zep_graph_memory_updater.py, zep_entity_reader.py). There is no ZEP_BASE_URL knob today, so Zep self-hosting requires a code patch. The free Zep tier (5 req/min) is enough for small simulations; busier ones will benefit from a paid tier.
  • Embeddings are handled by Zep, not the local LLM. Your local runtime does not need an embeddings endpoint.
  • MiroFish prompts are written in Chinese internally. Local models with weaker multilingual coverage may inject Chinese phrases into otherwise English/Portuguese outputs. Cloud models handle this gracefully.

Troubleshooting

Symptom Likely cause Fix
400 - 'response_format.type' must be 'json_schema' or 'text' Runtime rejects json_object Set LLM_JSON_MODE=none
400 - The number of tokens to keep from the initial prompt is greater than the context length Context window too small for combined seed documents Reload model with --context-length 32768 (or higher)
The model has crashed without additional information MLX runtime OOM under concurrency Reduce agent count, lower context length, or switch to a cloud LLM
panic: unable to create llama context (Ollama) macOS Tahoe + Apple Silicon Metal bug Use LM Studio instead — see Apple Silicon caveats
Zep 429 Rate limit exceeded for FREE plan Free tier is 5 req/min Reduce simulation size, or upgrade Zep, or close the UI tab to stop graph polling