MicroFish/.kiro/specs/i18n-ontology-generator-pro.../research.md

100 lines
9.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research & Design Decisions — i18n-ontology-generator-prompts
## Summary
- **Feature**: `i18n-ontology-generator-prompts`
- **Discovery Scope**: Extension (string-literal localization within an existing service)
- **Key Findings**:
- The `<think>` / markdown-fence stripping logic relied on by the ticket's R5 lives in `backend/app/utils/llm_client.py` (`LLMClient.chat` line 65 and `chat_json` lines 8487), **not** in `ontology_generator.py`. R5 is therefore satisfied implicitly so long as the call to `self.llm_client.chat_json(...)` is preserved exactly.
- The locale postfix (`get_language_instruction()`) is sourced from `/locales/languages.json` via `backend/app/utils/locale.py`; the English postfix and Chinese postfix are both already defined and resolved per request via `Accept-Language`. No locale-machinery changes are needed.
- `_validate_and_process` in the same file enforces the `Person` / `Organization` fallback invariant in code (lines 344393) regardless of prompt language. This means the prompt translation cannot break the post-validation invariants — the validator is the safety net.
- The sole production caller is `backend/app/api/graph.py:223228`, which consumes only the JSON shape (`entity_types`, `edge_types`, `analysis_summary`). It does not introspect prompt language. Stable.
## Research Log
### Locale resolution semantics
- **Context**: R3 requires `zh` to continue producing Chinese descriptions of equivalent quality after the base prompt is translated to English.
- **Sources Consulted**: `backend/app/utils/locale.py`, `/locales/languages.json` (referenced via `_languages` registry).
- **Findings**:
- `get_locale()` returns the `Accept-Language` header value when in a request context (falling back to `zh`) and a thread-local otherwise.
- `get_language_instruction()` returns `_languages[locale].llmInstruction`, defaulting to `请使用中文回答。`.
- The system prompt at line 210 already concatenates `lang_instruction` plus an English directive locking identifier formats. Both stay byte-for-byte unchanged.
- **Implications**: Locale switching survives the translation; no code changes are needed in locale.py. The English base prompt + Chinese postfix is a known-working pattern (R3 acceptance criteria 3 stays valid).
### `<think>` and markdown-fence stripping path
- **Context**: R5 requires preservation of the `<think>` and markdown-fence stripping per `CLAUDE.md` (commit 985f89f).
- **Sources Consulted**: `backend/app/utils/llm_client.py` lines 5093.
- **Findings**:
- `LLMClient.chat` strips `<think>...</think>` after every response (line 65).
- `LLMClient.chat_json` additionally strips ` ```json `, ` ``` `, and trailing fences (lines 8487) before `json.loads`.
- `ontology_generator.py` only invokes `chat_json` — it does not perform stripping itself.
- **Implications**: Translating the prompts in `ontology_generator.py` cannot break the stripping logic. The single call to `self.llm_client.chat_json(messages=messages, temperature=0.3, max_tokens=4096)` at lines 217221 must be preserved verbatim.
### Caller surface and contract
- **Context**: R4 requires zero diff to call sites of these prompts.
- **Sources Consulted**: `backend/app/api/graph.py:223228`, `backend/app/services/__init__.py`.
- **Findings**:
- Only one production caller. It uses default constructor and the public `generate(document_texts, simulation_requirement, additional_context)` signature.
- It reads `entity_types`, `edge_types`, `analysis_summary` from the result.
- No tests or scripts under `backend/scripts/` reference the module.
- **Implications**: The translation is invisible to callers as long as we hold the public surface constant and continue to produce the same JSON shape (which the validator guarantees).
### Validator safety net
- **Context**: R1 / R5 acceptance: the post-validation invariant (10 entity types, ending in Person/Organization) must hold under both locales after translation.
- **Sources Consulted**: `_validate_and_process` lines 277398.
- **Findings**:
- `_to_pascal_case` normalizes entity names regardless of language.
- `_validate_and_process` enforces `MAX_ENTITY_TYPES = 10`, `MAX_EDGE_TYPES = 10`, deduplicates by name, force-injects `Person` and `Organization` fallbacks if missing, and truncates `description` to 100 chars.
- **Implications**: Even if the LLM under an English prompt deviates from the count or fallback rules, the validator self-heals. Translation cannot break the JSON contract.
## Architecture Pattern Evaluation
| Option | Description | Strengths | Risks / Limitations | Notes |
|--------|-------------|-----------|---------------------|-------|
| In-place translation | Translate the constant and method strings; preserve all code structure. | Minimal diff; matches sibling-issue pattern (#3, #4, #5); no new abstractions. | English base biases output; mitigated by `get_language_instruction()` postfix and the trailing English directive at line 210. | Selected. |
| Externalize to locale files | Move prompts to `/locales/{en,zh}.json` keyed under `ontology.system_prompt`. | Eliminates cross-locale bias entirely; symmetric prompts. | Out of scope per ticket (#2 is file-internal); inconsistent with how the codebase currently handles LLM prompts; conflicts with #6 i18n track. | Rejected. |
| Hybrid (translate + extract postfix helper) | Translate in place; extract postfix concatenation into a helper. | Slightly cleaner. | Adds refactor outside ticket scope; ticket says "no diff to call sites" and "same function signatures". | Rejected. |
## Design Decisions
### Decision: Translate `ONTOLOGY_SYSTEM_PROMPT` and `_build_user_message` strings in place
- **Context**: Prompts in `ontology_generator.py` are Chinese, biasing model output toward Chinese structure under `Accept-Language: en`. The ticket scopes the work to this single file and excludes refactors.
- **Alternatives Considered**:
1. Externalize to `/locales/*.json` keyed prompts (rejected; out of scope, inconsistent with codebase).
2. Hybrid in-place translation + helper extraction (rejected; refactor outside scope).
- **Selected Approach**: Replace the body of `ONTOLOGY_SYSTEM_PROMPT` with an English translation that preserves section structure, JSON template, taxonomy lists, fallback rules, and reserved-name lists. Replace the four Chinese string literals in `_build_user_message` (section headings, additional-context block heading, trailing rules block, truncation notice) with English equivalents while preserving `f"..."` interpolations and the conditional inclusion of additional context.
- **Rationale**: Minimal-surface change; aligns with how sibling i18n issues are scoped; preserves the locale-postfix mechanism, the validator safety net, and all caller contracts.
- **Trade-offs**: An English base prompt biases the model toward English structure. Mitigations: the per-locale `llmInstruction` postfix instructs the model to respond in the requested language; the trailing English directive at line 210 already locks identifier formats; `_validate_and_process` self-heals invariants.
- **Follow-up**: Manual verification under both `en` and `zh` locales: assert valid JSON, assert exactly 10 entity types ending in `Person` and `Organization`, and assert description fields are in the expected language.
### Decision: Preserve all surrounding code unchanged
- **Context**: The ticket forbids changes to call sites and the surrounding code shape.
- **Alternatives Considered**:
1. Refactor language-locking directive into the locale module (rejected; out of scope and crosses file boundary).
2. Add a docstring or constant for "prompt version" (rejected; introduces unused state).
- **Selected Approach**: Translation-only diff. No new imports, no new constants, no signature changes, no logger or comment changes (those are #6 and #7 respectively).
- **Rationale**: Smallest possible review surface, fewest possible regression vectors, easiest possible PR to land.
- **Trade-offs**: None of consequence within the ticket's stated scope.
- **Follow-up**: Run the ontology generator round-trip locally; verify zero CJK characters in the patched literals via regex `[一-鿿]`.
## Risks & Mitigations
- **Risk: English prompt produces lower-quality Chinese ontology under `zh` locale** → Mitigation: The `llmInstruction` postfix already steers Chinese output. The trailing English directive at line 210 already locks identifier formats. If quality regresses in practice, future work (issue #6 / a follow-up) can externalize prompts to locale files.
- **Risk: Translator inadvertently changes the JSON template structure or a reserved attribute name** → Mitigation: Acceptance criteria R1.2R1.6 enumerate the structural constants verbatim. Validator code already enforces `MAX_ENTITY_TYPES`, `MAX_EDGE_TYPES`, and fallback injection independently.
- **Risk: Logger or comment text gets translated as part of the same edit** → Mitigation: The change scope is explicit (R7); reviewer compares the diff against R7 acceptance criteria.
## References
- [.ticket/2.md](../../../.ticket/2.md) — ticket snapshot for issue #2.
- [CLAUDE.md](../../../CLAUDE.md) — project conventions, including reasoning-model output stripping.
- [/locales/languages.json](../../../locales/languages.json) — `llmInstruction` definitions per locale.
- `backend/app/utils/locale.py` — locale resolver implementation.
- `backend/app/utils/llm_client.py``<think>` / markdown-fence stripping.
- `backend/app/api/graph.py:223228` — sole production caller.