MicroFish/.kiro/specs/i18n-ontology-generator-pro.../research.md

9.4 KiB
Raw Blame History

Research & Design Decisions — i18n-ontology-generator-prompts

Summary

  • Feature: i18n-ontology-generator-prompts
  • Discovery Scope: Extension (string-literal localization within an existing service)
  • Key Findings:
    • The <think> / markdown-fence stripping logic relied on by the ticket's R5 lives in backend/app/utils/llm_client.py (LLMClient.chat line 65 and chat_json lines 8487), not in ontology_generator.py. R5 is therefore satisfied implicitly so long as the call to self.llm_client.chat_json(...) is preserved exactly.
    • The locale postfix (get_language_instruction()) is sourced from /locales/languages.json via backend/app/utils/locale.py; the English postfix and Chinese postfix are both already defined and resolved per request via Accept-Language. No locale-machinery changes are needed.
    • _validate_and_process in the same file enforces the Person / Organization fallback invariant in code (lines 344393) regardless of prompt language. This means the prompt translation cannot break the post-validation invariants — the validator is the safety net.
    • The sole production caller is backend/app/api/graph.py:223228, which consumes only the JSON shape (entity_types, edge_types, analysis_summary). It does not introspect prompt language. Stable.

Research Log

Locale resolution semantics

  • Context: R3 requires zh to continue producing Chinese descriptions of equivalent quality after the base prompt is translated to English.
  • Sources Consulted: backend/app/utils/locale.py, /locales/languages.json (referenced via _languages registry).
  • Findings:
    • get_locale() returns the Accept-Language header value when in a request context (falling back to zh) and a thread-local otherwise.
    • get_language_instruction() returns _languages[locale].llmInstruction, defaulting to 请使用中文回答。.
    • The system prompt at line 210 already concatenates lang_instruction plus an English directive locking identifier formats. Both stay byte-for-byte unchanged.
  • Implications: Locale switching survives the translation; no code changes are needed in locale.py. The English base prompt + Chinese postfix is a known-working pattern (R3 acceptance criteria 3 stays valid).

<think> and markdown-fence stripping path

  • Context: R5 requires preservation of the <think> and markdown-fence stripping per CLAUDE.md (commit 985f89f).
  • Sources Consulted: backend/app/utils/llm_client.py lines 5093.
  • Findings:
    • LLMClient.chat strips <think>...</think> after every response (line 65).
    • LLMClient.chat_json additionally strips ```json, ```, and trailing fences (lines 8487) before json.loads.
    • ontology_generator.py only invokes chat_json — it does not perform stripping itself.
  • Implications: Translating the prompts in ontology_generator.py cannot break the stripping logic. The single call to self.llm_client.chat_json(messages=messages, temperature=0.3, max_tokens=4096) at lines 217221 must be preserved verbatim.

Caller surface and contract

  • Context: R4 requires zero diff to call sites of these prompts.
  • Sources Consulted: backend/app/api/graph.py:223228, backend/app/services/__init__.py.
  • Findings:
    • Only one production caller. It uses default constructor and the public generate(document_texts, simulation_requirement, additional_context) signature.
    • It reads entity_types, edge_types, analysis_summary from the result.
    • No tests or scripts under backend/scripts/ reference the module.
  • Implications: The translation is invisible to callers as long as we hold the public surface constant and continue to produce the same JSON shape (which the validator guarantees).

Validator safety net

  • Context: R1 / R5 acceptance: the post-validation invariant (10 entity types, ending in Person/Organization) must hold under both locales after translation.
  • Sources Consulted: _validate_and_process lines 277398.
  • Findings:
    • _to_pascal_case normalizes entity names regardless of language.
    • _validate_and_process enforces MAX_ENTITY_TYPES = 10, MAX_EDGE_TYPES = 10, deduplicates by name, force-injects Person and Organization fallbacks if missing, and truncates description to 100 chars.
  • Implications: Even if the LLM under an English prompt deviates from the count or fallback rules, the validator self-heals. Translation cannot break the JSON contract.

Architecture Pattern Evaluation

Option Description Strengths Risks / Limitations Notes
In-place translation Translate the constant and method strings; preserve all code structure. Minimal diff; matches sibling-issue pattern (#3, #4, #5); no new abstractions. English base biases output; mitigated by get_language_instruction() postfix and the trailing English directive at line 210. Selected.
Externalize to locale files Move prompts to /locales/{en,zh}.json keyed under ontology.system_prompt. Eliminates cross-locale bias entirely; symmetric prompts. Out of scope per ticket (#2 is file-internal); inconsistent with how the codebase currently handles LLM prompts; conflicts with #6 i18n track. Rejected.
Hybrid (translate + extract postfix helper) Translate in place; extract postfix concatenation into a helper. Slightly cleaner. Adds refactor outside ticket scope; ticket says "no diff to call sites" and "same function signatures". Rejected.

Design Decisions

Decision: Translate ONTOLOGY_SYSTEM_PROMPT and _build_user_message strings in place

  • Context: Prompts in ontology_generator.py are Chinese, biasing model output toward Chinese structure under Accept-Language: en. The ticket scopes the work to this single file and excludes refactors.
  • Alternatives Considered:
    1. Externalize to /locales/*.json keyed prompts (rejected; out of scope, inconsistent with codebase).
    2. Hybrid in-place translation + helper extraction (rejected; refactor outside scope).
  • Selected Approach: Replace the body of ONTOLOGY_SYSTEM_PROMPT with an English translation that preserves section structure, JSON template, taxonomy lists, fallback rules, and reserved-name lists. Replace the four Chinese string literals in _build_user_message (section headings, additional-context block heading, trailing rules block, truncation notice) with English equivalents while preserving f"..." interpolations and the conditional inclusion of additional context.
  • Rationale: Minimal-surface change; aligns with how sibling i18n issues are scoped; preserves the locale-postfix mechanism, the validator safety net, and all caller contracts.
  • Trade-offs: An English base prompt biases the model toward English structure. Mitigations: the per-locale llmInstruction postfix instructs the model to respond in the requested language; the trailing English directive at line 210 already locks identifier formats; _validate_and_process self-heals invariants.
  • Follow-up: Manual verification under both en and zh locales: assert valid JSON, assert exactly 10 entity types ending in Person and Organization, and assert description fields are in the expected language.

Decision: Preserve all surrounding code unchanged

  • Context: The ticket forbids changes to call sites and the surrounding code shape.
  • Alternatives Considered:
    1. Refactor language-locking directive into the locale module (rejected; out of scope and crosses file boundary).
    2. Add a docstring or constant for "prompt version" (rejected; introduces unused state).
  • Selected Approach: Translation-only diff. No new imports, no new constants, no signature changes, no logger or comment changes (those are #6 and #7 respectively).
  • Rationale: Smallest possible review surface, fewest possible regression vectors, easiest possible PR to land.
  • Trade-offs: None of consequence within the ticket's stated scope.
  • Follow-up: Run the ontology generator round-trip locally; verify zero CJK characters in the patched literals via regex [一-鿿].

Risks & Mitigations

  • Risk: English prompt produces lower-quality Chinese ontology under zh locale → Mitigation: The llmInstruction postfix already steers Chinese output. The trailing English directive at line 210 already locks identifier formats. If quality regresses in practice, future work (issue #6 / a follow-up) can externalize prompts to locale files.
  • Risk: Translator inadvertently changes the JSON template structure or a reserved attribute name → Mitigation: Acceptance criteria R1.2R1.6 enumerate the structural constants verbatim. Validator code already enforces MAX_ENTITY_TYPES, MAX_EDGE_TYPES, and fallback injection independently.
  • Risk: Logger or comment text gets translated as part of the same edit → Mitigation: The change scope is explicit (R7); reviewer compares the diff against R7 acceptance criteria.

References

  • .ticket/2.md — ticket snapshot for issue #2.
  • CLAUDE.md — project conventions, including reasoning-model output stripping.
  • /locales/languages.jsonllmInstruction definitions per locale.
  • backend/app/utils/locale.py — locale resolver implementation.
  • backend/app/utils/llm_client.py<think> / markdown-fence stripping.
  • backend/app/api/graph.py:223228 — sole production caller.