MicroFish/.kiro/specs/i18n-ontology-generator-pro.../gap-analysis.md

10 KiB
Raw Blame History

Gap Analysis — i18n-ontology-generator-prompts

1. Current State Investigation

Domain assets

  • Subject file: backend/app/services/ontology_generator.py (507 lines).
  • Module-level system prompt: ONTOLOGY_SYSTEM_PROMPT (lines 30173) — Chinese, ~140 lines of structured prompt content describing task, output format, design guidelines, entity reference list, relationship reference list.
  • User-message builder: OntologyGenerator._build_user_message (lines 231275) — Chinese section headings, truncation notice, and trailing rules block.
  • Locale postfix call site: get_language_instruction() is invoked at line 209 and concatenated into the system prompt at line 210, alongside an English directive that locks identifier formats.
  • Locale resolver: backend/app/utils/locale.py reads Accept-Language from request context, falls back to thread-local for background tasks, and ultimately defaults to zh. The English postfix lives in /locales/languages.json (llmInstruction).
  • LLM client: backend/app/utils/llm_client.py:LLMClient.chat_json performs <think> stripping (line 65) and markdown-fence stripping (lines 8487). This is outside ontology_generator.py, so the file does not own that logic — it just consumes it. Requirement R5 is satisfied trivially as long as we keep the chat_json call unchanged.

Call sites (consumers)

  • backend/app/api/graph.py:223228 — the only production caller. Uses OntologyGenerator() with no constructor args, calls .generate(document_texts, simulation_requirement, additional_context), and reads entity_types, edge_types, analysis_summary from the result. The shape contract is what matters; language of description is not parsed.
  • backend/app/services/__init__.py — re-exports the class.
  • No tests currently reference this module (verified via Grep ontology_generator|OntologyGenerator|ONTOLOGY_SYSTEM_PROMPT).

Conventions

  • 4-space indentation, Python 3.11+, snake_case identifiers, type hints where present (matches surrounding file style).
  • No linter/formatter — match existing style; existing file uses Chinese inline comments which are out of scope (issue #7).
  • LLM prompts in this codebase are typically defined as module-level string constants and concatenated with get_language_instruction() for locale steering.
  • Variable interpolation in user messages uses Python f-strings; the system prompt uses no interpolation today.

Integration surfaces

  • Output JSON schema (entity_types[], edge_types[], analysis_summary) is consumed by _validate_and_process (also in this file) and by Graphiti via the project's ontology field (set in graph.py:235).
  • Reserved attribute names list (name, uuid, group_id, created_at, summary) is asserted in the prompt for the LLM to obey, not enforced by code in this file.
  • Entity/edge fallback rules (Person, Organization) are both prompted and enforced by _validate_and_process lines 344393. Code is the safety net; prompt is the steering.

2. Requirements → Asset Map

Requirement Existing Asset Gap Type Notes
R1 (system prompt EN) ONTOLOGY_SYSTEM_PROMPT constant, lines 30173 Missing — needs translation Mechanically a string-literal swap. Must preserve JSON template, taxonomy lists, fallback rules, count constraints, length constraint.
R2 (user message EN) _build_user_message, lines 231275 Missing — needs translation Three string literals: section headings, additional-context block, trailing rules block, plus the truncation notice.
R3 (locale switching) get_language_instruction() call, line 209; trailing English directive, line 210 Constraint Must be preserved verbatim. No new code needed.
R4 (API stability) __init__, generate, generate_python_code, _to_pascal_case, _validate_and_process, MAX_TEXT_LENGTH_FOR_LLM, chat_json(temperature=0.3, max_tokens=4096) Constraint No changes to signatures or constants.
R5 (reasoning-model compat) LLMClient.chat_json (separate file) Constraint Already external; preservation is automatic if chat_json call is untouched.
R6 (graph build parity) Graph build pipeline rooted in graph.py Verification — manual run Requires a sample seed file run; not a code change.
R7 (out-of-scope discipline) Loggers (lines 297, 314, 341), docstrings, comments Constraint Translator must not touch them.

Gaps tagged

  • Missing: prompt content needs human/operator-quality English translation (R1, R2).
  • Constraint: signatures, JSON contract, taxonomy names, locale postfix, LLM-call parameters, comments/docstrings/loggers are immutable in this PR (R3, R4, R5, R7).
  • Verification: locale zh and locale en end-to-end runs to confirm parity (R3, R6).
  • Research Needed: none — locale machinery, JSON contract, and LLM client behaviour are all already understood from reading existing code in this repo.

Complexity signals

  • This is string-literal localization with structural preservation, not feature work. No data model, API, or workflow changes. No external integrations. No new patterns. The risk is content quality, not technical correctness.

3. Implementation Approach Options

Translate ONTOLOGY_SYSTEM_PROMPT and the three Chinese string literals inside _build_user_message directly. No new files, no new abstractions.

  • Minimal diff, easy to review, matches the file's existing style.
  • Preserves the locale-postfix mechanism unchanged (the postfix is what currently steers zh output and will continue to do so under an English base prompt).
  • Aligns with how the analogous i18n issues for sibling files (#3, #4, #5) are framed in the epic.
  • The English base prompt biases the model toward English structure for Chinese locale runs; mitigated by the existing trailing English directive that locks identifier formats and by the per-locale llmInstruction postfix.

Option B — Externalize prompts to locale files

Move ONTOLOGY_SYSTEM_PROMPT content to /locales/en.json and /locales/zh.json and resolve at runtime via t("ontology.system_prompt").

  • Provides parallel zh/en prompts, eliminating cross-locale bias entirely.
  • Out of scope per issue #2 — externalizing log messages is issue #6 and a similar pattern would expand this PR's surface beyond the ticket. Adopting it here would also risk merge conflicts with #6.
  • Adds runtime indirection (file IO, key lookups) for a string that has not been externalized in any other prompt module. Inconsistent with current convention until a future i18n-prompt initiative.
  • Requires authoring high-quality Chinese prompts as locale data, which is exactly what's being moved away from for English-bias reasons.

Option C — Hybrid: translate in place, parameterize the locale postfix

Translate in place per Option A, and additionally factor system_prompt = f"{ONTOLOGY_SYSTEM_PROMPT}\n\n{lang_instruction}\n..." into a small helper.

  • Slightly cleaner.
  • Refactor outside the ticket's scope. Issue #2 is explicit: "No diff to call sites of these prompts — same function signatures and return shapes." A helper would change a private code shape unnecessarily.

4. Effort & Risk

  • Effort: S (1 day) — string-literal translation with structural preservation. The bulk of the time is producing accurate, terminology-faithful English prose for the system prompt's design guidelines.
  • Risk: Low — well-bounded change, no API surface impact, JSON contract preserved by validator code that already exists, no new dependencies. The single residual risk is qualitative (English prompt failing to elicit equivalent ontology quality), mitigated by:
    • The trailing English directive at line 210 already locks identifier formats.
    • _validate_and_process enforces fallback Person / Organization types in code regardless of prompt.
    • Manual verification under both en and zh locales is part of acceptance.

5. Recommendations for Design Phase

  • Preferred approach: Option A — translate ONTOLOGY_SYSTEM_PROMPT and the four user-message string fragments in place. Preserve every code structure around them.
  • Key decisions for design:
    1. Translation style for the system prompt: faithful, terminology-preserving English. Maintain the same section structure (## Core Task Background, ## Output Format, ## Design Guidelines, ## Entity Type Reference, ## Relationship Type Reference). Keep all Chinese-language gloss in the entity reference list intact in spirit but rendered in English (e.g. Student: 学生 becomes Student: a student).
    2. Heading translations for user message: ## 模拟需求## Simulation Requirement; ## 文档内容## Document Content; ## 额外说明## Additional Context.
    3. Truncation notice: render in English, preserve both numeric interpolations ({original_length}, {self.MAX_TEXT_LENGTH_FOR_LLM}).
    4. Trailing rules block: render in English, preserve the five-rule enumeration semantics verbatim, and keep the call to action ("Based on the content above ...").
    5. The trailing English directive at line 210 (IMPORTANT: Entity type names MUST be in English PascalCase ...) is already English; leave it byte-for-byte unchanged.
    6. No code structure changes. No new imports. No changes to signatures, constants, or the chat_json call.
  • Verification plan for design:
    • Static check: zero CJK characters in any prompt string literal post-edit (regex [一-鿿] over the patched constant and the patched method body).
    • Runtime check: under LLM_API_KEY configured to a test provider, run a small OntologyGenerator().generate(...) round-trip with locale en and locale zh, asserting JSON validity and the 10/Person+Organization invariant.
    • End-to-end check: run the Step 1 graph build on a representative seed file with locale en; compare node and edge counts to a recent zh baseline within operator tolerance.
  • Research items: none open. All adjacent systems (locale resolver, LLM client, validator, graph build pipeline) are read-only and behave deterministically with respect to the changes proposed.