10 KiB
10 KiB
Gap Analysis — i18n-ontology-generator-prompts
1. Current State Investigation
Domain assets
- Subject file:
backend/app/services/ontology_generator.py(507 lines). - Module-level system prompt:
ONTOLOGY_SYSTEM_PROMPT(lines 30–173) — Chinese, ~140 lines of structured prompt content describing task, output format, design guidelines, entity reference list, relationship reference list. - User-message builder:
OntologyGenerator._build_user_message(lines 231–275) — Chinese section headings, truncation notice, and trailing rules block. - Locale postfix call site:
get_language_instruction()is invoked at line 209 and concatenated into the system prompt at line 210, alongside an English directive that locks identifier formats. - Locale resolver:
backend/app/utils/locale.pyreadsAccept-Languagefrom request context, falls back to thread-local for background tasks, and ultimately defaults tozh. The English postfix lives in/locales/languages.json(llmInstruction). - LLM client:
backend/app/utils/llm_client.py:LLMClient.chat_jsonperforms<think>stripping (line 65) and markdown-fence stripping (lines 84–87). This is outsideontology_generator.py, so the file does not own that logic — it just consumes it. Requirement R5 is satisfied trivially as long as we keep thechat_jsoncall unchanged.
Call sites (consumers)
backend/app/api/graph.py:223–228— the only production caller. UsesOntologyGenerator()with no constructor args, calls.generate(document_texts, simulation_requirement, additional_context), and readsentity_types,edge_types,analysis_summaryfrom the result. The shape contract is what matters; language ofdescriptionis not parsed.backend/app/services/__init__.py— re-exports the class.- No tests currently reference this module (verified via
Grep ontology_generator|OntologyGenerator|ONTOLOGY_SYSTEM_PROMPT).
Conventions
- 4-space indentation, Python 3.11+, snake_case identifiers, type hints where present (matches surrounding file style).
- No linter/formatter — match existing style; existing file uses Chinese inline comments which are out of scope (issue #7).
- LLM prompts in this codebase are typically defined as module-level string constants and concatenated with
get_language_instruction()for locale steering. - Variable interpolation in user messages uses Python f-strings; the system prompt uses no interpolation today.
Integration surfaces
- Output JSON schema (entity_types[], edge_types[], analysis_summary) is consumed by
_validate_and_process(also in this file) and by Graphiti via the project'sontologyfield (set ingraph.py:235). - Reserved attribute names list (
name,uuid,group_id,created_at,summary) is asserted in the prompt for the LLM to obey, not enforced by code in this file. - Entity/edge fallback rules (
Person,Organization) are both prompted and enforced by_validate_and_processlines 344–393. Code is the safety net; prompt is the steering.
2. Requirements → Asset Map
| Requirement | Existing Asset | Gap Type | Notes |
|---|---|---|---|
| R1 (system prompt EN) | ONTOLOGY_SYSTEM_PROMPT constant, lines 30–173 |
Missing — needs translation | Mechanically a string-literal swap. Must preserve JSON template, taxonomy lists, fallback rules, count constraints, length constraint. |
| R2 (user message EN) | _build_user_message, lines 231–275 |
Missing — needs translation | Three string literals: section headings, additional-context block, trailing rules block, plus the truncation notice. |
| R3 (locale switching) | get_language_instruction() call, line 209; trailing English directive, line 210 |
Constraint | Must be preserved verbatim. No new code needed. |
| R4 (API stability) | __init__, generate, generate_python_code, _to_pascal_case, _validate_and_process, MAX_TEXT_LENGTH_FOR_LLM, chat_json(temperature=0.3, max_tokens=4096) |
Constraint | No changes to signatures or constants. |
| R5 (reasoning-model compat) | LLMClient.chat_json (separate file) |
Constraint | Already external; preservation is automatic if chat_json call is untouched. |
| R6 (graph build parity) | Graph build pipeline rooted in graph.py |
Verification — manual run | Requires a sample seed file run; not a code change. |
| R7 (out-of-scope discipline) | Loggers (lines 297, 314, 341), docstrings, comments | Constraint | Translator must not touch them. |
Gaps tagged
- Missing: prompt content needs human/operator-quality English translation (R1, R2).
- Constraint: signatures, JSON contract, taxonomy names, locale postfix, LLM-call parameters, comments/docstrings/loggers are immutable in this PR (R3, R4, R5, R7).
- Verification: locale
zhand localeenend-to-end runs to confirm parity (R3, R6). - Research Needed: none — locale machinery, JSON contract, and LLM client behaviour are all already understood from reading existing code in this repo.
Complexity signals
- This is string-literal localization with structural preservation, not feature work. No data model, API, or workflow changes. No external integrations. No new patterns. The risk is content quality, not technical correctness.
3. Implementation Approach Options
Option A — In-place translation of the existing constant and method (recommended)
Translate ONTOLOGY_SYSTEM_PROMPT and the three Chinese string literals inside _build_user_message directly. No new files, no new abstractions.
- ✅ Minimal diff, easy to review, matches the file's existing style.
- ✅ Preserves the locale-postfix mechanism unchanged (the postfix is what currently steers
zhoutput and will continue to do so under an English base prompt). - ✅ Aligns with how the analogous i18n issues for sibling files (#3, #4, #5) are framed in the epic.
- ❌ The English base prompt biases the model toward English structure for Chinese locale runs; mitigated by the existing trailing English directive that locks identifier formats and by the per-locale
llmInstructionpostfix.
Option B — Externalize prompts to locale files
Move ONTOLOGY_SYSTEM_PROMPT content to /locales/en.json and /locales/zh.json and resolve at runtime via t("ontology.system_prompt").
- ✅ Provides parallel zh/en prompts, eliminating cross-locale bias entirely.
- ❌ Out of scope per issue #2 — externalizing log messages is issue #6 and a similar pattern would expand this PR's surface beyond the ticket. Adopting it here would also risk merge conflicts with #6.
- ❌ Adds runtime indirection (file IO, key lookups) for a string that has not been externalized in any other prompt module. Inconsistent with current convention until a future i18n-prompt initiative.
- ❌ Requires authoring high-quality Chinese prompts as locale data, which is exactly what's being moved away from for English-bias reasons.
Option C — Hybrid: translate in place, parameterize the locale postfix
Translate in place per Option A, and additionally factor system_prompt = f"{ONTOLOGY_SYSTEM_PROMPT}\n\n{lang_instruction}\n..." into a small helper.
- ✅ Slightly cleaner.
- ❌ Refactor outside the ticket's scope. Issue #2 is explicit: "No diff to call sites of these prompts — same function signatures and return shapes." A helper would change a private code shape unnecessarily.
4. Effort & Risk
- Effort: S (1 day) — string-literal translation with structural preservation. The bulk of the time is producing accurate, terminology-faithful English prose for the system prompt's design guidelines.
- Risk: Low — well-bounded change, no API surface impact, JSON contract preserved by validator code that already exists, no new dependencies. The single residual risk is qualitative (English prompt failing to elicit equivalent ontology quality), mitigated by:
- The trailing English directive at line 210 already locks identifier formats.
_validate_and_processenforces fallbackPerson/Organizationtypes in code regardless of prompt.- Manual verification under both
enandzhlocales is part of acceptance.
5. Recommendations for Design Phase
- Preferred approach: Option A — translate
ONTOLOGY_SYSTEM_PROMPTand the four user-message string fragments in place. Preserve every code structure around them. - Key decisions for design:
- Translation style for the system prompt: faithful, terminology-preserving English. Maintain the same section structure (
## Core Task Background,## Output Format,## Design Guidelines,## Entity Type Reference,## Relationship Type Reference). Keep all Chinese-language gloss in the entity reference list intact in spirit but rendered in English (e.g.Student: 学生becomesStudent: a student). - Heading translations for user message:
## 模拟需求→## Simulation Requirement;## 文档内容→## Document Content;## 额外说明→## Additional Context. - Truncation notice: render in English, preserve both numeric interpolations (
{original_length},{self.MAX_TEXT_LENGTH_FOR_LLM}). - Trailing rules block: render in English, preserve the five-rule enumeration semantics verbatim, and keep the call to action ("Based on the content above ...").
- The trailing English directive at line 210 (
IMPORTANT: Entity type names MUST be in English PascalCase ...) is already English; leave it byte-for-byte unchanged. - No code structure changes. No new imports. No changes to signatures, constants, or the
chat_jsoncall.
- Translation style for the system prompt: faithful, terminology-preserving English. Maintain the same section structure (
- Verification plan for design:
- Static check: zero CJK characters in any prompt string literal post-edit (regex
[一-鿿]over the patched constant and the patched method body). - Runtime check: under
LLM_API_KEYconfigured to a test provider, run a smallOntologyGenerator().generate(...)round-trip with localeenand localezh, asserting JSON validity and the 10/Person+Organization invariant. - End-to-end check: run the Step 1 graph build on a representative seed file with locale
en; compare node and edge counts to a recentzhbaseline within operator tolerance.
- Static check: zero CJK characters in any prompt string literal post-edit (regex
- Research items: none open. All adjacent systems (locale resolver, LLM client, validator, graph build pipeline) are read-only and behave deterministically with respect to the changes proposed.