# Design Document — i18n-ontology-generator-prompts ## Overview **Purpose**: Translate the Chinese prompt strings in `backend/app/services/ontology_generator.py` (the system prompt constant and the user-message template) to English while preserving every functional contract — JSON output schema, taxonomy lists, reserved-attribute names, fallback rules, variable interpolations, and the `get_language_instruction()` locale-postfix mechanism. The goal is to remove the Chinese-language base-prompt bias that currently leaks Chinese structure and word choice into ontology output even when `Accept-Language: en`. **Users**: MiroFish operators running the Step 1 graph-build pipeline under any locale; downstream developers consuming the JSON ontology emitted by `OntologyGenerator.generate(...)`. **Impact**: Replaces approximately one large module-level string constant and four embedded string literals with English equivalents. No API surface change. No new dependencies. No new files. The single production caller (`backend/app/api/graph.py:223–228`) and all consumers of the validator output are unaffected. ### Goals - Zero CJK characters in any prompt string literal contributed by `ontology_generator.py` to the system prompt or the user message. - English ontology descriptions and `analysis_summary` under `Accept-Language: en`. - Continued Chinese descriptions and `analysis_summary` under `Accept-Language: zh`, of equivalent quality to the pre-change behaviour. - No diff to public signatures, constants, LLM-call parameters, or call sites. ### Non-Goals - Externalizing prompts to `/locales/*.json` (out of scope per ticket). - Translating logger calls in this file (covered by issue #6). - Translating module/class/method docstrings or inline comments in this file (covered by issue #7). - Refactoring the ontology JSON schema, the validator, or the extraction flow. - Changing the entity-type or relationship-type reference taxonomies. - Modifying `backend/app/utils/locale.py`, the locale registries, or any non-target file. ## Boundary Commitments ### This Spec Owns - The English content of `ONTOLOGY_SYSTEM_PROMPT` (module-level constant in `backend/app/services/ontology_generator.py`). - The English content of the four string literals embedded in `OntologyGenerator._build_user_message`: section headings, additional-context block, trailing rules block, and truncation notice. ### Out of Boundary - Locale resolution machinery (`backend/app/utils/locale.py`). - Per-locale `llmInstruction` definitions (`/locales/languages.json`). - Reasoning-model output stripping (`backend/app/utils/llm_client.py`). - Logger calls and `logger.warning` strings inside `ontology_generator.py` (issue #6). - Module/class/method docstrings and inline comments inside `ontology_generator.py` (issue #7). - The entity / edge taxonomy itself; only its descriptive prose changes language. - All callers of `OntologyGenerator`, including `backend/app/api/graph.py`. - Tests, scripts, and frontend code. ### Allowed Dependencies - Existing `get_language_instruction()` import from `..utils.locale` (already imported; unchanged). - Existing `LLMClient.chat_json` invocation (unchanged). - No new imports. ### Revalidation Triggers The following changes elsewhere would invalidate this design and require revisiting the prompt: - A change to the JSON contract emitted by the LLM (`entity_types`, `edge_types`, `analysis_summary` keys or sub-keys). - A change to `_validate_and_process` invariants (10 entity types, fallback `Person`/`Organization`, `MAX_*` caps, description length). - A change to `get_language_instruction()` semantics or the per-locale `llmInstruction` strings. - A change to the reasoning-model output stripping in `LLMClient.chat`/`chat_json`. ## Architecture ### Existing Architecture Analysis `OntologyGenerator` lives in `backend/app/services/`, follows the in-process service pattern (no IO besides the LLM call), and is invoked synchronously from `backend/app/api/graph.py` inside a background `Task`. It depends on `LLMClient` for transport and on `get_language_instruction()` for locale steering. The relevant flow is: 1. The Flask handler resolves the request locale via `Accept-Language`; locale is set via `set_locale()` for the background thread. 2. `OntologyGenerator.generate()` builds a user message from inputs, prepends the (currently Chinese) system prompt with the locale postfix and the English identifier-format directive, calls `chat_json`, then runs the response through `_validate_and_process`. 3. The validator self-heals invariants (count, fallback types, length, deduplication). This design preserves all of the above. The change is purely lexical inside two regions of one file. ### Architecture Pattern & Boundary Map ```mermaid graph TB Caller[graph.py handler] Generator[OntologyGenerator] Validator[_validate_and_process] Locale[locale.get_language_instruction] Client[LLMClient.chat_json] Caller -->|generate inputs| Generator Generator -->|read locale postfix| Locale Generator -->|JSON request| Client Client -->|raw JSON| Generator Generator -->|self-heal invariants| Validator Validator -->|validated ontology| Caller ``` **Architecture Integration**: - Selected pattern: **In-place lexical translation** of two regions of an existing service. No structural change. - Domain/feature boundaries: locale machinery vs. service prompt vs. transport stripping remain cleanly separated. - Existing patterns preserved: prompt-as-constant; `f"..."` user-message construction; locale-postfix concatenation; validator self-healing. - New components rationale: none — no new components. - Steering compliance: matches `tech.md` ("translate keys, not raw log lines, when adding new logs that surface to users") for what is in-scope here, and respects the steering note that "existing files mix English and Chinese in comments/docstrings — preserve both; do not translate one into the other unless asked." This ticket is the explicit ask for the prompt strings, scoped to exclude comments/docstrings. ### Technology Stack | Layer | Choice / Version | Role in Feature | Notes | |-------|------------------|-----------------|-------| | Backend / Services | Python 3.11+ | Hosts `OntologyGenerator` | Existing — unchanged. | | Backend / Services | `openai` SDK via `LLMClient` | Issues the prompt; performs `` and fence stripping | Existing — unchanged. | | Backend / Services | `backend/app/utils/locale.py` | Resolves `Accept-Language` → `llmInstruction` postfix | Existing — unchanged. | No new dependencies. No version changes. ## File Structure Plan ### Modified Files - `backend/app/services/ontology_generator.py` — Replace the body of `ONTOLOGY_SYSTEM_PROMPT` with an English translation; replace the four Chinese string fragments in `_build_user_message` with English equivalents; preserve every other character of the file. No new files. No deletions. No moves. ## System Flows The control-flow diagram in *Architecture Pattern & Boundary Map* covers the relevant flow; no additional diagrams are needed for this string-literal change. ## Requirements Traceability | Requirement | Summary | Components | Interfaces | Flows | |-------------|---------|------------|------------|-------| | 1.1 | Zero Chinese in `ONTOLOGY_SYSTEM_PROMPT` | OntologyGenerator → `ONTOLOGY_SYSTEM_PROMPT` | None changed | n/a | | 1.2 | Preserve JSON output keys | OntologyGenerator → prompt template region | LLM JSON contract | Architecture diagram | | 1.3 | Preserve entity-type reference list verbatim | OntologyGenerator → prompt reference list | Prompt-only | n/a | | 1.4 | Preserve relationship-type reference list verbatim | OntologyGenerator → prompt reference list | Prompt-only | n/a | | 1.5 | Preserve reserved attribute names | OntologyGenerator → prompt rules region | Prompt-only | n/a | | 1.6 | Preserve fallback rule (Person, Organization) | OntologyGenerator → prompt + validator | Validator self-healing | n/a | | 1.7 | Preserve count constraints | OntologyGenerator → prompt + validator | Validator self-healing | n/a | | 1.8 | Preserve description-length constraint | OntologyGenerator → prompt + validator | Validator self-healing | n/a | | 2.1 | English section headings in user message | OntologyGenerator → `_build_user_message` | None changed | n/a | | 2.2 | English trailing rules block | OntologyGenerator → `_build_user_message` | None changed | n/a | | 2.3 | English truncation notice | OntologyGenerator → `_build_user_message` | None changed | n/a | | 2.4 | Variable interpolations preserved | OntologyGenerator → `_build_user_message` | f-string interpolation | n/a | | 2.5 | Conditional additional-context block preserved | OntologyGenerator → `_build_user_message` | Python conditional | n/a | | 2.6 | Zero Chinese in user message | OntologyGenerator → `_build_user_message` | n/a | n/a | | 3.1 | Postfix call site preserved | OntologyGenerator → `generate` line ~209 | `get_language_instruction()` | Architecture diagram | | 3.2 | English identifier-format directive preserved | OntologyGenerator → system_prompt assembly | Prompt-only | n/a | | 3.3 | `zh` locale produces Chinese output | OntologyGenerator + Locale | `get_language_instruction()` | Architecture diagram | | 3.4 | `en` locale produces English output | OntologyGenerator + Locale | `get_language_instruction()` | Architecture diagram | | 3.5 | No edits to locale module or registries | n/a (boundary commitment) | n/a | n/a | | 4.1–4.7 | API and constant stability | OntologyGenerator (signatures, constants) | Public surface | n/a | | 5.1–5.4 | Reasoning-model compatibility | OntologyGenerator → `chat_json` call | LLMClient.chat_json | Architecture diagram | | 6.1–6.3 | Step 1 graph-build parity | Validation runs (manual) | n/a | n/a | | 7.1–7.4 | Out-of-scope surfaces untouched | OntologyGenerator (boundary commitment) | n/a | n/a | ## Components and Interfaces | Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts | |-----------|--------------|--------|--------------|--------------------------|-----------| | OntologyGenerator (modified) | Backend / Service | Render English ontology-generation prompts; preserve all behaviour | 1.1–1.8, 2.1–2.6, 3.1–3.5, 4.1–4.7, 5.1–5.4, 7.1–7.4 | LLMClient.chat_json (P0), get_language_instruction (P0), `_validate_and_process` (P0) | Service | ### Backend / Service #### OntologyGenerator (modified) | Field | Detail | |-------|--------| | Intent | Translate prompt strings to English while preserving every functional contract. | | Requirements | 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4, 3.5, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 5.1, 5.2, 5.3, 5.4, 7.1, 7.2, 7.3, 7.4 | **Responsibilities & Constraints** - Owns: the English wording of `ONTOLOGY_SYSTEM_PROMPT` and the four user-message string fragments. - Domain boundary: prompt content only. Does not own locale resolution, transport, or validation logic. - Invariants: - `ONTOLOGY_SYSTEM_PROMPT` after translation MUST contain zero CJK characters. - The translated system prompt MUST present the same JSON template by key (`entity_types`, `edge_types`, `analysis_summary`; entity sub-keys `name`, `description`, `attributes`, `examples`; edge sub-keys `name`, `description`, `source_targets`, `attributes`; `source_targets` sub-keys `source`, `target`). - The translated system prompt MUST list the same entity-type names verbatim: `Student`, `Professor`, `Journalist`, `Celebrity`, `Executive`, `Official`, `Lawyer`, `Doctor`, `Person`, `University`, `Company`, `GovernmentAgency`, `MediaOutlet`, `Hospital`, `School`, `NGO`, `Organization`. - The translated system prompt MUST list the same relationship-type names verbatim: `WORKS_FOR`, `STUDIES_AT`, `AFFILIATED_WITH`, `REPRESENTS`, `REGULATES`, `REPORTS_ON`, `COMMENTS_ON`, `RESPONDS_TO`, `SUPPORTS`, `OPPOSES`, `COLLABORATES_WITH`, `COMPETES_WITH`. - The translated system prompt MUST list the same reserved attribute names verbatim: `name`, `uuid`, `group_id`, `created_at`, `summary`. - The translated system prompt MUST express the same numeric constraints: exactly 10 entity types, with the last 2 being `Person` and `Organization` fallbacks; 6–10 relationship types; 1–3 attributes per entity; description ≤ 100 characters. - The translated user message MUST preserve all f-string interpolations: `{simulation_requirement}`, `{combined_text}`, `{additional_context}`, `{original_length}`, `{self.MAX_TEXT_LENGTH_FOR_LLM}`. - The translated user message MUST conditionally include the `## Additional Context` block only when `additional_context` is truthy. - The call to `get_language_instruction()` MUST remain at its current location with its current return-value usage. - The trailing English identifier-format directive (`IMPORTANT: Entity type names MUST be in English PascalCase ...`) MUST remain byte-for-byte identical. - The call to `self.llm_client.chat_json(messages=messages, temperature=0.3, max_tokens=4096)` MUST remain unchanged. - All public signatures, the constant `MAX_TEXT_LENGTH_FOR_LLM`, and the private helpers `_to_pascal_case` and `_validate_and_process` MUST remain unchanged. - All `logger.warning(...)` calls and inline comments and docstrings in this file MUST remain unchanged (out of scope per #6 and #7). **Dependencies** - Inbound: `backend/app/api/graph.py:223–228` — sole production caller (P0). - Outbound: `backend/app/utils/locale.get_language_instruction` — locale postfix (P0). `backend/app/utils/llm_client.LLMClient.chat_json` — JSON LLM transport with stripping (P0). - External: none. **Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ] ##### Service Interface The public Python interface is unchanged: ```python class OntologyGenerator: def __init__(self, llm_client: Optional[LLMClient] = None) -> None: ... def generate( self, document_texts: List[str], simulation_requirement: str, additional_context: Optional[str] = None, ) -> Dict[str, Any]: ... def generate_python_code(self, ontology: Dict[str, Any]) -> str: ... ``` - Preconditions: `document_texts` is a non-empty list of strings; `simulation_requirement` is a non-empty string; locale is resolvable via the existing chain. - Postconditions: `generate()` returns a dict with `entity_types` (length ≤ 10, ending in `Person` and `Organization`), `edge_types` (length ≤ 10), and `analysis_summary` (string). - Invariants: see *Responsibilities & Constraints*. **Implementation Notes** - **Integration**: No new imports. No call-site changes. The only diff is the body of `ONTOLOGY_SYSTEM_PROMPT` and four string literals inside `_build_user_message`. - **Validation**: After implementation, run a targeted regex check (`[一-鿿]` over `ONTOLOGY_SYSTEM_PROMPT` and the relevant lines of `_build_user_message`) to confirm zero CJK in those literals. Run a manual round-trip via `OntologyGenerator().generate(...)` under both `en` and `zh` locales using a small seed text and assert: valid JSON, exactly 10 entity types ending in `Person` and `Organization`, descriptions in the expected language. Optionally run end-to-end Step 1 graph build on a representative seed file under `en` and compare node/edge counts to a recent `zh` baseline. - **Risks**: English-base bias on Chinese-locale output (mitigated by the `llmInstruction` postfix and the trailing English directive that locks identifier formats). Validator self-healing covers structural drift independent of prompt language. ## Data Models No data-model changes. The JSON schema emitted by the LLM and consumed by `_validate_and_process` is preserved verbatim. ## Error Handling ### Error Strategy Error handling is unchanged from the existing implementation: - LLM transport errors propagate from `LLMClient.chat_json` (raises on failure modes the SDK exposes). - Invalid JSON from the LLM raises `ValueError("LLM返回的JSON格式无效: ...")` from `chat_json`. Note: the error message itself is in `llm_client.py` and is out of scope for this spec (issue #6). - Validator self-healing handles structural drift (missing fallbacks, count overflows, invalid attribute reservations). ### Error Categories and Responses - **User errors (4xx)**: not applicable at this layer; surfaced by the API handler. - **System errors (5xx)**: LLM/network failures propagate to the API handler, which converts them to JSON error responses. - **Business logic errors**: structurally invalid ontology output is auto-corrected by `_validate_and_process` to satisfy the 10-type / fallback / length invariants. ### Monitoring Existing `logger.warning` and `logger.info` calls already log auto-conversions and final counts; no new monitoring is added. ## Testing Strategy ### Unit Tests Given the project's intentionally minimal test harness (`backend/scripts/test_profile_format.py` only, per `tech.md`), introducing a heavy new test suite is out of scope. Instead, two lightweight checks accompany the change: - **Static check**: a regex assertion in a small ad-hoc script (or a one-shot `python -c`) confirming that `ONTOLOGY_SYSTEM_PROMPT` and the patched literals in `_build_user_message` contain zero characters in `[一-鿿]`. This can be a permanent simple test under `backend/scripts/` if desired or a one-off check during PR review. - **Round-trip smoke test**: a manual run of `OntologyGenerator().generate(...)` against a configured LLM, locale `en`, with a small seed text. Assert: dict shape, entity-types length 10 ending in `Person`/`Organization`, description fields contain no `[一-鿿]`. Repeat under locale `zh` and assert description fields contain at least some `[一-鿿]` (sanity check that the postfix still steers Chinese output). ### Integration Tests - **Step 1 graph build under EN locale**: run the full pipeline end-to-end with a representative seed file under `Accept-Language: en`. Assert: pipeline completes without exception, ontology validates, node/edge counts in Neo4j are within operator-acceptable tolerance of a recent `zh` baseline. This is documented as an operator-run verification step in the PR description; automation is not required. ### E2E/UI Tests Not applicable — change does not affect frontend. ### Performance/Load Not applicable — change does not alter performance characteristics. LLM call parameters (`temperature=0.3`, `max_tokens=4096`) are unchanged. ## Optional Sections ### Security Considerations Not applicable. Translation does not introduce new authentication, authorization, data-handling, or input-validation paths. Reserved attribute names remain enforced via prompt and validator. ### Performance & Scalability Not applicable. Prompt token counts may differ slightly between Chinese and English renderings, but well within the existing `max_tokens=4096` budget. ### Migration Strategy Not applicable. The change is a single in-place edit; no data migration. Rollback is `git revert`. ## Supporting References - `backend/app/services/ontology_generator.py` — current Chinese prompt content (the source of translation). - `backend/app/utils/locale.py` — locale resolver. - `backend/app/utils/llm_client.py` — `chat_json` and `` / fence stripping. - `backend/app/api/graph.py:223–228` — sole production caller. - `.kiro/specs/i18n-ontology-generator-prompts/research.md` — discovery findings, alternatives evaluation, and design decisions. - `.ticket/2.md` — ticket snapshot.