MicroFish/.kiro/specs/i18n-ontology-generator-pro.../design.md

# Design Document — i18n-ontology-generator-prompts

## Overview

**Purpose**: Translate the Chinese prompt strings in `backend/app/services/ontology_generator.py` (the system prompt constant and the user-message template) to English while preserving every functional contract — JSON output schema, taxonomy lists, reserved-attribute names, fallback rules, variable interpolations, and the `get_language_instruction()` locale-postfix mechanism. The goal is to remove the Chinese-language base-prompt bias that currently leaks Chinese structure and word choice into ontology output even when `Accept-Language: en`.

**Users**: MiroFish operators running the Step 1 graph-build pipeline under any locale; downstream developers consuming the JSON ontology emitted by `OntologyGenerator.generate(...)`.

**Impact**: Replaces approximately one large module-level string constant and four embedded string literals with English equivalents. No API surface change. No new dependencies. No new files. The single production caller (`backend/app/api/graph.py:223–228`) and all consumers of the validator output are unaffected.

### Goals

- Zero CJK characters in any prompt string literal contributed by `ontology_generator.py` to the system prompt or the user message.
- English ontology descriptions and `analysis_summary` under `Accept-Language: en`.
- Continued Chinese descriptions and `analysis_summary` under `Accept-Language: zh`, of equivalent quality to the pre-change behaviour.
- No diff to public signatures, constants, LLM-call parameters, or call sites.

### Non-Goals

- Externalizing prompts to `/locales/*.json` (out of scope per ticket).
- Translating logger calls in this file (covered by issue #6).
- Translating module/class/method docstrings or inline comments in this file (covered by issue #7).
- Refactoring the ontology JSON schema, the validator, or the extraction flow.
- Changing the entity-type or relationship-type reference taxonomies.
- Modifying `backend/app/utils/locale.py`, the locale registries, or any non-target file.

## Boundary Commitments

### This Spec Owns

- The English content of `ONTOLOGY_SYSTEM_PROMPT` (module-level constant in `backend/app/services/ontology_generator.py`).
- The English content of the four string literals embedded in `OntologyGenerator._build_user_message`: section headings, additional-context block, trailing rules block, and truncation notice.

### Out of Boundary

- Locale resolution machinery (`backend/app/utils/locale.py`).
- Per-locale `llmInstruction` definitions (`/locales/languages.json`).
- Reasoning-model output stripping (`backend/app/utils/llm_client.py`).
- Logger calls and `logger.warning` strings inside `ontology_generator.py` (issue #6).
- Module/class/method docstrings and inline comments inside `ontology_generator.py` (issue #7).
- The entity / edge taxonomy itself; only its descriptive prose changes language.
- All callers of `OntologyGenerator`, including `backend/app/api/graph.py`.
- Tests, scripts, and frontend code.

### Allowed Dependencies

- Existing `get_language_instruction()` import from `..utils.locale` (already imported; unchanged).
- Existing `LLMClient.chat_json` invocation (unchanged).
- No new imports.

### Revalidation Triggers

The following changes elsewhere would invalidate this design and require revisiting the prompt:

- A change to the JSON contract emitted by the LLM (`entity_types`, `edge_types`, `analysis_summary` keys or sub-keys).
- A change to `_validate_and_process` invariants (10 entity types, fallback `Person`/`Organization`, `MAX_*` caps, description length).
- A change to `get_language_instruction()` semantics or the per-locale `llmInstruction` strings.
- A change to the reasoning-model output stripping in `LLMClient.chat`/`chat_json`.

## Architecture

### Existing Architecture Analysis

`OntologyGenerator` lives in `backend/app/services/`, follows the in-process service pattern (no IO besides the LLM call), and is invoked synchronously from `backend/app/api/graph.py` inside a background `Task`. It depends on `LLMClient` for transport and on `get_language_instruction()` for locale steering. The relevant flow is:

1. The Flask handler resolves the request locale via `Accept-Language`; locale is set via `set_locale()` for the background thread.
2. `OntologyGenerator.generate()` builds a user message from inputs, prepends the (currently Chinese) system prompt with the locale postfix and the English identifier-format directive, calls `chat_json`, then runs the response through `_validate_and_process`.
3. The validator self-heals invariants (count, fallback types, length, deduplication).

This design preserves all of the above. The change is purely lexical inside two regions of one file.

### Architecture Pattern & Boundary Map

```mermaid
graph TB
    Caller[graph.py handler]
    Generator[OntologyGenerator]
    Validator[_validate_and_process]
    Locale[locale.get_language_instruction]
    Client[LLMClient.chat_json]

    Caller -->|generate inputs| Generator
    Generator -->|read locale postfix| Locale
    Generator -->|JSON request| Client
    Client -->|raw JSON| Generator
    Generator -->|self-heal invariants| Validator
    Validator -->|validated ontology| Caller
```

**Architecture Integration**:

- Selected pattern: **In-place lexical translation** of two regions of an existing service. No structural change.
- Domain/feature boundaries: locale machinery vs. service prompt vs. transport stripping remain cleanly separated.
- Existing patterns preserved: prompt-as-constant; `f"..."` user-message construction; locale-postfix concatenation; validator self-healing.
- New components rationale: none — no new components.
- Steering compliance: matches `tech.md` ("translate keys, not raw log lines, when adding new logs that surface to users") for what is in-scope here, and respects the steering note that "existing files mix English and Chinese in comments/docstrings — preserve both; do not translate one into the other unless asked." This ticket is the explicit ask for the prompt strings, scoped to exclude comments/docstrings.

### Technology Stack

| Layer | Choice / Version | Role in Feature | Notes |
|-------|------------------|-----------------|-------|
| Backend / Services | Python 3.11+ | Hosts `OntologyGenerator` | Existing — unchanged. |
| Backend / Services | `openai` SDK via `LLMClient` | Issues the prompt; performs `<think>` and fence stripping | Existing — unchanged. |
| Backend / Services | `backend/app/utils/locale.py` | Resolves `Accept-Language` → `llmInstruction` postfix | Existing — unchanged. |

No new dependencies. No version changes.

## File Structure Plan

### Modified Files

- `backend/app/services/ontology_generator.py` — Replace the body of `ONTOLOGY_SYSTEM_PROMPT` with an English translation; replace the four Chinese string fragments in `_build_user_message` with English equivalents; preserve every other character of the file.

No new files. No deletions. No moves.

## System Flows

The control-flow diagram in *Architecture Pattern & Boundary Map* covers the relevant flow; no additional diagrams are needed for this string-literal change.

## Requirements Traceability

| Requirement | Summary | Components | Interfaces | Flows |
|-------------|---------|------------|------------|-------|
| 1.1 | Zero Chinese in `ONTOLOGY_SYSTEM_PROMPT` | OntologyGenerator → `ONTOLOGY_SYSTEM_PROMPT` | None changed | n/a |
| 1.2 | Preserve JSON output keys | OntologyGenerator → prompt template region | LLM JSON contract | Architecture diagram |
| 1.3 | Preserve entity-type reference list verbatim | OntologyGenerator → prompt reference list | Prompt-only | n/a |
| 1.4 | Preserve relationship-type reference list verbatim | OntologyGenerator → prompt reference list | Prompt-only | n/a |
| 1.5 | Preserve reserved attribute names | OntologyGenerator → prompt rules region | Prompt-only | n/a |
| 1.6 | Preserve fallback rule (Person, Organization) | OntologyGenerator → prompt + validator | Validator self-healing | n/a |
| 1.7 | Preserve count constraints | OntologyGenerator → prompt + validator | Validator self-healing | n/a |
| 1.8 | Preserve description-length constraint | OntologyGenerator → prompt + validator | Validator self-healing | n/a |
| 2.1 | English section headings in user message | OntologyGenerator → `_build_user_message` | None changed | n/a |
| 2.2 | English trailing rules block | OntologyGenerator → `_build_user_message` | None changed | n/a |
| 2.3 | English truncation notice | OntologyGenerator → `_build_user_message` | None changed | n/a |
| 2.4 | Variable interpolations preserved | OntologyGenerator → `_build_user_message` | f-string interpolation | n/a |
| 2.5 | Conditional additional-context block preserved | OntologyGenerator → `_build_user_message` | Python conditional | n/a |
| 2.6 | Zero Chinese in user message | OntologyGenerator → `_build_user_message` | n/a | n/a |
| 3.1 | Postfix call site preserved | OntologyGenerator → `generate` line ~209 | `get_language_instruction()` | Architecture diagram |
| 3.2 | English identifier-format directive preserved | OntologyGenerator → system_prompt assembly | Prompt-only | n/a |
| 3.3 | `zh` locale produces Chinese output | OntologyGenerator + Locale | `get_language_instruction()` | Architecture diagram |
| 3.4 | `en` locale produces English output | OntologyGenerator + Locale | `get_language_instruction()` | Architecture diagram |
| 3.5 | No edits to locale module or registries | n/a (boundary commitment) | n/a | n/a |
| 4.1–4.7 | API and constant stability | OntologyGenerator (signatures, constants) | Public surface | n/a |
| 5.1–5.4 | Reasoning-model compatibility | OntologyGenerator → `chat_json` call | LLMClient.chat_json | Architecture diagram |
| 6.1–6.3 | Step 1 graph-build parity | Validation runs (manual) | n/a | n/a |
| 7.1–7.4 | Out-of-scope surfaces untouched | OntologyGenerator (boundary commitment) | n/a | n/a |

## Components and Interfaces

| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts |
|-----------|--------------|--------|--------------|--------------------------|-----------|
| OntologyGenerator (modified) | Backend / Service | Render English ontology-generation prompts; preserve all behaviour | 1.1–1.8, 2.1–2.6, 3.1–3.5, 4.1–4.7, 5.1–5.4, 7.1–7.4 | LLMClient.chat_json (P0), get_language_instruction (P0), `_validate_and_process` (P0) | Service |

### Backend / Service

#### OntologyGenerator (modified)

| Field | Detail |
|-------|--------|
| Intent | Translate prompt strings to English while preserving every functional contract. |
| Requirements | 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4, 3.5, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 5.1, 5.2, 5.3, 5.4, 7.1, 7.2, 7.3, 7.4 |

**Responsibilities & Constraints**

- Owns: the English wording of `ONTOLOGY_SYSTEM_PROMPT` and the four user-message string fragments.
- Domain boundary: prompt content only. Does not own locale resolution, transport, or validation logic.
- Invariants:
    - `ONTOLOGY_SYSTEM_PROMPT` after translation MUST contain zero CJK characters.
    - The translated system prompt MUST present the same JSON template by key (`entity_types`, `edge_types`, `analysis_summary`; entity sub-keys `name`, `description`, `attributes`, `examples`; edge sub-keys `name`, `description`, `source_targets`, `attributes`; `source_targets` sub-keys `source`, `target`).
    - The translated system prompt MUST list the same entity-type names verbatim: `Student`, `Professor`, `Journalist`, `Celebrity`, `Executive`, `Official`, `Lawyer`, `Doctor`, `Person`, `University`, `Company`, `GovernmentAgency`, `MediaOutlet`, `Hospital`, `School`, `NGO`, `Organization`.
    - The translated system prompt MUST list the same relationship-type names verbatim: `WORKS_FOR`, `STUDIES_AT`, `AFFILIATED_WITH`, `REPRESENTS`, `REGULATES`, `REPORTS_ON`, `COMMENTS_ON`, `RESPONDS_TO`, `SUPPORTS`, `OPPOSES`, `COLLABORATES_WITH`, `COMPETES_WITH`.
    - The translated system prompt MUST list the same reserved attribute names verbatim: `name`, `uuid`, `group_id`, `created_at`, `summary`.
    - The translated system prompt MUST express the same numeric constraints: exactly 10 entity types, with the last 2 being `Person` and `Organization` fallbacks; 6–10 relationship types; 1–3 attributes per entity; description ≤ 100 characters.
    - The translated user message MUST preserve all f-string interpolations: `{simulation_requirement}`, `{combined_text}`, `{additional_context}`, `{original_length}`, `{self.MAX_TEXT_LENGTH_FOR_LLM}`.
    - The translated user message MUST conditionally include the `## Additional Context` block only when `additional_context` is truthy.
    - The call to `get_language_instruction()` MUST remain at its current location with its current return-value usage.
    - The trailing English identifier-format directive (`IMPORTANT: Entity type names MUST be in English PascalCase ...`) MUST remain byte-for-byte identical.
    - The call to `self.llm_client.chat_json(messages=messages, temperature=0.3, max_tokens=4096)` MUST remain unchanged.
    - All public signatures, the constant `MAX_TEXT_LENGTH_FOR_LLM`, and the private helpers `_to_pascal_case` and `_validate_and_process` MUST remain unchanged.
    - All `logger.warning(...)` calls and inline comments and docstrings in this file MUST remain unchanged (out of scope per #6 and #7).

**Dependencies**

- Inbound: `backend/app/api/graph.py:223–228` — sole production caller (P0).
- Outbound: `backend/app/utils/locale.get_language_instruction` — locale postfix (P0). `backend/app/utils/llm_client.LLMClient.chat_json` — JSON LLM transport with stripping (P0).
- External: none.

**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]

##### Service Interface

The public Python interface is unchanged:

```python
class OntologyGenerator:
    def __init__(self, llm_client: Optional[LLMClient] = None) -> None: ...

    def generate(
        self,
        document_texts: List[str],
        simulation_requirement: str,
        additional_context: Optional[str] = None,
    ) -> Dict[str, Any]: ...

    def generate_python_code(self, ontology: Dict[str, Any]) -> str: ...
```

- Preconditions: `document_texts` is a non-empty list of strings; `simulation_requirement` is a non-empty string; locale is resolvable via the existing chain.
- Postconditions: `generate()` returns a dict with `entity_types` (length ≤ 10, ending in `Person` and `Organization`), `edge_types` (length ≤ 10), and `analysis_summary` (string).
- Invariants: see *Responsibilities & Constraints*.

**Implementation Notes**

- **Integration**: No new imports. No call-site changes. The only diff is the body of `ONTOLOGY_SYSTEM_PROMPT` and four string literals inside `_build_user_message`.
- **Validation**: After implementation, run a targeted regex check (`[一-鿿]` over `ONTOLOGY_SYSTEM_PROMPT` and the relevant lines of `_build_user_message`) to confirm zero CJK in those literals. Run a manual round-trip via `OntologyGenerator().generate(...)` under both `en` and `zh` locales using a small seed text and assert: valid JSON, exactly 10 entity types ending in `Person` and `Organization`, descriptions in the expected language. Optionally run end-to-end Step 1 graph build on a representative seed file under `en` and compare node/edge counts to a recent `zh` baseline.
- **Risks**: English-base bias on Chinese-locale output (mitigated by the `llmInstruction` postfix and the trailing English directive that locks identifier formats). Validator self-healing covers structural drift independent of prompt language.

## Data Models

No data-model changes. The JSON schema emitted by the LLM and consumed by `_validate_and_process` is preserved verbatim.

## Error Handling

### Error Strategy

Error handling is unchanged from the existing implementation:

- LLM transport errors propagate from `LLMClient.chat_json` (raises on failure modes the SDK exposes).
- Invalid JSON from the LLM raises `ValueError("LLM返回的JSON格式无效: ...")` from `chat_json`. Note: the error message itself is in `llm_client.py` and is out of scope for this spec (issue #6).
- Validator self-healing handles structural drift (missing fallbacks, count overflows, invalid attribute reservations).

### Error Categories and Responses

- **User errors (4xx)**: not applicable at this layer; surfaced by the API handler.
- **System errors (5xx)**: LLM/network failures propagate to the API handler, which converts them to JSON error responses.
- **Business logic errors**: structurally invalid ontology output is auto-corrected by `_validate_and_process` to satisfy the 10-type / fallback / length invariants.

### Monitoring

Existing `logger.warning` and `logger.info` calls already log auto-conversions and final counts; no new monitoring is added.

## Testing Strategy

### Unit Tests

Given the project's intentionally minimal test harness (`backend/scripts/test_profile_format.py` only, per `tech.md`), introducing a heavy new test suite is out of scope. Instead, two lightweight checks accompany the change:

- **Static check**: a regex assertion in a small ad-hoc script (or a one-shot `python -c`) confirming that `ONTOLOGY_SYSTEM_PROMPT` and the patched literals in `_build_user_message` contain zero characters in `[一-鿿]`. This can be a permanent simple test under `backend/scripts/` if desired or a one-off check during PR review.
- **Round-trip smoke test**: a manual run of `OntologyGenerator().generate(...)` against a configured LLM, locale `en`, with a small seed text. Assert: dict shape, entity-types length 10 ending in `Person`/`Organization`, description fields contain no `[一-鿿]`. Repeat under locale `zh` and assert description fields contain at least some `[一-鿿]` (sanity check that the postfix still steers Chinese output).

### Integration Tests

- **Step 1 graph build under EN locale**: run the full pipeline end-to-end with a representative seed file under `Accept-Language: en`. Assert: pipeline completes without exception, ontology validates, node/edge counts in Neo4j are within operator-acceptable tolerance of a recent `zh` baseline. This is documented as an operator-run verification step in the PR description; automation is not required.

### E2E/UI Tests

Not applicable — change does not affect frontend.

### Performance/Load

Not applicable — change does not alter performance characteristics. LLM call parameters (`temperature=0.3`, `max_tokens=4096`) are unchanged.

## Optional Sections

### Security Considerations

Not applicable. Translation does not introduce new authentication, authorization, data-handling, or input-validation paths. Reserved attribute names remain enforced via prompt and validator.

### Performance & Scalability

Not applicable. Prompt token counts may differ slightly between Chinese and English renderings, but well within the existing `max_tokens=4096` budget.

### Migration Strategy

Not applicable. The change is a single in-place edit; no data migration. Rollback is `git revert`.

## Supporting References

- `backend/app/services/ontology_generator.py` — current Chinese prompt content (the source of translation).
- `backend/app/utils/locale.py` — locale resolver.
- `backend/app/utils/llm_client.py` — `chat_json` and `<think>` / fence stripping.
- `backend/app/api/graph.py:223–228` — sole production caller.
- `.kiro/specs/i18n-ontology-generator-prompts/research.md` — discovery findings, alternatives evaluation, and design decisions.
- `.ticket/2.md` — ticket snapshot.