MicroFish/.kiro/specs/i18n-ontology-generator-pro.../requirements.md

116 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Requirements Document
## Introduction
This specification covers the English translation of the prompt strings in `backend/app/services/ontology_generator.py`. The file produces the project ontology (entity types, relationship types, schema commentary) that drives the Graphiti graph build (Step 1 of the MiroFish pipeline). Today, the system prompt and user-message templates are written in Chinese; the language is steered at runtime by appending `get_language_instruction()` to the system message. While that postfix instructs the model *which* language to respond in, the base-prompt language biases the model's structural and lexical output. As a result, ontology descriptions, reasoning, and schema commentary skew Chinese under `Accept-Language: en`. Translating the base prompt to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (verified: `get_language_instruction()` returns the Chinese postfix `请使用中文回答。` when locale is `zh`, so a Chinese model response remains achievable from an English base prompt).
This work tracks GitHub issue [#2](https://github.com/salestech-group/MiroFish/issues/2).
## Boundary Context
- **In scope**:
- Translating `ONTOLOGY_SYSTEM_PROMPT` (the module-level system prompt constant) from Chinese to English.
- Translating the user-message template constructed in `OntologyGenerator._build_user_message` (Chinese section headings and instruction list) to English.
- Translating the truncation notice string emitted when input text exceeds `MAX_TEXT_LENGTH_FOR_LLM`.
- Translating the trailing instruction string appended to the user message ("必须遵守的规则" block).
- Preserving all functional contracts: JSON schema, key names, entity-type taxonomy, relationship-type taxonomy, attribute reserved-word list, fallback rules, variable interpolation, and the `get_language_instruction()` postfix call site.
- **Out of scope**:
- Logger messages, including warnings emitted by `_validate_and_process` (covered by issue #6).
- Module docstring, class docstrings, method docstrings, and inline comments (covered by issue #7).
- Refactoring the ontology JSON schema, validation flow, or extraction strategy.
- Changing the entity-type or relationship-type reference taxonomies (the categories themselves remain — only their description language changes).
- Editing call sites of `OntologyGenerator.generate` or `generate_python_code`.
- Translating the auto-generated Python code emitted by `generate_python_code` (the comment headers there are documentation, covered by #7).
- **Adjacent expectations**:
- The Graphiti adapter (`graphiti_adapter`) and Step 1 graph build pipeline must continue to consume the ontology output unchanged. No coupling to prompt language exists in the adapter; this is verified via the JSON schema contract being preserved.
- The locale resolution chain (`Accept-Language` header → `get_locale()``get_language_instruction()`) is owned by `backend/app/utils/locale.py` and is unchanged by this work. Translating the base prompt does not modify locale resolution semantics.
- Companion i18n issues (#3, #4, #5, #6, #7, #8, #9, #10) operate on different files or scopes and should not be touched here.
## Requirements
### Requirement 1: English Translation of the Ontology System Prompt
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the ontology-generation system prompt to be authored in English, so that the LLM's ontology descriptions, reasoning, and schema commentary are not biased toward Chinese structure or word choice.
#### Acceptance Criteria
1. The Ontology Generator shall define `ONTOLOGY_SYSTEM_PROMPT` containing zero Chinese characters in any string-literal content.
2. The Ontology Generator shall preserve the JSON output contract of the system prompt verbatim: the keys `entity_types`, `edge_types`, `analysis_summary`, and the entity sub-keys `name`, `description`, `attributes`, `examples`, and the edge sub-keys `name`, `description`, `source_targets`, `attributes`, plus the `source_targets` sub-keys `source` and `target`.
3. The Ontology Generator shall preserve the entity-type reference list verbatim by name (`Student`, `Professor`, `Journalist`, `Celebrity`, `Executive`, `Official`, `Lawyer`, `Doctor`, `Person`, `University`, `Company`, `GovernmentAgency`, `MediaOutlet`, `Hospital`, `School`, `NGO`, `Organization`).
4. The Ontology Generator shall preserve the relationship-type reference list verbatim by name (`WORKS_FOR`, `STUDIES_AT`, `AFFILIATED_WITH`, `REPRESENTS`, `REGULATES`, `REPORTS_ON`, `COMMENTS_ON`, `RESPONDS_TO`, `SUPPORTS`, `OPPOSES`, `COLLABORATES_WITH`, `COMPETES_WITH`).
5. The Ontology Generator shall preserve the reserved-attribute-name list verbatim (`name`, `uuid`, `group_id`, `created_at`, `summary`).
6. The Ontology Generator shall preserve the fallback-type rule that exactly two fallback entity types — `Person` and `Organization` — must appear at the end of a 10-item list.
7. The Ontology Generator shall preserve the entity-count constraint (exactly 10 entity types) and the edge-count constraint (610 relationship types).
8. The Ontology Generator shall preserve the description-length constraint (entity and edge `description` ≤ 100 characters).
### Requirement 2: English Translation of the User-Message Template
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the user-message template constructed by `_build_user_message` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings.
#### Acceptance Criteria
1. The Ontology Generator shall render the user message with English section headings in place of `## 模拟需求`, `## 文档内容`, and `## 额外说明`.
2. The Ontology Generator shall render the trailing rules block in English (replacing `请根据以上内容...` and the `必须遵守的规则` enumeration), preserving the rule semantics: 10 entity types total, last 2 are `Person`/`Organization` fallbacks, first 8 are concrete types, all entities must be real-world social-media-capable subjects (not abstract concepts), and reserved attribute names cannot be used.
3. The Ontology Generator shall render the truncation notice in English when the combined document text exceeds `MAX_TEXT_LENGTH_FOR_LLM`, including the original character count and the truncation length.
4. The Ontology Generator shall preserve all variable interpolations verbatim by name (`simulation_requirement`, `combined_text`, `additional_context`, and the `{original_length}` / `{self.MAX_TEXT_LENGTH_FOR_LLM}` interpolations in the truncation notice).
5. The Ontology Generator shall preserve the conditional inclusion of the `## Additional Context` section only when `additional_context` is truthy.
6. The Ontology Generator shall return zero Chinese characters across all string literals contributed to the assembled user message.
### Requirement 3: Locale Switching Continues to Work via `get_language_instruction()`
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: zh` (or any other configured non-English locale), I want the ontology output to remain in the requested locale of equivalent quality, so that translating the base prompt does not regress non-English support.
#### Acceptance Criteria
1. The Ontology Generator shall preserve the call to `get_language_instruction()` exactly at the existing location (currently the line above `system_prompt = f"{ONTOLOGY_SYSTEM_PROMPT}\n\n{lang_instruction}\n..."`), continuing to read locale via the existing thread-local / request-header resolution chain.
2. The Ontology Generator shall preserve the trailing English directive that locks identifier formats (`Entity type names MUST be in English PascalCase ...`, `Relationship type names MUST be in English UPPER_SNAKE_CASE ...`, `Attribute names MUST be in English snake_case ...`, `Only description fields and analysis_summary should use the specified language above.`).
3. When the locale is `zh`, the Ontology Generator shall produce a JSON ontology whose `description` and `analysis_summary` fields are in Chinese, equivalent in quality to the pre-change behaviour.
4. When the locale is `en`, the Ontology Generator shall produce a JSON ontology whose `description` and `analysis_summary` fields are in English.
5. The Ontology Generator shall not alter `backend/app/utils/locale.py`, the `_languages`, the `_translations` registries, or the locales under `/locales/`.
### Requirement 4: Public API and Call-Site Stability
**Objective:** As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of `OntologyGenerator` to remain unchanged, so that the graph-build flow and existing callers continue to work without modification.
#### Acceptance Criteria
1. The Ontology Generator shall preserve the signature of `OntologyGenerator.__init__(self, llm_client: Optional[LLMClient] = None)`.
2. The Ontology Generator shall preserve the signature of `OntologyGenerator.generate(self, document_texts: List[str], simulation_requirement: str, additional_context: Optional[str] = None) -> Dict[str, Any]`.
3. The Ontology Generator shall preserve the signature of `OntologyGenerator.generate_python_code(self, ontology: Dict[str, Any]) -> str`.
4. The Ontology Generator shall preserve the return-shape contract of `generate()`: a `Dict[str, Any]` with keys `entity_types`, `edge_types`, `analysis_summary` matching the existing JSON schema, post-validation.
5. The Ontology Generator shall preserve the signature of the private helper `_to_pascal_case(name: str) -> str` and the validator `_validate_and_process(self, result: Dict[str, Any]) -> Dict[str, Any]`.
6. The Ontology Generator shall preserve the constant `MAX_TEXT_LENGTH_FOR_LLM = 50000`.
7. The Ontology Generator shall preserve the LLM invocation parameters (`temperature=0.3`, `max_tokens=4096`) and the call to `self.llm_client.chat_json(...)`.
### Requirement 5: Reasoning-Model Output Compatibility
**Objective:** As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with `<think>` tags or markdown code fences), I want JSON parsing of the ontology response to continue working, so that translating the base prompt does not regress provider compatibility.
#### Acceptance Criteria
1. The Ontology Generator shall delegate JSON parsing to `LLMClient.chat_json` exactly as today (the call at the existing site is unchanged in name and arguments).
2. If a reasoning-model provider returns `<think>`-tagged or markdown-fenced output, then the existing stripping logic in `LLMClient.chat_json` shall continue to apply unchanged.
3. The Ontology Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language.
4. After translation, the Ontology Generator shall continue to round-trip a sample seed file through `generate()` and `_validate_and_process()` and produce a non-empty `entity_types` list of length 10 with the `Person` and `Organization` fallbacks present at indices 8 and 9 (or earlier, in the order produced).
### Requirement 6: Step 1 Graph Build Parity
**Objective:** As a MiroFish operator validating the change, I want the Graphiti / Neo4j Step 1 graph build to complete with comparable structure under the English ontology, so that the translation does not silently degrade graph quality.
#### Acceptance Criteria
1. When a representative seed file is processed end-to-end with locale `en`, the Step 1 graph build shall complete without raising an exception attributable to the ontology output.
2. When a representative seed file is processed end-to-end with locale `en`, the resulting Neo4j graph shall contain a node count and edge count comparable to the pre-change Chinese-prompt baseline within an operator-acceptable tolerance (a small percentage variance is acceptable; doubling or zeroing is not).
3. The Ontology Generator shall not change the function signatures or call sequence used by the Step 1 graph build pipeline (verified by Requirement 4).
### Requirement 7: Out-of-Scope Surfaces Remain Untouched
**Objective:** As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings, so that translation responsibilities for adjacent surfaces (issues #6 and #7) are not absorbed into this change.
#### Acceptance Criteria
1. The change shall not modify any `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, or `logger.debug(...)` call in `ontology_generator.py` (covered by issue #6).
2. The change shall not modify the module docstring, class docstring, method docstrings, or inline comments in `ontology_generator.py` (covered by issue #7).
3. The change shall not edit any file outside `backend/app/services/ontology_generator.py` for production code, except for adding test fixtures or scripts under a clearly-isolated directory if a verification harness is needed.
4. The change shall not introduce a new dependency or modify `backend/pyproject.toml` / `backend/uv.lock`.