MicroFish/.kiro/specs/i18n-ontology-generator-pro.../requirements.md

13 KiB
Raw Blame History

Requirements Document

Introduction

This specification covers the English translation of the prompt strings in backend/app/services/ontology_generator.py. The file produces the project ontology (entity types, relationship types, schema commentary) that drives the Graphiti graph build (Step 1 of the MiroFish pipeline). Today, the system prompt and user-message templates are written in Chinese; the language is steered at runtime by appending get_language_instruction() to the system message. While that postfix instructs the model which language to respond in, the base-prompt language biases the model's structural and lexical output. As a result, ontology descriptions, reasoning, and schema commentary skew Chinese under Accept-Language: en. Translating the base prompt to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (verified: get_language_instruction() returns the Chinese postfix 请使用中文回答。 when locale is zh, so a Chinese model response remains achievable from an English base prompt).

This work tracks GitHub issue #2.

Boundary Context

  • In scope:
    • Translating ONTOLOGY_SYSTEM_PROMPT (the module-level system prompt constant) from Chinese to English.
    • Translating the user-message template constructed in OntologyGenerator._build_user_message (Chinese section headings and instruction list) to English.
    • Translating the truncation notice string emitted when input text exceeds MAX_TEXT_LENGTH_FOR_LLM.
    • Translating the trailing instruction string appended to the user message ("必须遵守的规则" block).
    • Preserving all functional contracts: JSON schema, key names, entity-type taxonomy, relationship-type taxonomy, attribute reserved-word list, fallback rules, variable interpolation, and the get_language_instruction() postfix call site.
  • Out of scope:
    • Logger messages, including warnings emitted by _validate_and_process (covered by issue #6).
    • Module docstring, class docstrings, method docstrings, and inline comments (covered by issue #7).
    • Refactoring the ontology JSON schema, validation flow, or extraction strategy.
    • Changing the entity-type or relationship-type reference taxonomies (the categories themselves remain — only their description language changes).
    • Editing call sites of OntologyGenerator.generate or generate_python_code.
    • Translating the auto-generated Python code emitted by generate_python_code (the comment headers there are documentation, covered by #7).
  • Adjacent expectations:
    • The Graphiti adapter (graphiti_adapter) and Step 1 graph build pipeline must continue to consume the ontology output unchanged. No coupling to prompt language exists in the adapter; this is verified via the JSON schema contract being preserved.
    • The locale resolution chain (Accept-Language header → get_locale()get_language_instruction()) is owned by backend/app/utils/locale.py and is unchanged by this work. Translating the base prompt does not modify locale resolution semantics.
    • Companion i18n issues (#3, #4, #5, #6, #7, #8, #9, #10) operate on different files or scopes and should not be touched here.

Requirements

Requirement 1: English Translation of the Ontology System Prompt

Objective: As a MiroFish operator running the pipeline under Accept-Language: en, I want the ontology-generation system prompt to be authored in English, so that the LLM's ontology descriptions, reasoning, and schema commentary are not biased toward Chinese structure or word choice.

Acceptance Criteria

  1. The Ontology Generator shall define ONTOLOGY_SYSTEM_PROMPT containing zero Chinese characters in any string-literal content.
  2. The Ontology Generator shall preserve the JSON output contract of the system prompt verbatim: the keys entity_types, edge_types, analysis_summary, and the entity sub-keys name, description, attributes, examples, and the edge sub-keys name, description, source_targets, attributes, plus the source_targets sub-keys source and target.
  3. The Ontology Generator shall preserve the entity-type reference list verbatim by name (Student, Professor, Journalist, Celebrity, Executive, Official, Lawyer, Doctor, Person, University, Company, GovernmentAgency, MediaOutlet, Hospital, School, NGO, Organization).
  4. The Ontology Generator shall preserve the relationship-type reference list verbatim by name (WORKS_FOR, STUDIES_AT, AFFILIATED_WITH, REPRESENTS, REGULATES, REPORTS_ON, COMMENTS_ON, RESPONDS_TO, SUPPORTS, OPPOSES, COLLABORATES_WITH, COMPETES_WITH).
  5. The Ontology Generator shall preserve the reserved-attribute-name list verbatim (name, uuid, group_id, created_at, summary).
  6. The Ontology Generator shall preserve the fallback-type rule that exactly two fallback entity types — Person and Organization — must appear at the end of a 10-item list.
  7. The Ontology Generator shall preserve the entity-count constraint (exactly 10 entity types) and the edge-count constraint (610 relationship types).
  8. The Ontology Generator shall preserve the description-length constraint (entity and edge description ≤ 100 characters).

Requirement 2: English Translation of the User-Message Template

Objective: As a MiroFish operator running the pipeline under Accept-Language: en, I want the user-message template constructed by _build_user_message to be authored in English, so that the rendered prompt does not interleave English get_language_instruction() directives with Chinese section headings.

Acceptance Criteria

  1. The Ontology Generator shall render the user message with English section headings in place of ## 模拟需求, ## 文档内容, and ## 额外说明.
  2. The Ontology Generator shall render the trailing rules block in English (replacing 请根据以上内容... and the 必须遵守的规则 enumeration), preserving the rule semantics: 10 entity types total, last 2 are Person/Organization fallbacks, first 8 are concrete types, all entities must be real-world social-media-capable subjects (not abstract concepts), and reserved attribute names cannot be used.
  3. The Ontology Generator shall render the truncation notice in English when the combined document text exceeds MAX_TEXT_LENGTH_FOR_LLM, including the original character count and the truncation length.
  4. The Ontology Generator shall preserve all variable interpolations verbatim by name (simulation_requirement, combined_text, additional_context, and the {original_length} / {self.MAX_TEXT_LENGTH_FOR_LLM} interpolations in the truncation notice).
  5. The Ontology Generator shall preserve the conditional inclusion of the ## Additional Context section only when additional_context is truthy.
  6. The Ontology Generator shall return zero Chinese characters across all string literals contributed to the assembled user message.

Requirement 3: Locale Switching Continues to Work via get_language_instruction()

Objective: As a MiroFish operator running the pipeline under Accept-Language: zh (or any other configured non-English locale), I want the ontology output to remain in the requested locale of equivalent quality, so that translating the base prompt does not regress non-English support.

Acceptance Criteria

  1. The Ontology Generator shall preserve the call to get_language_instruction() exactly at the existing location (currently the line above system_prompt = f"{ONTOLOGY_SYSTEM_PROMPT}\n\n{lang_instruction}\n..."), continuing to read locale via the existing thread-local / request-header resolution chain.
  2. The Ontology Generator shall preserve the trailing English directive that locks identifier formats (Entity type names MUST be in English PascalCase ..., Relationship type names MUST be in English UPPER_SNAKE_CASE ..., Attribute names MUST be in English snake_case ..., Only description fields and analysis_summary should use the specified language above.).
  3. When the locale is zh, the Ontology Generator shall produce a JSON ontology whose description and analysis_summary fields are in Chinese, equivalent in quality to the pre-change behaviour.
  4. When the locale is en, the Ontology Generator shall produce a JSON ontology whose description and analysis_summary fields are in English.
  5. The Ontology Generator shall not alter backend/app/utils/locale.py, the _languages, the _translations registries, or the locales under /locales/.

Requirement 4: Public API and Call-Site Stability

Objective: As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of OntologyGenerator to remain unchanged, so that the graph-build flow and existing callers continue to work without modification.

Acceptance Criteria

  1. The Ontology Generator shall preserve the signature of OntologyGenerator.__init__(self, llm_client: Optional[LLMClient] = None).
  2. The Ontology Generator shall preserve the signature of OntologyGenerator.generate(self, document_texts: List[str], simulation_requirement: str, additional_context: Optional[str] = None) -> Dict[str, Any].
  3. The Ontology Generator shall preserve the signature of OntologyGenerator.generate_python_code(self, ontology: Dict[str, Any]) -> str.
  4. The Ontology Generator shall preserve the return-shape contract of generate(): a Dict[str, Any] with keys entity_types, edge_types, analysis_summary matching the existing JSON schema, post-validation.
  5. The Ontology Generator shall preserve the signature of the private helper _to_pascal_case(name: str) -> str and the validator _validate_and_process(self, result: Dict[str, Any]) -> Dict[str, Any].
  6. The Ontology Generator shall preserve the constant MAX_TEXT_LENGTH_FOR_LLM = 50000.
  7. The Ontology Generator shall preserve the LLM invocation parameters (temperature=0.3, max_tokens=4096) and the call to self.llm_client.chat_json(...).

Requirement 5: Reasoning-Model Output Compatibility

Objective: As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with <think> tags or markdown code fences), I want JSON parsing of the ontology response to continue working, so that translating the base prompt does not regress provider compatibility.

Acceptance Criteria

  1. The Ontology Generator shall delegate JSON parsing to LLMClient.chat_json exactly as today (the call at the existing site is unchanged in name and arguments).
  2. If a reasoning-model provider returns <think>-tagged or markdown-fenced output, then the existing stripping logic in LLMClient.chat_json shall continue to apply unchanged.
  3. The Ontology Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language.
  4. After translation, the Ontology Generator shall continue to round-trip a sample seed file through generate() and _validate_and_process() and produce a non-empty entity_types list of length 10 with the Person and Organization fallbacks present at indices 8 and 9 (or earlier, in the order produced).

Requirement 6: Step 1 Graph Build Parity

Objective: As a MiroFish operator validating the change, I want the Graphiti / Neo4j Step 1 graph build to complete with comparable structure under the English ontology, so that the translation does not silently degrade graph quality.

Acceptance Criteria

  1. When a representative seed file is processed end-to-end with locale en, the Step 1 graph build shall complete without raising an exception attributable to the ontology output.
  2. When a representative seed file is processed end-to-end with locale en, the resulting Neo4j graph shall contain a node count and edge count comparable to the pre-change Chinese-prompt baseline within an operator-acceptable tolerance (a small percentage variance is acceptable; doubling or zeroing is not).
  3. The Ontology Generator shall not change the function signatures or call sequence used by the Step 1 graph build pipeline (verified by Requirement 4).

Requirement 7: Out-of-Scope Surfaces Remain Untouched

Objective: As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings, so that translation responsibilities for adjacent surfaces (issues #6 and #7) are not absorbed into this change.

Acceptance Criteria

  1. The change shall not modify any logger.warning(...), logger.info(...), logger.error(...), or logger.debug(...) call in ontology_generator.py (covered by issue #6).
  2. The change shall not modify the module docstring, class docstring, method docstrings, or inline comments in ontology_generator.py (covered by issue #7).
  3. The change shall not edit any file outside backend/app/services/ontology_generator.py for production code, except for adding test fixtures or scripts under a clearly-isolated directory if a verification harness is needed.
  4. The change shall not introduce a new dependency or modify backend/pyproject.toml / backend/uv.lock.