8.1 KiB
8.1 KiB
Implementation Plan
-
1. Translate the ontology system-prompt constant to English
- Replace the body of
ONTOLOGY_SYSTEM_PROMPTwith an English rendering that preserves the section structure (core task background, output format JSON template, design guidelines, entity-type reference list, relationship-type reference list, attribute reserved-name rules) - Preserve the JSON template keys verbatim:
entity_types,edge_types,analysis_summary, and the entity sub-keysname,description,attributes,examples, the edge sub-keysname,description,source_targets,attributes, plus thesource_targetssub-keyssourceandtarget - Preserve the entity-type reference list verbatim by name (
Student,Professor,Journalist,Celebrity,Executive,Official,Lawyer,Doctor,Person,University,Company,GovernmentAgency,MediaOutlet,Hospital,School,NGO,Organization) - Preserve the relationship-type reference list verbatim by name (
WORKS_FOR,STUDIES_AT,AFFILIATED_WITH,REPRESENTS,REGULATES,REPORTS_ON,COMMENTS_ON,RESPONDS_TO,SUPPORTS,OPPOSES,COLLABORATES_WITH,COMPETES_WITH) - Preserve the reserved-attribute-name list verbatim (
name,uuid,group_id,created_at,summary) - Express the same numeric constraints in English: exactly 10 entity types, last 2 are
PersonandOrganizationfallbacks, 6–10 relationship types, 1–3 attributes per entity, descriptions ≤ 100 characters - Observable completion:
ONTOLOGY_SYSTEM_PROMPTis an English-only string with zero CJK characters and identical structural keys/values to the original - Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8
- Replace the body of
-
2. Translate the user-message template strings to English
- Replace the section headings
## 模拟需求,## 文档内容, and## 额外说明with English equivalents (## Simulation Requirement,## Document Content,## Additional Context) - Replace the trailing rules block (the Chinese
请根据以上内容.../必须遵守的规则enumeration) with an English block that conveys the same five rules: 10 entity types total; last 2 arePersonandOrganizationfallbacks; first 8 are concrete types from the document; entities must be real-world social-media-capable subjects (not abstract concepts); reserved attribute names cannot be used - Replace the truncation notice (the Chinese
(原文共...字,已截取前...字用于本体分析)) with an English equivalent that retains both numeric interpolations - Preserve every f-string interpolation by name and position:
{simulation_requirement},{combined_text},{additional_context},{original_length},{self.MAX_TEXT_LENGTH_FOR_LLM} - Preserve the conditional inclusion of the
## Additional Contextblock — it appears only whenadditional_contextis truthy - Observable completion:
_build_user_messageproduces an English-only message body for any input combination, with zero CJK characters in any string literal it contributes; under the same inputs as before, all interpolated values still appear in the rendered output - Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6
- Replace the section headings
-
3. Confirm boundary commitments around the translation
- Confirm the call to
get_language_instruction()and the assembledsystem_promptline remain at their existing position with their existing arguments (no rename, no relocation) - Confirm the trailing English identifier-format directive (
IMPORTANT: Entity type names MUST be in English PascalCase ...,Relationship type names MUST be in English UPPER_SNAKE_CASE ...,Attribute names MUST be in English snake_case ...,Only description fields and analysis_summary should use the specified language above.) remains byte-for-byte identical - Confirm the public signatures of
OntologyGenerator.__init__,generate,generate_python_code, the private_to_pascal_case, and_validate_and_processare unchanged - Confirm the constant
MAX_TEXT_LENGTH_FOR_LLM = 50000is unchanged - Confirm the LLM invocation parameters
temperature=0.3, max_tokens=4096and theself.llm_client.chat_json(...)call site are unchanged - Confirm
backend/app/utils/locale.py,/locales/languages.json,/locales/en.json, and/locales/zh.jsonare not modified - Confirm
logger.warning(...),logger.info(...), module/class/method docstrings, and inline comments inontology_generator.pyare not modified (these are owned by issues #6 and #7) - Confirm
backend/pyproject.toml,backend/uv.lock, and any file outsidebackend/app/services/ontology_generator.pyare not modified - Observable completion: a
git diffreview againstmainshows changes only insidebackend/app/services/ontology_generator.py, only insideONTOLOGY_SYSTEM_PROMPTand_build_user_message, and the surrounding lines are byte-identical - Requirements: 3.1, 3.2, 3.5, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 5.1, 5.3, 7.1, 7.2, 7.3, 7.4
- Confirm the call to
-
4. Verify reasoning-model output compatibility and JSON shape stability
- Inspect
LLMClient.chat_jsonto confirm<think>tag stripping (inchat) and markdown-fence stripping (inchat_json) are still the only post-processors applied to the LLM response, and that no new pre-processing has been added inontology_generator.py - Run an in-process round-trip: instantiate
OntologyGenerator, callgenerate(...)with a small representativedocument_textslist andsimulation_requirement, and assert the returned dict has keysentity_types(length 10),edge_types,analysis_summary; assert the last two entity-type names arePersonandOrganization - Repeat the round-trip under simulated reasoning-model output to confirm the existing stripping path still parses cleanly (e.g. by patching
chatto wrap a known-good JSON in<think>...</think>and triple-fenced code, then assertingchat_jsonstill parses) - Observable completion: a short verification script under
backend/scripts/(or an inlinepython -crecorded in the PR description) demonstrates the round-trip succeeds with both clean and<think>/fenced LLM outputs - Requirements: 5.1, 5.2, 5.3, 5.4
- Inspect
-
5. Verify locale-driven output language under both
enandzh- Set the thread-local locale to
enviaset_locale("en"), runOntologyGenerator().generate(...)against the configured LLM, and confirm the returneddescriptionfields andanalysis_summarycontain no CJK characters and read as natural English - Set the thread-local locale to
zhviaset_locale("zh"), run the same round-trip, and confirm the returneddescriptionfields andanalysis_summarycontain CJK characters of equivalent quality to the pre-change baseline - Observable completion: both runs succeed; the
enrun is CJK-free in description fields, thezhrun continues to produce Chinese descriptions; results are recorded in the PR description - Requirements: 3.3, 3.4
- Set the thread-local locale to
-
6. Verify Step 1 graph-build parity end-to-end under
enlocale- Using a representative seed file, exercise the full Step 1 graph-build pipeline (upload → ontology → Graphiti → Neo4j) under
Accept-Language: en - Confirm the run completes without raising an exception attributable to ontology output
- Compare the resulting Neo4j node and edge counts against a recent
zh-locale baseline; confirm they are within an operator-acceptable tolerance (no doubling, no zeroing) - Observable completion: the pipeline reaches
GRAPH_COMPLETED, and the comparison numbers are recorded in the PR description - Requirements: 6.1, 6.2, 6.3
- Using a representative seed file, exercise the full Step 1 graph-build pipeline (upload → ontology → Graphiti → Neo4j) under
-
* 7. Add a static guard against CJK regression in this file's prompt strings
- Add a small one-shot script under
backend/scripts/that loadsONTOLOGY_SYSTEM_PROMPTand the rendered output of_build_user_message(...)for representative inputs, and asserts zero matches against the regex[一-鿿]over those strings - Optional: extend the existing
pytest-style harness if a thin assertion fits the project's minimal test surface - Observable completion: running the script exits 0 against the patched module; running it against a hypothetical revert of the patch exits non-zero
- Requirements: 1.1, 2.6
- Add a small one-shot script under