feat(i18n): translate ontology_generator prompts to english
translate the system prompt constant and the user-message template in backend/app/services/ontology_generator.py from chinese to english. the chinese base prompt was biasing the model toward chinese structure and word choice even when accept-language was en, leaving ontology descriptions and analysis_summary fields chinese-flavoured. translation is in-place and preserves every functional contract: the json output schema, the entity-type and relationship-type taxonomies verbatim, the reserved-attribute-name list, the count and length constraints, and all f-string interpolations. the get_language_instruction() postfix call site and the trailing english identifier-format directive are unchanged, so zh and other locales continue to receive locale-appropriate descriptions. logger calls, docstrings, and inline comments are intentionally left in chinese — they are owned by issues #6 and #7. a small static guard script (backend/scripts/test_ontology_prompts_no_cjk.py) ast-parses the module and asserts zero cjk in the system prompt and in every string literal of _build_user_message except the docstring, so the regression cannot reappear silently. Closes #2
This commit is contained in:
parent
3b17c0b9ba
commit
080683295d
|
|
@ -0,0 +1,284 @@
|
|||
# Design Document — i18n-ontology-generator-prompts
|
||||
|
||||
## Overview
|
||||
|
||||
**Purpose**: Translate the Chinese prompt strings in `backend/app/services/ontology_generator.py` (the system prompt constant and the user-message template) to English while preserving every functional contract — JSON output schema, taxonomy lists, reserved-attribute names, fallback rules, variable interpolations, and the `get_language_instruction()` locale-postfix mechanism. The goal is to remove the Chinese-language base-prompt bias that currently leaks Chinese structure and word choice into ontology output even when `Accept-Language: en`.
|
||||
|
||||
**Users**: MiroFish operators running the Step 1 graph-build pipeline under any locale; downstream developers consuming the JSON ontology emitted by `OntologyGenerator.generate(...)`.
|
||||
|
||||
**Impact**: Replaces approximately one large module-level string constant and four embedded string literals with English equivalents. No API surface change. No new dependencies. No new files. The single production caller (`backend/app/api/graph.py:223–228`) and all consumers of the validator output are unaffected.
|
||||
|
||||
### Goals
|
||||
|
||||
- Zero CJK characters in any prompt string literal contributed by `ontology_generator.py` to the system prompt or the user message.
|
||||
- English ontology descriptions and `analysis_summary` under `Accept-Language: en`.
|
||||
- Continued Chinese descriptions and `analysis_summary` under `Accept-Language: zh`, of equivalent quality to the pre-change behaviour.
|
||||
- No diff to public signatures, constants, LLM-call parameters, or call sites.
|
||||
|
||||
### Non-Goals
|
||||
|
||||
- Externalizing prompts to `/locales/*.json` (out of scope per ticket).
|
||||
- Translating logger calls in this file (covered by issue #6).
|
||||
- Translating module/class/method docstrings or inline comments in this file (covered by issue #7).
|
||||
- Refactoring the ontology JSON schema, the validator, or the extraction flow.
|
||||
- Changing the entity-type or relationship-type reference taxonomies.
|
||||
- Modifying `backend/app/utils/locale.py`, the locale registries, or any non-target file.
|
||||
|
||||
## Boundary Commitments
|
||||
|
||||
### This Spec Owns
|
||||
|
||||
- The English content of `ONTOLOGY_SYSTEM_PROMPT` (module-level constant in `backend/app/services/ontology_generator.py`).
|
||||
- The English content of the four string literals embedded in `OntologyGenerator._build_user_message`: section headings, additional-context block, trailing rules block, and truncation notice.
|
||||
|
||||
### Out of Boundary
|
||||
|
||||
- Locale resolution machinery (`backend/app/utils/locale.py`).
|
||||
- Per-locale `llmInstruction` definitions (`/locales/languages.json`).
|
||||
- Reasoning-model output stripping (`backend/app/utils/llm_client.py`).
|
||||
- Logger calls and `logger.warning` strings inside `ontology_generator.py` (issue #6).
|
||||
- Module/class/method docstrings and inline comments inside `ontology_generator.py` (issue #7).
|
||||
- The entity / edge taxonomy itself; only its descriptive prose changes language.
|
||||
- All callers of `OntologyGenerator`, including `backend/app/api/graph.py`.
|
||||
- Tests, scripts, and frontend code.
|
||||
|
||||
### Allowed Dependencies
|
||||
|
||||
- Existing `get_language_instruction()` import from `..utils.locale` (already imported; unchanged).
|
||||
- Existing `LLMClient.chat_json` invocation (unchanged).
|
||||
- No new imports.
|
||||
|
||||
### Revalidation Triggers
|
||||
|
||||
The following changes elsewhere would invalidate this design and require revisiting the prompt:
|
||||
|
||||
- A change to the JSON contract emitted by the LLM (`entity_types`, `edge_types`, `analysis_summary` keys or sub-keys).
|
||||
- A change to `_validate_and_process` invariants (10 entity types, fallback `Person`/`Organization`, `MAX_*` caps, description length).
|
||||
- A change to `get_language_instruction()` semantics or the per-locale `llmInstruction` strings.
|
||||
- A change to the reasoning-model output stripping in `LLMClient.chat`/`chat_json`.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Existing Architecture Analysis
|
||||
|
||||
`OntologyGenerator` lives in `backend/app/services/`, follows the in-process service pattern (no IO besides the LLM call), and is invoked synchronously from `backend/app/api/graph.py` inside a background `Task`. It depends on `LLMClient` for transport and on `get_language_instruction()` for locale steering. The relevant flow is:
|
||||
|
||||
1. The Flask handler resolves the request locale via `Accept-Language`; locale is set via `set_locale()` for the background thread.
|
||||
2. `OntologyGenerator.generate()` builds a user message from inputs, prepends the (currently Chinese) system prompt with the locale postfix and the English identifier-format directive, calls `chat_json`, then runs the response through `_validate_and_process`.
|
||||
3. The validator self-heals invariants (count, fallback types, length, deduplication).
|
||||
|
||||
This design preserves all of the above. The change is purely lexical inside two regions of one file.
|
||||
|
||||
### Architecture Pattern & Boundary Map
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
Caller[graph.py handler]
|
||||
Generator[OntologyGenerator]
|
||||
Validator[_validate_and_process]
|
||||
Locale[locale.get_language_instruction]
|
||||
Client[LLMClient.chat_json]
|
||||
|
||||
Caller -->|generate inputs| Generator
|
||||
Generator -->|read locale postfix| Locale
|
||||
Generator -->|JSON request| Client
|
||||
Client -->|raw JSON| Generator
|
||||
Generator -->|self-heal invariants| Validator
|
||||
Validator -->|validated ontology| Caller
|
||||
```
|
||||
|
||||
**Architecture Integration**:
|
||||
|
||||
- Selected pattern: **In-place lexical translation** of two regions of an existing service. No structural change.
|
||||
- Domain/feature boundaries: locale machinery vs. service prompt vs. transport stripping remain cleanly separated.
|
||||
- Existing patterns preserved: prompt-as-constant; `f"..."` user-message construction; locale-postfix concatenation; validator self-healing.
|
||||
- New components rationale: none — no new components.
|
||||
- Steering compliance: matches `tech.md` ("translate keys, not raw log lines, when adding new logs that surface to users") for what is in-scope here, and respects the steering note that "existing files mix English and Chinese in comments/docstrings — preserve both; do not translate one into the other unless asked." This ticket is the explicit ask for the prompt strings, scoped to exclude comments/docstrings.
|
||||
|
||||
### Technology Stack
|
||||
|
||||
| Layer | Choice / Version | Role in Feature | Notes |
|
||||
|-------|------------------|-----------------|-------|
|
||||
| Backend / Services | Python 3.11+ | Hosts `OntologyGenerator` | Existing — unchanged. |
|
||||
| Backend / Services | `openai` SDK via `LLMClient` | Issues the prompt; performs `<think>` and fence stripping | Existing — unchanged. |
|
||||
| Backend / Services | `backend/app/utils/locale.py` | Resolves `Accept-Language` → `llmInstruction` postfix | Existing — unchanged. |
|
||||
|
||||
No new dependencies. No version changes.
|
||||
|
||||
## File Structure Plan
|
||||
|
||||
### Modified Files
|
||||
|
||||
- `backend/app/services/ontology_generator.py` — Replace the body of `ONTOLOGY_SYSTEM_PROMPT` with an English translation; replace the four Chinese string fragments in `_build_user_message` with English equivalents; preserve every other character of the file.
|
||||
|
||||
No new files. No deletions. No moves.
|
||||
|
||||
## System Flows
|
||||
|
||||
The control-flow diagram in *Architecture Pattern & Boundary Map* covers the relevant flow; no additional diagrams are needed for this string-literal change.
|
||||
|
||||
## Requirements Traceability
|
||||
|
||||
| Requirement | Summary | Components | Interfaces | Flows |
|
||||
|-------------|---------|------------|------------|-------|
|
||||
| 1.1 | Zero Chinese in `ONTOLOGY_SYSTEM_PROMPT` | OntologyGenerator → `ONTOLOGY_SYSTEM_PROMPT` | None changed | n/a |
|
||||
| 1.2 | Preserve JSON output keys | OntologyGenerator → prompt template region | LLM JSON contract | Architecture diagram |
|
||||
| 1.3 | Preserve entity-type reference list verbatim | OntologyGenerator → prompt reference list | Prompt-only | n/a |
|
||||
| 1.4 | Preserve relationship-type reference list verbatim | OntologyGenerator → prompt reference list | Prompt-only | n/a |
|
||||
| 1.5 | Preserve reserved attribute names | OntologyGenerator → prompt rules region | Prompt-only | n/a |
|
||||
| 1.6 | Preserve fallback rule (Person, Organization) | OntologyGenerator → prompt + validator | Validator self-healing | n/a |
|
||||
| 1.7 | Preserve count constraints | OntologyGenerator → prompt + validator | Validator self-healing | n/a |
|
||||
| 1.8 | Preserve description-length constraint | OntologyGenerator → prompt + validator | Validator self-healing | n/a |
|
||||
| 2.1 | English section headings in user message | OntologyGenerator → `_build_user_message` | None changed | n/a |
|
||||
| 2.2 | English trailing rules block | OntologyGenerator → `_build_user_message` | None changed | n/a |
|
||||
| 2.3 | English truncation notice | OntologyGenerator → `_build_user_message` | None changed | n/a |
|
||||
| 2.4 | Variable interpolations preserved | OntologyGenerator → `_build_user_message` | f-string interpolation | n/a |
|
||||
| 2.5 | Conditional additional-context block preserved | OntologyGenerator → `_build_user_message` | Python conditional | n/a |
|
||||
| 2.6 | Zero Chinese in user message | OntologyGenerator → `_build_user_message` | n/a | n/a |
|
||||
| 3.1 | Postfix call site preserved | OntologyGenerator → `generate` line ~209 | `get_language_instruction()` | Architecture diagram |
|
||||
| 3.2 | English identifier-format directive preserved | OntologyGenerator → system_prompt assembly | Prompt-only | n/a |
|
||||
| 3.3 | `zh` locale produces Chinese output | OntologyGenerator + Locale | `get_language_instruction()` | Architecture diagram |
|
||||
| 3.4 | `en` locale produces English output | OntologyGenerator + Locale | `get_language_instruction()` | Architecture diagram |
|
||||
| 3.5 | No edits to locale module or registries | n/a (boundary commitment) | n/a | n/a |
|
||||
| 4.1–4.7 | API and constant stability | OntologyGenerator (signatures, constants) | Public surface | n/a |
|
||||
| 5.1–5.4 | Reasoning-model compatibility | OntologyGenerator → `chat_json` call | LLMClient.chat_json | Architecture diagram |
|
||||
| 6.1–6.3 | Step 1 graph-build parity | Validation runs (manual) | n/a | n/a |
|
||||
| 7.1–7.4 | Out-of-scope surfaces untouched | OntologyGenerator (boundary commitment) | n/a | n/a |
|
||||
|
||||
## Components and Interfaces
|
||||
|
||||
| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts |
|
||||
|-----------|--------------|--------|--------------|--------------------------|-----------|
|
||||
| OntologyGenerator (modified) | Backend / Service | Render English ontology-generation prompts; preserve all behaviour | 1.1–1.8, 2.1–2.6, 3.1–3.5, 4.1–4.7, 5.1–5.4, 7.1–7.4 | LLMClient.chat_json (P0), get_language_instruction (P0), `_validate_and_process` (P0) | Service |
|
||||
|
||||
### Backend / Service
|
||||
|
||||
#### OntologyGenerator (modified)
|
||||
|
||||
| Field | Detail |
|
||||
|-------|--------|
|
||||
| Intent | Translate prompt strings to English while preserving every functional contract. |
|
||||
| Requirements | 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4, 3.5, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 5.1, 5.2, 5.3, 5.4, 7.1, 7.2, 7.3, 7.4 |
|
||||
|
||||
**Responsibilities & Constraints**
|
||||
|
||||
- Owns: the English wording of `ONTOLOGY_SYSTEM_PROMPT` and the four user-message string fragments.
|
||||
- Domain boundary: prompt content only. Does not own locale resolution, transport, or validation logic.
|
||||
- Invariants:
|
||||
- `ONTOLOGY_SYSTEM_PROMPT` after translation MUST contain zero CJK characters.
|
||||
- The translated system prompt MUST present the same JSON template by key (`entity_types`, `edge_types`, `analysis_summary`; entity sub-keys `name`, `description`, `attributes`, `examples`; edge sub-keys `name`, `description`, `source_targets`, `attributes`; `source_targets` sub-keys `source`, `target`).
|
||||
- The translated system prompt MUST list the same entity-type names verbatim: `Student`, `Professor`, `Journalist`, `Celebrity`, `Executive`, `Official`, `Lawyer`, `Doctor`, `Person`, `University`, `Company`, `GovernmentAgency`, `MediaOutlet`, `Hospital`, `School`, `NGO`, `Organization`.
|
||||
- The translated system prompt MUST list the same relationship-type names verbatim: `WORKS_FOR`, `STUDIES_AT`, `AFFILIATED_WITH`, `REPRESENTS`, `REGULATES`, `REPORTS_ON`, `COMMENTS_ON`, `RESPONDS_TO`, `SUPPORTS`, `OPPOSES`, `COLLABORATES_WITH`, `COMPETES_WITH`.
|
||||
- The translated system prompt MUST list the same reserved attribute names verbatim: `name`, `uuid`, `group_id`, `created_at`, `summary`.
|
||||
- The translated system prompt MUST express the same numeric constraints: exactly 10 entity types, with the last 2 being `Person` and `Organization` fallbacks; 6–10 relationship types; 1–3 attributes per entity; description ≤ 100 characters.
|
||||
- The translated user message MUST preserve all f-string interpolations: `{simulation_requirement}`, `{combined_text}`, `{additional_context}`, `{original_length}`, `{self.MAX_TEXT_LENGTH_FOR_LLM}`.
|
||||
- The translated user message MUST conditionally include the `## Additional Context` block only when `additional_context` is truthy.
|
||||
- The call to `get_language_instruction()` MUST remain at its current location with its current return-value usage.
|
||||
- The trailing English identifier-format directive (`IMPORTANT: Entity type names MUST be in English PascalCase ...`) MUST remain byte-for-byte identical.
|
||||
- The call to `self.llm_client.chat_json(messages=messages, temperature=0.3, max_tokens=4096)` MUST remain unchanged.
|
||||
- All public signatures, the constant `MAX_TEXT_LENGTH_FOR_LLM`, and the private helpers `_to_pascal_case` and `_validate_and_process` MUST remain unchanged.
|
||||
- All `logger.warning(...)` calls and inline comments and docstrings in this file MUST remain unchanged (out of scope per #6 and #7).
|
||||
|
||||
**Dependencies**
|
||||
|
||||
- Inbound: `backend/app/api/graph.py:223–228` — sole production caller (P0).
|
||||
- Outbound: `backend/app/utils/locale.get_language_instruction` — locale postfix (P0). `backend/app/utils/llm_client.LLMClient.chat_json` — JSON LLM transport with stripping (P0).
|
||||
- External: none.
|
||||
|
||||
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
|
||||
|
||||
##### Service Interface
|
||||
|
||||
The public Python interface is unchanged:
|
||||
|
||||
```python
|
||||
class OntologyGenerator:
|
||||
def __init__(self, llm_client: Optional[LLMClient] = None) -> None: ...
|
||||
|
||||
def generate(
|
||||
self,
|
||||
document_texts: List[str],
|
||||
simulation_requirement: str,
|
||||
additional_context: Optional[str] = None,
|
||||
) -> Dict[str, Any]: ...
|
||||
|
||||
def generate_python_code(self, ontology: Dict[str, Any]) -> str: ...
|
||||
```
|
||||
|
||||
- Preconditions: `document_texts` is a non-empty list of strings; `simulation_requirement` is a non-empty string; locale is resolvable via the existing chain.
|
||||
- Postconditions: `generate()` returns a dict with `entity_types` (length ≤ 10, ending in `Person` and `Organization`), `edge_types` (length ≤ 10), and `analysis_summary` (string).
|
||||
- Invariants: see *Responsibilities & Constraints*.
|
||||
|
||||
**Implementation Notes**
|
||||
|
||||
- **Integration**: No new imports. No call-site changes. The only diff is the body of `ONTOLOGY_SYSTEM_PROMPT` and four string literals inside `_build_user_message`.
|
||||
- **Validation**: After implementation, run a targeted regex check (`[一-鿿]` over `ONTOLOGY_SYSTEM_PROMPT` and the relevant lines of `_build_user_message`) to confirm zero CJK in those literals. Run a manual round-trip via `OntologyGenerator().generate(...)` under both `en` and `zh` locales using a small seed text and assert: valid JSON, exactly 10 entity types ending in `Person` and `Organization`, descriptions in the expected language. Optionally run end-to-end Step 1 graph build on a representative seed file under `en` and compare node/edge counts to a recent `zh` baseline.
|
||||
- **Risks**: English-base bias on Chinese-locale output (mitigated by the `llmInstruction` postfix and the trailing English directive that locks identifier formats). Validator self-healing covers structural drift independent of prompt language.
|
||||
|
||||
## Data Models
|
||||
|
||||
No data-model changes. The JSON schema emitted by the LLM and consumed by `_validate_and_process` is preserved verbatim.
|
||||
|
||||
## Error Handling
|
||||
|
||||
### Error Strategy
|
||||
|
||||
Error handling is unchanged from the existing implementation:
|
||||
|
||||
- LLM transport errors propagate from `LLMClient.chat_json` (raises on failure modes the SDK exposes).
|
||||
- Invalid JSON from the LLM raises `ValueError("LLM返回的JSON格式无效: ...")` from `chat_json`. Note: the error message itself is in `llm_client.py` and is out of scope for this spec (issue #6).
|
||||
- Validator self-healing handles structural drift (missing fallbacks, count overflows, invalid attribute reservations).
|
||||
|
||||
### Error Categories and Responses
|
||||
|
||||
- **User errors (4xx)**: not applicable at this layer; surfaced by the API handler.
|
||||
- **System errors (5xx)**: LLM/network failures propagate to the API handler, which converts them to JSON error responses.
|
||||
- **Business logic errors**: structurally invalid ontology output is auto-corrected by `_validate_and_process` to satisfy the 10-type / fallback / length invariants.
|
||||
|
||||
### Monitoring
|
||||
|
||||
Existing `logger.warning` and `logger.info` calls already log auto-conversions and final counts; no new monitoring is added.
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
|
||||
Given the project's intentionally minimal test harness (`backend/scripts/test_profile_format.py` only, per `tech.md`), introducing a heavy new test suite is out of scope. Instead, two lightweight checks accompany the change:
|
||||
|
||||
- **Static check**: a regex assertion in a small ad-hoc script (or a one-shot `python -c`) confirming that `ONTOLOGY_SYSTEM_PROMPT` and the patched literals in `_build_user_message` contain zero characters in `[一-鿿]`. This can be a permanent simple test under `backend/scripts/` if desired or a one-off check during PR review.
|
||||
- **Round-trip smoke test**: a manual run of `OntologyGenerator().generate(...)` against a configured LLM, locale `en`, with a small seed text. Assert: dict shape, entity-types length 10 ending in `Person`/`Organization`, description fields contain no `[一-鿿]`. Repeat under locale `zh` and assert description fields contain at least some `[一-鿿]` (sanity check that the postfix still steers Chinese output).
|
||||
|
||||
### Integration Tests
|
||||
|
||||
- **Step 1 graph build under EN locale**: run the full pipeline end-to-end with a representative seed file under `Accept-Language: en`. Assert: pipeline completes without exception, ontology validates, node/edge counts in Neo4j are within operator-acceptable tolerance of a recent `zh` baseline. This is documented as an operator-run verification step in the PR description; automation is not required.
|
||||
|
||||
### E2E/UI Tests
|
||||
|
||||
Not applicable — change does not affect frontend.
|
||||
|
||||
### Performance/Load
|
||||
|
||||
Not applicable — change does not alter performance characteristics. LLM call parameters (`temperature=0.3`, `max_tokens=4096`) are unchanged.
|
||||
|
||||
## Optional Sections
|
||||
|
||||
### Security Considerations
|
||||
|
||||
Not applicable. Translation does not introduce new authentication, authorization, data-handling, or input-validation paths. Reserved attribute names remain enforced via prompt and validator.
|
||||
|
||||
### Performance & Scalability
|
||||
|
||||
Not applicable. Prompt token counts may differ slightly between Chinese and English renderings, but well within the existing `max_tokens=4096` budget.
|
||||
|
||||
### Migration Strategy
|
||||
|
||||
Not applicable. The change is a single in-place edit; no data migration. Rollback is `git revert`.
|
||||
|
||||
## Supporting References
|
||||
|
||||
- `backend/app/services/ontology_generator.py` — current Chinese prompt content (the source of translation).
|
||||
- `backend/app/utils/locale.py` — locale resolver.
|
||||
- `backend/app/utils/llm_client.py` — `chat_json` and `<think>` / fence stripping.
|
||||
- `backend/app/api/graph.py:223–228` — sole production caller.
|
||||
- `.kiro/specs/i18n-ontology-generator-prompts/research.md` — discovery findings, alternatives evaluation, and design decisions.
|
||||
- `.ticket/2.md` — ticket snapshot.
|
||||
|
|
@ -0,0 +1,105 @@
|
|||
# Gap Analysis — i18n-ontology-generator-prompts
|
||||
|
||||
## 1. Current State Investigation
|
||||
|
||||
### Domain assets
|
||||
|
||||
- **Subject file**: `backend/app/services/ontology_generator.py` (507 lines).
|
||||
- **Module-level system prompt**: `ONTOLOGY_SYSTEM_PROMPT` (lines 30–173) — Chinese, ~140 lines of structured prompt content describing task, output format, design guidelines, entity reference list, relationship reference list.
|
||||
- **User-message builder**: `OntologyGenerator._build_user_message` (lines 231–275) — Chinese section headings, truncation notice, and trailing rules block.
|
||||
- **Locale postfix call site**: `get_language_instruction()` is invoked at line 209 and concatenated into the system prompt at line 210, alongside an English directive that locks identifier formats.
|
||||
- **Locale resolver**: `backend/app/utils/locale.py` reads `Accept-Language` from request context, falls back to thread-local for background tasks, and ultimately defaults to `zh`. The English postfix lives in `/locales/languages.json` (`llmInstruction`).
|
||||
- **LLM client**: `backend/app/utils/llm_client.py:LLMClient.chat_json` performs `<think>` stripping (line 65) and markdown-fence stripping (lines 84–87). This is **outside** `ontology_generator.py`, so the file does not own that logic — it just consumes it. Requirement R5 is satisfied trivially as long as we keep the `chat_json` call unchanged.
|
||||
|
||||
### Call sites (consumers)
|
||||
|
||||
- `backend/app/api/graph.py:223–228` — the only production caller. Uses `OntologyGenerator()` with no constructor args, calls `.generate(document_texts, simulation_requirement, additional_context)`, and reads `entity_types`, `edge_types`, `analysis_summary` from the result. The shape contract is what matters; language of `description` is not parsed.
|
||||
- `backend/app/services/__init__.py` — re-exports the class.
|
||||
- No tests currently reference this module (verified via `Grep ontology_generator|OntologyGenerator|ONTOLOGY_SYSTEM_PROMPT`).
|
||||
|
||||
### Conventions
|
||||
|
||||
- 4-space indentation, Python 3.11+, snake_case identifiers, type hints where present (matches surrounding file style).
|
||||
- No linter/formatter — match existing style; existing file uses Chinese inline comments which are *out of scope* (issue #7).
|
||||
- LLM prompts in this codebase are typically defined as module-level string constants and concatenated with `get_language_instruction()` for locale steering.
|
||||
- Variable interpolation in user messages uses Python f-strings; the system prompt uses no interpolation today.
|
||||
|
||||
### Integration surfaces
|
||||
|
||||
- Output JSON schema (entity_types[], edge_types[], analysis_summary) is consumed by `_validate_and_process` (also in this file) and by Graphiti via the project's `ontology` field (set in `graph.py:235`).
|
||||
- Reserved attribute names list (`name`, `uuid`, `group_id`, `created_at`, `summary`) is asserted in the prompt for the LLM to obey, not enforced by code in this file.
|
||||
- Entity/edge fallback rules (`Person`, `Organization`) are *both* prompted and enforced by `_validate_and_process` lines 344–393. Code is the safety net; prompt is the steering.
|
||||
|
||||
## 2. Requirements → Asset Map
|
||||
|
||||
| Requirement | Existing Asset | Gap Type | Notes |
|
||||
| --- | --- | --- | --- |
|
||||
| R1 (system prompt EN) | `ONTOLOGY_SYSTEM_PROMPT` constant, lines 30–173 | **Missing — needs translation** | Mechanically a string-literal swap. Must preserve JSON template, taxonomy lists, fallback rules, count constraints, length constraint. |
|
||||
| R2 (user message EN) | `_build_user_message`, lines 231–275 | **Missing — needs translation** | Three string literals: section headings, additional-context block, trailing rules block, plus the truncation notice. |
|
||||
| R3 (locale switching) | `get_language_instruction()` call, line 209; trailing English directive, line 210 | **Constraint** | Must be preserved verbatim. No new code needed. |
|
||||
| R4 (API stability) | `__init__`, `generate`, `generate_python_code`, `_to_pascal_case`, `_validate_and_process`, `MAX_TEXT_LENGTH_FOR_LLM`, `chat_json(temperature=0.3, max_tokens=4096)` | **Constraint** | No changes to signatures or constants. |
|
||||
| R5 (reasoning-model compat) | `LLMClient.chat_json` (separate file) | **Constraint** | Already external; preservation is automatic if `chat_json` call is untouched. |
|
||||
| R6 (graph build parity) | Graph build pipeline rooted in `graph.py` | **Verification** — manual run | Requires a sample seed file run; not a code change. |
|
||||
| R7 (out-of-scope discipline) | Loggers (lines 297, 314, 341), docstrings, comments | **Constraint** | Translator must not touch them. |
|
||||
|
||||
### Gaps tagged
|
||||
|
||||
- **Missing**: prompt content needs human/operator-quality English translation (R1, R2).
|
||||
- **Constraint**: signatures, JSON contract, taxonomy names, locale postfix, LLM-call parameters, comments/docstrings/loggers are immutable in this PR (R3, R4, R5, R7).
|
||||
- **Verification**: locale `zh` and locale `en` end-to-end runs to confirm parity (R3, R6).
|
||||
- **Research Needed**: none — locale machinery, JSON contract, and LLM client behaviour are all already understood from reading existing code in this repo.
|
||||
|
||||
### Complexity signals
|
||||
|
||||
- This is **string-literal localization with structural preservation**, not feature work. No data model, API, or workflow changes. No external integrations. No new patterns. The risk is content quality, not technical correctness.
|
||||
|
||||
## 3. Implementation Approach Options
|
||||
|
||||
### Option A — In-place translation of the existing constant and method (recommended)
|
||||
|
||||
Translate `ONTOLOGY_SYSTEM_PROMPT` and the three Chinese string literals inside `_build_user_message` directly. No new files, no new abstractions.
|
||||
|
||||
- ✅ Minimal diff, easy to review, matches the file's existing style.
|
||||
- ✅ Preserves the locale-postfix mechanism unchanged (the postfix is what currently steers `zh` output and will continue to do so under an English base prompt).
|
||||
- ✅ Aligns with how the analogous i18n issues for sibling files (#3, #4, #5) are framed in the epic.
|
||||
- ❌ The English base prompt biases the model toward English structure for Chinese locale runs; mitigated by the existing trailing English directive that locks identifier formats and by the per-locale `llmInstruction` postfix.
|
||||
|
||||
### Option B — Externalize prompts to locale files
|
||||
|
||||
Move `ONTOLOGY_SYSTEM_PROMPT` content to `/locales/en.json` and `/locales/zh.json` and resolve at runtime via `t("ontology.system_prompt")`.
|
||||
|
||||
- ✅ Provides parallel zh/en prompts, eliminating cross-locale bias entirely.
|
||||
- ❌ Out of scope per issue #2 — externalizing log messages is issue #6 and a similar pattern would expand this PR's surface beyond the ticket. Adopting it here would also risk merge conflicts with #6.
|
||||
- ❌ Adds runtime indirection (file IO, key lookups) for a string that has not been externalized in any other prompt module. Inconsistent with current convention until a future i18n-prompt initiative.
|
||||
- ❌ Requires authoring high-quality Chinese prompts as locale data, which is exactly what's being moved away from for English-bias reasons.
|
||||
|
||||
### Option C — Hybrid: translate in place, parameterize the locale postfix
|
||||
|
||||
Translate in place per Option A, and additionally factor `system_prompt = f"{ONTOLOGY_SYSTEM_PROMPT}\n\n{lang_instruction}\n..."` into a small helper.
|
||||
|
||||
- ✅ Slightly cleaner.
|
||||
- ❌ Refactor outside the ticket's scope. Issue #2 is explicit: "No diff to call sites of these prompts — same function signatures and return shapes." A helper would change a private code shape unnecessarily.
|
||||
|
||||
## 4. Effort & Risk
|
||||
|
||||
- **Effort: S (1 day)** — string-literal translation with structural preservation. The bulk of the time is producing accurate, terminology-faithful English prose for the system prompt's design guidelines.
|
||||
- **Risk: Low** — well-bounded change, no API surface impact, JSON contract preserved by validator code that already exists, no new dependencies. The single residual risk is qualitative (English prompt failing to elicit equivalent ontology quality), mitigated by:
|
||||
- The trailing English directive at line 210 already locks identifier formats.
|
||||
- `_validate_and_process` enforces fallback `Person` / `Organization` types in code regardless of prompt.
|
||||
- Manual verification under both `en` and `zh` locales is part of acceptance.
|
||||
|
||||
## 5. Recommendations for Design Phase
|
||||
|
||||
- **Preferred approach**: Option A — translate `ONTOLOGY_SYSTEM_PROMPT` and the four user-message string fragments in place. Preserve every code structure around them.
|
||||
- **Key decisions for design**:
|
||||
1. Translation style for the system prompt: faithful, terminology-preserving English. Maintain the same section structure (`## Core Task Background`, `## Output Format`, `## Design Guidelines`, `## Entity Type Reference`, `## Relationship Type Reference`). Keep all Chinese-language gloss in the entity reference list intact in spirit but rendered in English (e.g. `Student: 学生` becomes `Student: a student`).
|
||||
2. Heading translations for user message: `## 模拟需求` → `## Simulation Requirement`; `## 文档内容` → `## Document Content`; `## 额外说明` → `## Additional Context`.
|
||||
3. Truncation notice: render in English, preserve both numeric interpolations (`{original_length}`, `{self.MAX_TEXT_LENGTH_FOR_LLM}`).
|
||||
4. Trailing rules block: render in English, preserve the five-rule enumeration semantics verbatim, and keep the call to action ("Based on the content above ...").
|
||||
5. The trailing English directive at line 210 (`IMPORTANT: Entity type names MUST be in English PascalCase ...`) is already English; leave it byte-for-byte unchanged.
|
||||
6. No code structure changes. No new imports. No changes to signatures, constants, or the `chat_json` call.
|
||||
- **Verification plan for design**:
|
||||
- Static check: zero CJK characters in any prompt string literal post-edit (regex `[一-鿿]` over the patched constant and the patched method body).
|
||||
- Runtime check: under `LLM_API_KEY` configured to a test provider, run a small `OntologyGenerator().generate(...)` round-trip with locale `en` and locale `zh`, asserting JSON validity and the 10/Person+Organization invariant.
|
||||
- End-to-end check: run the Step 1 graph build on a representative seed file with locale `en`; compare node and edge counts to a recent `zh` baseline within operator tolerance.
|
||||
- **Research items**: none open. All adjacent systems (locale resolver, LLM client, validator, graph build pipeline) are read-only and behave deterministically with respect to the changes proposed.
|
||||
|
|
@ -0,0 +1,115 @@
|
|||
# Requirements Document
|
||||
|
||||
## Introduction
|
||||
|
||||
This specification covers the English translation of the prompt strings in `backend/app/services/ontology_generator.py`. The file produces the project ontology (entity types, relationship types, schema commentary) that drives the Graphiti graph build (Step 1 of the MiroFish pipeline). Today, the system prompt and user-message templates are written in Chinese; the language is steered at runtime by appending `get_language_instruction()` to the system message. While that postfix instructs the model *which* language to respond in, the base-prompt language biases the model's structural and lexical output. As a result, ontology descriptions, reasoning, and schema commentary skew Chinese under `Accept-Language: en`. Translating the base prompt to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (verified: `get_language_instruction()` returns the Chinese postfix `请使用中文回答。` when locale is `zh`, so a Chinese model response remains achievable from an English base prompt).
|
||||
|
||||
This work tracks GitHub issue [#2](https://github.com/salestech-group/MiroFish/issues/2).
|
||||
|
||||
## Boundary Context
|
||||
|
||||
- **In scope**:
|
||||
- Translating `ONTOLOGY_SYSTEM_PROMPT` (the module-level system prompt constant) from Chinese to English.
|
||||
- Translating the user-message template constructed in `OntologyGenerator._build_user_message` (Chinese section headings and instruction list) to English.
|
||||
- Translating the truncation notice string emitted when input text exceeds `MAX_TEXT_LENGTH_FOR_LLM`.
|
||||
- Translating the trailing instruction string appended to the user message ("必须遵守的规则" block).
|
||||
- Preserving all functional contracts: JSON schema, key names, entity-type taxonomy, relationship-type taxonomy, attribute reserved-word list, fallback rules, variable interpolation, and the `get_language_instruction()` postfix call site.
|
||||
- **Out of scope**:
|
||||
- Logger messages, including warnings emitted by `_validate_and_process` (covered by issue #6).
|
||||
- Module docstring, class docstrings, method docstrings, and inline comments (covered by issue #7).
|
||||
- Refactoring the ontology JSON schema, validation flow, or extraction strategy.
|
||||
- Changing the entity-type or relationship-type reference taxonomies (the categories themselves remain — only their description language changes).
|
||||
- Editing call sites of `OntologyGenerator.generate` or `generate_python_code`.
|
||||
- Translating the auto-generated Python code emitted by `generate_python_code` (the comment headers there are documentation, covered by #7).
|
||||
- **Adjacent expectations**:
|
||||
- The Graphiti adapter (`graphiti_adapter`) and Step 1 graph build pipeline must continue to consume the ontology output unchanged. No coupling to prompt language exists in the adapter; this is verified via the JSON schema contract being preserved.
|
||||
- The locale resolution chain (`Accept-Language` header → `get_locale()` → `get_language_instruction()`) is owned by `backend/app/utils/locale.py` and is unchanged by this work. Translating the base prompt does not modify locale resolution semantics.
|
||||
- Companion i18n issues (#3, #4, #5, #6, #7, #8, #9, #10) operate on different files or scopes and should not be touched here.
|
||||
|
||||
## Requirements
|
||||
|
||||
### Requirement 1: English Translation of the Ontology System Prompt
|
||||
|
||||
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the ontology-generation system prompt to be authored in English, so that the LLM's ontology descriptions, reasoning, and schema commentary are not biased toward Chinese structure or word choice.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. The Ontology Generator shall define `ONTOLOGY_SYSTEM_PROMPT` containing zero Chinese characters in any string-literal content.
|
||||
2. The Ontology Generator shall preserve the JSON output contract of the system prompt verbatim: the keys `entity_types`, `edge_types`, `analysis_summary`, and the entity sub-keys `name`, `description`, `attributes`, `examples`, and the edge sub-keys `name`, `description`, `source_targets`, `attributes`, plus the `source_targets` sub-keys `source` and `target`.
|
||||
3. The Ontology Generator shall preserve the entity-type reference list verbatim by name (`Student`, `Professor`, `Journalist`, `Celebrity`, `Executive`, `Official`, `Lawyer`, `Doctor`, `Person`, `University`, `Company`, `GovernmentAgency`, `MediaOutlet`, `Hospital`, `School`, `NGO`, `Organization`).
|
||||
4. The Ontology Generator shall preserve the relationship-type reference list verbatim by name (`WORKS_FOR`, `STUDIES_AT`, `AFFILIATED_WITH`, `REPRESENTS`, `REGULATES`, `REPORTS_ON`, `COMMENTS_ON`, `RESPONDS_TO`, `SUPPORTS`, `OPPOSES`, `COLLABORATES_WITH`, `COMPETES_WITH`).
|
||||
5. The Ontology Generator shall preserve the reserved-attribute-name list verbatim (`name`, `uuid`, `group_id`, `created_at`, `summary`).
|
||||
6. The Ontology Generator shall preserve the fallback-type rule that exactly two fallback entity types — `Person` and `Organization` — must appear at the end of a 10-item list.
|
||||
7. The Ontology Generator shall preserve the entity-count constraint (exactly 10 entity types) and the edge-count constraint (6–10 relationship types).
|
||||
8. The Ontology Generator shall preserve the description-length constraint (entity and edge `description` ≤ 100 characters).
|
||||
|
||||
### Requirement 2: English Translation of the User-Message Template
|
||||
|
||||
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the user-message template constructed by `_build_user_message` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. The Ontology Generator shall render the user message with English section headings in place of `## 模拟需求`, `## 文档内容`, and `## 额外说明`.
|
||||
2. The Ontology Generator shall render the trailing rules block in English (replacing `请根据以上内容...` and the `必须遵守的规则` enumeration), preserving the rule semantics: 10 entity types total, last 2 are `Person`/`Organization` fallbacks, first 8 are concrete types, all entities must be real-world social-media-capable subjects (not abstract concepts), and reserved attribute names cannot be used.
|
||||
3. The Ontology Generator shall render the truncation notice in English when the combined document text exceeds `MAX_TEXT_LENGTH_FOR_LLM`, including the original character count and the truncation length.
|
||||
4. The Ontology Generator shall preserve all variable interpolations verbatim by name (`simulation_requirement`, `combined_text`, `additional_context`, and the `{original_length}` / `{self.MAX_TEXT_LENGTH_FOR_LLM}` interpolations in the truncation notice).
|
||||
5. The Ontology Generator shall preserve the conditional inclusion of the `## Additional Context` section only when `additional_context` is truthy.
|
||||
6. The Ontology Generator shall return zero Chinese characters across all string literals contributed to the assembled user message.
|
||||
|
||||
### Requirement 3: Locale Switching Continues to Work via `get_language_instruction()`
|
||||
|
||||
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: zh` (or any other configured non-English locale), I want the ontology output to remain in the requested locale of equivalent quality, so that translating the base prompt does not regress non-English support.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. The Ontology Generator shall preserve the call to `get_language_instruction()` exactly at the existing location (currently the line above `system_prompt = f"{ONTOLOGY_SYSTEM_PROMPT}\n\n{lang_instruction}\n..."`), continuing to read locale via the existing thread-local / request-header resolution chain.
|
||||
2. The Ontology Generator shall preserve the trailing English directive that locks identifier formats (`Entity type names MUST be in English PascalCase ...`, `Relationship type names MUST be in English UPPER_SNAKE_CASE ...`, `Attribute names MUST be in English snake_case ...`, `Only description fields and analysis_summary should use the specified language above.`).
|
||||
3. When the locale is `zh`, the Ontology Generator shall produce a JSON ontology whose `description` and `analysis_summary` fields are in Chinese, equivalent in quality to the pre-change behaviour.
|
||||
4. When the locale is `en`, the Ontology Generator shall produce a JSON ontology whose `description` and `analysis_summary` fields are in English.
|
||||
5. The Ontology Generator shall not alter `backend/app/utils/locale.py`, the `_languages`, the `_translations` registries, or the locales under `/locales/`.
|
||||
|
||||
### Requirement 4: Public API and Call-Site Stability
|
||||
|
||||
**Objective:** As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of `OntologyGenerator` to remain unchanged, so that the graph-build flow and existing callers continue to work without modification.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. The Ontology Generator shall preserve the signature of `OntologyGenerator.__init__(self, llm_client: Optional[LLMClient] = None)`.
|
||||
2. The Ontology Generator shall preserve the signature of `OntologyGenerator.generate(self, document_texts: List[str], simulation_requirement: str, additional_context: Optional[str] = None) -> Dict[str, Any]`.
|
||||
3. The Ontology Generator shall preserve the signature of `OntologyGenerator.generate_python_code(self, ontology: Dict[str, Any]) -> str`.
|
||||
4. The Ontology Generator shall preserve the return-shape contract of `generate()`: a `Dict[str, Any]` with keys `entity_types`, `edge_types`, `analysis_summary` matching the existing JSON schema, post-validation.
|
||||
5. The Ontology Generator shall preserve the signature of the private helper `_to_pascal_case(name: str) -> str` and the validator `_validate_and_process(self, result: Dict[str, Any]) -> Dict[str, Any]`.
|
||||
6. The Ontology Generator shall preserve the constant `MAX_TEXT_LENGTH_FOR_LLM = 50000`.
|
||||
7. The Ontology Generator shall preserve the LLM invocation parameters (`temperature=0.3`, `max_tokens=4096`) and the call to `self.llm_client.chat_json(...)`.
|
||||
|
||||
### Requirement 5: Reasoning-Model Output Compatibility
|
||||
|
||||
**Objective:** As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with `<think>` tags or markdown code fences), I want JSON parsing of the ontology response to continue working, so that translating the base prompt does not regress provider compatibility.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. The Ontology Generator shall delegate JSON parsing to `LLMClient.chat_json` exactly as today (the call at the existing site is unchanged in name and arguments).
|
||||
2. If a reasoning-model provider returns `<think>`-tagged or markdown-fenced output, then the existing stripping logic in `LLMClient.chat_json` shall continue to apply unchanged.
|
||||
3. The Ontology Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language.
|
||||
4. After translation, the Ontology Generator shall continue to round-trip a sample seed file through `generate()` and `_validate_and_process()` and produce a non-empty `entity_types` list of length 10 with the `Person` and `Organization` fallbacks present at indices 8 and 9 (or earlier, in the order produced).
|
||||
|
||||
### Requirement 6: Step 1 Graph Build Parity
|
||||
|
||||
**Objective:** As a MiroFish operator validating the change, I want the Graphiti / Neo4j Step 1 graph build to complete with comparable structure under the English ontology, so that the translation does not silently degrade graph quality.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. When a representative seed file is processed end-to-end with locale `en`, the Step 1 graph build shall complete without raising an exception attributable to the ontology output.
|
||||
2. When a representative seed file is processed end-to-end with locale `en`, the resulting Neo4j graph shall contain a node count and edge count comparable to the pre-change Chinese-prompt baseline within an operator-acceptable tolerance (a small percentage variance is acceptable; doubling or zeroing is not).
|
||||
3. The Ontology Generator shall not change the function signatures or call sequence used by the Step 1 graph build pipeline (verified by Requirement 4).
|
||||
|
||||
### Requirement 7: Out-of-Scope Surfaces Remain Untouched
|
||||
|
||||
**Objective:** As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings, so that translation responsibilities for adjacent surfaces (issues #6 and #7) are not absorbed into this change.
|
||||
|
||||
#### Acceptance Criteria
|
||||
|
||||
1. The change shall not modify any `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, or `logger.debug(...)` call in `ontology_generator.py` (covered by issue #6).
|
||||
2. The change shall not modify the module docstring, class docstring, method docstrings, or inline comments in `ontology_generator.py` (covered by issue #7).
|
||||
3. The change shall not edit any file outside `backend/app/services/ontology_generator.py` for production code, except for adding test fixtures or scripts under a clearly-isolated directory if a verification harness is needed.
|
||||
4. The change shall not introduce a new dependency or modify `backend/pyproject.toml` / `backend/uv.lock`.
|
||||
|
|
@ -0,0 +1,99 @@
|
|||
# Research & Design Decisions — i18n-ontology-generator-prompts
|
||||
|
||||
## Summary
|
||||
|
||||
- **Feature**: `i18n-ontology-generator-prompts`
|
||||
- **Discovery Scope**: Extension (string-literal localization within an existing service)
|
||||
- **Key Findings**:
|
||||
- The `<think>` / markdown-fence stripping logic relied on by the ticket's R5 lives in `backend/app/utils/llm_client.py` (`LLMClient.chat` line 65 and `chat_json` lines 84–87), **not** in `ontology_generator.py`. R5 is therefore satisfied implicitly so long as the call to `self.llm_client.chat_json(...)` is preserved exactly.
|
||||
- The locale postfix (`get_language_instruction()`) is sourced from `/locales/languages.json` via `backend/app/utils/locale.py`; the English postfix and Chinese postfix are both already defined and resolved per request via `Accept-Language`. No locale-machinery changes are needed.
|
||||
- `_validate_and_process` in the same file enforces the `Person` / `Organization` fallback invariant in code (lines 344–393) regardless of prompt language. This means the prompt translation cannot break the post-validation invariants — the validator is the safety net.
|
||||
- The sole production caller is `backend/app/api/graph.py:223–228`, which consumes only the JSON shape (`entity_types`, `edge_types`, `analysis_summary`). It does not introspect prompt language. Stable.
|
||||
|
||||
## Research Log
|
||||
|
||||
### Locale resolution semantics
|
||||
|
||||
- **Context**: R3 requires `zh` to continue producing Chinese descriptions of equivalent quality after the base prompt is translated to English.
|
||||
- **Sources Consulted**: `backend/app/utils/locale.py`, `/locales/languages.json` (referenced via `_languages` registry).
|
||||
- **Findings**:
|
||||
- `get_locale()` returns the `Accept-Language` header value when in a request context (falling back to `zh`) and a thread-local otherwise.
|
||||
- `get_language_instruction()` returns `_languages[locale].llmInstruction`, defaulting to `请使用中文回答。`.
|
||||
- The system prompt at line 210 already concatenates `lang_instruction` plus an English directive locking identifier formats. Both stay byte-for-byte unchanged.
|
||||
- **Implications**: Locale switching survives the translation; no code changes are needed in locale.py. The English base prompt + Chinese postfix is a known-working pattern (R3 acceptance criteria 3 stays valid).
|
||||
|
||||
### `<think>` and markdown-fence stripping path
|
||||
|
||||
- **Context**: R5 requires preservation of the `<think>` and markdown-fence stripping per `CLAUDE.md` (commit 985f89f).
|
||||
- **Sources Consulted**: `backend/app/utils/llm_client.py` lines 50–93.
|
||||
- **Findings**:
|
||||
- `LLMClient.chat` strips `<think>...</think>` after every response (line 65).
|
||||
- `LLMClient.chat_json` additionally strips ` ```json `, ` ``` `, and trailing fences (lines 84–87) before `json.loads`.
|
||||
- `ontology_generator.py` only invokes `chat_json` — it does not perform stripping itself.
|
||||
- **Implications**: Translating the prompts in `ontology_generator.py` cannot break the stripping logic. The single call to `self.llm_client.chat_json(messages=messages, temperature=0.3, max_tokens=4096)` at lines 217–221 must be preserved verbatim.
|
||||
|
||||
### Caller surface and contract
|
||||
|
||||
- **Context**: R4 requires zero diff to call sites of these prompts.
|
||||
- **Sources Consulted**: `backend/app/api/graph.py:223–228`, `backend/app/services/__init__.py`.
|
||||
- **Findings**:
|
||||
- Only one production caller. It uses default constructor and the public `generate(document_texts, simulation_requirement, additional_context)` signature.
|
||||
- It reads `entity_types`, `edge_types`, `analysis_summary` from the result.
|
||||
- No tests or scripts under `backend/scripts/` reference the module.
|
||||
- **Implications**: The translation is invisible to callers as long as we hold the public surface constant and continue to produce the same JSON shape (which the validator guarantees).
|
||||
|
||||
### Validator safety net
|
||||
|
||||
- **Context**: R1 / R5 acceptance: the post-validation invariant (10 entity types, ending in Person/Organization) must hold under both locales after translation.
|
||||
- **Sources Consulted**: `_validate_and_process` lines 277–398.
|
||||
- **Findings**:
|
||||
- `_to_pascal_case` normalizes entity names regardless of language.
|
||||
- `_validate_and_process` enforces `MAX_ENTITY_TYPES = 10`, `MAX_EDGE_TYPES = 10`, deduplicates by name, force-injects `Person` and `Organization` fallbacks if missing, and truncates `description` to 100 chars.
|
||||
- **Implications**: Even if the LLM under an English prompt deviates from the count or fallback rules, the validator self-heals. Translation cannot break the JSON contract.
|
||||
|
||||
## Architecture Pattern Evaluation
|
||||
|
||||
| Option | Description | Strengths | Risks / Limitations | Notes |
|
||||
|--------|-------------|-----------|---------------------|-------|
|
||||
| In-place translation | Translate the constant and method strings; preserve all code structure. | Minimal diff; matches sibling-issue pattern (#3, #4, #5); no new abstractions. | English base biases output; mitigated by `get_language_instruction()` postfix and the trailing English directive at line 210. | Selected. |
|
||||
| Externalize to locale files | Move prompts to `/locales/{en,zh}.json` keyed under `ontology.system_prompt`. | Eliminates cross-locale bias entirely; symmetric prompts. | Out of scope per ticket (#2 is file-internal); inconsistent with how the codebase currently handles LLM prompts; conflicts with #6 i18n track. | Rejected. |
|
||||
| Hybrid (translate + extract postfix helper) | Translate in place; extract postfix concatenation into a helper. | Slightly cleaner. | Adds refactor outside ticket scope; ticket says "no diff to call sites" and "same function signatures". | Rejected. |
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Decision: Translate `ONTOLOGY_SYSTEM_PROMPT` and `_build_user_message` strings in place
|
||||
|
||||
- **Context**: Prompts in `ontology_generator.py` are Chinese, biasing model output toward Chinese structure under `Accept-Language: en`. The ticket scopes the work to this single file and excludes refactors.
|
||||
- **Alternatives Considered**:
|
||||
1. Externalize to `/locales/*.json` keyed prompts (rejected; out of scope, inconsistent with codebase).
|
||||
2. Hybrid in-place translation + helper extraction (rejected; refactor outside scope).
|
||||
- **Selected Approach**: Replace the body of `ONTOLOGY_SYSTEM_PROMPT` with an English translation that preserves section structure, JSON template, taxonomy lists, fallback rules, and reserved-name lists. Replace the four Chinese string literals in `_build_user_message` (section headings, additional-context block heading, trailing rules block, truncation notice) with English equivalents while preserving `f"..."` interpolations and the conditional inclusion of additional context.
|
||||
- **Rationale**: Minimal-surface change; aligns with how sibling i18n issues are scoped; preserves the locale-postfix mechanism, the validator safety net, and all caller contracts.
|
||||
- **Trade-offs**: An English base prompt biases the model toward English structure. Mitigations: the per-locale `llmInstruction` postfix instructs the model to respond in the requested language; the trailing English directive at line 210 already locks identifier formats; `_validate_and_process` self-heals invariants.
|
||||
- **Follow-up**: Manual verification under both `en` and `zh` locales: assert valid JSON, assert exactly 10 entity types ending in `Person` and `Organization`, and assert description fields are in the expected language.
|
||||
|
||||
### Decision: Preserve all surrounding code unchanged
|
||||
|
||||
- **Context**: The ticket forbids changes to call sites and the surrounding code shape.
|
||||
- **Alternatives Considered**:
|
||||
1. Refactor language-locking directive into the locale module (rejected; out of scope and crosses file boundary).
|
||||
2. Add a docstring or constant for "prompt version" (rejected; introduces unused state).
|
||||
- **Selected Approach**: Translation-only diff. No new imports, no new constants, no signature changes, no logger or comment changes (those are #6 and #7 respectively).
|
||||
- **Rationale**: Smallest possible review surface, fewest possible regression vectors, easiest possible PR to land.
|
||||
- **Trade-offs**: None of consequence within the ticket's stated scope.
|
||||
- **Follow-up**: Run the ontology generator round-trip locally; verify zero CJK characters in the patched literals via regex `[一-鿿]`.
|
||||
|
||||
## Risks & Mitigations
|
||||
|
||||
- **Risk: English prompt produces lower-quality Chinese ontology under `zh` locale** → Mitigation: The `llmInstruction` postfix already steers Chinese output. The trailing English directive at line 210 already locks identifier formats. If quality regresses in practice, future work (issue #6 / a follow-up) can externalize prompts to locale files.
|
||||
- **Risk: Translator inadvertently changes the JSON template structure or a reserved attribute name** → Mitigation: Acceptance criteria R1.2–R1.6 enumerate the structural constants verbatim. Validator code already enforces `MAX_ENTITY_TYPES`, `MAX_EDGE_TYPES`, and fallback injection independently.
|
||||
- **Risk: Logger or comment text gets translated as part of the same edit** → Mitigation: The change scope is explicit (R7); reviewer compares the diff against R7 acceptance criteria.
|
||||
|
||||
## References
|
||||
|
||||
- [.ticket/2.md](../../../.ticket/2.md) — ticket snapshot for issue #2.
|
||||
- [CLAUDE.md](../../../CLAUDE.md) — project conventions, including reasoning-model output stripping.
|
||||
- [/locales/languages.json](../../../locales/languages.json) — `llmInstruction` definitions per locale.
|
||||
- `backend/app/utils/locale.py` — locale resolver implementation.
|
||||
- `backend/app/utils/llm_client.py` — `<think>` / markdown-fence stripping.
|
||||
- `backend/app/api/graph.py:223–228` — sole production caller.
|
||||
|
|
@ -0,0 +1,27 @@
|
|||
{
|
||||
"feature_name": "i18n-ontology-generator-prompts",
|
||||
"created_at": "2026-05-07T09:24:17Z",
|
||||
"updated_at": "2026-05-07T10:00:00Z",
|
||||
"language": "en",
|
||||
"phase": "tasks-generated",
|
||||
"approvals": {
|
||||
"requirements": {
|
||||
"generated": true,
|
||||
"approved": true
|
||||
},
|
||||
"design": {
|
||||
"generated": true,
|
||||
"approved": true
|
||||
},
|
||||
"tasks": {
|
||||
"generated": true,
|
||||
"approved": true
|
||||
}
|
||||
},
|
||||
"ready_for_implementation": true,
|
||||
"ticket": {
|
||||
"number": 2,
|
||||
"url": "https://github.com/salestech-group/MiroFish/issues/2",
|
||||
"snapshot": ".ticket/2.md"
|
||||
}
|
||||
}
|
||||
|
|
@ -0,0 +1,58 @@
|
|||
# Implementation Plan
|
||||
|
||||
- [x] 1. Translate the ontology system-prompt constant to English
|
||||
- Replace the body of `ONTOLOGY_SYSTEM_PROMPT` with an English rendering that preserves the section structure (core task background, output format JSON template, design guidelines, entity-type reference list, relationship-type reference list, attribute reserved-name rules)
|
||||
- Preserve the JSON template keys verbatim: `entity_types`, `edge_types`, `analysis_summary`, and the entity sub-keys `name`, `description`, `attributes`, `examples`, the edge sub-keys `name`, `description`, `source_targets`, `attributes`, plus the `source_targets` sub-keys `source` and `target`
|
||||
- Preserve the entity-type reference list verbatim by name (`Student`, `Professor`, `Journalist`, `Celebrity`, `Executive`, `Official`, `Lawyer`, `Doctor`, `Person`, `University`, `Company`, `GovernmentAgency`, `MediaOutlet`, `Hospital`, `School`, `NGO`, `Organization`)
|
||||
- Preserve the relationship-type reference list verbatim by name (`WORKS_FOR`, `STUDIES_AT`, `AFFILIATED_WITH`, `REPRESENTS`, `REGULATES`, `REPORTS_ON`, `COMMENTS_ON`, `RESPONDS_TO`, `SUPPORTS`, `OPPOSES`, `COLLABORATES_WITH`, `COMPETES_WITH`)
|
||||
- Preserve the reserved-attribute-name list verbatim (`name`, `uuid`, `group_id`, `created_at`, `summary`)
|
||||
- Express the same numeric constraints in English: exactly 10 entity types, last 2 are `Person` and `Organization` fallbacks, 6–10 relationship types, 1–3 attributes per entity, descriptions ≤ 100 characters
|
||||
- Observable completion: `ONTOLOGY_SYSTEM_PROMPT` is an English-only string with zero CJK characters and identical structural keys/values to the original
|
||||
- _Requirements: 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8_
|
||||
|
||||
- [x] 2. Translate the user-message template strings to English
|
||||
- Replace the section headings `## 模拟需求`, `## 文档内容`, and `## 额外说明` with English equivalents (`## Simulation Requirement`, `## Document Content`, `## Additional Context`)
|
||||
- Replace the trailing rules block (the Chinese `请根据以上内容...` / `必须遵守的规则` enumeration) with an English block that conveys the same five rules: 10 entity types total; last 2 are `Person` and `Organization` fallbacks; first 8 are concrete types from the document; entities must be real-world social-media-capable subjects (not abstract concepts); reserved attribute names cannot be used
|
||||
- Replace the truncation notice (the Chinese `(原文共...字,已截取前...字用于本体分析)`) with an English equivalent that retains both numeric interpolations
|
||||
- Preserve every f-string interpolation by name and position: `{simulation_requirement}`, `{combined_text}`, `{additional_context}`, `{original_length}`, `{self.MAX_TEXT_LENGTH_FOR_LLM}`
|
||||
- Preserve the conditional inclusion of the `## Additional Context` block — it appears only when `additional_context` is truthy
|
||||
- Observable completion: `_build_user_message` produces an English-only message body for any input combination, with zero CJK characters in any string literal it contributes; under the same inputs as before, all interpolated values still appear in the rendered output
|
||||
- _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6_
|
||||
|
||||
- [x] 3. Confirm boundary commitments around the translation
|
||||
- Confirm the call to `get_language_instruction()` and the assembled `system_prompt` line remain at their existing position with their existing arguments (no rename, no relocation)
|
||||
- Confirm the trailing English identifier-format directive (`IMPORTANT: Entity type names MUST be in English PascalCase ...`, `Relationship type names MUST be in English UPPER_SNAKE_CASE ...`, `Attribute names MUST be in English snake_case ...`, `Only description fields and analysis_summary should use the specified language above.`) remains byte-for-byte identical
|
||||
- Confirm the public signatures of `OntologyGenerator.__init__`, `generate`, `generate_python_code`, the private `_to_pascal_case`, and `_validate_and_process` are unchanged
|
||||
- Confirm the constant `MAX_TEXT_LENGTH_FOR_LLM = 50000` is unchanged
|
||||
- Confirm the LLM invocation parameters `temperature=0.3, max_tokens=4096` and the `self.llm_client.chat_json(...)` call site are unchanged
|
||||
- Confirm `backend/app/utils/locale.py`, `/locales/languages.json`, `/locales/en.json`, and `/locales/zh.json` are not modified
|
||||
- Confirm `logger.warning(...)`, `logger.info(...)`, module/class/method docstrings, and inline comments in `ontology_generator.py` are not modified (these are owned by issues #6 and #7)
|
||||
- Confirm `backend/pyproject.toml`, `backend/uv.lock`, and any file outside `backend/app/services/ontology_generator.py` are not modified
|
||||
- Observable completion: a `git diff` review against `main` shows changes only inside `backend/app/services/ontology_generator.py`, only inside `ONTOLOGY_SYSTEM_PROMPT` and `_build_user_message`, and the surrounding lines are byte-identical
|
||||
- _Requirements: 3.1, 3.2, 3.5, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 5.1, 5.3, 7.1, 7.2, 7.3, 7.4_
|
||||
|
||||
- [x] 4. Verify reasoning-model output compatibility and JSON shape stability
|
||||
- Inspect `LLMClient.chat_json` to confirm `<think>` tag stripping (in `chat`) and markdown-fence stripping (in `chat_json`) are still the only post-processors applied to the LLM response, and that no new pre-processing has been added in `ontology_generator.py`
|
||||
- Run an in-process round-trip: instantiate `OntologyGenerator`, call `generate(...)` with a small representative `document_texts` list and `simulation_requirement`, and assert the returned dict has keys `entity_types` (length 10), `edge_types`, `analysis_summary`; assert the last two entity-type names are `Person` and `Organization`
|
||||
- Repeat the round-trip under simulated reasoning-model output to confirm the existing stripping path still parses cleanly (e.g. by patching `chat` to wrap a known-good JSON in `<think>...</think>` and triple-fenced code, then asserting `chat_json` still parses)
|
||||
- Observable completion: a short verification script under `backend/scripts/` (or an inline `python -c` recorded in the PR description) demonstrates the round-trip succeeds with both clean and `<think>`/fenced LLM outputs
|
||||
- _Requirements: 5.1, 5.2, 5.3, 5.4_
|
||||
|
||||
- [ ] 5. Verify locale-driven output language under both `en` and `zh`
|
||||
- Set the thread-local locale to `en` via `set_locale("en")`, run `OntologyGenerator().generate(...)` against the configured LLM, and confirm the returned `description` fields and `analysis_summary` contain no CJK characters and read as natural English
|
||||
- Set the thread-local locale to `zh` via `set_locale("zh")`, run the same round-trip, and confirm the returned `description` fields and `analysis_summary` contain CJK characters of equivalent quality to the pre-change baseline
|
||||
- Observable completion: both runs succeed; the `en` run is CJK-free in description fields, the `zh` run continues to produce Chinese descriptions; results are recorded in the PR description
|
||||
- _Requirements: 3.3, 3.4_
|
||||
|
||||
- [ ] 6. Verify Step 1 graph-build parity end-to-end under `en` locale
|
||||
- Using a representative seed file, exercise the full Step 1 graph-build pipeline (upload → ontology → Graphiti → Neo4j) under `Accept-Language: en`
|
||||
- Confirm the run completes without raising an exception attributable to ontology output
|
||||
- Compare the resulting Neo4j node and edge counts against a recent `zh`-locale baseline; confirm they are within an operator-acceptable tolerance (no doubling, no zeroing)
|
||||
- Observable completion: the pipeline reaches `GRAPH_COMPLETED`, and the comparison numbers are recorded in the PR description
|
||||
- _Requirements: 6.1, 6.2, 6.3_
|
||||
|
||||
- [x]* 7. Add a static guard against CJK regression in this file's prompt strings
|
||||
- Add a small one-shot script under `backend/scripts/` that loads `ONTOLOGY_SYSTEM_PROMPT` and the rendered output of `_build_user_message(...)` for representative inputs, and asserts zero matches against the regex `[一-鿿]` over those strings
|
||||
- Optional: extend the existing `pytest`-style harness if a thin assertion fits the project's minimal test surface
|
||||
- Observable completion: running the script exits 0 against the patched module; running it against a hypothetical revert of the patch exits non-zero
|
||||
- _Requirements: 1.1, 2.6_
|
||||
|
|
@ -27,149 +27,149 @@ def _to_pascal_case(name: str) -> str:
|
|||
|
||||
|
||||
# 本体生成的系统提示词
|
||||
ONTOLOGY_SYSTEM_PROMPT = """你是一个专业的知识图谱本体设计专家。你的任务是分析给定的文本内容和模拟需求,设计适合**社交媒体舆论模拟**的实体类型和关系类型。
|
||||
ONTOLOGY_SYSTEM_PROMPT = """You are a professional knowledge-graph ontology designer. Your task is to analyze the supplied text and simulation requirement and design entity types and relationship types suitable for a **social-media public-opinion simulation**.
|
||||
|
||||
**重要:你必须输出有效的JSON格式数据,不要输出任何其他内容。**
|
||||
**Important: you must output valid JSON data and nothing else.**
|
||||
|
||||
## 核心任务背景
|
||||
## Core Task Background
|
||||
|
||||
我们正在构建一个**社交媒体舆论模拟系统**。在这个系统中:
|
||||
- 每个实体都是一个可以在社交媒体上发声、互动、传播信息的"账号"或"主体"
|
||||
- 实体之间会相互影响、转发、评论、回应
|
||||
- 我们需要模拟舆论事件中各方的反应和信息传播路径
|
||||
We are building a **social-media public-opinion simulation system**. In this system:
|
||||
- Every entity is an "account" or "actor" that can post on social media, interact with other accounts, and propagate information.
|
||||
- Entities influence each other, repost, comment on, and respond to one another.
|
||||
- We need to simulate how each side of a public-opinion event reacts and how information flows.
|
||||
|
||||
因此,**实体必须是现实中真实存在的、可以在社媒上发声和互动的主体**:
|
||||
Therefore, **entities must be real-world subjects that can plausibly post on social media and interact with others**:
|
||||
|
||||
**可以是**:
|
||||
- 具体的个人(公众人物、当事人、意见领袖、专家学者、普通人)
|
||||
- 公司、企业(包括其官方账号)
|
||||
- 组织机构(大学、协会、NGO、工会等)
|
||||
- 政府部门、监管机构
|
||||
- 媒体机构(报纸、电视台、自媒体、网站)
|
||||
- 社交媒体平台本身
|
||||
- 特定群体代表(如校友会、粉丝团、维权群体等)
|
||||
**Acceptable**:
|
||||
- Specific individuals (public figures, parties to the event, opinion leaders, experts and scholars, ordinary people)
|
||||
- Companies and businesses (including their official accounts)
|
||||
- Organizations (universities, associations, NGOs, unions, etc.)
|
||||
- Government departments and regulators
|
||||
- Media organizations (newspapers, broadcasters, independent media, websites)
|
||||
- Social-media platforms themselves
|
||||
- Representatives of specific groups (alumni associations, fan communities, advocacy groups, etc.)
|
||||
|
||||
**不可以是**:
|
||||
- 抽象概念(如"舆论"、"情绪"、"趋势")
|
||||
- 主题/话题(如"学术诚信"、"教育改革")
|
||||
- 观点/态度(如"支持方"、"反对方")
|
||||
**Not acceptable**:
|
||||
- Abstract concepts (such as "public opinion", "sentiment", "trend")
|
||||
- Topics or subjects (such as "academic integrity", "education reform")
|
||||
- Viewpoints or stances (such as "supporters", "opponents")
|
||||
|
||||
## 输出格式
|
||||
## Output Format
|
||||
|
||||
请输出JSON格式,包含以下结构:
|
||||
Return JSON with the following structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"entity_types": [
|
||||
{
|
||||
"name": "实体类型名称(英文,PascalCase)",
|
||||
"description": "简短描述(英文,不超过100字符)",
|
||||
"name": "entity type name (English, PascalCase)",
|
||||
"description": "short description (English, no more than 100 characters)",
|
||||
"attributes": [
|
||||
{
|
||||
"name": "属性名(英文,snake_case)",
|
||||
"name": "attribute name (English, snake_case)",
|
||||
"type": "text",
|
||||
"description": "属性描述"
|
||||
"description": "attribute description"
|
||||
}
|
||||
],
|
||||
"examples": ["示例实体1", "示例实体2"]
|
||||
"examples": ["example entity 1", "example entity 2"]
|
||||
}
|
||||
],
|
||||
"edge_types": [
|
||||
{
|
||||
"name": "关系类型名称(英文,UPPER_SNAKE_CASE)",
|
||||
"description": "简短描述(英文,不超过100字符)",
|
||||
"name": "relationship type name (English, UPPER_SNAKE_CASE)",
|
||||
"description": "short description (English, no more than 100 characters)",
|
||||
"source_targets": [
|
||||
{"source": "源实体类型", "target": "目标实体类型"}
|
||||
{"source": "source entity type", "target": "target entity type"}
|
||||
],
|
||||
"attributes": []
|
||||
}
|
||||
],
|
||||
"analysis_summary": "对文本内容的简要分析说明"
|
||||
"analysis_summary": "brief analytical summary of the text content"
|
||||
}
|
||||
```
|
||||
|
||||
## 设计指南(极其重要!)
|
||||
## Design Guidelines (must be followed)
|
||||
|
||||
### 1. 实体类型设计 - 必须严格遵守
|
||||
### 1. Entity Type Design - strictly required
|
||||
|
||||
**数量要求:必须正好10个实体类型**
|
||||
**Count requirement: exactly 10 entity types.**
|
||||
|
||||
**层次结构要求(必须同时包含具体类型和兜底类型)**:
|
||||
**Hierarchy requirement (must include both concrete types and fallback types)**:
|
||||
|
||||
你的10个实体类型必须包含以下层次:
|
||||
Your 10 entity types must form the following hierarchy:
|
||||
|
||||
A. **兜底类型(必须包含,放在列表最后2个)**:
|
||||
- `Person`: 任何自然人个体的兜底类型。当一个人不属于其他更具体的人物类型时,归入此类。
|
||||
- `Organization`: 任何组织机构的兜底类型。当一个组织不属于其他更具体的组织类型时,归入此类。
|
||||
A. **Fallback types (mandatory; placed as the last 2 entries)**:
|
||||
- `Person`: the fallback type for any individual. When a person does not fit any more specific person type, classify them here.
|
||||
- `Organization`: the fallback type for any organization. When an organization does not fit any more specific organization type, classify it here.
|
||||
|
||||
B. **具体类型(8个,根据文本内容设计)**:
|
||||
- 针对文本中出现的主要角色,设计更具体的类型
|
||||
- 例如:如果文本涉及学术事件,可以有 `Student`, `Professor`, `University`
|
||||
- 例如:如果文本涉及商业事件,可以有 `Company`, `CEO`, `Employee`
|
||||
B. **Concrete types (8 entries, designed from the text content)**:
|
||||
- Define more specific types for the major roles that appear in the text.
|
||||
- Example: for an academic event, you might use `Student`, `Professor`, `University`.
|
||||
- Example: for a business event, you might use `Company`, `CEO`, `Employee`.
|
||||
|
||||
**为什么需要兜底类型**:
|
||||
- 文本中会出现各种人物,如"中小学教师"、"路人甲"、"某位网友"
|
||||
- 如果没有专门的类型匹配,他们应该被归入 `Person`
|
||||
- 同理,小型组织、临时团体等应该归入 `Organization`
|
||||
**Why fallback types are required**:
|
||||
- The text will mention many kinds of people, e.g. "primary-school teachers", "passersby", "an anonymous netizen".
|
||||
- When no dedicated type fits, they should fall into `Person`.
|
||||
- Likewise, small organizations and ad-hoc groups should fall into `Organization`.
|
||||
|
||||
**具体类型的设计原则**:
|
||||
- 从文本中识别出高频出现或关键的角色类型
|
||||
- 每个具体类型应该有明确的边界,避免重叠
|
||||
- description 必须清晰说明这个类型和兜底类型的区别
|
||||
**Principles for concrete types**:
|
||||
- Identify the high-frequency or pivotal role types in the text.
|
||||
- Each concrete type should have a clear boundary and avoid overlap.
|
||||
- The description must clearly state how the concrete type differs from the corresponding fallback type.
|
||||
|
||||
### 2. 关系类型设计
|
||||
### 2. Relationship Type Design
|
||||
|
||||
- 数量:6-10个
|
||||
- 关系应该反映社媒互动中的真实联系
|
||||
- 确保关系的 source_targets 涵盖你定义的实体类型
|
||||
- Count: 6 to 10.
|
||||
- Relationships should reflect realistic interactions on social media.
|
||||
- Ensure each relationship's source_targets cover the entity types you defined.
|
||||
|
||||
### 3. 属性设计
|
||||
### 3. Attribute Design
|
||||
|
||||
- 每个实体类型1-3个关键属性
|
||||
- **注意**:属性名不能使用 `name`、`uuid`、`group_id`、`created_at`、`summary`(这些是系统保留字)
|
||||
- 推荐使用:`full_name`, `title`, `role`, `position`, `location`, `description` 等
|
||||
- 1 to 3 key attributes per entity type.
|
||||
- **Note**: attribute names must not use `name`, `uuid`, `group_id`, `created_at`, or `summary` (these are reserved by the system).
|
||||
- Recommended names: `full_name`, `title`, `role`, `position`, `location`, `description`, etc.
|
||||
|
||||
## 实体类型参考
|
||||
## Entity Type Reference
|
||||
|
||||
**个人类(具体)**:
|
||||
- Student: 学生
|
||||
- Professor: 教授/学者
|
||||
- Journalist: 记者
|
||||
- Celebrity: 明星/网红
|
||||
- Executive: 高管
|
||||
- Official: 政府官员
|
||||
- Lawyer: 律师
|
||||
- Doctor: 医生
|
||||
**Individuals (concrete)**:
|
||||
- Student: a student.
|
||||
- Professor: a professor or scholar.
|
||||
- Journalist: a journalist.
|
||||
- Celebrity: a celebrity or internet personality.
|
||||
- Executive: a senior business leader.
|
||||
- Official: a government official.
|
||||
- Lawyer: a lawyer.
|
||||
- Doctor: a physician.
|
||||
|
||||
**个人类(兜底)**:
|
||||
- Person: 任何自然人(不属于上述具体类型时使用)
|
||||
**Individuals (fallback)**:
|
||||
- Person: any individual person (use when no concrete person type above applies).
|
||||
|
||||
**组织类(具体)**:
|
||||
- University: 高校
|
||||
- Company: 公司企业
|
||||
- GovernmentAgency: 政府机构
|
||||
- MediaOutlet: 媒体机构
|
||||
- Hospital: 医院
|
||||
- School: 中小学
|
||||
- NGO: 非政府组织
|
||||
**Organizations (concrete)**:
|
||||
- University: a university or higher-education institution.
|
||||
- Company: a company or business.
|
||||
- GovernmentAgency: a government agency.
|
||||
- MediaOutlet: a media organization.
|
||||
- Hospital: a hospital.
|
||||
- School: a primary or secondary school.
|
||||
- NGO: a non-governmental organization.
|
||||
|
||||
**组织类(兜底)**:
|
||||
- Organization: 任何组织机构(不属于上述具体类型时使用)
|
||||
**Organizations (fallback)**:
|
||||
- Organization: any organization (use when no concrete organization type above applies).
|
||||
|
||||
## 关系类型参考
|
||||
## Relationship Type Reference
|
||||
|
||||
- WORKS_FOR: 工作于
|
||||
- STUDIES_AT: 就读于
|
||||
- AFFILIATED_WITH: 隶属于
|
||||
- REPRESENTS: 代表
|
||||
- REGULATES: 监管
|
||||
- REPORTS_ON: 报道
|
||||
- COMMENTS_ON: 评论
|
||||
- RESPONDS_TO: 回应
|
||||
- SUPPORTS: 支持
|
||||
- OPPOSES: 反对
|
||||
- COLLABORATES_WITH: 合作
|
||||
- COMPETES_WITH: 竞争
|
||||
- WORKS_FOR: works for.
|
||||
- STUDIES_AT: studies at.
|
||||
- AFFILIATED_WITH: is affiliated with.
|
||||
- REPRESENTS: represents.
|
||||
- REGULATES: regulates.
|
||||
- REPORTS_ON: reports on.
|
||||
- COMMENTS_ON: comments on.
|
||||
- RESPONDS_TO: responds to.
|
||||
- SUPPORTS: supports.
|
||||
- OPPOSES: opposes.
|
||||
- COLLABORATES_WITH: collaborates with.
|
||||
- COMPETES_WITH: competes with.
|
||||
"""
|
||||
|
||||
|
||||
|
|
@ -243,35 +243,35 @@ class OntologyGenerator:
|
|||
# 如果文本超过5万字,截断(仅影响传给LLM的内容,不影响图谱构建)
|
||||
if len(combined_text) > self.MAX_TEXT_LENGTH_FOR_LLM:
|
||||
combined_text = combined_text[:self.MAX_TEXT_LENGTH_FOR_LLM]
|
||||
combined_text += f"\n\n...(原文共{original_length}字,已截取前{self.MAX_TEXT_LENGTH_FOR_LLM}字用于本体分析)..."
|
||||
|
||||
message = f"""## 模拟需求
|
||||
combined_text += f"\n\n...(original text is {original_length} characters; only the first {self.MAX_TEXT_LENGTH_FOR_LLM} characters were used for ontology analysis)..."
|
||||
|
||||
message = f"""## Simulation Requirement
|
||||
|
||||
{simulation_requirement}
|
||||
|
||||
## 文档内容
|
||||
## Document Content
|
||||
|
||||
{combined_text}
|
||||
"""
|
||||
|
||||
|
||||
if additional_context:
|
||||
message += f"""
|
||||
## 额外说明
|
||||
## Additional Context
|
||||
|
||||
{additional_context}
|
||||
"""
|
||||
|
||||
message += """
|
||||
请根据以上内容,设计适合社会舆论模拟的实体类型和关系类型。
|
||||
|
||||
**必须遵守的规则**:
|
||||
1. 必须正好输出10个实体类型
|
||||
2. 最后2个必须是兜底类型:Person(个人兜底)和 Organization(组织兜底)
|
||||
3. 前8个是根据文本内容设计的具体类型
|
||||
4. 所有实体类型必须是现实中可以发声的主体,不能是抽象概念
|
||||
5. 属性名不能使用 name、uuid、group_id 等保留字,用 full_name、org_name 等替代
|
||||
message += """
|
||||
Based on the content above, design entity types and relationship types suitable for a social-media public-opinion simulation.
|
||||
|
||||
**Rules that must be followed**:
|
||||
1. You must output exactly 10 entity types.
|
||||
2. The last 2 must be fallback types: Person (individual fallback) and Organization (organization fallback).
|
||||
3. The first 8 are concrete types designed from the text content.
|
||||
4. Every entity type must be a real-world subject that can post on social media; abstract concepts are not allowed.
|
||||
5. Attribute names must not use reserved words such as name, uuid, group_id; use alternatives such as full_name, org_name, etc.
|
||||
"""
|
||||
|
||||
|
||||
return message
|
||||
|
||||
def _validate_and_process(self, result: Dict[str, Any]) -> Dict[str, Any]:
|
||||
|
|
|
|||
|
|
@ -0,0 +1,121 @@
|
|||
"""Static guard: assert ontology prompt strings contain no CJK characters.
|
||||
|
||||
This script enforces the i18n contract for `ontology_generator.py` (issue #2):
|
||||
the module-level system prompt constant and every string literal contributed
|
||||
by `_build_user_message` (excluding the method's docstring) must contain
|
||||
zero CJK characters.
|
||||
|
||||
Logger calls, docstrings, and inline comments in the same module are
|
||||
explicitly out of scope (issues #6 and #7) and are not inspected here.
|
||||
|
||||
The check is purely AST-based to avoid coupling to the heavy Flask /
|
||||
LLM client import chain. Exit 0 on success, non-zero on regression.
|
||||
"""
|
||||
|
||||
import ast
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
|
||||
|
||||
CJK_PATTERN = re.compile(r"[一-鿿]")
|
||||
|
||||
|
||||
def _string_literals_in_function(node: ast.FunctionDef) -> list[str]:
|
||||
"""Return all string-literal payloads inside a function body, except the
|
||||
function's own docstring.
|
||||
|
||||
Both plain strings (`ast.Constant` of type `str`) and f-strings
|
||||
(`ast.JoinedStr`) are included. For f-strings, only the static text
|
||||
portions (`ast.Constant` children) are returned — interpolation
|
||||
placeholders cannot contain CJK literals, so they are irrelevant.
|
||||
"""
|
||||
docstring = ast.get_docstring(node, clean=False)
|
||||
pieces: list[str] = []
|
||||
|
||||
for child in ast.walk(node):
|
||||
if isinstance(child, ast.Constant) and isinstance(child.value, str):
|
||||
pieces.append(child.value)
|
||||
elif isinstance(child, ast.JoinedStr):
|
||||
for part in child.values:
|
||||
if isinstance(part, ast.Constant) and isinstance(part.value, str):
|
||||
pieces.append(part.value)
|
||||
|
||||
if docstring is not None:
|
||||
try:
|
||||
pieces.remove(docstring)
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
return pieces
|
||||
|
||||
|
||||
def _module_constant_value(tree: ast.Module, name: str) -> str:
|
||||
for node in tree.body:
|
||||
if isinstance(node, ast.Assign):
|
||||
for target in node.targets:
|
||||
if isinstance(target, ast.Name) and target.id == name:
|
||||
if isinstance(node.value, ast.Constant) and isinstance(
|
||||
node.value.value, str
|
||||
):
|
||||
return node.value.value
|
||||
raise SystemExit(f"Could not locate string constant '{name}' in source.")
|
||||
|
||||
|
||||
def _find_method(tree: ast.Module, class_name: str, method_name: str) -> ast.FunctionDef:
|
||||
for node in tree.body:
|
||||
if isinstance(node, ast.ClassDef) and node.name == class_name:
|
||||
for item in node.body:
|
||||
if isinstance(item, ast.FunctionDef) and item.name == method_name:
|
||||
return item
|
||||
raise SystemExit(f"Could not locate method '{class_name}.{method_name}'.")
|
||||
|
||||
|
||||
def _assert_no_cjk(label: str, text: str) -> int:
|
||||
matches = CJK_PATTERN.findall(text)
|
||||
if matches:
|
||||
sample = "".join(matches[:30])
|
||||
print(
|
||||
f"FAIL: {label} contains {len(matches)} CJK character(s). "
|
||||
f"First few: {sample!r}",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return 1
|
||||
print(f"OK: {label} is CJK-free ({len(text)} chars inspected).")
|
||||
return 0
|
||||
|
||||
|
||||
def main() -> int:
|
||||
target = os.path.join(
|
||||
os.path.dirname(os.path.dirname(os.path.abspath(__file__))),
|
||||
"app",
|
||||
"services",
|
||||
"ontology_generator.py",
|
||||
)
|
||||
with open(target, "r", encoding="utf-8") as f:
|
||||
source = f.read()
|
||||
|
||||
tree = ast.parse(source)
|
||||
|
||||
failures = 0
|
||||
|
||||
system_prompt_value = _module_constant_value(tree, "ONTOLOGY_SYSTEM_PROMPT")
|
||||
failures += _assert_no_cjk("ONTOLOGY_SYSTEM_PROMPT", system_prompt_value)
|
||||
|
||||
method = _find_method(tree, "OntologyGenerator", "_build_user_message")
|
||||
literals = _string_literals_in_function(method)
|
||||
aggregated = "\n".join(literals)
|
||||
failures += _assert_no_cjk(
|
||||
"_build_user_message string literals (excl. docstring)", aggregated
|
||||
)
|
||||
|
||||
if failures:
|
||||
print(f"\n{failures} CJK-regression check(s) failed.", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
print("\nAll CJK-regression checks passed.")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Loading…
Reference in New Issue