Merge pull request #32 from salestech-group/feat/25-i18n-oasis-profile-generator-prompts

fix(i18n): translate oasis profile generator prompts to english
This commit is contained in:
Dominik Seemann 2026-05-11 11:21:11 +02:00 committed by GitHub
commit 54d7fb7828
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
5 changed files with 464 additions and 708 deletions

View File

@ -2,616 +2,316 @@
## Overview ## Overview
**Purpose**: Translate the Chinese prompt strings in **Purpose**: Translate the Chinese prompt strings, context-builder section labels, fallback persona templates, and console-output formatting in `backend/app/services/oasis_profile_generator.py` to English while preserving every functional contract — LLM JSON output schema, the `_normalize_gender` mapping that must continue to accept Chinese gender values, the `_generate_profile_rule_based` default `country: "中国"` data value, all f-string interpolations, and the `get_language_instruction()` locale-postfix mechanism. The goal is to remove the Chinese-language base-prompt and context-label bias that currently leaks Chinese structure and word choice into OASIS profile output even when `Accept-Language: en`.
`backend/app/services/oasis_profile_generator.py` (the system prompt
inside `_get_system_prompt`, the individual-persona f-string template
inside `_build_individual_persona_prompt`, the group-persona f-string
template inside `_build_group_persona_prompt`, and the four
`attrs_str`/`context_str` fallback literals) to English while
preserving every functional contract — JSON output keys, the `gender`
English enum, the `age` integer rule, the `persona` no-newline rule,
all `{variable}` interpolations, and every `get_language_instruction()`
call site. The goal is to remove the Chinese-language base-prompt bias
that currently leaks Chinese structure and word choice into persona
output even when `Accept-Language: en`.
**Users**: MiroFish operators running the Step 2 environment-setup **Users**: MiroFish operators running the Step 2 OASIS profile generation under any locale; downstream OASIS / CAMEL-OASIS consumers of the agent JSON / CSV produced by `OasisProfileGenerator`.
pipeline under any locale; downstream Step 3 (CAMEL-OASIS subprocess)
which consumes the produced persona dictionaries.
**Impact**: Replaces approximately one one-line system prompt and two **Impact**: Replaces approximately one base-prompt string, two large user-message templates, four context-builder section labels, three fallback persona templates, and ten console-output strings with English equivalents inside one file. No API surface change. No new dependencies. No new files. Callers (`backend/app/api/simulation.py`, etc.) and OASIS consumers are unaffected.
large f-string templates with English equivalents inside one file. No
API change, no new dependencies, no new files. The two production
callers (`backend/app/services/simulation_manager.py:316` and
`backend/app/api/simulation.py:1413`) and the OASIS subprocess are
unaffected.
### Goals ### Goals
- Zero CJK characters in any prompt string literal contributed by - Zero CJK characters in any prompt string literal contributed by `oasis_profile_generator.py` to the system prompt, the user message, or the context block.
`oasis_profile_generator.py` to the system prompt or the two - Zero CJK characters in any console-output literal in `_print_generated_profile` and the surrounding banners.
user-message bodies (including the `attrs_str`/`context_str` - English `bio` / `persona` output under `Accept-Language: en`.
fallback literals). - Continued Chinese `bio` / `persona` output under `Accept-Language: zh`, of equivalent quality to the pre-change behaviour.
- English persona prose (`bio`, `persona`, `profession`, - No diff to public signatures, dataclass schema, LLM-call parameters, or call sites.
`interested_topics`) under `Accept-Language: en`.
- Continued Chinese persona prose under `Accept-Language: zh`, of
equivalent quality to the pre-change behaviour.
- `gender` field stays exactly one of `"male"`/`"female"`/`"other"`
regardless of locale.
- No diff to public signatures, taxonomy lists, LLM-call parameters,
or call sites.
### Non-Goals ### Non-Goals
- Externalizing prompts to `/locales/*.json` (out of scope per ticket). - Externalizing prompts to `/locales/*.json` (out of scope per ticket and consistent with `i18n-ontology-generator-prompts`).
- Translating logger calls in this file (covered by issue #6). - Translating logger calls in this file (covered by issue #6).
- Translating module/class/method docstrings or inline comments - Translating module/class/method docstrings or inline comments in this file (covered by issue #7).
(covered by issue #7). - Refactoring the OASIS profile JSON schema, the OASIS adapter, or the simulation flow.
- Refactoring the `OasisAgentProfile` schema, `MBTI_TYPES` / - Modifying the `_normalize_gender` mapping table (it must keep accepting Chinese gender keys).
`COUNTRIES` lists, or the `INDIVIDUAL_ENTITY_TYPES` / - Modifying the `_generate_profile_rule_based` default `"中国"` country value (data, not prompt).
`GROUP_ENTITY_TYPES` taxonomies. - Modifying the `ValueError("LLM_API_KEY 未配置")` raise (covered by issue #6).
- Modifying the rule-based fallback (`_generate_profile_rule_based`) - Modifying `backend/app/utils/locale.py`, the locale registries, or any non-target file.
including its Chinese country defaults.
- Modifying the resilience helpers `_fix_truncated_json` /
`_try_fix_json` and the Chinese persona fallback fragments inside
them (e.g. `f"{entity_name}是一个{entity_type}。"`).
- Modifying `backend/app/utils/locale.py`, the locale registries, or
any non-target file.
- Modifying `backend/scripts/test_profile_format.py`.
## Boundary Commitments ## Boundary Commitments
### This Spec Owns ### This Spec Owns
- The English content of `_get_system_prompt`'s `base_prompt` literal. - The English content of the `base_prompt` string in `OasisProfileGenerator._get_system_prompt` (line 664).
- The English content of the f-string template body in - The English content of every string literal in `OasisProfileGenerator._build_individual_persona_prompt` (lines 677714).
`_build_individual_persona_prompt`. - The English content of every string literal in `OasisProfileGenerator._build_group_persona_prompt` (lines 726762).
- The English content of the f-string template body in - The English content of the section-label literals embedded in `OasisProfileGenerator._search_zep_for_entity` (lines 384, 390, 392) and `OasisProfileGenerator._build_entity_context` (lines 422, 438, 440, 443, 463, 472, 475).
`_build_group_persona_prompt`. - The English content of the fallback persona templates in `OasisProfileGenerator._generate_profile_with_llm` (line 547) and `OasisProfileGenerator._try_fix_json` (lines 644, 659).
- The English replacements for the four `"无"` / `"无额外上下文"` - The English content of the no-attributes / no-context placeholder literals (`"无"`, `"无额外上下文"`) at lines 677, 678, 726, 727.
fallback literals (in both individual and group builders). - The English content of every string literal in `OasisProfileGenerator._print_generated_profile` (lines 1011, 1017, 1019, 1022, 1025, 1026, 1027, 1028) and the surrounding banners in `OasisProfileGenerator.generate_profiles_from_entities` (lines 945, 1001).
### Out of Boundary ### Out of Boundary
- Locale resolution machinery (`backend/app/utils/locale.py`). - Locale resolution machinery (`backend/app/utils/locale.py`).
- Per-locale `llmInstruction` definitions - Per-locale `llmInstruction` definitions (`/locales/languages.json`).
(`/locales/languages.json`). - Reasoning-model output stripping (`backend/app/utils/llm_client.py`).
- Reasoning-model output stripping inside `_fix_truncated_json` / - All `logger.*` calls (already keyed via `t("log.profile_generator.*")`; covered by issue #6).
`_try_fix_json`. - Module / class / method docstrings and inline comments (covered by issue #7), including the inline comments at lines 65, 93, 641, 804807, 816819.
- Logger calls and translation keys (`t("log.profile_generator.*")`) - The `_normalize_gender` mapping table (lines 11231132) — must continue to accept Chinese gender keys from upstream.
inside `oasis_profile_generator.py` (issue #6, already merged). - The hard-coded `country: "中国"` default in `_generate_profile_rule_based` (lines 807, 819) — this is a data value, not a prompt.
- Module / class / method docstrings and inline comments inside - The `ValueError("LLM_API_KEY 未配置")` raise (line 194) — covered by issue #6.
`oasis_profile_generator.py` (issue #7). - All callers of `OasisProfileGenerator`, including `backend/app/api/simulation.py`.
- Rule-based fallback (`_generate_profile_rule_based`) including its
Chinese country defaults `"中国"`.
- Chinese persona fragments inside the resilience helpers (e.g.
`f"{entity_name}是一个{entity_type}。"`) — those are runtime data
fallbacks, not LLM prompts.
- All callers of `OasisProfileGenerator`
(`simulation_manager.py`, `api/simulation.py`).
- Tests, scripts, and frontend code. - Tests, scripts, and frontend code.
- The `print(...)` banner at line 945 (closely associated with logger
externalization #6).
### Allowed Dependencies ### Allowed Dependencies
- Existing imports in the target file (no additions). Specifically: - Existing `get_language_instruction`, `get_locale`, `set_locale`, `t` imports from `..utils.locale` (already imported; unchanged).
`get_language_instruction`, `get_locale`, `set_locale`, `t` from - Existing `OpenAI` SDK invocation (unchanged).
`..utils.locale` are already imported and remain unchanged. - No new imports.
- Existing LLM transport via `self.client.chat.completions.create`
(unchanged).
### Revalidation Triggers ### Revalidation Triggers
The following changes elsewhere would invalidate this design: The following changes elsewhere would invalidate this design and require revisiting the prompt:
- A change to the JSON contract emitted by the LLM (`bio`, `persona`, - A change to the JSON contract emitted by the LLM (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`).
`age`, `gender`, `mbti`, `country`, `profession`, - A change to `OasisAgentProfile` field semantics.
`interested_topics` keys). - A change to `get_language_instruction()` semantics or the per-locale `llmInstruction` strings.
- A change to the `OasisAgentProfile` dataclass field set or the - A change to OASIS / CAMEL-OASIS profile field expectations (e.g. if `gender` accepts more than `male` / `female` / `other`).
Reddit/Twitter serializers.
- A change to `get_language_instruction()` semantics or the per-locale
`llmInstruction` strings.
- A change to OASIS subprocess profile-format expectations (verified
via `backend/scripts/test_profile_format.py`).
## Architecture ## Architecture
### Existing Architecture Analysis ### Existing Architecture Analysis
`OasisProfileGenerator` lives in `backend/app/services/`, follows the `OasisProfileGenerator` lives in `backend/app/services/`, follows the in-process service pattern with bounded thread-pool fan-out for batched profile generation, and is invoked from `backend/app/api/simulation.py` inside a background `Task`. It depends on:
in-process service pattern, and is invoked from a Flask handler inside
a background task. The relevant flow:
1. The Flask handler resolves the request locale via `Accept-Language`; - `OpenAI` SDK for the LLM call.
`set_locale()` is propagated into worker threads in - `GraphitiAdapter` (legacy `zep_client` field name) for the Zep / Graphiti graph search.
`generate_profiles_for_entities` (locale captured at line ~910 and - `get_language_instruction()` for locale steering.
restored inside `generate_single_profile` at line ~914). - `t()` for already-keyed log strings.
2. For each entity, `generate_profile_from_entity` decides between the
individual or group prompt builder via
`self._is_individual_entity(entity_type)`.
3. The chosen builder produces a user-message string; `_get_system_prompt`
produces a system-message string. Both are sent to the LLM via
`self.client.chat.completions.create(..., response_format={"type": "json_object"})`.
4. The LLM response is JSON-decoded; on failure, `_try_fix_json` and
`_fix_truncated_json` attempt recovery; on terminal failure,
`_generate_profile_rule_based` produces a rule-based persona.
5. The result is wrapped in an `OasisAgentProfile` dataclass and
serialized to Reddit JSON or Twitter CSV via `_save_reddit_json` /
`_save_twitter_csv`.
This design preserves all of the above. The change is purely lexical The relevant flow is:
inside three method bodies and four literal defaults.
1. The Flask handler resolves the request locale via `Accept-Language`; the locale is propagated to thread-pool workers via the `set_locale(current_locale)` capture in `generate_profiles_from_entities` (line 914).
2. For each entity, `_build_entity_context()` is called: it composes a context block by concatenating headed sub-sections (entity attributes, related facts/edges, related node summaries, Graphiti-search facts, Graphiti-search nodes). Some of these labels are currently in Chinese.
3. The context string is interpolated into the user-message template by either `_build_individual_persona_prompt` or `_build_group_persona_prompt`. Both templates are currently in Chinese, with English `gender` token directives interleaved.
4. The system prompt is built by `_get_system_prompt`: a Chinese base prompt followed by the locale-appropriate `get_language_instruction()`.
5. The two messages are sent to `chat.completions.create` with `response_format={"type": "json_object"}`. The result flows through `json.loads``_try_fix_json``_fix_truncated_json` fallback chain. Synthesized fallback personas use the Chinese template `f"{entity_name}是一个{entity_type}。"` if the LLM result is unusable.
6. After per-profile completion, `_print_generated_profile` writes a Chinese-headed banner to stdout, and `generate_profiles_from_entities` writes Chinese batch banners.
This design preserves all of the above structurally. The change is purely lexical inside the seven regions of one file.
### Architecture Pattern & Boundary Map ### Architecture Pattern & Boundary Map
```mermaid ```mermaid
graph TB graph TB
Caller["simulation_manager.py / api/simulation.py"] Caller[simulation.py handler]
Generator["OasisProfileGenerator"] Generator[OasisProfileGenerator]
Sys["_get_system_prompt"] Locale[locale.get_language_instruction]
Ind["_build_individual_persona_prompt"] Graph[GraphitiAdapter graph.search]
Grp["_build_group_persona_prompt"] LLM[OpenAI chat.completions]
Locale["locale.get_language_instruction"]
Client["openai.chat.completions.create"]
Parser["_try_fix_json / _fix_truncated_json"]
Fallback["_generate_profile_rule_based"]
Serializer["_save_reddit_json / _save_twitter_csv"]
Caller --> Generator Caller -->|generate_profiles_from_entities| Generator
Generator --> Sys Generator -->|build context block| Generator
Generator --> Ind Generator -->|read locale postfix| Locale
Generator --> Grp Generator -->|search facts/nodes| Graph
Sys -. inline call .-> Locale Generator -->|JSON request| LLM
Ind -. inline call .-> Locale LLM -->|raw JSON| Generator
Grp -. inline call .-> Locale Generator -->|OasisAgentProfile| Caller
Sys --> Client
Ind --> Client
Grp --> Client
Client --> Parser
Parser --> Fallback
Generator --> Serializer
classDef change fill:#fff4ce,stroke:#a16207,color:#000
class Sys,Ind,Grp change
``` ```
The three highlighted nodes (`_get_system_prompt`,
`_build_individual_persona_prompt`,
`_build_group_persona_prompt`) are the only nodes whose **string
contents** change. Every edge — including each call to
`get_language_instruction()` — remains intact.
**Architecture Integration**: **Architecture Integration**:
- **Selected pattern**: In-place lexical translation of the three - Selected pattern: **In-place lexical translation** of seven regions of an existing service. No structural change.
prompt builders (Option A from `gap-analysis.md` / `research.md`). - Domain/feature boundaries: locale machinery vs. prompt assembly vs. LLM transport remain cleanly separated.
- **Domain/feature boundaries**: Same as today; `OasisProfileGenerator` - Existing patterns preserved: prompt-as-f-string user-message construction; Chinese-keyed `_normalize_gender` mapping; `t(...)` for log strings; `get_language_instruction()` postfix concatenation.
remains the sole owner of persona prompt content. `LocaleService` - New components rationale: none — no new components.
remains the sole owner of locale-postfix steering. - Steering compliance: matches the established `i18n-*-prompts` family pattern (issues #2, #3, #4, #5) of in-place translation rather than `t()` keying for prompt bodies. Respects the steering note that "existing files mix English and Chinese in comments/docstrings — preserve both; do not translate one into the other unless asked." This ticket is the explicit ask for prompt strings, scoped to exclude comments/docstrings.
- **Existing patterns preserved**: locale-thread propagation, retry
logic with temperature decay, JSON resilience helpers, rule-based
fallback, two-platform serialization.
- **New components rationale**: none — no new components.
- **Steering compliance**: aligns with `tech.md` ("LLM prompts use the
`get_language_instruction()` postfix mechanism, not key files") and
`structure.md` ("services own their own prompt strings").
### Technology Stack & Alignment ### Technology Stack
| Layer | Choice / Version | Role in Feature | Notes | | Layer | Choice / Version | Role in Feature | Notes |
|-------|------------------|-----------------|-------| |-------|------------------|-----------------|-------|
| Backend / Services | Python ≥3.11 | Hosts the prompt builders | No version change | | Backend / Services | Python 3.11+ | Hosts `OasisProfileGenerator` | Existing — unchanged. |
| LLM transport | `openai` SDK against any OpenAI-compatible endpoint | Sends translated prompts | Unchanged | | Backend / Services | `openai` SDK | Issues the prompt; returns JSON | Existing — unchanged. |
| i18n | `backend/app/utils/locale.py` | Resolves locale and provides `get_language_instruction()` postfix | Unchanged | | Backend / Services | `backend/app/utils/locale.py` | Resolves `Accept-Language``llmInstruction` postfix | Existing — unchanged. |
| Storage | None | — | No persistence change | | Backend / Services | `GraphitiAdapter` | Provides Graphiti graph search facts/nodes | Existing — unchanged. |
No new dependencies. No version bumps. The locale infrastructure used No new dependencies. No version changes.
by the change is the same one used by every sibling i18n spec already
merged.
## File Structure Plan ## File Structure Plan
### Modified Files ### Modified Files
- `backend/app/services/oasis_profile_generator.py` — only file that - `backend/app/services/oasis_profile_generator.py` — Replace the body of `_get_system_prompt` `base_prompt`; replace every Chinese string literal in `_build_individual_persona_prompt` and `_build_group_persona_prompt` with English equivalents; replace the four section labels in `_search_zep_for_entity` and the six section labels in `_build_entity_context`; replace the three fallback persona templates; replace the two `"无"` / `"无额外上下文"` placeholders; replace the console-output literals in `_print_generated_profile` and the two `print(...)` banners in `generate_profiles_from_entities`. Preserve every other character of the file.
changes.
- `_get_system_prompt(self, is_individual: bool) -> str` — translate
`base_prompt` literal to English. Keep
`f"{base_prompt}\n\n{get_language_instruction()}"` shape.
- `_build_individual_persona_prompt(self, entity_name, entity_type,
entity_summary, entity_attributes, context) -> str` — translate
the f-string body to English; replace `"无"` and `"无额外上下文"`
defaults; keep every `{variable}` interpolation and the inline
`{get_language_instruction()}` call.
- `_build_group_persona_prompt(self, entity_name, entity_type,
entity_summary, entity_attributes, context) -> str` — same
treatment as the individual builder.
No other files in the repository are touched by this change. No new files. No deletions. No moves.
## System Flows ## System Flows
The runtime flow does not change. The only way to demonstrate this is The control-flow diagram in *Architecture Pattern & Boundary Map* covers the relevant flow; no additional diagrams are needed for this string-literal change.
to compare the call graph before and after — and the call graph is
already shown in the Architecture diagram above. Skipping a separate
sequence diagram.
## Requirements Traceability ## Requirements Traceability
| Requirement | Summary | Components | Interfaces | Flows | | Requirement | Summary | Components | Interfaces | Flows |
|-------------|---------|------------|------------|-------| |-------------|---------|------------|------------|-------|
| 1.1 | `base_prompt` contains zero Chinese characters | `_get_system_prompt` | `(self, is_individual: bool) -> str` | system-message construction | | 1.11.4 | English `_get_system_prompt` `base_prompt`; preserve `get_language_instruction()` site | OasisProfileGenerator → `_get_system_prompt` | None changed | Architecture diagram |
| 1.2 | Preserve `f"{base_prompt}\n\n{get_language_instruction()}"` | `_get_system_prompt` | inline `get_language_instruction()` | system-message construction | | 2.12.9 | English `_build_individual_persona_prompt`; preserve interpolations and JSON keys | OasisProfileGenerator → `_build_individual_persona_prompt` | f-string interpolation | n/a |
| 1.3 | Preserve role/intent semantics | `_get_system_prompt` | — | — | | 3.13.9 | English `_build_group_persona_prompt`; preserve fixed-value rules and interpolations | OasisProfileGenerator → `_build_group_persona_prompt` | f-string interpolation | n/a |
| 1.4 | Preserve signature `_get_system_prompt(self, is_individual: bool) -> str` | `_get_system_prompt` | (signature) | — | | 4.14.10 | English context-builder section labels | OasisProfileGenerator → `_search_zep_for_entity`, `_build_entity_context` | Prompt-only | n/a |
| 2.1 | Individual prompt body in English | `_build_individual_persona_prompt` | f-string body | user-message construction | | 5.15.3 | English fallback persona templates | OasisProfileGenerator → `_generate_profile_with_llm`, `_try_fix_json` | None changed | n/a |
| 2.2 | Preserve `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}` | `_build_individual_persona_prompt` | f-string interpolations | — | | 6.16.7 | English console-output formatting | OasisProfileGenerator → `_print_generated_profile`, `generate_profiles_from_entities` | None changed | n/a |
| 2.3 | Preserve JSON keys `bio, persona, age, gender, mbti, country, profession, interested_topics` | `_build_individual_persona_prompt` | prompt content | — | | 7.17.4 | Locale switching preserved via `get_language_instruction()` | OasisProfileGenerator + Locale | `get_language_instruction()` | Architecture diagram |
| 2.4 | Preserve field-level constraints (lengths, MBTI, gender enum, age int) | `_build_individual_persona_prompt` | prompt content | — | | 8.18.6 | Public API and call-site stability; preserve `_normalize_gender` and `country: "中国"` data default | OasisProfileGenerator (signatures, dataclass) | Public surface | n/a |
| 2.5 | Preserve trailing-rules block semantics | `_build_individual_persona_prompt` | prompt content | — | | 9.19.3 | Reasoning-model compatibility | OasisProfileGenerator → `chat.completions.create` + `_try_fix_json` | OpenAI SDK | Architecture diagram |
| 2.6 | Preserve method signature | `_build_individual_persona_prompt` | (signature) | — | | 10.110.7 | Out-of-scope surfaces untouched | OasisProfileGenerator (boundary commitment) | n/a | n/a |
| 2.7 | Translate `"无"` and `"无额外上下文"` defaults | `_build_individual_persona_prompt` | literal defaults | — |
| 2.8 | Zero Chinese in assembled body | `_build_individual_persona_prompt` | — | — |
| 3.1 | Group prompt body in English | `_build_group_persona_prompt` | f-string body | user-message construction |
| 3.2 | Preserve interpolations | `_build_group_persona_prompt` | f-string interpolations | — |
| 3.3 | Preserve JSON keys | `_build_group_persona_prompt` | prompt content | — |
| 3.4 | Preserve field-level constraints (age=30, gender="other", etc.) | `_build_group_persona_prompt` | prompt content | — |
| 3.5 | Preserve trailing-rules semantics | `_build_group_persona_prompt` | prompt content | — |
| 3.6 | Preserve method signature | `_build_group_persona_prompt` | (signature) | — |
| 3.7 | Translate `"无"` / `"无额外上下文"` defaults | `_build_group_persona_prompt` | literal defaults | — |
| 3.8 | Zero Chinese in assembled body | `_build_group_persona_prompt` | — | — |
| 4.1 | Preserve every `get_language_instruction()` call site | all three builders | inline call | system + user message construction |
| 4.2 | Preserve locale-thread plumbing | `generate_profiles_for_entities` (untouched) | `set_locale(current_locale)` | worker thread spawn |
| 4.3 | Locale=zh produces Chinese personas | runtime behaviour | locale postfix | LLM call |
| 4.4 | Locale=en produces English personas | runtime behaviour | locale postfix | LLM call |
| 4.5 | `gender` ∈ {male, female, other} regardless of locale | prompt content | — | — |
| 4.6 | Don't alter locale.py / locales/ | (none) | — | — |
| 5.1 | Preserve `OasisAgentProfile` dataclass | (untouched) | dataclass | — |
| 5.2 | Preserve method signatures | (untouched) | signatures | — |
| 5.3 | Preserve LLM invocation parameters | (untouched) | `chat.completions.create(...)` | — |
| 5.4 | Preserve `MBTI_TYPES`, `COUNTRIES`, taxonomy lists | (untouched) | class constants | — |
| 6.1 | Preserve `_fix_truncated_json` / `_try_fix_json` | (untouched) | helpers | — |
| 6.2 | Reasoning-model recovery still works | (untouched) | resilience helpers | — |
| 6.3 | No new prompt-language-dependent pre-processing | (none added) | — | — |
| 6.4 | Round-trip yields non-empty `bio` and `persona` | runtime behaviour | LLM call | — |
| 7.1 | `pytest test_profile_format.py` passes | runtime behaviour | serializers | — |
| 7.2 | Reddit format schema preserved | (untouched) | `to_reddit_format` | — |
| 7.3 | Twitter format schema preserved | (untouched) | `to_twitter_format` | — |
| 7.4 | `gender` enum preserved | prompt content | — | — |
| 8.1 | No logger edits | (untouched) | — | — |
| 8.2 | No docstring/comment edits | (untouched) | — | — |
| 8.3 | No rule-based fallback edits | (untouched) | — | — |
| 8.4 | No edits outside the target file | (none) | — | — |
| 8.5 | No new dependencies | (none) | `pyproject.toml` / `uv.lock` untouched | — |
| 8.6 | No edits to `test_profile_format.py` | (untouched) | — | — |
## Components and Interfaces ## Components and Interfaces
| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts | | Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts |
|-----------|--------------|--------|--------------|--------------------------|-----------| |-----------|--------------|--------|--------------|--------------------------|-----------|
| `_get_system_prompt` | backend service / prompt builder | Produce the system message (English base + locale postfix) | 1.1, 1.2, 1.3, 1.4, 4.1, 4.5 | `get_language_instruction` (P0) | Service | | OasisProfileGenerator (modified) | Backend / Service | Render English profile-generation prompts and context labels; preserve all behaviour | 1.110.7 | `OpenAI.chat.completions.create` (P0), `get_language_instruction` (P0), `GraphitiAdapter.graph.search` (P1), `_normalize_gender` (P0) | Service |
| `_build_individual_persona_prompt` | backend service / prompt builder | Produce the individual-entity user message in English | 2.x, 4.1, 4.5 | `get_language_instruction` (P0); JSON encoder (P1) | Service |
| `_build_group_persona_prompt` | backend service / prompt builder | Produce the group/institution user message in English | 3.x, 4.1, 4.5 | `get_language_instruction` (P0); JSON encoder (P1) | Service |
Only the three prompt-builder methods change. They all live inside the ### Backend / Service
single class `OasisProfileGenerator` in
`backend/app/services/oasis_profile_generator.py`. No new components.
### Backend / Services #### OasisProfileGenerator (modified)
#### `_get_system_prompt`
| Field | Detail | | Field | Detail |
|-------|--------| |-------|--------|
| Intent | Build the `system` message: a one-line English directive that frames the model as a social-media persona expert + the per-locale postfix. | | Intent | Translate prompt strings, context labels, fallback persona templates, and console output to English while preserving every functional contract. |
| Requirements | 1.1, 1.2, 1.3, 1.4, 4.1, 4.5 | | Requirements | 1.1, 1.2, 1.3, 1.4, 2.12.9, 3.13.9, 4.14.10, 5.15.3, 6.16.7, 7.17.4, 8.18.6, 9.19.3, 10.110.7 |
**Responsibilities & Constraints** **Responsibilities & Constraints**
- Construct and return a single string of the form - Owns: the English wording of the system prompt body, the two user-message templates, the context-builder section labels, the fallback persona templates, the no-attributes / no-context placeholders, and the console-output formatting.
`f"{base_prompt}\n\n{get_language_instruction()}"`. - Domain boundary: prompt content and proximate console output only. Does not own locale resolution, transport, validation, or data values like the OASIS `country` default.
- Preserve the signature - Invariants:
`_get_system_prompt(self, is_individual: bool) -> str`. - All seven owned regions after translation MUST contain zero CJK characters.
- The English `base_prompt` MUST convey: (a) expert role in - The translated user-message templates MUST present the same eight required JSON keys: `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`.
social-media persona generation; (b) intent to produce detailed, - The translated individual-persona template MUST require `gender ∈ {"male", "female"}` and `age` to be a valid integer.
realistic personas for opinion-simulation, faithful to existing - The translated group-persona template MUST require `age == 30` and `gender == "other"`.
reality; (c) the JSON-output requirement and the no-unescaped-newline - The translated user-message templates MUST preserve the f-string interpolations: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`.
rule. - The translated context-builder labels MUST preserve the section structure (heading + bulleted body).
- The English `base_prompt` MUST NOT contain any CJK codepoint. - The translated fallback persona templates MUST preserve the `entity_summary or template` priority order.
- The call to `get_language_instruction()` MUST remain at its current locations.
- The call to `self.client.chat.completions.create(...)` MUST remain unchanged.
- All public signatures, dataclass schema, and the private helper signatures MUST remain unchanged.
- All `logger.*` calls (already keyed) and inline comments and docstrings in this file MUST remain unchanged (out of scope per #6 and #7).
- The `_normalize_gender` mapping table MUST remain unchanged.
- The rule-based `country: "中国"` default MUST remain unchanged.
**Dependencies** **Dependencies**
- Outbound: `get_language_instruction()` from - Inbound: `backend/app/api/simulation.py` — production caller (P0).
`backend/app/utils/locale.py` (P0, criticality high — the entire - Outbound: `backend/app/utils/locale.get_language_instruction` — locale postfix (P0); `backend/app/utils/locale.t` — already-keyed log strings (P0); `backend/app/services/graphiti_adapter.GraphitiAdapter.graph.search` — facts/nodes retrieval (P1); `OpenAI.chat.completions.create` — JSON LLM transport (P0).
locale-steering chain depends on it). - External: none.
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ] **Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
##### Service Interface ##### Service Interface
The public Python interface is unchanged. Representative signatures:
```python ```python
def _get_system_prompt(self, is_individual: bool) -> str: class OasisProfileGenerator:
"""Return the LLM system message: English base + locale postfix.""" def __init__(
... self,
api_key: Optional[str] = None,
base_url: Optional[str] = None,
model_name: Optional[str] = None,
zep_api_key: Optional[str] = None,
graph_id: Optional[str] = None,
) -> None: ...
def generate_profile_from_entity(
self,
entity: EntityNode,
user_id: int,
use_llm: bool = True,
) -> OasisAgentProfile: ...
def generate_profiles_from_entities(
self,
entities: List[EntityNode],
use_llm: bool = True,
progress_callback: Optional[callable] = None,
graph_id: Optional[str] = None,
parallel_count: int = 5,
realtime_output_path: Optional[str] = None,
output_platform: str = "reddit",
) -> List[OasisAgentProfile]: ...
def save_profiles(
self,
profiles: List[OasisAgentProfile],
file_path: str,
platform: str = "reddit",
) -> None: ...
``` ```
- Preconditions: none. - Preconditions: a configured LLM provider; a configured Graphiti / Neo4j graph; a non-empty `entities` list when batching.
- Postconditions: returns a non-empty string ending with the locale - Postconditions: `OasisAgentProfile` instances with English `bio` and `persona` under locale `en`, Chinese under locale `zh`, and structurally equivalent across locales.
postfix produced by `get_language_instruction()`. - Invariants: see *Responsibilities & Constraints*.
- Invariants: contains zero CJK codepoints.
**Implementation Notes** **Implementation Notes**
- Integration: called only from `_call_llm_with_retry` (line ~523) - **Integration**: No new imports. No call-site changes. The diff is confined to seven regions of one file.
with `is_individual` decided upstream. The `is_individual` flag is - **Validation**: After implementation, run a targeted regex check (`[一-鿿]`) over the seven owned regions to confirm zero CJK; smoke-test `_build_individual_persona_prompt(...)` and `_build_group_persona_prompt(...)` with representative inputs to confirm interpolations still work; round-trip a single profile end-to-end under both `en` and `zh` locales.
reserved for future divergence between system prompts; the current - **Risks**: English-base bias on Chinese-locale output (mitigated by the `llmInstruction` postfix already present in both system and user messages). Reduced LLM compliance with `gender ∈ {male, female}` for individual entities (mitigated by retaining the explicit English-token directive verbatim in the rules block).
implementation does not branch on it, and this design preserves
that.
- Validation: a CJK regex audit on the method body after the edit must
match zero codepoints.
- Risks: dropping one of the three role/intent pieces (expert framing,
JSON output requirement, no-newline rule). Implementation task lists
all three explicitly.
#### `_build_individual_persona_prompt`
| Field | Detail |
|-------|--------|
| Intent | Build the user-message string for an individual entity in English. Preserve every `{variable}` interpolation, the inline `{get_language_instruction()}` call, every JSON-output key, and every locale-independent constraint. |
| Requirements | 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5 |
**Responsibilities & Constraints**
- Preserve signature
`_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
- Preserve `attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else <fallback>` with `<fallback>` translated to English (`"None"`).
- Preserve `context_str = context[:3000] if context else <fallback>` with `<fallback>` translated to English (`"No additional context"`).
- Translate the f-string body to English with these structural sections (mirror the original Chinese intent):
1. **Lead sentence** — instruct the model to generate a detailed
social-media persona for the entity, faithful to existing reality.
2. **Entity context block** — labelled lines for `entity_name`,
`entity_type`, `entity_summary`, `entity_attributes` (English
labels; values via `{...}` interpolation).
3. **Context information block**`Context information:` heading
followed by `{context_str}`.
4. **JSON-fields enumeration** — `Generate JSON with the following
fields:` followed by the eight numbered items (`bio`, `persona`,
`age`, `gender`, `mbti`, `country`, `profession`,
`interested_topics`) with English descriptions matching
Requirement 2.4.
5. **Trailing rules block**`Important:` followed by:
- `All field values must be strings or numbers; do not use newlines.`
- `persona must be a single coherent block of text.`
- `{get_language_instruction()} (gender field MUST use English values: "male" or "female")`
- `Content must remain consistent with the entity information.`
- `age must be a valid integer; gender must be exactly "male" or "female".`
- Preserve every `{variable}` interpolation present in the original by
name: `{entity_name}`, `{entity_type}`, `{entity_summary}`,
`{attrs_str}`, `{context_str}`, `{get_language_instruction()}`.
- The translated body MUST NOT contain any CJK codepoint.
**Dependencies**
- Outbound: `json.dumps(..., ensure_ascii=False)` (P1, formatting the
attributes dict) — unchanged.
- Outbound: `get_language_instruction()` (P0) — interpolated inline.
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
##### Service Interface
```python
def _build_individual_persona_prompt(
self,
entity_name: str,
entity_type: str,
entity_summary: str,
entity_attributes: Dict[str, Any],
context: str,
) -> str:
"""Return the LLM user message for an individual-entity persona."""
...
```
- Preconditions: `entity_name`, `entity_type`, `entity_summary`
are strings (may be empty); `entity_attributes` is a dict (may be
empty); `context` is a string (may be empty).
- Postconditions: returns a non-empty English string with all six
interpolations resolved.
- Invariants: contains zero CJK codepoints; preserves every
`{variable}` interpolation by name.
**Implementation Notes**
- Integration: called from `_call_llm_with_retry` (line ~506) when
`is_individual` is true.
- Validation: post-edit CJK regex audit; interpolation-set audit
(verify the multiset of `{...}` tokens equals the pre-change set);
smoke import + `pytest backend/scripts/test_profile_format.py`.
- Risks: dropping the `gender` enum lock when translating; dropping
the inline `{get_language_instruction()}` call. The implementation
task list calls these out as discrete checks.
#### `_build_group_persona_prompt`
| Field | Detail |
|-------|--------|
| Intent | Build the user-message string for a group/institution entity in English. Preserve every `{variable}` interpolation, the inline `{get_language_instruction()}` call, every JSON-output key, and every locale-independent constraint (notably `age == 30` and `gender == "other"`). |
| Requirements | 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5 |
**Responsibilities & Constraints**
- Preserve signature
`_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
- Preserve the `attrs_str` and `context_str` fallback handling with
English defaults (`"None"`, `"No additional context"`), identical to
the individual builder.
- Translate the f-string body to English with these structural
sections (mirror the original Chinese intent for institutions):
1. **Lead sentence** — instruct the model to generate a detailed
social-media account profile for the institution/group, faithful
to existing reality.
2. **Entity context block** — labelled lines for `entity_name`,
`entity_type`, `entity_summary`, `entity_attributes`.
3. **Context information block**`Context information:` heading
followed by `{context_str}`.
4. **JSON-fields enumeration** — `Generate JSON with the following
fields:` followed by the eight numbered items as defined in
Requirement 3.4: `bio` (~200 chars, official voice), `persona`
(~2000 chars, single coherent text covering institutional
basics, account positioning, voice, publishing pattern, stance,
special notes, institutional memory), `age` (= integer 30,
institutional virtual age), `gender` (= literal `"other"`),
`mbti` (e.g. ISTJ for strict/conservative), `country` (country
name string), `profession` (institutional function),
`interested_topics` (array).
5. **Trailing rules block**`Important:` followed by:
- `All field values must be strings or numbers; null is not allowed.`
- `persona must be a single coherent block of text without newlines.`
- `{get_language_instruction()} (gender field MUST use English value "other")`
- `age must be the integer 30; gender must be the string "other".`
- `Account voice must match its identity positioning.`
- Preserve every `{variable}` interpolation present in the original.
- The translated body MUST NOT contain any CJK codepoint.
**Dependencies**
- Outbound: same as individual builder.
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
##### Service Interface
```python
def _build_group_persona_prompt(
self,
entity_name: str,
entity_type: str,
entity_summary: str,
entity_attributes: Dict[str, Any],
context: str,
) -> str:
"""Return the LLM user message for a group/institution persona."""
...
```
- Preconditions / Postconditions / Invariants: same shape as the
individual builder.
**Implementation Notes**
- Integration: called from `_call_llm_with_retry` (line ~510) when
`is_individual` is false.
- Validation: same checks as the individual builder, plus an explicit
audit that the institutional sentinels (`age == 30`,
`gender == "other"`) appear in English in the trailing-rules block.
- Risks: same as the individual builder; additionally, the `country`
language hint (`"使用中文,如\"中国\""`) is intentionally dropped
during translation — the validation task verifies that under
`Accept-Language: en` a sample run produces an English country
name.
## Data Models ## Data Models
No data-model changes. The persona JSON schema, the No data-model changes. The `OasisAgentProfile` dataclass is preserved verbatim.
`OasisAgentProfile` dataclass, the Reddit/Twitter serializers, and the
OASIS subprocess profile-format expectations are all preserved
verbatim.
## Error Handling ## Error Handling
### Error Strategy ### Error Strategy
No new error paths. The existing flow is preserved: Error handling is unchanged from the existing implementation:
- `json.JSONDecodeError``_try_fix_json``_fix_truncated_json` - LLM transport errors propagate from `chat.completions.create`.
partial-extract via regex → `_generate_profile_rule_based`. - Truncation (`finish_reason == "length"`) is repaired by `_fix_truncated_json`.
- LLM call failure → retry with temperature decay (`0.7 - attempt * 0.1`) - Invalid JSON falls through to `_try_fix_json`, then to a synthesized fallback profile (now with English persona text).
up to `max_attempts = 3`. - Per-entity exceptions are caught and a fallback `OasisAgentProfile` is constructed with English fallback strings.
- Terminal failure → rule-based fallback persona.
- Per-entity worker exception → fallback `OasisAgentProfile` produced
inside `generate_single_profile` at line ~932.
The translated prompts do not introduce new failure modes. Translating
prompt language has no semantic effect on JSON parsing or on the
`response_format={"type": "json_object"}` constraint.
### Error Categories and Responses ### Error Categories and Responses
- **User errors**: not applicable (this is an internal pipeline). - **User errors (4xx)**: not applicable at this layer; surfaced by the API handler.
- **System errors**: LLM transport errors are retried; logger emits - **System errors (5xx)**: LLM/network failures propagate to the API handler, which converts them to JSON error responses.
`t("log.profile_generator.m011")` etc. Logger keys already exist in - **Business logic errors**: malformed JSON is auto-repaired or replaced with a fallback profile.
`locales/{en,zh}.json`.
- **Business-logic errors**: `gender` not in the English enum, `age`
not an integer — the prompt explicitly mandates them; the validator
inside `_try_fix_json` does not enforce these but the OASIS
subprocess does. No change in either direction.
### Monitoring ### Monitoring
Existing logger calls are unchanged. Logger keys already i18n-keyed via Existing `logger.*` calls (keyed via `t("log.profile_generator.*")`) cover progress and warnings; no new monitoring is added.
`t("log.profile_generator.*")`.
## Testing Strategy ## Testing Strategy
### Unit Tests ### Unit Tests
- **(Existing)** Given the project's intentionally minimal test harness (`backend/scripts/test_profile_format.py` only), the change is verified via:
`backend/scripts/test_profile_format.py::test_profile_formats`
must continue to pass without modification. - **Static check**: a one-shot regex assertion against the patched module ensuring zero CJK characters in the seven owned regions. This can be a quick `python -c` invocation during PR review.
- **(Manual)** Smoke import: - **Round-trip smoke test**: instantiate `OasisProfileGenerator()`, call `_build_individual_persona_prompt(...)` and `_build_group_persona_prompt(...)` with representative inputs, and verify all required interpolations appear in the output and no CJK characters remain.
`cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator"` - **Fallback rendering**: simulate a JSON parse failure and verify the English fallback persona template is produced.
— confirms no syntax errors after editing f-strings.
### Integration Tests ### Integration Tests
- **(Manual)** Run the prompt builders directly under each locale: - **Step 2 profile generation under EN locale**: run a small batched profile generation against a real Graphiti graph with locale `en`. Verify produced profiles have English `bio` / `persona` and pass the existing OASIS profile-format check.
- `set_locale("en")`
`OasisProfileGenerator()._build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")`
— assert no CJK codepoints in the output, assert the English
locale postfix appears via `get_language_instruction()` (which is
`"Please respond in English."`).
- `set_locale("zh")` → same call → assert the locale postfix is
`"请使用中文回答。"`.
- These do not require an LLM call; they only verify the rendered
prompt string.
### E2E Tests ### E2E/UI Tests
- **(Manual, optional, preferred but skippable when no LLM key Not applicable — change does not affect frontend.
present)** Run `npm run dev` and trigger Step 2 profile generation
from the UI under English locale on a small entity set; spot-check
that bios and persona prose are in English. Skip if a live LLM key
is unavailable in CI; sibling specs #2/#4/#5 used the same manual
E2E approach.
### Performance / Load ### Performance/Load
Not applicable. Prompt translation has no measurable performance Not applicable — token counts may differ slightly between Chinese and English renderings, but the LLM call has no `max_tokens` cap and remains within provider-acceptable limits.
impact.
## Optional Sections ## Optional Sections
### Security Considerations ### Security Considerations
No security implications. No new external surfaces; no new data Not applicable. Translation does not introduce new authentication, authorization, data-handling, or input-validation paths.
retention; no change to authentication or authorization.
### Performance & Scalability
Not applicable.
### Migration Strategy ### Migration Strategy
No migration required. The change is forward-compatible: a deployment Not applicable. The change is a single in-place edit; no data migration. Rollback is `git revert`.
that picks up the translated prompts continues to serve users on the
`zh` locale via the unchanged
`get_language_instruction()` postfix mechanism.
## Supporting References ## Supporting References
- `gap-analysis.md` — option evaluation and effort/risk sizing. - `backend/app/services/oasis_profile_generator.py` — current Chinese prompt content (the source of translation).
- `research.md` — discovery findings, design decisions (in particular - `backend/app/utils/locale.py` — locale resolver.
the "drop the country language hint" decision), and risk register. - `backend/app/api/simulation.py` — call site.
- `requirements.md` — EARS requirements with numeric IDs. - `.kiro/specs/i18n-ontology-generator-prompts/design.md` — adjacent reference design for in-place prompt translation.
- Sibling specs `i18n-ontology-generator-prompts`, - `.ticket/25.md` — ticket snapshot.
`i18n-simulation-config-generator-prompts`,
`i18n-report-agent-prompts` — same translation pattern, already
merged.

View File

@ -2,144 +2,167 @@
## Introduction ## Introduction
This specification covers the English translation of the prompt strings in `backend/app/services/oasis_profile_generator.py`. The file converts Graphiti graph entities into OASIS agent persona dictionaries that drive Step 2 (Environment Setup) of the MiroFish pipeline. Today, the system prompt and the two `_build_*_persona_prompt` user-message templates are written in Chinese; the language is steered at runtime by appending `get_language_instruction()` to the system prompt and inside the user prompt body. While that postfix instructs the model *which* language to respond in, the base-prompt language biases the model's structural and lexical output, so persona prose (bio, persona, profession, interested_topics) skews Chinese under `Accept-Language: en`. Translating the base prompts to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (`get_language_instruction()` returns `请使用中文回答。` when locale is `zh`, so a Chinese model response remains achievable from an English base prompt). This specification covers the English translation of the LLM-prompt assembly strings in `backend/app/services/oasis_profile_generator.py`. The file generates OASIS Agent profiles (bio, persona, demographics) from Graphiti/Zep entities during pipeline Step 2. Today, the system prompt and the two user-message builders (`_build_individual_persona_prompt`, `_build_group_persona_prompt`) are written in Chinese, and the runtime context-builders (`_search_zep_for_entity`, `_build_entity_context`) embed Chinese section labels (`事实信息:`, `相关实体:`, `### 实体属性`, `### 关联实体信息`, etc.) into the prompt context that is later interpolated into the user message. Locale is steered at runtime by appending `get_language_instruction()` to the system message and the user-message rules block, but the base-prompt language and the embedded context labels bias the LLM toward Chinese output even when `Accept-Language: en`. Translating the prompt body and the context labels removes that bias while preserving the existing locale-switching mechanism for non-English locales.
This work tracks GitHub issue [#3](https://github.com/salestech-group/MiroFish/issues/3) and is sibling to the already-merged ontology-generator (#2), simulation-config-generator (#4), and report-agent (#5) prompt translation specs. This work tracks GitHub issue [#25](https://github.com/salestech-group/MiroFish/issues/25).
## Boundary Context ## Boundary Context
- **In scope**: - **In scope**:
- Translating the system-prompt base string in `OasisProfileGenerator._get_system_prompt` (currently `"你是社交媒体用户画像生成专家。…"` at line ~664) from Chinese to English. - Translating the system-prompt base string in `_get_system_prompt` (`base_prompt = "你是社交媒体用户画像生成专家..."`).
- Translating the individual-persona user-message template in `OasisProfileGenerator._build_individual_persona_prompt` (currently lines ~680714) from Chinese to English. - Translating the user-message body in `_build_individual_persona_prompt` (header line, field labels, JSON-field descriptions, "重要" rules block).
- Translating the group/institution-persona user-message template in `OasisProfileGenerator._build_group_persona_prompt` (currently lines ~729762) from Chinese to English. - Translating the user-message body in `_build_group_persona_prompt` (header line, field labels, JSON-field descriptions, "重要" rules block).
- Translating the small `attrs_str` and `context_str` fallback default literals (`"无"`, `"无额外上下文"`) to English equivalents. - Translating the placeholder values used inside those builders: `"无"` and `"无额外上下文"` (substituted when an entity has no attributes or no context).
- Preserving all functional contracts: every `get_language_instruction()` call site, all variable interpolations, all JSON output keys, the `gender` enum constraint, the `age` integer constraint, and the institutional age=30 / gender="other" rule. - Translating the section-heading labels prepended to context fragments by `_search_zep_for_entity` (`"相关实体: "` prefix on node-name labels; `"事实信息:"`, `"相关实体:"` block headings).
- Translating the section-heading labels prepended to context fragments by `_build_entity_context` (`"### 实体属性"`, `"### 相关事实和关系"`, `"### 关联实体信息"`, `"### Zep检索到的事实信息"`, `"### Zep检索到的相关节点"`, plus the inline `(相关实体)` placeholder in edge-direction fragments).
- Translating the fallback persona templates (`f"{entity_name}是一个{entity_type}。"`) used when LLM JSON parsing fails or fields are missing.
- Translating the console-output formatting in `_print_generated_profile` (the `【简介】`, `【详细人设】`, `【基本属性】` headings and the `用户名:`, `年龄:`, `性别:`, `MBTI:`, `职业:`, `国家:`, `兴趣话题:` row labels) and the surrounding `print` banners in `generate_profiles_from_entities` (`开始生成Agent人设...`, `人设生成完成!...`).
- Translating the `'无'` sentinel emitted when `interested_topics` is empty in `_print_generated_profile`.
- Preserving all functional contracts: f-string interpolations, JSON output schema, `get_language_instruction()` postfix call sites, `_normalize_gender` mappings (Chinese `男`/`女`/`机构`/`其他` keys remain — input data may still arrive in those forms), the `country: "中国"` rule-based default in `_generate_profile_rule_based`, the `OASIS 库要求字段名为 username无下划线` inline comments at lines 65 and 93 (these are code-level documentation, owned by issue #7), and the `# 可能被截断` / `# 机构虚拟年龄` etc. inline comments (owned by issue #7).
- **Out of scope**: - **Out of scope**:
- Logger calls (`logger.info`, `logger.warning`, `logger.error`) and the printed banner text inside `oasis_profile_generator.py` — covered by issue #6. - Logger calls in this file (covered by issue #6 and the in-flight #24/#25 backend-log work — the logger calls already use `t("log.profile_generator.*")` keys).
- Module docstring, class docstrings, method docstrings, and inline comments — covered by issue #7. - Module/class/method docstrings and inline code comments (covered by issue #7 — including the `# OASIS 库要求字段名为 username` and `# 机构虚拟年龄` style comments).
- The fallback Chinese string literals embedded in non-prompt code paths (e.g. `f"{entity_name}是一个{entity_type}。"` inside `_try_fix_json` and the rule-based fallback) — those are runtime data fallbacks, not LLM prompts, and are out of scope for this issue (they are part of the fallback flow covered when comments/docstrings #7 lands or in a future cleanup; they are not user-visible while the LLM path succeeds). - The `_normalize_gender` mapping table (it must continue to accept Chinese gender inputs that may still arrive from upstream LLM output or user-supplied data).
- Refactoring the OASIS profile JSON schema, the `OasisAgentProfile` dataclass, the MBTI list, the `COMMON_COUNTRIES` list, the entity-type taxonomy splits (`PERSONAL_ENTITY_TYPES` vs `GROUP_ENTITY_TYPES`), or persona-generation flow control. - The hard-coded `"中国"` rule-based country default (this is a data value that downstream OASIS expects in a free-form `country` field; changing the default is a data migration, not a translation).
- Changing OASIS profile-format compatibility — verified by `backend/scripts/test_profile_format.py`. - The Chinese identifier in the `ValueError("LLM_API_KEY 未配置")` raise — that is an exception message, not a prompt fragment, and will be translated under issue #6 (already partially in progress under #24).
- Editing the locale plumbing block (currently the `current_locale = get_locale()` capture and the `set_locale(current_locale)` call inside `generate_single_profile` around lines ~910916). - Externalising prompt strings to `/locales/*.json` (out of scope per the `i18n-*-prompts` family of tickets — same pattern as issues #2/#3/#4/#5).
- Editing call sites of `OasisProfileGenerator` (`api/simulation.py`, etc.).
- Editing `backend/app/utils/locale.py`, the locale registries, or `/locales/`.
- **Adjacent expectations**: - **Adjacent expectations**:
- The Step 2 environment-setup pipeline must continue to consume the OASIS profile output unchanged. The Reddit (`to_reddit_format`) and Twitter (`to_twitter_format`) serializers are not coupled to prompt language; this is verified via the JSON schema contract preservation. - The OASIS / CAMEL-OASIS simulation layer must continue to consume profile JSON unchanged. No coupling to prompt language exists in the OASIS adapter.
- The locale resolution chain (`Accept-Language` header → `get_locale()``get_language_instruction()`) is owned by `backend/app/utils/locale.py` and is unchanged by this work. - The locale resolution chain (`Accept-Language` header → `get_locale()``get_language_instruction()`) is owned by `backend/app/utils/locale.py` and is unchanged by this work. Translating the base prompt does not modify locale resolution semantics.
- Companion i18n issues (#6 logs, #7 comments/docstrings, #9 frontend comments, #10 e2e verification, #12 README) operate on different files or scopes and must not be touched here. - Companion i18n issues (#3, #4, #5, #6, #7, #9, #10, #23, #24, #26) operate on different files or scopes and should not be touched here.
## Requirements ## Requirements
### Requirement 1: English Translation of the System Prompt ### Requirement 1: English Translation of the Profile-Generation System Prompt
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the persona-generation system prompt to be authored in English, so that the LLM's persona prose is not biased toward Chinese structure or word choice. **Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the profile-generation system prompt to be authored in English, so that the LLM's persona output is not biased toward Chinese structure or word choice.
#### Acceptance Criteria #### Acceptance Criteria
1. The OASIS Profile Generator shall set the `base_prompt` constant inside `_get_system_prompt` to an English string containing zero Chinese characters. 1. The OASIS Profile Generator shall define `base_prompt` (in `_get_system_prompt`) containing zero CJK characters in any string-literal content.
2. The OASIS Profile Generator shall preserve the system-prompt assembly contract verbatim: the format `f"{base_prompt}\n\n{get_language_instruction()}"` and the call to `get_language_instruction()` at exactly that site. 2. The OASIS Profile Generator shall preserve the system-prompt requirement that the model returns valid JSON whose string values do not contain unescaped newline characters.
3. The OASIS Profile Generator shall preserve the role and intent semantics of the original prompt: identifying the model as an expert in social-media user-persona generation, requesting detailed and realistic personas for opinion simulation that reflect existing real-world conditions, and mandating valid JSON output where string values must not contain unescaped newlines. 3. The OASIS Profile Generator shall preserve the call to `get_language_instruction()` appended to `base_prompt`, exactly at the existing concatenation site, so locale steering continues to work for non-English locales.
4. The OASIS Profile Generator shall preserve the function signature `_get_system_prompt(self, is_individual: bool) -> str`. 4. The OASIS Profile Generator shall preserve the `is_individual` parameter of `_get_system_prompt` and continue to return a single concatenated system-prompt string of the form `"{base_prompt}\n\n{language_instruction}"`.
### Requirement 2: English Translation of the Individual-Persona User-Message Template ### Requirement 2: English Translation of the Individual-Persona User-Message Template
**Objective:** As a MiroFish operator generating personas for individual entities under `Accept-Language: en`, I want the user-message template constructed by `_build_individual_persona_prompt` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings. **Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the individual-persona user-message template constructed by `_build_individual_persona_prompt` to be authored in English, so that the rendered prompt does not interleave English instructions with Chinese section headings, and the LLM is not biased toward Chinese output.
#### Acceptance Criteria #### Acceptance Criteria
1. The OASIS Profile Generator shall render the individual-persona user message with English section headings and prose in place of the current Chinese (entity name, entity type, entity summary, entity attributes, context section, JSON-fields enumeration, "important" trailing block). 1. The OASIS Profile Generator shall render the individual-persona user message with English field labels in place of `实体名称`, `实体类型`, `实体摘要`, `实体属性`, and `上下文信息`.
2. The OASIS Profile Generator shall preserve all variable interpolations verbatim by name: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, and the inline `{get_language_instruction()}` call inside the trailing rules block. 2. The OASIS Profile Generator shall render the JSON-field descriptions (the `请生成JSON包含以下字段` enumeration) in English while preserving the eight required output keys verbatim by name (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`).
3. The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics` (verbatim, English). 3. The OASIS Profile Generator shall preserve the requirement language that `gender` MUST be the literal English token `"male"` or `"female"` for individual entities, and that `age` MUST be a valid integer.
4. The OASIS Profile Generator shall preserve the field-level constraints in the prompt: 4. The OASIS Profile Generator shall preserve the trailing rules block (the `重要:` enumeration) in English, conveying the same constraints: all field values must be strings or numbers, no embedded newlines; persona must be a coherent single text block; the `gender` field uses English `male`/`female`; content must remain consistent with the entity information; `age` must be a valid integer.
- `bio` ≈ 200 characters, social-media biography. 5. The OASIS Profile Generator shall preserve the call to `get_language_instruction()` interpolated into the rules block.
- `persona` ≈ 2000 characters, single coherent text covering: basic information (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behavior (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions in it). 6. The OASIS Profile Generator shall preserve all f-string interpolations verbatim by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`.
- `age` MUST be an integer. 7. The OASIS Profile Generator shall replace the no-attributes placeholder `"无"` with the English `"None"` when `entity_attributes` is empty / falsy, and the no-context placeholder `"无额外上下文"` with an English equivalent (e.g. `"No additional context"`) when `context` is empty / falsy.
- `gender` MUST be one of `"male"` or `"female"` (English enum value, locale-independent). 8. The OASIS Profile Generator shall return zero CJK characters across all string literals contributed by `_build_individual_persona_prompt`.
- `mbti` MUST be an MBTI four-letter type (e.g. INTJ, ENFP). 9. The OASIS Profile Generator shall preserve the existing `country` field instruction semantics (a free-form country name is requested) but replace the example `"中国"` with a locale-neutral English phrasing that does not bias the model toward any single country (e.g. `Free-form country name`).
- `country` MUST be a country name string.
- `profession` MUST be a profession string.
- `interested_topics` MUST be an array.
5. The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no newlines inside string values, `persona` is a single coherent text, `gender` must be the English `male`/`female` enum even when locale is `zh`, content must stay consistent with the source entity, `age` must be a valid integer.
6. The OASIS Profile Generator shall preserve the function signature `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
7. The OASIS Profile Generator shall preserve the `context[:3000]` truncation behaviour and the conditional fallback (`"无额外上下文"` translated to `"No additional context"`) when `context` is empty/falsy. Likewise, `attrs_str` shall fall back to an English placeholder (`"None"`) when `entity_attributes` is empty/falsy, replacing the current `"无"` literal.
8. The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled individual-persona prompt body.
### Requirement 3: English Translation of the Group/Institution-Persona User-Message Template ### Requirement 3: English Translation of the Group/Institution-Persona User-Message Template
**Objective:** As a MiroFish operator generating personas for institutional/group entities under `Accept-Language: en`, I want the user-message template constructed by `_build_group_persona_prompt` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings. **Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the group-persona user-message template constructed by `_build_group_persona_prompt` to be authored in English, with the same scope and contract as Requirement 2 but for institutional entities.
#### Acceptance Criteria #### Acceptance Criteria
1. The OASIS Profile Generator shall render the group-persona user message with English section headings and prose in place of the current Chinese. 1. The OASIS Profile Generator shall render the group-persona user message with English field labels in place of `实体名称`, `实体类型`, `实体摘要`, `实体属性`, and `上下文信息`.
2. The OASIS Profile Generator shall preserve all variable interpolations verbatim by name: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, and the inline `{get_language_instruction()}` call inside the trailing rules block. 2. The OASIS Profile Generator shall render the JSON-field descriptions (the `请生成JSON包含以下字段` enumeration) in English while preserving the eight required output keys verbatim by name (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`).
3. The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics` (verbatim, English). 3. The OASIS Profile Generator shall preserve the fixed-value requirements: `age` MUST be the integer literal `30`; `gender` MUST be the literal English token `"other"`.
4. The OASIS Profile Generator shall preserve the field-level constraints in the prompt: 4. The OASIS Profile Generator shall preserve the trailing rules block (the `重要:` enumeration) in English, conveying the same constraints: all field values must be strings or numbers (no nulls); persona must be a coherent single text block (no embedded newlines); the `gender` field uses English `"other"`; `age` must be the integer `30`; the institutional account's voice must match its identity.
- `bio` ≈ 200 characters, an official-account biography that reads as professionally appropriate. 5. The OASIS Profile Generator shall preserve the call to `get_language_instruction()` interpolated into the rules block.
- `persona` ≈ 2000 characters, single coherent text covering: institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions in it). 6. The OASIS Profile Generator shall preserve all f-string interpolations verbatim by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`.
- `age` MUST be the integer `30` (the institutional virtual-age sentinel). 7. The OASIS Profile Generator shall use the same English placeholders as Requirement 2 for the no-attributes and no-context cases.
- `gender` MUST be the literal `"other"` (English enum value, locale-independent), indicating non-individual. 8. The OASIS Profile Generator shall return zero CJK characters across all string literals contributed by `_build_group_persona_prompt`.
- `mbti` MUST be an MBTI four-letter type used to characterize account voice (e.g. ISTJ for strict/conservative). 9. The OASIS Profile Generator shall preserve the existing `country` field instruction with a locale-neutral English phrasing (matching Requirement 2.9).
- `country` MUST be a country name string.
- `profession` MUST describe institutional function.
- `interested_topics` MUST be an array of focus areas.
5. The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no `null` values, no newlines in string values, `persona` is a single coherent text, `gender` must be the English `"other"` enum even when locale is `zh`, the institutional account voice must match its identity positioning, and `age` must be the integer `30`.
6. The OASIS Profile Generator shall preserve the function signature `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
7. The OASIS Profile Generator shall preserve the `context[:3000]` truncation behaviour and the conditional English-equivalent fallback for empty `context` and empty `entity_attributes`, mirroring Requirement 2.
8. The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled group-persona prompt body.
### Requirement 4: Locale Switching Continues to Work via `get_language_instruction()` ### Requirement 4: English Translation of the Context-Builder Section Labels
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: zh` (or any other configured non-English locale), I want generated personas to remain in the requested locale at equivalent quality, so that translating the base prompt does not regress non-English support. **Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the section labels embedded in the context string by `_search_zep_for_entity` and `_build_entity_context` to be in English, so that the prompt context block interpolated into the user message is fully English and the LLM is not biased toward Chinese output by the context labels.
#### Acceptance Criteria #### Acceptance Criteria
1. The OASIS Profile Generator shall preserve every existing `get_language_instruction()` call site exactly: the system-prompt site in `_get_system_prompt`, the inline call inside the trailing rules block of `_build_individual_persona_prompt`, and the inline call inside the trailing rules block of `_build_group_persona_prompt`. 1. The OASIS Profile Generator shall render the related-node prefix (currently `"相关实体: "`) in English (e.g. `"Related entity: "`) in `_search_zep_for_entity`.
2. The OASIS Profile Generator shall preserve the locale-capture/restore plumbing inside `generate_profiles_for_entities` (currently the `current_locale = get_locale()` capture and the `set_locale(current_locale)` call inside `generate_single_profile`) — this code is not modified by the change. 2. The OASIS Profile Generator shall render the facts block heading (currently `"事实信息:"`) in English (e.g. `"Facts:"`) in `_search_zep_for_entity`.
3. While the locale is `zh`, the OASIS Profile Generator shall produce profiles whose `bio`, `persona`, `profession`, and `interested_topics` content is in Chinese, equivalent in quality to the pre-change behaviour. 3. The OASIS Profile Generator shall render the related-entities block heading (currently `"相关实体:"`) in English (e.g. `"Related entities:"`) in `_search_zep_for_entity`.
4. While the locale is `en`, the OASIS Profile Generator shall produce profiles whose `bio`, `persona`, `profession`, and `interested_topics` content is in English. 4. The OASIS Profile Generator shall render the entity-attributes section heading (currently `"### 实体属性"`) in English (e.g. `"### Entity attributes"`) in `_build_entity_context`.
5. While the locale is `en` or `zh`, the OASIS Profile Generator shall produce profiles whose `gender` field is one of the literal English values `"male"`, `"female"` (individual entities) or `"other"` (group entities), regardless of locale. 5. The OASIS Profile Generator shall render the related-facts/relationships section heading (currently `"### 相关事实和关系"`) in English (e.g. `"### Related facts and relationships"`) in `_build_entity_context`.
6. The OASIS Profile Generator shall not alter `backend/app/utils/locale.py`, the `_languages`, the `_translations` registries, or the locales under `/locales/`. 6. The OASIS Profile Generator shall render the related-entity-information section heading (currently `"### 关联实体信息"`) in English (e.g. `"### Related entity information"`) in `_build_entity_context`.
7. The OASIS Profile Generator shall render the Zep-retrieved facts section heading (currently `"### Zep检索到的事实信息"`) in English (e.g. `"### Facts retrieved from the graph"`) in `_build_entity_context`.
8. The OASIS Profile Generator shall render the Zep-retrieved related-nodes section heading (currently `"### Zep检索到的相关节点"`) in English (e.g. `"### Related nodes retrieved from the graph"`) in `_build_entity_context`.
9. The OASIS Profile Generator shall render the inline edge-direction placeholder (currently `(相关实体)`) in English (e.g. `(related entity)`) in both outgoing and incoming branches of `_build_entity_context`.
10. The OASIS Profile Generator shall return zero CJK characters across all section-label string literals contributed by `_search_zep_for_entity` and `_build_entity_context`.
### Requirement 5: Public API and Call-Site Stability ### Requirement 5: English Translation of the Fallback Persona Templates
**Objective:** As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of `OasisProfileGenerator` and `OasisAgentProfile` to remain unchanged, so that the Step 2 environment-setup flow and existing callers continue to work without modification. **Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, when the LLM JSON parse fails or returns missing fields and the code falls back to a synthesized persona template, I want the fallback persona to be in English so that the resulting profile JSON does not contain unintended Chinese strings.
#### Acceptance Criteria #### Acceptance Criteria
1. The OASIS Profile Generator shall preserve the dataclass `OasisAgentProfile`, including its field set (`user_id`, `user_name`, `name`, `bio`, `persona`, `karma`, `friend_count`, `follower_count`, `statuses_count`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`, `source_entity_uuid`, `source_entity_type`, `created_at`), default values, and the `to_reddit_format`, `to_twitter_format`, `to_full_dict` serializers. 1. The OASIS Profile Generator shall replace the fallback persona template `f"{entity_name}是一个{entity_type}。"` at every occurrence (currently at the persona-validation branch in `_generate_profile_with_llm` line 547, the regex-extraction branch in `_try_fix_json` line 644, and the catastrophic-failure branch line 659) with an English equivalent (e.g. `f"{entity_name} is a {entity_type}."`).
2. The OASIS Profile Generator shall preserve the signatures and call semantics of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_for_entities`, `_call_llm_with_retry`, `_generate_profile_rule_based`, `_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, and `_generate_username`. 2. The OASIS Profile Generator shall preserve the priority order of the fallback chain (`entity_summary or template`).
3. The OASIS Profile Generator shall preserve the LLM invocation parameters (`temperature`, `max_tokens`, model selection, retry behaviour) at the call sites that consume the prompts produced by the translated builders. 3. The OASIS Profile Generator shall return zero CJK characters across all fallback persona literals.
4. The OASIS Profile Generator shall preserve the `PERSONAL_ENTITY_TYPES` and `GROUP_ENTITY_TYPES` taxonomies, the `MBTI_TYPES` list, and the `COMMON_COUNTRIES` list verbatim.
### Requirement 6: Reasoning-Model Output Compatibility ### Requirement 6: English Translation of the Console-Output Formatting
**Objective:** As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with `<think>` tags or markdown code fences), I want JSON parsing of the persona response to continue working, so that translating the base prompt does not regress provider compatibility. **Objective:** As a MiroFish operator monitoring profile generation in the console under `Accept-Language: en`, I want the per-profile diagnostic banner and the start/end batch banners to be in English so that the entire console stream is consistent with the requested locale.
#### Acceptance Criteria #### Acceptance Criteria
1. The OASIS Profile Generator shall preserve the existing `_fix_truncated_json` and `_try_fix_json` resilience helpers exactly, including their regex-based extraction of `bio` and `persona` from partial output. 1. The OASIS Profile Generator shall render the per-profile section headings in English in `_print_generated_profile`: `【简介】``[Bio]`, `【详细人设】``[Persona]`, `【基本属性】``[Basic attributes]` (or equivalent English markers).
2. If a reasoning-model provider returns truncated, `<think>`-tagged, or markdown-fenced output, then the existing parsing/recovery flow shall continue to apply unchanged. 2. The OASIS Profile Generator shall render the per-profile row labels in English in `_print_generated_profile`: `用户名:``Username:`, `年龄:``Age:`, `性别:``Gender:`, `职业:``Profession:`, `国家:``Country:`, `兴趣话题:``Interested topics:`.
3. The OASIS Profile Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language. 3. The OASIS Profile Generator shall replace the empty-topics sentinel `'无'` in `_print_generated_profile` with an English equivalent (e.g. `'None'`).
4. After translation, the OASIS Profile Generator shall continue to round-trip a representative entity through `generate_profile_from_entity` and produce a JSON object with at minimum a non-empty `bio` and a non-empty `persona`, matching the pre-change behaviour. 4. The OASIS Profile Generator shall render the start-of-batch and end-of-batch banners in `generate_profiles_from_entities` in English: `开始生成Agent人设 - 共 {total} 个实体,并行数: {parallel_count}``Generating agent profiles — {total} entities, parallel: {parallel_count}` (or equivalent); `人设生成完成!共生成 {len([p for p in profiles if p])} 个Agent``Profile generation complete — produced {n} agents` (or equivalent).
5. The OASIS Profile Generator shall preserve all f-string interpolations in the banners verbatim (`{total}`, `{parallel_count}`, the count expression).
6. The OASIS Profile Generator shall return zero CJK characters across all string literals contributed by `_print_generated_profile` and the surrounding `print(...)` banners in `generate_profiles_from_entities`.
7. The OASIS Profile Generator shall continue to use the existing `t('progress.profileGenerated', ...)` key for the per-profile heading row, since that key is already locale-keyed via the `t()` helper.
### Requirement 7: Step 2 Environment-Setup Parity (OASIS Format Compatibility) ### Requirement 7: Locale Switching Continues to Work via `get_language_instruction()`
**Objective:** As a MiroFish operator validating the change, I want the OASIS subprocess to accept the generated profiles unchanged, so that the translation does not silently break Step 2 → Step 3 hand-off. **Objective:** As a MiroFish operator running the pipeline under `Accept-Language: zh` (or any other configured non-English locale), I want the profile output to remain in the requested locale of equivalent quality, so that translating the base prompt does not regress non-English support.
#### Acceptance Criteria #### Acceptance Criteria
1. While `uv run python -m pytest backend/scripts/test_profile_format.py` runs against the changed code, the test suite shall pass with zero regressions versus the pre-change baseline. 1. The OASIS Profile Generator shall preserve the call to `get_language_instruction()` exactly at its existing locations (currently inside `_get_system_prompt` and inside both `_build_individual_persona_prompt` and `_build_group_persona_prompt` rules blocks), continuing to read locale via the existing thread-local / request-header resolution chain.
2. While a representative Reddit-format profile dictionary is produced under locale `en`, every field name shall match the existing OASIS-required schema: `user_id`, `username`, `name`, `bio`, `persona`, `karma`, `created_at`, plus optional `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`. 2. When the locale is `zh`, the OASIS Profile Generator shall produce profile JSON whose `bio` and `persona` fields are in Chinese, equivalent in quality to the pre-change behaviour.
3. While a representative Twitter-format profile dictionary is produced under locale `en`, every field name shall match the existing OASIS-required schema: `user_id`, `username`, `name`, `bio`, `persona`, `friend_count`, `follower_count`, `statuses_count`, `created_at`, plus optional `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`. 3. When the locale is `en`, the OASIS Profile Generator shall produce profile JSON whose `bio` and `persona` fields are in English.
4. The OASIS Profile Generator shall produce `gender` values that are exactly one of `"male"`, `"female"`, `"other"` regardless of locale, satisfying the OASIS subprocess's expected enum. 4. The OASIS Profile Generator shall not alter `backend/app/utils/locale.py`, the `_languages` registry, the `_translations` registries, or the locales under `/locales/`.
### Requirement 8: Out-of-Scope Surfaces Remain Untouched ### Requirement 8: Public API and Call-Site Stability
**Objective:** As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings, so that translation responsibilities for adjacent surfaces (issues #6, #7, and the rule-based fallback) are not absorbed into this change. **Objective:** As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of `OasisProfileGenerator` to remain unchanged, so that the simulation pipeline and existing callers continue to work without modification.
#### Acceptance Criteria #### Acceptance Criteria
1. The change shall not modify any `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, or `logger.debug(...)` call in `oasis_profile_generator.py` (covered by issue #6). 1. The OASIS Profile Generator shall preserve the signatures of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_from_entities`, `set_graph_id`, `save_profiles`, and `save_profiles_to_json`.
2. The change shall not modify the module docstring, class docstrings, method docstrings, or inline comments in `oasis_profile_generator.py` (covered by issue #7). 2. The OASIS Profile Generator shall preserve the signatures of all private helpers, including `_generate_profile_with_llm`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, `_get_system_prompt`, `_build_entity_context`, `_search_zep_for_entity`, `_print_generated_profile`, `_normalize_gender`, `_save_twitter_csv`, `_save_reddit_json`, `_try_fix_json`, `_fix_truncated_json`, `_is_individual_entity`, `_is_group_entity`, `_generate_profile_rule_based`, `_generate_username`.
3. The change shall not modify the rule-based fallback Chinese fragments inside `_try_fix_json` (e.g. `f"{entity_name}是一个{entity_type}。"`) and the rule-based path inside `_generate_profile_rule_based` — those are runtime data fallbacks, not LLM prompts, and remain out of scope here. 3. The OASIS Profile Generator shall preserve the return shape of `generate_profile_from_entity` (a populated `OasisAgentProfile` dataclass instance) and `generate_profiles_from_entities` (a `List[OasisAgentProfile]`).
4. The change shall not edit any file outside `backend/app/services/oasis_profile_generator.py` for production code. 4. The OASIS Profile Generator shall preserve the LLM invocation parameters (`response_format={"type": "json_object"}`, the `temperature=0.7 - (attempt * 0.1)` schedule, the absence of `max_tokens`) and the call to `self.client.chat.completions.create(...)`.
5. The change shall not introduce a new dependency or modify `backend/pyproject.toml` / `backend/uv.lock`. 5. The OASIS Profile Generator shall preserve the `_normalize_gender` mapping table verbatim (the Chinese keys `男`, `女`, `机构`, `其他` continue to accept upstream Chinese input).
6. The change shall not modify `backend/scripts/test_profile_format.py` (the test is the contract; the implementation must match it). 6. The OASIS Profile Generator shall preserve the rule-based `country: "中国"` default in `_generate_profile_rule_based` (this is a data value, not a prompt; changing it is out of scope per the boundary commitments).
### Requirement 9: Reasoning-Model Output Compatibility
**Objective:** As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with `<think>` tags or markdown code fences), I want JSON parsing of the profile response to continue working, so that translating the base prompt does not regress provider compatibility.
#### Acceptance Criteria
1. The OASIS Profile Generator shall continue to call `self.client.chat.completions.create(...)` with `response_format={"type": "json_object"}` and parse the response via the existing `json.loads` / `_try_fix_json` / `_fix_truncated_json` chain unchanged.
2. The OASIS Profile Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language.
3. The fallback persona templates from Requirement 5 shall be safe to embed in JSON (no embedded raw newlines, balanced quotes).
### Requirement 10: Out-of-Scope Surfaces Remain Untouched
**Objective:** As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings and the immediately-adjacent context labels and console output, so that translation responsibilities for adjacent surfaces (issues #6 and #7) are not absorbed into this change.
#### Acceptance Criteria
1. The change shall not modify any `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, or `logger.debug(...)` call in `oasis_profile_generator.py` (covered by issues #6 / #24 / #25-style backend-log work — the calls already use `t("log.profile_generator.*")`).
2. The change shall not modify the module docstring, class docstrings, method docstrings, or inline comments in `oasis_profile_generator.py` (covered by issue #7) — including the inline comments at lines 65, 93, 641, 804807, 816819, etc.
3. The change shall not modify the `_normalize_gender` mapping table (Chinese gender keys must remain to handle upstream input).
4. The change shall not modify the rule-based `country: "中国"` default in `_generate_profile_rule_based`.
5. The change shall not modify the `ValueError("LLM_API_KEY 未配置")` raise (covered by issue #6).
6. The change shall not edit any file outside `backend/app/services/oasis_profile_generator.py` for production code, except for adding test fixtures or scripts under a clearly-isolated directory if a verification harness is needed.
7. The change shall not introduce a new dependency or modify `backend/pyproject.toml` / `backend/uv.lock`.

View File

@ -1,10 +1,9 @@
{ {
"feature_name": "i18n-oasis-profile-generator-prompts", "feature_name": "i18n-oasis-profile-generator-prompts",
"created_at": "2026-05-08T05:26:06Z", "created_at": "2026-05-07T22:50:00Z",
"updated_at": "2026-05-08T05:30:00Z", "updated_at": "2026-05-07T22:50:00Z",
"language": "en", "language": "en",
"phase": "tasks-generated", "phase": "tasks-generated",
"ticket": 3,
"approvals": { "approvals": {
"requirements": { "requirements": {
"generated": true, "generated": true,
@ -19,5 +18,10 @@
"approved": true "approved": true
} }
}, },
"ready_for_implementation": true "ready_for_implementation": true,
"ticket": {
"number": 25,
"url": "https://github.com/salestech-group/MiroFish/issues/25",
"snapshot": ".ticket/25.md"
}
} }

View File

@ -1,66 +1,92 @@
# Implementation Plan # Implementation Plan
- [x] 1. Translate the system-prompt builder to English - [ ] 1. Translate the system-prompt base string in `_get_system_prompt`
- Replace the Chinese `base_prompt` literal inside `_get_system_prompt` (currently `"你是社交媒体用户画像生成专家。…"` at line ~664) with an English rendering that conveys the same role and intent: identifies the model as an expert in social-media user-persona generation, asks for detailed and realistic personas suitable for opinion-simulation that faithfully reflect existing real-world conditions, mandates valid JSON output, and forbids unescaped newlines inside string values - Replace the body of `base_prompt` (currently `"你是社交媒体用户画像生成专家。生成详细、真实的人设用于舆论模拟,最大程度还原已有现实情况。必须返回有效的JSON格式所有字符串值不能包含未转义的换行符。"`) with an English equivalent that preserves the same intent: define the LLM as an expert social-media-persona generator; require detailed, realistic personas grounded in supplied context; require valid JSON output; forbid unescaped newlines in string values
- Preserve the assembled return shape `f"{base_prompt}\n\n{get_language_instruction()}"` exactly — the call to `get_language_instruction()` is unchanged in name and position - Preserve the trailing `f"{base_prompt}\n\n{get_language_instruction()}"` concatenation site exactly
- Preserve the method signature `_get_system_prompt(self, is_individual: bool) -> str`; do not branch on `is_individual` (current behaviour preserved) - Preserve the `is_individual` parameter (still accepted, still unused — no signature change)
- Observable completion: `_get_system_prompt(True)` and `_get_system_prompt(False)` both return non-empty English strings ending with the per-locale postfix from `get_language_instruction()`; the `base_prompt` body contains zero CJK characters - Observable completion: `_get_system_prompt(...)` returns an English-only base prompt followed by the locale-appropriate `get_language_instruction()` postfix
- _Requirements: 1.1, 1.2, 1.3, 1.4_ - _Requirements: 1.1, 1.2, 1.3, 1.4_
- [x] 2. Translate the individual-persona user-message builder to English - [ ] 2. Translate the individual-persona user-message template in `_build_individual_persona_prompt`
- Replace the Chinese f-string body inside `_build_individual_persona_prompt` (currently lines ~680714) with an English rendering structured as: a lead sentence requesting a detailed social-media persona faithful to existing reality; an entity-context block with English labels for `entity_name`, `entity_type`, `entity_summary`, `entity_attributes`; a `Context information:` block; a `Generate JSON with the following fields:` enumeration of the eight output keys (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`); and a trailing `Important:` rules block - Replace the introductory line (`"为实体生成详细的社交媒体用户人设,..."`) with an English equivalent
- Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars; `persona` ≈ 2000 chars covering basic info (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behaviour (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions); `age` integer; `gender` MUST be the literal `"male"` or `"female"`; `mbti` four-letter type; `country` country name; `profession`; `interested_topics` array - Replace the field-label rows (`实体名称`, `实体类型`, `实体摘要`, `实体属性`, `上下文信息`) with English equivalents
- Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English values `"male"` / `"female"`; content stays consistent with the entity; `age` MUST be a valid integer - Replace the `请生成JSON包含以下字段:` enumeration block with an English equivalent that preserves the eight required output keys verbatim by name (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`)
- Replace the `attrs_str` and `context_str` Chinese fallback defaults with English: `"无"``"None"` (used when `entity_attributes` is empty/falsy) and `"无额外上下文"``"No additional context"` (used when `context` is empty/falsy) - Translate the per-field guidance: `bio` is a 200-character social-media bio; `persona` is a coherent ~2000-character text containing basic info, background, personality (with MBTI), social-media behavior, stance, distinctive traits, and event-specific memories; `age` must be an integer; `gender` must be the literal English token `"male"` or `"female"`; `mbti` is an MBTI four-letter code; `country` is a free-form country name; `profession` is a free-form occupation; `interested_topics` is a list of topics
- Drop the country-language hint `(使用中文,如"中国"` so `get_language_instruction()` steers the country language; preserve the country line as a neutral `country: country name` entry - Replace the trailing `重要:` rules block with an English equivalent: all field values must be strings or numbers, no embedded newlines; persona must be a coherent single text block; `gender` must use English `male`/`female`; content must remain consistent with the entity information; `age` must be a valid integer
- Preserve the call to `get_language_instruction()` interpolated into the rules block
- Replace the `attrs_str` no-attributes placeholder `"无"` with `"None"` (or English equivalent) at line 677
- Replace the `context_str` no-context placeholder `"无额外上下文"` with `"No additional context"` (or English equivalent) at line 678
- Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}` - Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`
- Preserve the `context[:3000]` truncation behaviour and the method signature `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str` - Observable completion: `_build_individual_persona_prompt(...)` produces an English-only message body for any input combination, with zero CJK characters in any string literal it contributes; under the same inputs as before, all interpolated values still appear in the rendered output
- Observable completion: calling `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains the `gender` enum lock-in `"male"` / `"female"` exactly once - _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9_
- _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5_
- [x] 3. Translate the group/institution-persona user-message builder to English - [ ] 3. Translate the group-persona user-message template in `_build_group_persona_prompt`
- Replace the Chinese f-string body inside `_build_group_persona_prompt` (currently lines ~729762) with an English rendering structured the same way as Task 2 but adapted for institutional voice: lead sentence requesting a detailed social-media account profile for an institution/group faithful to existing reality; entity-context block; `Context information:` block; `Generate JSON with the following fields:` enumeration of the eight output keys; trailing `Important:` rules block - Replace the introductory line (`"为机构/群体实体生成详细的社交媒体账号设定,..."`) with an English equivalent
- Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars in an official-account voice; `persona` ≈ 2000 chars covering institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions); `age` MUST be the integer `30`; `gender` MUST be the literal `"other"`; `mbti` four-letter type characterizing account voice; `country`; `profession` describes institutional function; `interested_topics` array - Replace the field-label rows (`实体名称`, `实体类型`, `实体摘要`, `实体属性`, `上下文信息`) with English equivalents (matching task 2)
- Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers, no `null` allowed; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English value `"other"`; `age` MUST be the integer `30` and `gender` MUST be the string `"other"`; account voice must match identity positioning - Replace the `请生成JSON包含以下字段:` enumeration block with an English equivalent that preserves the eight required output keys verbatim by name (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`)
- Replace the `attrs_str` and `context_str` Chinese fallback defaults with the same English replacements applied in Task 2 (`"None"` and `"No additional context"`) - Translate the per-field guidance: `bio` is a polished ~200-character official-account bio; `persona` is a coherent ~2000-character text covering institutional background, account positioning, voice, content patterns, official stance, distinctive traits, and event-specific memories; `age` must be the integer literal `30`; `gender` must be the literal English token `"other"`; `mbti` describes account voice; `country` is a free-form country name; `profession` is the institution's role; `interested_topics` is a list of focus areas
- Drop the country-language hint as in Task 2 - Replace the trailing `重要:` rules block with an English equivalent: all field values must be strings or numbers (no nulls); persona must be a coherent single text block (no embedded newlines); `gender` must use English `"other"`; `age` must be the integer `30`; the institutional account's voice must match its identity
- Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}` - Preserve the call to `get_language_instruction()` interpolated into the rules block
- Preserve the `context[:3000]` truncation behaviour and the method signature `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str` - Replace the `attrs_str` and `context_str` placeholders the same way as in task 2 (lines 726, 727)
- Observable completion: calling `_build_group_persona_prompt("ACME Corp", "Organization", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains both the `age == 30` lock-in and the `gender == "other"` lock-in - Preserve every f-string interpolation by name and position
- _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5_ - Observable completion: `_build_group_persona_prompt(...)` produces an English-only message body for any input combination, with zero CJK characters; under the same inputs as before, all interpolated values still appear in the rendered output
- _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9_
- [x] 4. Confirm boundary commitments around the translation - [ ] 4. Translate the section labels in `_search_zep_for_entity` and `_build_entity_context`
- Confirm every existing `get_language_instruction()` call site is preserved verbatim: the system-prompt assembly inside `_get_system_prompt`, the inline call inside the trailing rules block of `_build_individual_persona_prompt`, and the inline call inside the trailing rules block of `_build_group_persona_prompt` - Replace the related-node prefix `f"相关实体: {node.name}"` with an English equivalent (e.g. `f"Related entity: {node.name}"`) at line 384
- Confirm the locale-thread plumbing in `generate_profiles_for_entities` (capture `current_locale = get_locale()` at line ~910 and `set_locale(current_locale)` inside the worker at line ~914) is byte-identical - Replace the facts block heading `"事实信息:\n"` with `"Facts:\n"` (or equivalent) at line 390
- Confirm the public signatures of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_for_entities`, `set_graph_id`, and the private helpers `_call_llm_with_retry`, `_generate_profile_rule_based`, `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, `_save_twitter_csv`, `_save_reddit_json`, `_generate_username` are unchanged - Replace the related-entities block heading `"相关实体:\n"` with `"Related entities:\n"` (or equivalent) at line 392
- Confirm the `OasisAgentProfile` dataclass field set, default values, and the `to_reddit_format`, `to_twitter_format`, `to_full_dict` serializers are unchanged - Replace the entity-attributes section heading `"### 实体属性\n"` with `"### Entity attributes\n"` (or equivalent) at line 422
- Confirm class constants `MBTI_TYPES`, `COUNTRIES`, `INDIVIDUAL_ENTITY_TYPES`, `GROUP_ENTITY_TYPES` are unchanged - Replace the inline edge-direction placeholder `(相关实体)` with `(related entity)` (or equivalent) at lines 438 and 440 (both outgoing and incoming branches)
- Confirm the LLM invocation parameters at the call site that consumes the translated prompts (`response_format={"type": "json_object"}`, `temperature=0.7 - (attempt * 0.1)`, `max_attempts=3`) are unchanged - Replace the related-facts/relationships section heading `"### 相关事实和关系\n"` with `"### Related facts and relationships\n"` (or equivalent) at line 443
- Confirm `_fix_truncated_json` and `_try_fix_json` (including their Chinese persona fragments such as `f"{entity_name}是一个{entity_type}。"`) are not modified — these are runtime data fallbacks, not prompts, and are out of scope - Replace the related-entity-information section heading `"### 关联实体信息\n"` with `"### Related entity information\n"` (or equivalent) at line 463
- Confirm `_generate_profile_rule_based` is not modified — including its Chinese country defaults `"中国"` at lines ~807 and ~819 - Replace the Zep-retrieved facts section heading `"### Zep检索到的事实信息\n"` with `"### Facts retrieved from the graph\n"` (or equivalent) at line 472
- Confirm `backend/app/utils/locale.py`, `/locales/languages.json`, `/locales/en.json`, and `/locales/zh.json` are not modified - Replace the Zep-retrieved related-nodes section heading `"### Zep检索到的相关节点\n"` with `"### Related nodes retrieved from the graph\n"` (or equivalent) at line 475
- Confirm `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, the print banner at line ~945, module / class / method docstrings, and inline comments in `oasis_profile_generator.py` are not modified (owned by issues #6 and #7) - Preserve the structure (heading + bulleted body, joined by `"\n".join(...)`)
- Confirm `backend/scripts/test_profile_format.py`, `backend/pyproject.toml`, `backend/uv.lock`, and any file outside `backend/app/services/oasis_profile_generator.py` are not modified - Observable completion: the context string returned by `_build_entity_context(...)` contains zero CJK characters in section labels for any input
- Observable completion: a `git diff` review against `main` shows changes only inside `backend/app/services/oasis_profile_generator.py`, only inside `_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, and the surrounding lines (method headers, neighbouring methods) are byte-identical - _Requirements: 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 4.10_
- _Requirements: 1.4, 2.6, 3.6, 4.1, 4.2, 4.6, 5.1, 5.2, 5.3, 5.4, 6.1, 6.3, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6_
- [x] 5. Verify smoke import and OASIS profile-format pytest - [ ] 5. Translate the fallback persona templates
- Run `cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator, OasisAgentProfile"` and confirm it exits 0 (catches f-string syntax errors) - Replace `f"{entity_name}是一个{entity_type}。"` with `f"{entity_name} is a {entity_type}."` (or equivalent) at line 547 (`_generate_profile_with_llm`, missing-persona branch)
- Run `cd backend && uv run python -m pytest backend/scripts/test_profile_format.py` (or equivalent invocation per project convention) and confirm it passes — the test does not exercise prompts, so a pure-translation diff must keep it green - Replace the same template at line 644 (`_try_fix_json`, regex-extraction branch)
- Construct an instance of `OasisProfileGenerator` (using `OasisProfileGenerator.__new__(OasisProfileGenerator)` to skip `__init__` if the LLM key is unavailable, mirroring the pattern in `test_profile_format.py`) and confirm `_get_system_prompt(True)`, `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")`, and `_build_group_persona_prompt("ACME", "Organization", "summary", {"k": "v"}, "ctx")` each return a string with zero CJK matches against the regex `[一-鿿]` - Replace the same template at line 659 (`_try_fix_json`, catastrophic-failure branch)
- Observable completion: smoke import exits 0; pytest passes with zero regressions; the three prompt-builder calls each produce English-only output under the default `zh` locale (the `get_language_instruction()` postfix at the end is the only place where Chinese is allowed to appear, and only when locale is `zh`) - Preserve the `entity_summary or template` priority order at every site
- _Requirements: 6.4, 7.1, 7.2, 7.3, 7.4_ - Observable completion: when the LLM fails JSON parse and the fallback template is invoked, the resulting `persona` value is English
- _Requirements: 5.1, 5.2, 5.3_
- [x] 6. Verify locale-driven output language under both `en` and `zh` - [ ] 6. Translate the console-output formatting in `_print_generated_profile` and the surrounding banners
- With the thread-local locale forced via `set_locale("en")`, render each of the three builders against representative inputs and confirm: each output contains zero CJK characters; each ends with the English locale postfix `"Please respond in English."`; the `gender` enum constraint appears as English `"male"` / `"female"` (individual) or `"other"` (group) - Replace the section headings in `_print_generated_profile`: `f"【简介】"` → English equivalent (e.g. `"[Bio]"`), `f"【详细人设】"` → English equivalent (e.g. `"[Persona]"`), `f"【基本属性】"` → English equivalent (e.g. `"[Basic attributes]"`)
- With `set_locale("zh")`, render the same three builders and confirm: the per-prompt body remains English-only (the translated base prompt does not depend on locale); each ends with the Chinese locale postfix `"请使用中文回答。"`; the `gender` enum constraint still appears as the English literal values - Replace the row labels in `_print_generated_profile`: `f"用户名:"``f"Username: {profile.user_name}"`, `f"年龄: {profile.age} | 性别: {profile.gender} | MBTI: {profile.mbti}"``f"Age: {profile.age} | Gender: {profile.gender} | MBTI: {profile.mbti}"`, `f"职业: {profile.profession} | 国家: {profile.country}"``f"Profession: {profile.profession} | Country: {profile.country}"`, `f"兴趣话题: {topics_str}"``f"Interested topics: {topics_str}"`
- Optionally, with a configured LLM key, run `OasisProfileGenerator().generate_profile_from_entity(...)` end-to-end under each locale against a synthetic `EntityNode` and spot-check that the produced `bio`, `persona`, `profession` are English under `en` and Chinese under `zh`, while `gender` is one of the three English enum literals under both - Replace the empty-topics sentinel `'无'` with `'None'` (or equivalent) at line 1011
- Observable completion: the locale-`en` rendering is CJK-free in the prompt body and ends with the English locale postfix; the locale-`zh` rendering preserves the prompt body in English and ends with the Chinese locale postfix; if the LLM round-trip is exercised, results are recorded in the PR description - Replace the start-of-batch banner in `generate_profiles_from_entities` (currently `f"开始生成Agent人设 - 共 {total} 个实体,并行数: {parallel_count}"` at line 945) with an English equivalent (e.g. `f"Generating agent profiles — {total} entities, parallel: {parallel_count}"`)
- _Requirements: 4.3, 4.4, 4.5_ - Replace the end-of-batch banner (currently `f"人设生成完成!共生成 {len([p for p in profiles if p])} 个Agent"` at line 1001) with an English equivalent (e.g. `f"Profile generation complete — produced {len([p for p in profiles if p])} agents"`)
- Preserve all f-string interpolations
- Preserve the existing `t('progress.profileGenerated', name=entity_name, type=entity_type)` call (already locale-keyed)
- Observable completion: the console output stream contains zero CJK characters in literals contributed by `_print_generated_profile` and the two batch banners (the entity name itself may still contain CJK because it is data, not a literal)
- _Requirements: 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7_
- [x] 7. Final CJK regression sweep on the three builders - [ ] 7. Confirm boundary commitments around the translation
- Run a regex audit limited to the three method bodies (`_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`) using the project-level CJK guard regex (`[一-鿿]`) and confirm zero matches inside their string literals - Confirm `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, `logger.debug(...)` calls and their `t("log.profile_generator.*")` keys in this file are unchanged
- Run a CJK audit on the rendered output of the three builders for representative inputs and confirm zero matches in the prompt body (the locale postfix is excluded — its Chinese form is a deliberate kept use under `zh`) - Confirm the module/class/method docstrings and inline comments are unchanged (including lines 65, 93, 641, 804807, 816819)
- Confirm the file-level `git grep -nE '[\\x{4e00}-\\x{9fff}]' -- backend/app/services/oasis_profile_generator.py` output still flags only known out-of-scope locations: docstrings, comments, logger keys, rule-based fallback country `"中国"` defaults, and resilience-helper Chinese fragments — and does not flag any line inside the three translated method bodies - Confirm `_normalize_gender` mapping table (Chinese keys `男`/`女`/`机构`/`其他`) is unchanged
- Observable completion: the targeted regex audit returns zero matches inside the three method bodies; the file-level audit's residual CJK lines all fall outside the three method bodies and match the out-of-scope inventory in `design.md` § Boundary Commitments → Out of Boundary - Confirm the rule-based `country: "中国"` default at lines 807, 819 is unchanged
- _Requirements: 1.1, 2.8, 3.8, 8.1, 8.2, 8.3_ - Confirm the `ValueError("LLM_API_KEY 未配置")` raise at line 194 is unchanged
- Confirm public signatures (`__init__`, `generate_profile_from_entity`, `generate_profiles_from_entities`, `set_graph_id`, `save_profiles`, `save_profiles_to_json`) and private helper signatures are unchanged
- Confirm the `OasisAgentProfile` dataclass schema is unchanged
- Confirm the LLM call (`response_format={"type": "json_object"}`, `temperature=0.7 - (attempt * 0.1)`, no `max_tokens`) is unchanged
- Confirm `backend/app/utils/locale.py`, `/locales/languages.json`, `/locales/en.json`, `/locales/zh.json` are not modified
- Confirm `backend/pyproject.toml`, `backend/uv.lock`, and any file outside `backend/app/services/oasis_profile_generator.py` are not modified
- Observable completion: a `git diff` review against `main` shows changes only inside `backend/app/services/oasis_profile_generator.py`, only inside the seven owned regions
- _Requirements: 7.1, 7.4, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6, 10.1, 10.2, 10.3, 10.4, 10.5, 10.6, 10.7_
- [ ] 8. Verify CJK-free invariant in the seven owned regions
- Run a one-shot script that imports `OasisProfileGenerator`, calls `_build_individual_persona_prompt(...)`, `_build_group_persona_prompt(...)`, `_get_system_prompt(...)`, and `_build_entity_context(...)` with representative inputs that contain no CJK in the inputs themselves, and asserts the rendered output contains zero matches against the regex `[一-鿿]`
- Manually inspect the seven owned regions in the patched file with a CJK regex (`grep -nP '[\x{4e00}-\x{9fff}]'`) and confirm there are no remaining matches inside the owned regions
- Observable completion: the inspection passes; if it fails, fix the offending region and re-run before completing this task
- _Requirements: 1.1, 2.8, 3.8, 4.10, 5.3, 6.6_
- [ ] 9. Verify locale-driven output language under both `en` and `zh`
- Set the thread-local locale to `en` via `set_locale("en")`, run `OasisProfileGenerator().generate_profile_from_entity(...)` against the configured LLM with a small representative entity, and confirm the returned `bio` and `persona` are in English
- Set the thread-local locale to `zh` via `set_locale("zh")`, run the same round-trip, and confirm the returned `bio` and `persona` are in Chinese, equivalent in quality to the pre-change baseline
- Observable completion: both runs succeed; the `en` run is CJK-free in `bio` and `persona`; the `zh` run continues to produce Chinese; results recorded in the PR description
- _Requirements: 7.2, 7.3_

View File

@ -374,15 +374,15 @@ class OasisProfileGenerator:
if hasattr(node, 'summary') and node.summary: if hasattr(node, 'summary') and node.summary:
all_summaries.add(node.summary) all_summaries.add(node.summary)
if hasattr(node, 'name') and node.name and node.name != entity_name: if hasattr(node, 'name') and node.name and node.name != entity_name:
all_summaries.add(f"相关实体: {node.name}") all_summaries.add(f"Related entity: {node.name}")
results["node_summaries"] = list(all_summaries) results["node_summaries"] = list(all_summaries)
# Assemble the combined context block. # Assemble the combined context block.
context_parts = [] context_parts = []
if results["facts"]: if results["facts"]:
context_parts.append("事实信息:\n" + "\n".join(f"- {f}" for f in results["facts"][:20])) context_parts.append("Facts:\n" + "\n".join(f"- {f}" for f in results["facts"][:20]))
if results["node_summaries"]: if results["node_summaries"]:
context_parts.append("相关实体:\n" + "\n".join(f"- {s}" for s in results["node_summaries"][:10])) context_parts.append("Related entities:\n" + "\n".join(f"- {s}" for s in results["node_summaries"][:10]))
results["context"] = "\n\n".join(context_parts) results["context"] = "\n\n".join(context_parts)
logger.info(t("log.profile_generator.m006", entity_name=entity_name, len=len(results['facts']), len_2=len(results['node_summaries']))) logger.info(t("log.profile_generator.m006", entity_name=entity_name, len=len(results['facts']), len_2=len(results['node_summaries'])))
@ -411,7 +411,7 @@ class OasisProfileGenerator:
if value and str(value).strip(): if value and str(value).strip():
attrs.append(f"- {key}: {value}") attrs.append(f"- {key}: {value}")
if attrs: if attrs:
context_parts.append("### 实体属性\n" + "\n".join(attrs)) context_parts.append("### Entity attributes\n" + "\n".join(attrs))
# 2. Related edges (facts / relationships). # 2. Related edges (facts / relationships).
existing_facts = set() existing_facts = set()
@ -427,12 +427,12 @@ class OasisProfileGenerator:
existing_facts.add(fact) existing_facts.add(fact)
elif edge_name: elif edge_name:
if direction == "outgoing": if direction == "outgoing":
relationships.append(f"- {entity.name} --[{edge_name}]--> (相关实体)") relationships.append(f"- {entity.name} --[{edge_name}]--> (related entity)")
else: else:
relationships.append(f"- (相关实体) --[{edge_name}]--> {entity.name}") relationships.append(f"- (related entity) --[{edge_name}]--> {entity.name}")
if relationships: if relationships:
context_parts.append("### 相关事实和关系\n" + "\n".join(relationships)) context_parts.append("### Related facts and relationships\n" + "\n".join(relationships))
# 3. Detailed information for related nodes. # 3. Detailed information for related nodes.
if entity.related_nodes: if entity.related_nodes:
@ -452,7 +452,7 @@ class OasisProfileGenerator:
related_info.append(f"- **{node_name}**{label_str}") related_info.append(f"- **{node_name}**{label_str}")
if related_info: if related_info:
context_parts.append("### 关联实体信息\n" + "\n".join(related_info)) context_parts.append("### Related entity information\n" + "\n".join(related_info))
# 4. Augment with Zep hybrid retrieval. # 4. Augment with Zep hybrid retrieval.
zep_results = self._search_zep_for_entity(entity) zep_results = self._search_zep_for_entity(entity)
@ -461,10 +461,10 @@ class OasisProfileGenerator:
# Deduplicate against already-known facts. # Deduplicate against already-known facts.
new_facts = [f for f in zep_results["facts"] if f not in existing_facts] new_facts = [f for f in zep_results["facts"] if f not in existing_facts]
if new_facts: if new_facts:
context_parts.append("### Zep检索到的事实信息\n" + "\n".join(f"- {f}" for f in new_facts[:15])) context_parts.append("### Facts retrieved from the graph\n" + "\n".join(f"- {f}" for f in new_facts[:15]))
if zep_results.get("node_summaries"): if zep_results.get("node_summaries"):
context_parts.append("### Zep检索到的相关节点\n" + "\n".join(f"- {s}" for s in zep_results["node_summaries"][:10])) context_parts.append("### Related nodes retrieved from the graph\n" + "\n".join(f"- {s}" for s in zep_results["node_summaries"][:10]))
return "\n\n".join(context_parts) return "\n\n".join(context_parts)
@ -535,7 +535,7 @@ class OasisProfileGenerator:
if "bio" not in result or not result["bio"]: if "bio" not in result or not result["bio"]:
result["bio"] = entity_summary[:200] if entity_summary else f"{entity_type}: {entity_name}" result["bio"] = entity_summary[:200] if entity_summary else f"{entity_type}: {entity_name}"
if "persona" not in result or not result["persona"]: if "persona" not in result or not result["persona"]:
result["persona"] = entity_summary or f"{entity_name}是一个{entity_type}" result["persona"] = entity_summary or f"{entity_name} is a {entity_type}."
return result return result
@ -631,7 +631,7 @@ class OasisProfileGenerator:
persona_match = re.search(r'"persona"\s*:\s*"([^"]*)', content) # May be truncated. persona_match = re.search(r'"persona"\s*:\s*"([^"]*)', content) # May be truncated.
bio = bio_match.group(1) if bio_match else (entity_summary[:200] if entity_summary else f"{entity_type}: {entity_name}") bio = bio_match.group(1) if bio_match else (entity_summary[:200] if entity_summary else f"{entity_type}: {entity_name}")
persona = persona_match.group(1) if persona_match else (entity_summary or f"{entity_name}是一个{entity_type}") persona = persona_match.group(1) if persona_match else (entity_summary or f"{entity_name} is a {entity_type}.")
# If we recovered something meaningful, mark the result as fixed. # If we recovered something meaningful, mark the result as fixed.
if bio_match or persona_match: if bio_match or persona_match:
@ -646,12 +646,12 @@ class OasisProfileGenerator:
logger.warning(t("log.profile_generator.m014")) logger.warning(t("log.profile_generator.m014"))
return { return {
"bio": entity_summary[:200] if entity_summary else f"{entity_type}: {entity_name}", "bio": entity_summary[:200] if entity_summary else f"{entity_type}: {entity_name}",
"persona": entity_summary or f"{entity_name}是一个{entity_type}" "persona": entity_summary or f"{entity_name} is a {entity_type}."
} }
def _get_system_prompt(self, is_individual: bool) -> str: def _get_system_prompt(self, is_individual: bool) -> str:
"""Return the system prompt for persona generation.""" """Return the system prompt for persona generation."""
base_prompt = "You are an expert in social-media user-persona generation. Produce detailed, realistic personas for opinion simulation that faithfully reflect existing real-world conditions. You MUST return valid JSON; no string value may contain unescaped newlines." base_prompt = "You are an expert at generating social-media user personas. Produce detailed, realistic personas for opinion-simulation, faithfully grounded in the supplied real-world context. You MUST return valid JSON; no string value may contain unescaped newline characters."
return f"{base_prompt}\n\n{get_language_instruction()}" return f"{base_prompt}\n\n{get_language_instruction()}"
def _build_individual_persona_prompt( def _build_individual_persona_prompt(
@ -667,40 +667,41 @@ class OasisProfileGenerator:
attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None" attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None"
context_str = context[:3000] if context else "No additional context" context_str = context[:3000] if context else "No additional context"
return f"""Generate a detailed social-media user persona for the entity, faithfully reflecting existing real-world conditions. return f"""Generate a detailed social-media user persona for an entity, faithfully grounded in the supplied real-world context.
Entity name: {entity_name} Entity name: {entity_name}
Entity type: {entity_type} Entity type: {entity_type}
Entity summary: {entity_summary} Entity summary: {entity_summary}
Entity attributes: {attrs_str} Entity attributes: {attrs_str}
Context information: Context:
{context_str} {context_str}
Generate JSON with the following fields: Produce a JSON object with the following fields:
1. bio: social-media biography, ~200 characters 1. bio: ~200-character social-media bio.
2. persona: detailed persona description (~2000 characters of plain text), covering: 2. persona: detailed persona description as a single coherent ~2000-character plain-text passage covering:
- Basic information (age, profession, education, location) - basic info (age, profession, educational background, location)
- Background (notable experience, association with the event, social ties) - background (notable experiences, link to the focal event, social relationships)
- Personality (MBTI type, core traits, emotional expression) - personality (MBTI type, core traits, emotional expression style)
- Social-media behavior (posting frequency, content preferences, interaction style, language traits) - social-media behaviour (posting frequency, content preferences, interaction style, voice)
- Stance (attitudes toward the topic, content likely to anger or move them) - stance and opinions (attitude toward the topic, content likely to provoke or move them)
- Unique features (catchphrases, special experiences, hobbies) - distinctive traits (catchphrases, unusual experiences, hobbies)
- Personal memory (a key part of the persona: this individual's relation to the event and prior actions/reactions in it) - personal memories (a key part of the persona; describe this individual's link to the focal event and any actions / reactions they have already taken in connection with it)
3. age: age number (MUST be an integer) 3. age: an integer.
4. gender: gender, MUST be one of the English literals: "male" or "female" 4. gender: must be the literal English token "male" or "female".
5. mbti: MBTI type (e.g. INTJ, ENFP) 5. mbti: MBTI type (e.g. INTJ, ENFP).
6. country: country name 6. country: free-form country name.
7. profession: profession 7. profession: free-form occupation.
8. interested_topics: array of interest topics 8. interested_topics: array of topic strings.
Important: Important:
- All field values MUST be strings or numbers; do not use unescaped newlines. - All field values must be strings or numbers; do not include newline characters in any string value.
- persona MUST be a single coherent block of text. - persona must be a single coherent prose passage.
- {get_language_instruction()} (gender field MUST use the English values "male" or "female") - {get_language_instruction()} (the gender field must remain English: male/female.)
- Content must remain consistent with the entity information. - The content must remain consistent with the supplied entity information.
- age MUST be a valid integer; gender MUST be "male" or "female". - age must be a valid integer; gender must be exactly "male" or "female".
""" """
def _build_group_persona_prompt( def _build_group_persona_prompt(
@ -716,40 +717,41 @@ Important:
attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None" attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None"
context_str = context[:3000] if context else "No additional context" context_str = context[:3000] if context else "No additional context"
return f"""Generate a detailed social-media account profile for the institution/group entity, faithfully reflecting existing real-world conditions. return f"""Generate a detailed social-media account profile for an institutional or group entity, faithfully grounded in the supplied real-world context.
Entity name: {entity_name} Entity name: {entity_name}
Entity type: {entity_type} Entity type: {entity_type}
Entity summary: {entity_summary} Entity summary: {entity_summary}
Entity attributes: {attrs_str} Entity attributes: {attrs_str}
Context information: Context:
{context_str} {context_str}
Generate JSON with the following fields: Produce a JSON object with the following fields:
1. bio: official-account biography, ~200 characters, professional and appropriate 1. bio: ~200-character official-account bio, polished and professional.
2. persona: detailed account-profile description (~2000 characters of plain text), covering: 2. persona: detailed account profile as a single coherent ~2000-character plain-text passage covering:
- Institutional basics (formal name, institution type, founding background, primary functions) - institution basics (formal name, type of institution, founding background, primary functions)
- Account positioning (account type, target audience, core function) - account positioning (account type, target audience, core purpose)
- Voice (language traits, common phrasing, taboo topics) - voice (linguistic style, common expressions, taboo topics)
- Publishing pattern (content types, publishing frequency, active hours) - content patterns (content types, posting frequency, active hours)
- Stance (official position on the core topic, controversy-handling style) - stance (official position on the focal topic, how disputes are handled)
- Special notes (the group portrait represented, operational habits) - special notes (the group profile it represents, operational habits)
- Institutional memory (a key part of the account profile: this institution's relation to the event and prior actions/reactions in it) - institutional memory (a key part of the persona; describe this institution's link to the focal event and any actions / reactions it has already taken in connection with it)
3. age: fixed integer 30 (the institutional virtual age) 3. age: must be the integer 30 (a virtual age used for institutional accounts).
4. gender: fixed literal "other" (institutional accounts use "other" to indicate non-individual) 4. gender: must be the literal English token "other" (institutional accounts use "other" to indicate non-individual).
5. mbti: MBTI type used to characterize account voice (e.g. ISTJ for strict/conservative) 5. mbti: MBTI type used to describe the account's voice (e.g. ISTJ for a rigorous, conservative tone).
6. country: country name 6. country: free-form country name.
7. profession: institutional function description 7. profession: free-form description of the institution's role.
8. interested_topics: array of focus areas 8. interested_topics: array of focus areas.
Important: Important:
- All field values MUST be strings or numbers; null values are not allowed. - All field values must be strings or numbers; null values are not allowed.
- persona MUST be a single coherent block of text without unescaped newlines. - persona must be a single coherent prose passage; do not include newline characters in any string value.
- {get_language_instruction()} (gender field MUST use the English value "other") - {get_language_instruction()} (the gender field must remain English: "other".)
- age MUST be the integer 30; gender MUST be the string "other". - age must be the integer 30; gender must be exactly the string "other".
- Account voice MUST match the institution's identity positioning.""" - The institutional account's voice must match its identity."""
def _generate_profile_rule_based( def _generate_profile_rule_based(
self, self,
@ -959,7 +961,7 @@ Important:
progress_callback( progress_callback(
current, current,
total, total,
f"已完成 {current}/{total}: {entity.name}{entity_type}" f"Completed {current}/{total}: {entity.name} ({entity_type})"
) )
if error: if error:
@ -994,24 +996,25 @@ Important:
separator = "-" * 70 separator = "-" * 70
# Assemble the full output (no truncation). # Assemble the full output (no truncation).
topics_str = ', '.join(profile.interested_topics) if profile.interested_topics else '' topics_str = ', '.join(profile.interested_topics) if profile.interested_topics else 'None'
output_lines = [ output_lines = [
f"\n{separator}", f"\n{separator}",
t('progress.profileGenerated', name=entity_name, type=entity_type), t('progress.profileGenerated', name=entity_name, type=entity_type),
f"{separator}", f"{separator}",
f"用户名: {profile.user_name}", f"Username: {profile.user_name}",
f"", f"",
f"【简介】", f"[Bio]",
f"{profile.bio}", f"{profile.bio}",
f"", f"",
f"【详细人设】", f"[Persona]",
f"{profile.persona}", f"{profile.persona}",
f"", f"",
f"【基本属性】", f"[Basic attributes]",
f"年龄: {profile.age} | 性别: {profile.gender} | MBTI: {profile.mbti}", f"Age: {profile.age} | Gender: {profile.gender} | MBTI: {profile.mbti}",
f"职业: {profile.profession} | 国家: {profile.country}", f"Profession: {profile.profession} | Country: {profile.country}",
f"兴趣话题: {topics_str}", f"Interested topics: {topics_str}",
separator separator
] ]