Merge pull request #34 from salestech-group/feat/i18n-3-translate-oasis-profile-prompts

feat(i18n): translate oasis_profile_generator prompts to english
This commit is contained in:
Dominik Seemann 2026-05-08 11:09:17 +02:00 committed by GitHub
commit e6f939592c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
7 changed files with 1380 additions and 66 deletions

View File

@ -0,0 +1,617 @@
# Design Document — i18n-oasis-profile-generator-prompts
## Overview
**Purpose**: Translate the Chinese prompt strings in
`backend/app/services/oasis_profile_generator.py` (the system prompt
inside `_get_system_prompt`, the individual-persona f-string template
inside `_build_individual_persona_prompt`, the group-persona f-string
template inside `_build_group_persona_prompt`, and the four
`attrs_str`/`context_str` fallback literals) to English while
preserving every functional contract — JSON output keys, the `gender`
English enum, the `age` integer rule, the `persona` no-newline rule,
all `{variable}` interpolations, and every `get_language_instruction()`
call site. The goal is to remove the Chinese-language base-prompt bias
that currently leaks Chinese structure and word choice into persona
output even when `Accept-Language: en`.
**Users**: MiroFish operators running the Step 2 environment-setup
pipeline under any locale; downstream Step 3 (CAMEL-OASIS subprocess)
which consumes the produced persona dictionaries.
**Impact**: Replaces approximately one one-line system prompt and two
large f-string templates with English equivalents inside one file. No
API change, no new dependencies, no new files. The two production
callers (`backend/app/services/simulation_manager.py:316` and
`backend/app/api/simulation.py:1413`) and the OASIS subprocess are
unaffected.
### Goals
- Zero CJK characters in any prompt string literal contributed by
`oasis_profile_generator.py` to the system prompt or the two
user-message bodies (including the `attrs_str`/`context_str`
fallback literals).
- English persona prose (`bio`, `persona`, `profession`,
`interested_topics`) under `Accept-Language: en`.
- Continued Chinese persona prose under `Accept-Language: zh`, of
equivalent quality to the pre-change behaviour.
- `gender` field stays exactly one of `"male"`/`"female"`/`"other"`
regardless of locale.
- No diff to public signatures, taxonomy lists, LLM-call parameters,
or call sites.
### Non-Goals
- Externalizing prompts to `/locales/*.json` (out of scope per ticket).
- Translating logger calls in this file (covered by issue #6).
- Translating module/class/method docstrings or inline comments
(covered by issue #7).
- Refactoring the `OasisAgentProfile` schema, `MBTI_TYPES` /
`COUNTRIES` lists, or the `INDIVIDUAL_ENTITY_TYPES` /
`GROUP_ENTITY_TYPES` taxonomies.
- Modifying the rule-based fallback (`_generate_profile_rule_based`)
including its Chinese country defaults.
- Modifying the resilience helpers `_fix_truncated_json` /
`_try_fix_json` and the Chinese persona fallback fragments inside
them (e.g. `f"{entity_name}是一个{entity_type}。"`).
- Modifying `backend/app/utils/locale.py`, the locale registries, or
any non-target file.
- Modifying `backend/scripts/test_profile_format.py`.
## Boundary Commitments
### This Spec Owns
- The English content of `_get_system_prompt`'s `base_prompt` literal.
- The English content of the f-string template body in
`_build_individual_persona_prompt`.
- The English content of the f-string template body in
`_build_group_persona_prompt`.
- The English replacements for the four `"无"` / `"无额外上下文"`
fallback literals (in both individual and group builders).
### Out of Boundary
- Locale resolution machinery (`backend/app/utils/locale.py`).
- Per-locale `llmInstruction` definitions
(`/locales/languages.json`).
- Reasoning-model output stripping inside `_fix_truncated_json` /
`_try_fix_json`.
- Logger calls and translation keys (`t("log.profile_generator.*")`)
inside `oasis_profile_generator.py` (issue #6, already merged).
- Module / class / method docstrings and inline comments inside
`oasis_profile_generator.py` (issue #7).
- Rule-based fallback (`_generate_profile_rule_based`) including its
Chinese country defaults `"中国"`.
- Chinese persona fragments inside the resilience helpers (e.g.
`f"{entity_name}是一个{entity_type}。"`) — those are runtime data
fallbacks, not LLM prompts.
- All callers of `OasisProfileGenerator`
(`simulation_manager.py`, `api/simulation.py`).
- Tests, scripts, and frontend code.
- The `print(...)` banner at line 945 (closely associated with logger
externalization #6).
### Allowed Dependencies
- Existing imports in the target file (no additions). Specifically:
`get_language_instruction`, `get_locale`, `set_locale`, `t` from
`..utils.locale` are already imported and remain unchanged.
- Existing LLM transport via `self.client.chat.completions.create`
(unchanged).
### Revalidation Triggers
The following changes elsewhere would invalidate this design:
- A change to the JSON contract emitted by the LLM (`bio`, `persona`,
`age`, `gender`, `mbti`, `country`, `profession`,
`interested_topics` keys).
- A change to the `OasisAgentProfile` dataclass field set or the
Reddit/Twitter serializers.
- A change to `get_language_instruction()` semantics or the per-locale
`llmInstruction` strings.
- A change to OASIS subprocess profile-format expectations (verified
via `backend/scripts/test_profile_format.py`).
## Architecture
### Existing Architecture Analysis
`OasisProfileGenerator` lives in `backend/app/services/`, follows the
in-process service pattern, and is invoked from a Flask handler inside
a background task. The relevant flow:
1. The Flask handler resolves the request locale via `Accept-Language`;
`set_locale()` is propagated into worker threads in
`generate_profiles_for_entities` (locale captured at line ~910 and
restored inside `generate_single_profile` at line ~914).
2. For each entity, `generate_profile_from_entity` decides between the
individual or group prompt builder via
`self._is_individual_entity(entity_type)`.
3. The chosen builder produces a user-message string; `_get_system_prompt`
produces a system-message string. Both are sent to the LLM via
`self.client.chat.completions.create(..., response_format={"type": "json_object"})`.
4. The LLM response is JSON-decoded; on failure, `_try_fix_json` and
`_fix_truncated_json` attempt recovery; on terminal failure,
`_generate_profile_rule_based` produces a rule-based persona.
5. The result is wrapped in an `OasisAgentProfile` dataclass and
serialized to Reddit JSON or Twitter CSV via `_save_reddit_json` /
`_save_twitter_csv`.
This design preserves all of the above. The change is purely lexical
inside three method bodies and four literal defaults.
### Architecture Pattern & Boundary Map
```mermaid
graph TB
Caller["simulation_manager.py / api/simulation.py"]
Generator["OasisProfileGenerator"]
Sys["_get_system_prompt"]
Ind["_build_individual_persona_prompt"]
Grp["_build_group_persona_prompt"]
Locale["locale.get_language_instruction"]
Client["openai.chat.completions.create"]
Parser["_try_fix_json / _fix_truncated_json"]
Fallback["_generate_profile_rule_based"]
Serializer["_save_reddit_json / _save_twitter_csv"]
Caller --> Generator
Generator --> Sys
Generator --> Ind
Generator --> Grp
Sys -. inline call .-> Locale
Ind -. inline call .-> Locale
Grp -. inline call .-> Locale
Sys --> Client
Ind --> Client
Grp --> Client
Client --> Parser
Parser --> Fallback
Generator --> Serializer
classDef change fill:#fff4ce,stroke:#a16207,color:#000
class Sys,Ind,Grp change
```
The three highlighted nodes (`_get_system_prompt`,
`_build_individual_persona_prompt`,
`_build_group_persona_prompt`) are the only nodes whose **string
contents** change. Every edge — including each call to
`get_language_instruction()` — remains intact.
**Architecture Integration**:
- **Selected pattern**: In-place lexical translation of the three
prompt builders (Option A from `gap-analysis.md` / `research.md`).
- **Domain/feature boundaries**: Same as today; `OasisProfileGenerator`
remains the sole owner of persona prompt content. `LocaleService`
remains the sole owner of locale-postfix steering.
- **Existing patterns preserved**: locale-thread propagation, retry
logic with temperature decay, JSON resilience helpers, rule-based
fallback, two-platform serialization.
- **New components rationale**: none — no new components.
- **Steering compliance**: aligns with `tech.md` ("LLM prompts use the
`get_language_instruction()` postfix mechanism, not key files") and
`structure.md` ("services own their own prompt strings").
### Technology Stack & Alignment
| Layer | Choice / Version | Role in Feature | Notes |
|-------|------------------|-----------------|-------|
| Backend / Services | Python ≥3.11 | Hosts the prompt builders | No version change |
| LLM transport | `openai` SDK against any OpenAI-compatible endpoint | Sends translated prompts | Unchanged |
| i18n | `backend/app/utils/locale.py` | Resolves locale and provides `get_language_instruction()` postfix | Unchanged |
| Storage | None | — | No persistence change |
No new dependencies. No version bumps. The locale infrastructure used
by the change is the same one used by every sibling i18n spec already
merged.
## File Structure Plan
### Modified Files
- `backend/app/services/oasis_profile_generator.py` — only file that
changes.
- `_get_system_prompt(self, is_individual: bool) -> str` — translate
`base_prompt` literal to English. Keep
`f"{base_prompt}\n\n{get_language_instruction()}"` shape.
- `_build_individual_persona_prompt(self, entity_name, entity_type,
entity_summary, entity_attributes, context) -> str` — translate
the f-string body to English; replace `"无"` and `"无额外上下文"`
defaults; keep every `{variable}` interpolation and the inline
`{get_language_instruction()}` call.
- `_build_group_persona_prompt(self, entity_name, entity_type,
entity_summary, entity_attributes, context) -> str` — same
treatment as the individual builder.
No other files in the repository are touched by this change.
## System Flows
The runtime flow does not change. The only way to demonstrate this is
to compare the call graph before and after — and the call graph is
already shown in the Architecture diagram above. Skipping a separate
sequence diagram.
## Requirements Traceability
| Requirement | Summary | Components | Interfaces | Flows |
|-------------|---------|------------|------------|-------|
| 1.1 | `base_prompt` contains zero Chinese characters | `_get_system_prompt` | `(self, is_individual: bool) -> str` | system-message construction |
| 1.2 | Preserve `f"{base_prompt}\n\n{get_language_instruction()}"` | `_get_system_prompt` | inline `get_language_instruction()` | system-message construction |
| 1.3 | Preserve role/intent semantics | `_get_system_prompt` | — | — |
| 1.4 | Preserve signature `_get_system_prompt(self, is_individual: bool) -> str` | `_get_system_prompt` | (signature) | — |
| 2.1 | Individual prompt body in English | `_build_individual_persona_prompt` | f-string body | user-message construction |
| 2.2 | Preserve `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}` | `_build_individual_persona_prompt` | f-string interpolations | — |
| 2.3 | Preserve JSON keys `bio, persona, age, gender, mbti, country, profession, interested_topics` | `_build_individual_persona_prompt` | prompt content | — |
| 2.4 | Preserve field-level constraints (lengths, MBTI, gender enum, age int) | `_build_individual_persona_prompt` | prompt content | — |
| 2.5 | Preserve trailing-rules block semantics | `_build_individual_persona_prompt` | prompt content | — |
| 2.6 | Preserve method signature | `_build_individual_persona_prompt` | (signature) | — |
| 2.7 | Translate `"无"` and `"无额外上下文"` defaults | `_build_individual_persona_prompt` | literal defaults | — |
| 2.8 | Zero Chinese in assembled body | `_build_individual_persona_prompt` | — | — |
| 3.1 | Group prompt body in English | `_build_group_persona_prompt` | f-string body | user-message construction |
| 3.2 | Preserve interpolations | `_build_group_persona_prompt` | f-string interpolations | — |
| 3.3 | Preserve JSON keys | `_build_group_persona_prompt` | prompt content | — |
| 3.4 | Preserve field-level constraints (age=30, gender="other", etc.) | `_build_group_persona_prompt` | prompt content | — |
| 3.5 | Preserve trailing-rules semantics | `_build_group_persona_prompt` | prompt content | — |
| 3.6 | Preserve method signature | `_build_group_persona_prompt` | (signature) | — |
| 3.7 | Translate `"无"` / `"无额外上下文"` defaults | `_build_group_persona_prompt` | literal defaults | — |
| 3.8 | Zero Chinese in assembled body | `_build_group_persona_prompt` | — | — |
| 4.1 | Preserve every `get_language_instruction()` call site | all three builders | inline call | system + user message construction |
| 4.2 | Preserve locale-thread plumbing | `generate_profiles_for_entities` (untouched) | `set_locale(current_locale)` | worker thread spawn |
| 4.3 | Locale=zh produces Chinese personas | runtime behaviour | locale postfix | LLM call |
| 4.4 | Locale=en produces English personas | runtime behaviour | locale postfix | LLM call |
| 4.5 | `gender` ∈ {male, female, other} regardless of locale | prompt content | — | — |
| 4.6 | Don't alter locale.py / locales/ | (none) | — | — |
| 5.1 | Preserve `OasisAgentProfile` dataclass | (untouched) | dataclass | — |
| 5.2 | Preserve method signatures | (untouched) | signatures | — |
| 5.3 | Preserve LLM invocation parameters | (untouched) | `chat.completions.create(...)` | — |
| 5.4 | Preserve `MBTI_TYPES`, `COUNTRIES`, taxonomy lists | (untouched) | class constants | — |
| 6.1 | Preserve `_fix_truncated_json` / `_try_fix_json` | (untouched) | helpers | — |
| 6.2 | Reasoning-model recovery still works | (untouched) | resilience helpers | — |
| 6.3 | No new prompt-language-dependent pre-processing | (none added) | — | — |
| 6.4 | Round-trip yields non-empty `bio` and `persona` | runtime behaviour | LLM call | — |
| 7.1 | `pytest test_profile_format.py` passes | runtime behaviour | serializers | — |
| 7.2 | Reddit format schema preserved | (untouched) | `to_reddit_format` | — |
| 7.3 | Twitter format schema preserved | (untouched) | `to_twitter_format` | — |
| 7.4 | `gender` enum preserved | prompt content | — | — |
| 8.1 | No logger edits | (untouched) | — | — |
| 8.2 | No docstring/comment edits | (untouched) | — | — |
| 8.3 | No rule-based fallback edits | (untouched) | — | — |
| 8.4 | No edits outside the target file | (none) | — | — |
| 8.5 | No new dependencies | (none) | `pyproject.toml` / `uv.lock` untouched | — |
| 8.6 | No edits to `test_profile_format.py` | (untouched) | — | — |
## Components and Interfaces
| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts |
|-----------|--------------|--------|--------------|--------------------------|-----------|
| `_get_system_prompt` | backend service / prompt builder | Produce the system message (English base + locale postfix) | 1.1, 1.2, 1.3, 1.4, 4.1, 4.5 | `get_language_instruction` (P0) | Service |
| `_build_individual_persona_prompt` | backend service / prompt builder | Produce the individual-entity user message in English | 2.x, 4.1, 4.5 | `get_language_instruction` (P0); JSON encoder (P1) | Service |
| `_build_group_persona_prompt` | backend service / prompt builder | Produce the group/institution user message in English | 3.x, 4.1, 4.5 | `get_language_instruction` (P0); JSON encoder (P1) | Service |
Only the three prompt-builder methods change. They all live inside the
single class `OasisProfileGenerator` in
`backend/app/services/oasis_profile_generator.py`. No new components.
### Backend / Services
#### `_get_system_prompt`
| Field | Detail |
|-------|--------|
| Intent | Build the `system` message: a one-line English directive that frames the model as a social-media persona expert + the per-locale postfix. |
| Requirements | 1.1, 1.2, 1.3, 1.4, 4.1, 4.5 |
**Responsibilities & Constraints**
- Construct and return a single string of the form
`f"{base_prompt}\n\n{get_language_instruction()}"`.
- Preserve the signature
`_get_system_prompt(self, is_individual: bool) -> str`.
- The English `base_prompt` MUST convey: (a) expert role in
social-media persona generation; (b) intent to produce detailed,
realistic personas for opinion-simulation, faithful to existing
reality; (c) the JSON-output requirement and the no-unescaped-newline
rule.
- The English `base_prompt` MUST NOT contain any CJK codepoint.
**Dependencies**
- Outbound: `get_language_instruction()` from
`backend/app/utils/locale.py` (P0, criticality high — the entire
locale-steering chain depends on it).
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
##### Service Interface
```python
def _get_system_prompt(self, is_individual: bool) -> str:
"""Return the LLM system message: English base + locale postfix."""
...
```
- Preconditions: none.
- Postconditions: returns a non-empty string ending with the locale
postfix produced by `get_language_instruction()`.
- Invariants: contains zero CJK codepoints.
**Implementation Notes**
- Integration: called only from `_call_llm_with_retry` (line ~523)
with `is_individual` decided upstream. The `is_individual` flag is
reserved for future divergence between system prompts; the current
implementation does not branch on it, and this design preserves
that.
- Validation: a CJK regex audit on the method body after the edit must
match zero codepoints.
- Risks: dropping one of the three role/intent pieces (expert framing,
JSON output requirement, no-newline rule). Implementation task lists
all three explicitly.
#### `_build_individual_persona_prompt`
| Field | Detail |
|-------|--------|
| Intent | Build the user-message string for an individual entity in English. Preserve every `{variable}` interpolation, the inline `{get_language_instruction()}` call, every JSON-output key, and every locale-independent constraint. |
| Requirements | 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5 |
**Responsibilities & Constraints**
- Preserve signature
`_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
- Preserve `attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else <fallback>` with `<fallback>` translated to English (`"None"`).
- Preserve `context_str = context[:3000] if context else <fallback>` with `<fallback>` translated to English (`"No additional context"`).
- Translate the f-string body to English with these structural sections (mirror the original Chinese intent):
1. **Lead sentence** — instruct the model to generate a detailed
social-media persona for the entity, faithful to existing reality.
2. **Entity context block** — labelled lines for `entity_name`,
`entity_type`, `entity_summary`, `entity_attributes` (English
labels; values via `{...}` interpolation).
3. **Context information block**`Context information:` heading
followed by `{context_str}`.
4. **JSON-fields enumeration** — `Generate JSON with the following
fields:` followed by the eight numbered items (`bio`, `persona`,
`age`, `gender`, `mbti`, `country`, `profession`,
`interested_topics`) with English descriptions matching
Requirement 2.4.
5. **Trailing rules block**`Important:` followed by:
- `All field values must be strings or numbers; do not use newlines.`
- `persona must be a single coherent block of text.`
- `{get_language_instruction()} (gender field MUST use English values: "male" or "female")`
- `Content must remain consistent with the entity information.`
- `age must be a valid integer; gender must be exactly "male" or "female".`
- Preserve every `{variable}` interpolation present in the original by
name: `{entity_name}`, `{entity_type}`, `{entity_summary}`,
`{attrs_str}`, `{context_str}`, `{get_language_instruction()}`.
- The translated body MUST NOT contain any CJK codepoint.
**Dependencies**
- Outbound: `json.dumps(..., ensure_ascii=False)` (P1, formatting the
attributes dict) — unchanged.
- Outbound: `get_language_instruction()` (P0) — interpolated inline.
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
##### Service Interface
```python
def _build_individual_persona_prompt(
self,
entity_name: str,
entity_type: str,
entity_summary: str,
entity_attributes: Dict[str, Any],
context: str,
) -> str:
"""Return the LLM user message for an individual-entity persona."""
...
```
- Preconditions: `entity_name`, `entity_type`, `entity_summary`
are strings (may be empty); `entity_attributes` is a dict (may be
empty); `context` is a string (may be empty).
- Postconditions: returns a non-empty English string with all six
interpolations resolved.
- Invariants: contains zero CJK codepoints; preserves every
`{variable}` interpolation by name.
**Implementation Notes**
- Integration: called from `_call_llm_with_retry` (line ~506) when
`is_individual` is true.
- Validation: post-edit CJK regex audit; interpolation-set audit
(verify the multiset of `{...}` tokens equals the pre-change set);
smoke import + `pytest backend/scripts/test_profile_format.py`.
- Risks: dropping the `gender` enum lock when translating; dropping
the inline `{get_language_instruction()}` call. The implementation
task list calls these out as discrete checks.
#### `_build_group_persona_prompt`
| Field | Detail |
|-------|--------|
| Intent | Build the user-message string for a group/institution entity in English. Preserve every `{variable}` interpolation, the inline `{get_language_instruction()}` call, every JSON-output key, and every locale-independent constraint (notably `age == 30` and `gender == "other"`). |
| Requirements | 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5 |
**Responsibilities & Constraints**
- Preserve signature
`_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
- Preserve the `attrs_str` and `context_str` fallback handling with
English defaults (`"None"`, `"No additional context"`), identical to
the individual builder.
- Translate the f-string body to English with these structural
sections (mirror the original Chinese intent for institutions):
1. **Lead sentence** — instruct the model to generate a detailed
social-media account profile for the institution/group, faithful
to existing reality.
2. **Entity context block** — labelled lines for `entity_name`,
`entity_type`, `entity_summary`, `entity_attributes`.
3. **Context information block**`Context information:` heading
followed by `{context_str}`.
4. **JSON-fields enumeration** — `Generate JSON with the following
fields:` followed by the eight numbered items as defined in
Requirement 3.4: `bio` (~200 chars, official voice), `persona`
(~2000 chars, single coherent text covering institutional
basics, account positioning, voice, publishing pattern, stance,
special notes, institutional memory), `age` (= integer 30,
institutional virtual age), `gender` (= literal `"other"`),
`mbti` (e.g. ISTJ for strict/conservative), `country` (country
name string), `profession` (institutional function),
`interested_topics` (array).
5. **Trailing rules block**`Important:` followed by:
- `All field values must be strings or numbers; null is not allowed.`
- `persona must be a single coherent block of text without newlines.`
- `{get_language_instruction()} (gender field MUST use English value "other")`
- `age must be the integer 30; gender must be the string "other".`
- `Account voice must match its identity positioning.`
- Preserve every `{variable}` interpolation present in the original.
- The translated body MUST NOT contain any CJK codepoint.
**Dependencies**
- Outbound: same as individual builder.
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
##### Service Interface
```python
def _build_group_persona_prompt(
self,
entity_name: str,
entity_type: str,
entity_summary: str,
entity_attributes: Dict[str, Any],
context: str,
) -> str:
"""Return the LLM user message for a group/institution persona."""
...
```
- Preconditions / Postconditions / Invariants: same shape as the
individual builder.
**Implementation Notes**
- Integration: called from `_call_llm_with_retry` (line ~510) when
`is_individual` is false.
- Validation: same checks as the individual builder, plus an explicit
audit that the institutional sentinels (`age == 30`,
`gender == "other"`) appear in English in the trailing-rules block.
- Risks: same as the individual builder; additionally, the `country`
language hint (`"使用中文,如\"中国\""`) is intentionally dropped
during translation — the validation task verifies that under
`Accept-Language: en` a sample run produces an English country
name.
## Data Models
No data-model changes. The persona JSON schema, the
`OasisAgentProfile` dataclass, the Reddit/Twitter serializers, and the
OASIS subprocess profile-format expectations are all preserved
verbatim.
## Error Handling
### Error Strategy
No new error paths. The existing flow is preserved:
- `json.JSONDecodeError``_try_fix_json``_fix_truncated_json`
partial-extract via regex → `_generate_profile_rule_based`.
- LLM call failure → retry with temperature decay (`0.7 - attempt * 0.1`)
up to `max_attempts = 3`.
- Terminal failure → rule-based fallback persona.
- Per-entity worker exception → fallback `OasisAgentProfile` produced
inside `generate_single_profile` at line ~932.
The translated prompts do not introduce new failure modes. Translating
prompt language has no semantic effect on JSON parsing or on the
`response_format={"type": "json_object"}` constraint.
### Error Categories and Responses
- **User errors**: not applicable (this is an internal pipeline).
- **System errors**: LLM transport errors are retried; logger emits
`t("log.profile_generator.m011")` etc. Logger keys already exist in
`locales/{en,zh}.json`.
- **Business-logic errors**: `gender` not in the English enum, `age`
not an integer — the prompt explicitly mandates them; the validator
inside `_try_fix_json` does not enforce these but the OASIS
subprocess does. No change in either direction.
### Monitoring
Existing logger calls are unchanged. Logger keys already i18n-keyed via
`t("log.profile_generator.*")`.
## Testing Strategy
### Unit Tests
- **(Existing)**
`backend/scripts/test_profile_format.py::test_profile_formats`
must continue to pass without modification.
- **(Manual)** Smoke import:
`cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator"`
— confirms no syntax errors after editing f-strings.
### Integration Tests
- **(Manual)** Run the prompt builders directly under each locale:
- `set_locale("en")`
`OasisProfileGenerator()._build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")`
— assert no CJK codepoints in the output, assert the English
locale postfix appears via `get_language_instruction()` (which is
`"Please respond in English."`).
- `set_locale("zh")` → same call → assert the locale postfix is
`"请使用中文回答。"`.
- These do not require an LLM call; they only verify the rendered
prompt string.
### E2E Tests
- **(Manual, optional, preferred but skippable when no LLM key
present)** Run `npm run dev` and trigger Step 2 profile generation
from the UI under English locale on a small entity set; spot-check
that bios and persona prose are in English. Skip if a live LLM key
is unavailable in CI; sibling specs #2/#4/#5 used the same manual
E2E approach.
### Performance / Load
Not applicable. Prompt translation has no measurable performance
impact.
## Optional Sections
### Security Considerations
No security implications. No new external surfaces; no new data
retention; no change to authentication or authorization.
### Migration Strategy
No migration required. The change is forward-compatible: a deployment
that picks up the translated prompts continues to serve users on the
`zh` locale via the unchanged
`get_language_instruction()` postfix mechanism.
## Supporting References
- `gap-analysis.md` — option evaluation and effort/risk sizing.
- `research.md` — discovery findings, design decisions (in particular
the "drop the country language hint" decision), and risk register.
- `requirements.md` — EARS requirements with numeric IDs.
- Sibling specs `i18n-ontology-generator-prompts`,
`i18n-simulation-config-generator-prompts`,
`i18n-report-agent-prompts` — same translation pattern, already
merged.

View File

@ -0,0 +1,241 @@
# Gap Analysis — i18n-oasis-profile-generator-prompts
This document analyzes the gap between the requirements and the existing
codebase, lists implementation options, and recommends an approach for the
design phase.
## 1. Current State Investigation
### Target file
`backend/app/services/oasis_profile_generator.py` — 1195 lines. Defines:
- `OasisAgentProfile` dataclass with Reddit / Twitter serializers.
- `OasisProfileGenerator` class with the following public-API surface:
`__init__`, `generate_profile_from_entity`, `generate_profiles_from_entities`,
`set_graph_id`, plus private helpers `_call_llm_with_retry`,
`_generate_profile_rule_based`, `_get_system_prompt`,
`_build_individual_persona_prompt`, `_build_group_persona_prompt`,
`_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`,
`_save_twitter_csv`, `_save_reddit_json`, `_generate_username`.
### Chinese surfaces in the file (by category)
| Category | Lines | In scope this issue? |
| --- | --- | --- |
| Module / class / method docstrings | scattered | **No** — covered by #7 |
| Inline `#` comments | scattered | **No** — covered by #7 |
| `logger.{info,warning,error}` calls (translated via `t("log.profile_generator.*")`) | scattered | **No** — already done by #6 |
| `print(...)` banners (e.g. line 945) | a few | **No** — companion to #6 in spirit; not a prompt literal |
| **System prompt `base_prompt`** (line 664) | 1 line | **Yes** |
| **Individual-persona prompt body** (lines 680714) | block | **Yes** |
| **Group-persona prompt body** (lines 729762) | block | **Yes** |
| `attrs_str` / `context_str` defaults `"无"` / `"无额外上下文"` (lines 677, 678, 726, 727) | 4 lines | **Yes** — they substitute *into* the prompt body |
| Rule-based fallback (`_generate_profile_rule_based`, lines 764835) including `"country": "中国"` and `"国家"` placeholders | block | **No** — runtime data, not a prompt |
| Resilience-helper Chinese fragments (`f"{entity_name}是一个{entity_type}。"` at lines 547, 644, 659) | a few | **No** — runtime data, not a prompt |
The file already imports `get_locale`, `set_locale`, `t`, and
`get_language_instruction` from `app.utils.locale`. The locale-capture /
restore plumbing inside `generate_profiles_for_entities` (lines ~910916)
already propagates the request locale to background-thread workers — no
changes required.
### Locale infrastructure (already in place)
`backend/app/utils/locale.py`:
- `get_language_instruction()` returns the per-locale postfix from
`/locales/languages.json` (e.g. `Please respond in English.` for `en`,
`请使用中文回答。` for `zh`).
- `t(key, **kwargs)` resolves `log.*` keys for backend logger messages;
not used by this issue.
- `set_locale` / `get_locale` are thread-local, with restoration plumbed
into `generate_profiles_for_entities`.
### Sibling specs already shipped
- `i18n-ontology-generator-prompts` (#2 — merged)
- `i18n-simulation-config-generator-prompts` (#4 — merged)
- `i18n-report-agent-prompts` (#5 — merged)
- `i18n-externalize-backend-logs` (#6 — merged; logger keys for
`log.profile_generator.*` are already in `locales/{en,zh}.json`)
The translation pattern they established:
1. Translate the base prompt body (English narrative + headings).
2. Preserve every `get_language_instruction()` call site verbatim so
`Accept-Language: zh` still produces Chinese output.
3. Preserve all `{variable}` interpolations in f-strings.
4. Preserve all locale-independent "lock" rules (e.g. `gender` enum) in
English text within the prompt.
5. No new dependencies, no new files, single-file diff.
This is a direct sibling — same pattern applies.
### Test contract
`backend/scripts/test_profile_format.py`:
- Pytest-collectable function `test_profile_formats`.
- Constructs `OasisAgentProfile` instances directly (no LLM call) and
serializes them via `_save_twitter_csv` / `_save_reddit_json`.
- Verifies CSV header includes `user_id, user_name, name, bio,
friend_count, follower_count, statuses_count, created_at` and JSON
output includes `realname, username, bio, persona`.
- **Does not exercise the prompts.** A pure prompt translation cannot
break it; a refactor of dataclass field names or serializers would.
### Callers
- `backend/app/services/simulation_manager.py:316`
`OasisProfileGenerator(graph_id=state.graph_id)`.
- `backend/app/api/simulation.py:1413``OasisProfileGenerator()`.
Neither caller looks at prompt language; both consume the persona dict
output. No call-site changes are needed.
## 2. Requirement-to-Asset Map
| Req. | Asset / file | Gap |
| --- | --- | --- |
| 1. System prompt → English | `_get_system_prompt` line 664 | **Missing** — Chinese literal needs to become English literal |
| 2. Individual-persona template → English | `_build_individual_persona_prompt` lines 680714 | **Missing** — Chinese block needs translation; preserve `{...}` interpolations and inline `{get_language_instruction()}` |
| 3. Group-persona template → English | `_build_group_persona_prompt` lines 729762 | **Missing** — Chinese block needs translation; preserve `{...}` interpolations and inline `{get_language_instruction()}` |
| 4. Locale switching unchanged | `app.utils.locale` + the three `get_language_instruction()` call sites | **Constraint** — code path must stay byte-identical at those call sites |
| 5. Public API stability | `OasisAgentProfile` dataclass + `OasisProfileGenerator` method signatures | **Constraint** — no signatures change |
| 6. Reasoning-model parsing unchanged | `_fix_truncated_json`, `_try_fix_json` | **Constraint** — no edits |
| 7. OASIS schema parity | `_save_twitter_csv`, `_save_reddit_json`, `to_*_format` serializers | **Constraint** — no edits; pytest must continue passing |
| 8. Out-of-scope guard | logger calls, docstrings, comments, rule-based fallback | **Constraint** — explicitly do not edit |
No requirement is blocked or unknown. Every requirement maps to a known
location with a clear, narrow change.
## 3. Implementation Approach Options
### Option A — In-place edit of the three prompt builders (extend existing)
Translate `base_prompt` (1 line), the individual-persona f-string body
(~35 lines), and the group-persona f-string body (~34 lines) directly,
plus the four `"无"` / `"无额外上下文"` fallback literals. Keep all method
bodies otherwise byte-identical.
- **Files touched**: `backend/app/services/oasis_profile_generator.py`
only.
- **Compatibility**: zero API change. All call sites unaffected. Locale
switching preserved by leaving the inline `{get_language_instruction()}`
placeholders untouched.
- **Complexity**: low. Pattern is identical to merged siblings #2, #4,
#5.
**Trade-offs**:
- ✅ Minimal diff, exactly the pattern reviewers expect.
- ✅ No risk to the unrelated rule-based fallback or serialization paths.
- ✅ Out-of-scope items (logger, docstrings, rule-based fallback) are not
touched, so #6/#7 remain clean.
- ❌ Leaves the file mixed-language in non-prompt parts (docstrings, rule
fallback) until #7 lands. Acceptable per scope split.
### Option B — Move prompt strings into module-level constants
Introduce `INDIVIDUAL_PERSONA_PROMPT_TEMPLATE` and
`GROUP_PERSONA_PROMPT_TEMPLATE` constants at module scope (mirroring
`ONTOLOGY_SYSTEM_PROMPT` style in `ontology_generator.py`), and have the
builders `.format(**kwargs)` against them.
- **Files touched**: same single file, but with structural refactor.
- **Compatibility**: still zero public API change, but the diff is
larger and reviewers must verify equivalent behaviour around
`{get_language_instruction()}` (which would need to become a runtime
substitution not an f-string interpolation, since constants don't
re-evaluate per call).
**Trade-offs**:
- ✅ Constants are easier to spot in `git grep`.
- ❌ Larger diff, more review surface.
- ❌ The inline `get_language_instruction()` call is currently captured at
f-string render time; moving to a `.format(...)` template requires
passing the resolved instruction in as a kwarg — a behavioural change
that exceeds "translate prompts only".
- ❌ Diverges from the sibling pattern just shipped (#4, #5 used in-place
edits, not module constants). #2 used module constants but only for the
system prompt — the user-message template was still built inside the
method.
### Option C — Externalize prompt text into `/locales/*.json`
Move every prompt sentence into `locales/en.json` and `locales/zh.json`,
keyed under `prompt.profile_generator.*`, and use `t(key, **vars)` to
resolve.
- **Compatibility**: would address `Accept-Language` purely via the
existing translation mechanism without depending on the
`get_language_instruction()` postfix.
**Trade-offs**:
- ✅ Most i18n-pure approach.
- ❌ Significantly larger diff (touches three repos: source file,
`en.json`, `zh.json`).
- ❌ Diverges from the established project pattern. The sibling specs
(#2, #4, #5) deliberately did **not** externalize prompts — the
project rationale (per `tech.md`) is that backend logger messages are
the i18n surface, while LLM prompts use the `get_language_instruction()`
postfix mechanism.
- ❌ Higher review and merge cost for no operational gain.
## 4. Recommended Approach
**Option A** — single-file in-place edit of the three prompt builders
plus the four `"无"` / `"无额外上下文"` fallback literals.
Rationale:
- Matches the merged sibling specs verbatim (#2, #4, #5) so reviewers
can apply the same mental checklist.
- Smallest possible diff that satisfies every acceptance criterion in
requirements.md.
- Leaves out-of-scope surfaces (logger, docstrings, rule-based
fallback) untouched — clean handoff to #7 and clean separation from
already-merged #6.
- Zero new dependencies, zero new files, zero API change, zero risk to
`test_profile_format.py`.
### Translation choices to lock in during design
1. The system prompt `base_prompt` becomes a single English sentence in
the spirit of the original (expert in social-media persona generation;
detailed and realistic personas for opinion simulation; faithful
reflection of real-world conditions; valid JSON, no unescaped
newlines).
2. The two persona prompt bodies adopt English section headings and
prose. The previously-Chinese hint
`country: 国家(使用中文,如"中国"` is dropped — the
`get_language_instruction()` postfix already steers locale, and the
rule-based fallback (out of scope) handles its own country values.
3. The trailing rules block keeps the locale-independent "lock"
constraints inline (`gender` enum, `age` integer requirement,
`persona` newline rule) and continues to embed
`{get_language_instruction()}` verbatim.
## 5. Effort & Risk
- **Effort**: **S** (13 days; realistically <½ day). One-file diff,
established sibling pattern, no new test infrastructure.
- **Risk**: **Low**. The translated prompts touch only the LLM
`messages` payload. The locale-switching pathway, public API,
serializers, retry logic, fallback, and tests are all untouched. The
only failure mode is a mistranslated constraint (e.g. accidentally
dropping `gender ∈ {male, female, other}`), which the design checklist
enumerates and reviewers can verify by diff.
### Research items carried into design phase
- None blocking. The design phase will:
- Enumerate the exact final English text for each of the three blocks.
- Verify each translated block preserves every JSON-output key,
every `{variable}` interpolation, and the inline
`{get_language_instruction()}` call.
- Spot-check that the diff stays within
`backend/app/services/oasis_profile_generator.py`.

View File

@ -0,0 +1,145 @@
# Requirements Document
## Introduction
This specification covers the English translation of the prompt strings in `backend/app/services/oasis_profile_generator.py`. The file converts Graphiti graph entities into OASIS agent persona dictionaries that drive Step 2 (Environment Setup) of the MiroFish pipeline. Today, the system prompt and the two `_build_*_persona_prompt` user-message templates are written in Chinese; the language is steered at runtime by appending `get_language_instruction()` to the system prompt and inside the user prompt body. While that postfix instructs the model *which* language to respond in, the base-prompt language biases the model's structural and lexical output, so persona prose (bio, persona, profession, interested_topics) skews Chinese under `Accept-Language: en`. Translating the base prompts to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (`get_language_instruction()` returns `请使用中文回答。` when locale is `zh`, so a Chinese model response remains achievable from an English base prompt).
This work tracks GitHub issue [#3](https://github.com/salestech-group/MiroFish/issues/3) and is sibling to the already-merged ontology-generator (#2), simulation-config-generator (#4), and report-agent (#5) prompt translation specs.
## Boundary Context
- **In scope**:
- Translating the system-prompt base string in `OasisProfileGenerator._get_system_prompt` (currently `"你是社交媒体用户画像生成专家。…"` at line ~664) from Chinese to English.
- Translating the individual-persona user-message template in `OasisProfileGenerator._build_individual_persona_prompt` (currently lines ~680714) from Chinese to English.
- Translating the group/institution-persona user-message template in `OasisProfileGenerator._build_group_persona_prompt` (currently lines ~729762) from Chinese to English.
- Translating the small `attrs_str` and `context_str` fallback default literals (`"无"`, `"无额外上下文"`) to English equivalents.
- Preserving all functional contracts: every `get_language_instruction()` call site, all variable interpolations, all JSON output keys, the `gender` enum constraint, the `age` integer constraint, and the institutional age=30 / gender="other" rule.
- **Out of scope**:
- Logger calls (`logger.info`, `logger.warning`, `logger.error`) and the printed banner text inside `oasis_profile_generator.py` — covered by issue #6.
- Module docstring, class docstrings, method docstrings, and inline comments — covered by issue #7.
- The fallback Chinese string literals embedded in non-prompt code paths (e.g. `f"{entity_name}是一个{entity_type}。"` inside `_try_fix_json` and the rule-based fallback) — those are runtime data fallbacks, not LLM prompts, and are out of scope for this issue (they are part of the fallback flow covered when comments/docstrings #7 lands or in a future cleanup; they are not user-visible while the LLM path succeeds).
- Refactoring the OASIS profile JSON schema, the `OasisAgentProfile` dataclass, the MBTI list, the `COMMON_COUNTRIES` list, the entity-type taxonomy splits (`PERSONAL_ENTITY_TYPES` vs `GROUP_ENTITY_TYPES`), or persona-generation flow control.
- Changing OASIS profile-format compatibility — verified by `backend/scripts/test_profile_format.py`.
- Editing the locale plumbing block (currently the `current_locale = get_locale()` capture and the `set_locale(current_locale)` call inside `generate_single_profile` around lines ~910916).
- **Adjacent expectations**:
- The Step 2 environment-setup pipeline must continue to consume the OASIS profile output unchanged. The Reddit (`to_reddit_format`) and Twitter (`to_twitter_format`) serializers are not coupled to prompt language; this is verified via the JSON schema contract preservation.
- The locale resolution chain (`Accept-Language` header → `get_locale()``get_language_instruction()`) is owned by `backend/app/utils/locale.py` and is unchanged by this work.
- Companion i18n issues (#6 logs, #7 comments/docstrings, #9 frontend comments, #10 e2e verification, #12 README) operate on different files or scopes and must not be touched here.
## Requirements
### Requirement 1: English Translation of the System Prompt
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the persona-generation system prompt to be authored in English, so that the LLM's persona prose is not biased toward Chinese structure or word choice.
#### Acceptance Criteria
1. The OASIS Profile Generator shall set the `base_prompt` constant inside `_get_system_prompt` to an English string containing zero Chinese characters.
2. The OASIS Profile Generator shall preserve the system-prompt assembly contract verbatim: the format `f"{base_prompt}\n\n{get_language_instruction()}"` and the call to `get_language_instruction()` at exactly that site.
3. The OASIS Profile Generator shall preserve the role and intent semantics of the original prompt: identifying the model as an expert in social-media user-persona generation, requesting detailed and realistic personas for opinion simulation that reflect existing real-world conditions, and mandating valid JSON output where string values must not contain unescaped newlines.
4. The OASIS Profile Generator shall preserve the function signature `_get_system_prompt(self, is_individual: bool) -> str`.
### Requirement 2: English Translation of the Individual-Persona User-Message Template
**Objective:** As a MiroFish operator generating personas for individual entities under `Accept-Language: en`, I want the user-message template constructed by `_build_individual_persona_prompt` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings.
#### Acceptance Criteria
1. The OASIS Profile Generator shall render the individual-persona user message with English section headings and prose in place of the current Chinese (entity name, entity type, entity summary, entity attributes, context section, JSON-fields enumeration, "important" trailing block).
2. The OASIS Profile Generator shall preserve all variable interpolations verbatim by name: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, and the inline `{get_language_instruction()}` call inside the trailing rules block.
3. The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics` (verbatim, English).
4. The OASIS Profile Generator shall preserve the field-level constraints in the prompt:
- `bio` ≈ 200 characters, social-media biography.
- `persona` ≈ 2000 characters, single coherent text covering: basic information (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behavior (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions in it).
- `age` MUST be an integer.
- `gender` MUST be one of `"male"` or `"female"` (English enum value, locale-independent).
- `mbti` MUST be an MBTI four-letter type (e.g. INTJ, ENFP).
- `country` MUST be a country name string.
- `profession` MUST be a profession string.
- `interested_topics` MUST be an array.
5. The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no newlines inside string values, `persona` is a single coherent text, `gender` must be the English `male`/`female` enum even when locale is `zh`, content must stay consistent with the source entity, `age` must be a valid integer.
6. The OASIS Profile Generator shall preserve the function signature `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
7. The OASIS Profile Generator shall preserve the `context[:3000]` truncation behaviour and the conditional fallback (`"无额外上下文"` translated to `"No additional context"`) when `context` is empty/falsy. Likewise, `attrs_str` shall fall back to an English placeholder (`"None"`) when `entity_attributes` is empty/falsy, replacing the current `"无"` literal.
8. The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled individual-persona prompt body.
### Requirement 3: English Translation of the Group/Institution-Persona User-Message Template
**Objective:** As a MiroFish operator generating personas for institutional/group entities under `Accept-Language: en`, I want the user-message template constructed by `_build_group_persona_prompt` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings.
#### Acceptance Criteria
1. The OASIS Profile Generator shall render the group-persona user message with English section headings and prose in place of the current Chinese.
2. The OASIS Profile Generator shall preserve all variable interpolations verbatim by name: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, and the inline `{get_language_instruction()}` call inside the trailing rules block.
3. The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics` (verbatim, English).
4. The OASIS Profile Generator shall preserve the field-level constraints in the prompt:
- `bio` ≈ 200 characters, an official-account biography that reads as professionally appropriate.
- `persona` ≈ 2000 characters, single coherent text covering: institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions in it).
- `age` MUST be the integer `30` (the institutional virtual-age sentinel).
- `gender` MUST be the literal `"other"` (English enum value, locale-independent), indicating non-individual.
- `mbti` MUST be an MBTI four-letter type used to characterize account voice (e.g. ISTJ for strict/conservative).
- `country` MUST be a country name string.
- `profession` MUST describe institutional function.
- `interested_topics` MUST be an array of focus areas.
5. The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no `null` values, no newlines in string values, `persona` is a single coherent text, `gender` must be the English `"other"` enum even when locale is `zh`, the institutional account voice must match its identity positioning, and `age` must be the integer `30`.
6. The OASIS Profile Generator shall preserve the function signature `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`.
7. The OASIS Profile Generator shall preserve the `context[:3000]` truncation behaviour and the conditional English-equivalent fallback for empty `context` and empty `entity_attributes`, mirroring Requirement 2.
8. The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled group-persona prompt body.
### Requirement 4: Locale Switching Continues to Work via `get_language_instruction()`
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: zh` (or any other configured non-English locale), I want generated personas to remain in the requested locale at equivalent quality, so that translating the base prompt does not regress non-English support.
#### Acceptance Criteria
1. The OASIS Profile Generator shall preserve every existing `get_language_instruction()` call site exactly: the system-prompt site in `_get_system_prompt`, the inline call inside the trailing rules block of `_build_individual_persona_prompt`, and the inline call inside the trailing rules block of `_build_group_persona_prompt`.
2. The OASIS Profile Generator shall preserve the locale-capture/restore plumbing inside `generate_profiles_for_entities` (currently the `current_locale = get_locale()` capture and the `set_locale(current_locale)` call inside `generate_single_profile`) — this code is not modified by the change.
3. While the locale is `zh`, the OASIS Profile Generator shall produce profiles whose `bio`, `persona`, `profession`, and `interested_topics` content is in Chinese, equivalent in quality to the pre-change behaviour.
4. While the locale is `en`, the OASIS Profile Generator shall produce profiles whose `bio`, `persona`, `profession`, and `interested_topics` content is in English.
5. While the locale is `en` or `zh`, the OASIS Profile Generator shall produce profiles whose `gender` field is one of the literal English values `"male"`, `"female"` (individual entities) or `"other"` (group entities), regardless of locale.
6. The OASIS Profile Generator shall not alter `backend/app/utils/locale.py`, the `_languages`, the `_translations` registries, or the locales under `/locales/`.
### Requirement 5: Public API and Call-Site Stability
**Objective:** As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of `OasisProfileGenerator` and `OasisAgentProfile` to remain unchanged, so that the Step 2 environment-setup flow and existing callers continue to work without modification.
#### Acceptance Criteria
1. The OASIS Profile Generator shall preserve the dataclass `OasisAgentProfile`, including its field set (`user_id`, `user_name`, `name`, `bio`, `persona`, `karma`, `friend_count`, `follower_count`, `statuses_count`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`, `source_entity_uuid`, `source_entity_type`, `created_at`), default values, and the `to_reddit_format`, `to_twitter_format`, `to_full_dict` serializers.
2. The OASIS Profile Generator shall preserve the signatures and call semantics of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_for_entities`, `_call_llm_with_retry`, `_generate_profile_rule_based`, `_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, and `_generate_username`.
3. The OASIS Profile Generator shall preserve the LLM invocation parameters (`temperature`, `max_tokens`, model selection, retry behaviour) at the call sites that consume the prompts produced by the translated builders.
4. The OASIS Profile Generator shall preserve the `PERSONAL_ENTITY_TYPES` and `GROUP_ENTITY_TYPES` taxonomies, the `MBTI_TYPES` list, and the `COMMON_COUNTRIES` list verbatim.
### Requirement 6: Reasoning-Model Output Compatibility
**Objective:** As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with `<think>` tags or markdown code fences), I want JSON parsing of the persona response to continue working, so that translating the base prompt does not regress provider compatibility.
#### Acceptance Criteria
1. The OASIS Profile Generator shall preserve the existing `_fix_truncated_json` and `_try_fix_json` resilience helpers exactly, including their regex-based extraction of `bio` and `persona` from partial output.
2. If a reasoning-model provider returns truncated, `<think>`-tagged, or markdown-fenced output, then the existing parsing/recovery flow shall continue to apply unchanged.
3. The OASIS Profile Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language.
4. After translation, the OASIS Profile Generator shall continue to round-trip a representative entity through `generate_profile_from_entity` and produce a JSON object with at minimum a non-empty `bio` and a non-empty `persona`, matching the pre-change behaviour.
### Requirement 7: Step 2 Environment-Setup Parity (OASIS Format Compatibility)
**Objective:** As a MiroFish operator validating the change, I want the OASIS subprocess to accept the generated profiles unchanged, so that the translation does not silently break Step 2 → Step 3 hand-off.
#### Acceptance Criteria
1. While `uv run python -m pytest backend/scripts/test_profile_format.py` runs against the changed code, the test suite shall pass with zero regressions versus the pre-change baseline.
2. While a representative Reddit-format profile dictionary is produced under locale `en`, every field name shall match the existing OASIS-required schema: `user_id`, `username`, `name`, `bio`, `persona`, `karma`, `created_at`, plus optional `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`.
3. While a representative Twitter-format profile dictionary is produced under locale `en`, every field name shall match the existing OASIS-required schema: `user_id`, `username`, `name`, `bio`, `persona`, `friend_count`, `follower_count`, `statuses_count`, `created_at`, plus optional `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`.
4. The OASIS Profile Generator shall produce `gender` values that are exactly one of `"male"`, `"female"`, `"other"` regardless of locale, satisfying the OASIS subprocess's expected enum.
### Requirement 8: Out-of-Scope Surfaces Remain Untouched
**Objective:** As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings, so that translation responsibilities for adjacent surfaces (issues #6, #7, and the rule-based fallback) are not absorbed into this change.
#### Acceptance Criteria
1. The change shall not modify any `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, or `logger.debug(...)` call in `oasis_profile_generator.py` (covered by issue #6).
2. The change shall not modify the module docstring, class docstrings, method docstrings, or inline comments in `oasis_profile_generator.py` (covered by issue #7).
3. The change shall not modify the rule-based fallback Chinese fragments inside `_try_fix_json` (e.g. `f"{entity_name}是一个{entity_type}。"`) and the rule-based path inside `_generate_profile_rule_based` — those are runtime data fallbacks, not LLM prompts, and remain out of scope here.
4. The change shall not edit any file outside `backend/app/services/oasis_profile_generator.py` for production code.
5. The change shall not introduce a new dependency or modify `backend/pyproject.toml` / `backend/uv.lock`.
6. The change shall not modify `backend/scripts/test_profile_format.py` (the test is the contract; the implementation must match it).

View File

@ -0,0 +1,222 @@
# Research & Design Decisions — i18n-oasis-profile-generator-prompts
## Summary
- **Feature**: `i18n-oasis-profile-generator-prompts`
- **Discovery Scope**: **Extension** (single-file translation in an existing
brownfield service; sibling pattern already merged in #2, #4, #5)
- **Key Findings**:
- The existing `get_language_instruction()` postfix mechanism (defined in
`backend/app/utils/locale.py`) is the project-canonical way to steer LLM
output language. Translating the base prompt does not interfere with it
and is the same approach taken in already-merged sibling specs.
- The only Chinese surfaces inside the prompt-rendering path are
`_get_system_prompt`, `_build_individual_persona_prompt`,
`_build_group_persona_prompt`, and the four `attrs_str`/`context_str`
fallback literals (`"无"`, `"无额外上下文"`). All other Chinese in the
file is logger keys (already done by #6), docstrings/comments
(out-of-scope, #7), or rule-based fallback data (out-of-scope).
- `backend/scripts/test_profile_format.py` does not exercise prompts; it
only constructs `OasisAgentProfile` and round-trips through
`_save_twitter_csv` / `_save_reddit_json`. A pure-translation diff
cannot break it.
## Research Log
### Locale steering mechanism
- **Context**: Confirm that translating the base prompt does not regress
Chinese output under `Accept-Language: zh`.
- **Sources Consulted**:
- `backend/app/utils/locale.py` (lines 5096).
- `locales/languages.json` (entries for `en` and `zh` with
`llmInstruction` field).
- Sibling spec `i18n-ontology-generator-prompts/design.md` and the
merged commits referenced by it.
- **Findings**:
- `get_language_instruction()` returns `Please respond in English.`
for locale `en`, `请使用中文回答。` for locale `zh`.
- The function is called as an inline f-string interpolation in the
individual-persona and group-persona prompt bodies, and explicitly
appended in `_get_system_prompt`. All three sites must be preserved
byte-for-byte.
- The thread-local locale is captured in
`generate_profiles_for_entities` (line ~910) and restored inside the
worker via `set_locale(current_locale)` (line ~914). This plumbing is
untouched by the change.
- **Implications**:
- Design lock-in: the inline `{get_language_instruction()}` call must
remain in each of the three builders. Removing or renaming it would
silently regress non-English locales.
- The Chinese hint `country: 国家(使用中文,如"中国"` in the original
prompt overrides the locale postfix and forces Chinese output for one
field. The English translation drops that hint so the locale postfix
decides the country language. The rule-based fallback (out of scope)
has its own (Chinese) defaults and is not affected.
### Test contract
- **Context**: Verify that `backend/scripts/test_profile_format.py`
remains green after a prompt-only translation.
- **Sources Consulted**: `backend/scripts/test_profile_format.py`,
`oasis_profile_generator.py:_save_twitter_csv`,
`oasis_profile_generator.py:_save_reddit_json`,
`oasis_profile_generator.py:to_reddit_format`,
`oasis_profile_generator.py:to_twitter_format`.
- **Findings**:
- The pytest function `test_profile_formats` constructs
`OasisAgentProfile` instances directly without invoking the LLM.
- It calls `_save_twitter_csv` and `_save_reddit_json` to verify CSV
and JSON shape. Required CSV header: `user_id, user_name, name, bio,
friend_count, follower_count, statuses_count, created_at`. Required
JSON keys: `realname, username, bio, persona`.
- **Implications**:
- Translating prompts cannot regress this test. The validation
requirement (Requirement 7) is satisfied automatically as long as
serializer code is not edited.
- No new tests are required for this change.
### Sibling specs already shipped
- **Context**: Confirm there is an established project pattern this work
must mirror.
- **Sources Consulted**:
- `.kiro/specs/i18n-ontology-generator-prompts/{design,tasks,requirements}.md`
- `.kiro/specs/i18n-report-agent-prompts/`
- `.kiro/specs/i18n-simulation-config-generator-prompts/`
- Recent merged commits referencing #2, #4, #5.
- **Findings**:
- All three siblings used a single-file in-place translation diff.
- All three preserved every `get_language_instruction()` call site.
- All three left logger calls and docstrings to companion issues
(#6 / #7).
- None externalized prompts to `/locales/*.json`.
- **Implications**:
- The same approach is correct here. Reviewer expectations are set by
the sibling diffs.
### OASIS profile schema
- **Context**: Verify that translated prompts continue to satisfy the
OASIS subprocess's expected schema (especially `gender` enum and
`age` integer).
- **Sources Consulted**: `OasisAgentProfile` dataclass,
`to_reddit_format`, `to_twitter_format`, sibling `_generate_profile_rule_based`.
- **Findings**:
- OASIS-required fields are produced by serializers, not by the
prompt: `user_id`, `username`, `name`, `bio`, `karma`/`friend_count`/`follower_count`/`statuses_count`, `created_at`.
- The prompt-defined fields land in optional positions: `age`,
`gender`, `mbti`, `country`, `profession`, `interested_topics`.
- The `gender` enum constraint (`"male"`/`"female"` for individuals,
`"other"` for groups) is locale-independent and must remain in
English text inside the translated prompt.
- **Implications**:
- The English prompt must explicitly call out `gender ∈ {male, female}`
(individual) and `gender == "other"` (group), independent of the
`get_language_instruction()` postfix.
## Architecture Pattern Evaluation
| Option | Description | Strengths | Risks / Limitations | Notes |
|--------|-------------|-----------|---------------------|-------|
| **A — In-place builder edit** | Translate three method bodies + four fallback literals directly | Smallest diff; matches sibling pattern; zero API change | None of note | **Selected** |
| B — Module-level constants | Hoist prompts to `INDIVIDUAL_PERSONA_PROMPT_TEMPLATE` etc. | Easier `git grep` | Larger diff; the inline `{get_language_instruction()}` call would need to become a `.format()` kwarg, which is a behavioural change beyond translation | Diverges from #4 / #5 |
| C — Externalize to `locales/*.json` | Move every prompt sentence into `t(...)` keys | Most i18n-pure | Three-file diff; diverges from project rationale (prompts use postfix mechanism, not key files) | Rejected |
## Design Decisions
### Decision: In-place edit of the three prompt builders (Option A)
- **Context**: Three methods build prompt strings; one of them is a
one-line system prompt, the other two are large f-string templates
with embedded `{variable}` interpolations and an inline
`{get_language_instruction()}` call.
- **Alternatives Considered**:
1. Option B — module-level constants.
2. Option C — externalize to `/locales/*.json` keys.
- **Selected Approach**: Translate each method body in place. Replace
the four `"无"` / `"无额外上下文"` fallbacks with English equivalents
(`"None"` and `"No additional context"`). Preserve all `{...}`
interpolations and the inline `{get_language_instruction()}` call.
- **Rationale**: Matches merged sibling specs verbatim. Smallest review
surface. Zero API change. Out-of-scope surfaces (logger, docstrings,
rule-based fallback) cleanly avoided.
- **Trade-offs**: Leaves the file mixed-language in non-prompt parts
(docstrings, rule fallback) until #7 lands. Acceptable per scope
split.
- **Follow-up**: During implementation, run a regex audit for any
Chinese codepoints inside the three method bodies after the edit and
confirm the diff stays within
`backend/app/services/oasis_profile_generator.py`.
### Decision: Drop the "use Chinese country names" hint
- **Context**: The current prompt at line 704 reads
`country: 国家(使用中文,如"中国"` and at line 753
`country: 国家(使用中文,如"中国"`. This forces Chinese for the
`country` field even under `Accept-Language: en`.
- **Alternatives Considered**:
1. Translate to English literally:
`country: country (use English, e.g. "China")`.
2. Drop the language hint entirely:
`country: country name string`.
- **Selected Approach**: Drop the language hint. Let
`get_language_instruction()` steer the country language alongside
every other free-text field.
- **Rationale**: Hard-coding a language in the prompt defeats the
locale-steering mechanism. The rule-based fallback (out of scope)
carries its own Chinese defaults; under the LLM path, locale should
decide.
- **Trade-offs**: Under `Accept-Language: zh`, the LLM may produce a
Chinese country name (e.g. `中国`) — this is the desired behaviour.
Under `Accept-Language: en`, the LLM produces English (`China`),
matching `COUNTRIES = ["China", "US", ...]` already in the file.
- **Follow-up**: Verify in the validation phase that a sample run under
locale `en` produces an English country name.
### Decision: Keep `gender` enum constraint in English inside the prompt
- **Context**: `gender` must be one of `"male"`/`"female"`/`"other"`
regardless of locale, because OASIS consumers and the
`_generate_profile_rule_based` fallback assume English values.
- **Alternatives Considered**: None — the constraint is a contract.
- **Selected Approach**: The translated prompt explicitly states the
enum in English, even when the locale postfix asks for Chinese
output: `gender MUST be one of "male" or "female" (English literal)`.
- **Rationale**: Same as the existing Chinese prompt (which already
states `必须是英文: "male" 或 "female"`). The translation preserves
the same lock-in.
- **Trade-offs**: None.
- **Follow-up**: Validation phase will check that under both locales
the produced `gender` is one of the three English literals.
## Risks & Mitigations
- **Risk**: Mistranslation drops a locale-independent constraint
(e.g. `gender` enum, `age` integer rule, `persona` no-newline rule).
- **Mitigation**: The implementation task list will enumerate every
constraint inline so reviewers can check by diff.
- **Risk**: Variable-name typo inside an f-string causes a `KeyError`
at runtime.
- **Mitigation**: Implementation task verifies that the set of
`{variable}` interpolations in each translated block matches the
pre-change set 1:1; a `python -c "import ..."` smoke import and a
`pytest backend/scripts/test_profile_format.py` run are mandatory.
- **Risk**: Accidentally leaving a CJK codepoint inside the three
builders.
- **Mitigation**: Final implementation step runs the project's
repo-level CJK guard regex (added by #26) constrained to the three
builders' line ranges.
## References
- `backend/app/services/oasis_profile_generator.py` — target file.
- `backend/app/utils/locale.py` — locale infrastructure.
- `locales/languages.json`, `locales/en.json`, `locales/zh.json`
locale registries.
- `.kiro/specs/i18n-ontology-generator-prompts/` — sibling spec #2.
- `.kiro/specs/i18n-simulation-config-generator-prompts/` — sibling
spec #4.
- `.kiro/specs/i18n-report-agent-prompts/` — sibling spec #5.
- GitHub issue
[#3](https://github.com/salestech-group/MiroFish/issues/3).

View File

@ -0,0 +1,23 @@
{
"feature_name": "i18n-oasis-profile-generator-prompts",
"created_at": "2026-05-08T05:26:06Z",
"updated_at": "2026-05-08T05:30:00Z",
"language": "en",
"phase": "tasks-generated",
"ticket": 3,
"approvals": {
"requirements": {
"generated": true,
"approved": true
},
"design": {
"generated": true,
"approved": true
},
"tasks": {
"generated": true,
"approved": true
}
},
"ready_for_implementation": true
}

View File

@ -0,0 +1,66 @@
# Implementation Plan
- [x] 1. Translate the system-prompt builder to English
- Replace the Chinese `base_prompt` literal inside `_get_system_prompt` (currently `"你是社交媒体用户画像生成专家。…"` at line ~664) with an English rendering that conveys the same role and intent: identifies the model as an expert in social-media user-persona generation, asks for detailed and realistic personas suitable for opinion-simulation that faithfully reflect existing real-world conditions, mandates valid JSON output, and forbids unescaped newlines inside string values
- Preserve the assembled return shape `f"{base_prompt}\n\n{get_language_instruction()}"` exactly — the call to `get_language_instruction()` is unchanged in name and position
- Preserve the method signature `_get_system_prompt(self, is_individual: bool) -> str`; do not branch on `is_individual` (current behaviour preserved)
- Observable completion: `_get_system_prompt(True)` and `_get_system_prompt(False)` both return non-empty English strings ending with the per-locale postfix from `get_language_instruction()`; the `base_prompt` body contains zero CJK characters
- _Requirements: 1.1, 1.2, 1.3, 1.4_
- [x] 2. Translate the individual-persona user-message builder to English
- Replace the Chinese f-string body inside `_build_individual_persona_prompt` (currently lines ~680714) with an English rendering structured as: a lead sentence requesting a detailed social-media persona faithful to existing reality; an entity-context block with English labels for `entity_name`, `entity_type`, `entity_summary`, `entity_attributes`; a `Context information:` block; a `Generate JSON with the following fields:` enumeration of the eight output keys (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`); and a trailing `Important:` rules block
- Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars; `persona` ≈ 2000 chars covering basic info (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behaviour (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions); `age` integer; `gender` MUST be the literal `"male"` or `"female"`; `mbti` four-letter type; `country` country name; `profession`; `interested_topics` array
- Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English values `"male"` / `"female"`; content stays consistent with the entity; `age` MUST be a valid integer
- Replace the `attrs_str` and `context_str` Chinese fallback defaults with English: `"无"``"None"` (used when `entity_attributes` is empty/falsy) and `"无额外上下文"``"No additional context"` (used when `context` is empty/falsy)
- Drop the country-language hint `(使用中文,如"中国"` so `get_language_instruction()` steers the country language; preserve the country line as a neutral `country: country name` entry
- Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`
- Preserve the `context[:3000]` truncation behaviour and the method signature `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`
- Observable completion: calling `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains the `gender` enum lock-in `"male"` / `"female"` exactly once
- _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5_
- [x] 3. Translate the group/institution-persona user-message builder to English
- Replace the Chinese f-string body inside `_build_group_persona_prompt` (currently lines ~729762) with an English rendering structured the same way as Task 2 but adapted for institutional voice: lead sentence requesting a detailed social-media account profile for an institution/group faithful to existing reality; entity-context block; `Context information:` block; `Generate JSON with the following fields:` enumeration of the eight output keys; trailing `Important:` rules block
- Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars in an official-account voice; `persona` ≈ 2000 chars covering institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions); `age` MUST be the integer `30`; `gender` MUST be the literal `"other"`; `mbti` four-letter type characterizing account voice; `country`; `profession` describes institutional function; `interested_topics` array
- Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers, no `null` allowed; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English value `"other"`; `age` MUST be the integer `30` and `gender` MUST be the string `"other"`; account voice must match identity positioning
- Replace the `attrs_str` and `context_str` Chinese fallback defaults with the same English replacements applied in Task 2 (`"None"` and `"No additional context"`)
- Drop the country-language hint as in Task 2
- Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`
- Preserve the `context[:3000]` truncation behaviour and the method signature `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`
- Observable completion: calling `_build_group_persona_prompt("ACME Corp", "Organization", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains both the `age == 30` lock-in and the `gender == "other"` lock-in
- _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5_
- [x] 4. Confirm boundary commitments around the translation
- Confirm every existing `get_language_instruction()` call site is preserved verbatim: the system-prompt assembly inside `_get_system_prompt`, the inline call inside the trailing rules block of `_build_individual_persona_prompt`, and the inline call inside the trailing rules block of `_build_group_persona_prompt`
- Confirm the locale-thread plumbing in `generate_profiles_for_entities` (capture `current_locale = get_locale()` at line ~910 and `set_locale(current_locale)` inside the worker at line ~914) is byte-identical
- Confirm the public signatures of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_for_entities`, `set_graph_id`, and the private helpers `_call_llm_with_retry`, `_generate_profile_rule_based`, `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, `_save_twitter_csv`, `_save_reddit_json`, `_generate_username` are unchanged
- Confirm the `OasisAgentProfile` dataclass field set, default values, and the `to_reddit_format`, `to_twitter_format`, `to_full_dict` serializers are unchanged
- Confirm class constants `MBTI_TYPES`, `COUNTRIES`, `INDIVIDUAL_ENTITY_TYPES`, `GROUP_ENTITY_TYPES` are unchanged
- Confirm the LLM invocation parameters at the call site that consumes the translated prompts (`response_format={"type": "json_object"}`, `temperature=0.7 - (attempt * 0.1)`, `max_attempts=3`) are unchanged
- Confirm `_fix_truncated_json` and `_try_fix_json` (including their Chinese persona fragments such as `f"{entity_name}是一个{entity_type}。"`) are not modified — these are runtime data fallbacks, not prompts, and are out of scope
- Confirm `_generate_profile_rule_based` is not modified — including its Chinese country defaults `"中国"` at lines ~807 and ~819
- Confirm `backend/app/utils/locale.py`, `/locales/languages.json`, `/locales/en.json`, and `/locales/zh.json` are not modified
- Confirm `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, the print banner at line ~945, module / class / method docstrings, and inline comments in `oasis_profile_generator.py` are not modified (owned by issues #6 and #7)
- Confirm `backend/scripts/test_profile_format.py`, `backend/pyproject.toml`, `backend/uv.lock`, and any file outside `backend/app/services/oasis_profile_generator.py` are not modified
- Observable completion: a `git diff` review against `main` shows changes only inside `backend/app/services/oasis_profile_generator.py`, only inside `_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, and the surrounding lines (method headers, neighbouring methods) are byte-identical
- _Requirements: 1.4, 2.6, 3.6, 4.1, 4.2, 4.6, 5.1, 5.2, 5.3, 5.4, 6.1, 6.3, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6_
- [x] 5. Verify smoke import and OASIS profile-format pytest
- Run `cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator, OasisAgentProfile"` and confirm it exits 0 (catches f-string syntax errors)
- Run `cd backend && uv run python -m pytest backend/scripts/test_profile_format.py` (or equivalent invocation per project convention) and confirm it passes — the test does not exercise prompts, so a pure-translation diff must keep it green
- Construct an instance of `OasisProfileGenerator` (using `OasisProfileGenerator.__new__(OasisProfileGenerator)` to skip `__init__` if the LLM key is unavailable, mirroring the pattern in `test_profile_format.py`) and confirm `_get_system_prompt(True)`, `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")`, and `_build_group_persona_prompt("ACME", "Organization", "summary", {"k": "v"}, "ctx")` each return a string with zero CJK matches against the regex `[一-鿿]`
- Observable completion: smoke import exits 0; pytest passes with zero regressions; the three prompt-builder calls each produce English-only output under the default `zh` locale (the `get_language_instruction()` postfix at the end is the only place where Chinese is allowed to appear, and only when locale is `zh`)
- _Requirements: 6.4, 7.1, 7.2, 7.3, 7.4_
- [x] 6. Verify locale-driven output language under both `en` and `zh`
- With the thread-local locale forced via `set_locale("en")`, render each of the three builders against representative inputs and confirm: each output contains zero CJK characters; each ends with the English locale postfix `"Please respond in English."`; the `gender` enum constraint appears as English `"male"` / `"female"` (individual) or `"other"` (group)
- With `set_locale("zh")`, render the same three builders and confirm: the per-prompt body remains English-only (the translated base prompt does not depend on locale); each ends with the Chinese locale postfix `"请使用中文回答。"`; the `gender` enum constraint still appears as the English literal values
- Optionally, with a configured LLM key, run `OasisProfileGenerator().generate_profile_from_entity(...)` end-to-end under each locale against a synthetic `EntityNode` and spot-check that the produced `bio`, `persona`, `profession` are English under `en` and Chinese under `zh`, while `gender` is one of the three English enum literals under both
- Observable completion: the locale-`en` rendering is CJK-free in the prompt body and ends with the English locale postfix; the locale-`zh` rendering preserves the prompt body in English and ends with the Chinese locale postfix; if the LLM round-trip is exercised, results are recorded in the PR description
- _Requirements: 4.3, 4.4, 4.5_
- [x] 7. Final CJK regression sweep on the three builders
- Run a regex audit limited to the three method bodies (`_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`) using the project-level CJK guard regex (`[一-鿿]`) and confirm zero matches inside their string literals
- Run a CJK audit on the rendered output of the three builders for representative inputs and confirm zero matches in the prompt body (the locale postfix is excluded — its Chinese form is a deliberate kept use under `zh`)
- Confirm the file-level `git grep -nE '[\\x{4e00}-\\x{9fff}]' -- backend/app/services/oasis_profile_generator.py` output still flags only known out-of-scope locations: docstrings, comments, logger keys, rule-based fallback country `"中国"` defaults, and resilience-helper Chinese fragments — and does not flag any line inside the three translated method bodies
- Observable completion: the targeted regex audit returns zero matches inside the three method bodies; the file-level audit's residual CJK lines all fall outside the three method bodies and match the out-of-scope inventory in `design.md` § Boundary Commitments → Out of Boundary
- _Requirements: 1.1, 2.8, 3.8, 8.1, 8.2, 8.3_

View File

@ -661,9 +661,9 @@ class OasisProfileGenerator:
def _get_system_prompt(self, is_individual: bool) -> str:
"""获取系统提示词"""
base_prompt = "你是社交媒体用户画像生成专家。生成详细、真实的人设用于舆论模拟,最大程度还原已有现实情况。必须返回有效的JSON格式所有字符串值不能包含未转义的换行符。"
base_prompt = "You are an expert in social-media user-persona generation. Produce detailed, realistic personas for opinion simulation that faithfully reflect existing real-world conditions. You MUST return valid JSON; no string value may contain unescaped newlines."
return f"{base_prompt}\n\n{get_language_instruction()}"
def _build_individual_persona_prompt(
self,
entity_name: str,
@ -673,44 +673,44 @@ class OasisProfileGenerator:
context: str
) -> str:
"""构建个人实体的详细人设提示词"""
attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else ""
context_str = context[:3000] if context else "无额外上下文"
return f"""为实体生成详细的社交媒体用户人设,最大程度还原已有现实情况。
实体名称: {entity_name}
实体类型: {entity_type}
实体摘要: {entity_summary}
实体属性: {attrs_str}
attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None"
context_str = context[:3000] if context else "No additional context"
上下文信息:
return f"""Generate a detailed social-media user persona for the entity, faithfully reflecting existing real-world conditions.
Entity name: {entity_name}
Entity type: {entity_type}
Entity summary: {entity_summary}
Entity attributes: {attrs_str}
Context information:
{context_str}
请生成JSON包含以下字段:
Generate JSON with the following fields:
1. bio: 社交媒体简介200
2. persona: 详细人设描述2000字的纯文本需包含:
- 基本信息年龄职业教育背景所在地
- 人物背景重要经历与事件的关联社会关系
- 性格特征MBTI类型核心性格情绪表达方式
- 社交媒体行为发帖频率内容偏好互动风格语言特点
- 立场观点对话题的态度可能被激怒/感动的内容
- 独特特征口头禅特殊经历个人爱好
- 个人记忆人设的重要部分要介绍这个个体与事件的关联以及这个个体在事件中的已有动作与反应
3. age: 年龄数字必须是整数
4. gender: 性别必须是英文: "male" "female"
5. mbti: MBTI类型如INTJENFP等
6. country: 国家使用中文"中国"
7. profession: 职业
8. interested_topics: 感兴趣话题数组
1. bio: social-media biography, ~200 characters
2. persona: detailed persona description (~2000 characters of plain text), covering:
- Basic information (age, profession, education, location)
- Background (notable experience, association with the event, social ties)
- Personality (MBTI type, core traits, emotional expression)
- Social-media behavior (posting frequency, content preferences, interaction style, language traits)
- Stance (attitudes toward the topic, content likely to anger or move them)
- Unique features (catchphrases, special experiences, hobbies)
- Personal memory (a key part of the persona: this individual's relation to the event and prior actions/reactions in it)
3. age: age number (MUST be an integer)
4. gender: gender, MUST be one of the English literals: "male" or "female"
5. mbti: MBTI type (e.g. INTJ, ENFP)
6. country: country name
7. profession: profession
8. interested_topics: array of interest topics
重要:
- 所有字段值必须是字符串或数字不要使用换行符
- persona必须是一段连贯的文字描述
- {get_language_instruction()} (gender字段必须用英文male/female)
- 内容要与实体信息保持一致
- age必须是有效的整数gender必须是"male""female"
Important:
- All field values MUST be strings or numbers; do not use unescaped newlines.
- persona MUST be a single coherent block of text.
- {get_language_instruction()} (gender field MUST use the English values "male" or "female")
- Content must remain consistent with the entity information.
- age MUST be a valid integer; gender MUST be "male" or "female".
"""
def _build_group_persona_prompt(
@ -722,44 +722,44 @@ class OasisProfileGenerator:
context: str
) -> str:
"""构建群体/机构实体的详细人设提示词"""
attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else ""
context_str = context[:3000] if context else "无额外上下文"
return f"""为机构/群体实体生成详细的社交媒体账号设定,最大程度还原已有现实情况。
实体名称: {entity_name}
实体类型: {entity_type}
实体摘要: {entity_summary}
实体属性: {attrs_str}
attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None"
context_str = context[:3000] if context else "No additional context"
上下文信息:
return f"""Generate a detailed social-media account profile for the institution/group entity, faithfully reflecting existing real-world conditions.
Entity name: {entity_name}
Entity type: {entity_type}
Entity summary: {entity_summary}
Entity attributes: {attrs_str}
Context information:
{context_str}
请生成JSON包含以下字段:
Generate JSON with the following fields:
1. bio: 官方账号简介200专业得体
2. persona: 详细账号设定描述2000字的纯文本需包含:
- 机构基本信息正式名称机构性质成立背景主要职能
- 账号定位账号类型目标受众核心功能
- 发言风格语言特点常用表达禁忌话题
- 发布内容特点内容类型发布频率活跃时间段
- 立场态度对核心话题的官方立场面对争议的处理方式
- 特殊说明代表的群体画像运营习惯
- 机构记忆机构人设的重要部分要介绍这个机构与事件的关联以及这个机构在事件中的已有动作与反应
3. age: 固定填30机构账号的虚拟年龄
4. gender: 固定填"other"机构账号使用other表示非个人
5. mbti: MBTI类型用于描述账号风格如ISTJ代表严谨保守
6. country: 国家使用中文"中国"
7. profession: 机构职能描述
8. interested_topics: 关注领域数组
1. bio: official-account biography, ~200 characters, professional and appropriate
2. persona: detailed account-profile description (~2000 characters of plain text), covering:
- Institutional basics (formal name, institution type, founding background, primary functions)
- Account positioning (account type, target audience, core function)
- Voice (language traits, common phrasing, taboo topics)
- Publishing pattern (content types, publishing frequency, active hours)
- Stance (official position on the core topic, controversy-handling style)
- Special notes (the group portrait represented, operational habits)
- Institutional memory (a key part of the account profile: this institution's relation to the event and prior actions/reactions in it)
3. age: fixed integer 30 (the institutional virtual age)
4. gender: fixed literal "other" (institutional accounts use "other" to indicate non-individual)
5. mbti: MBTI type used to characterize account voice (e.g. ISTJ for strict/conservative)
6. country: country name
7. profession: institutional function description
8. interested_topics: array of focus areas
重要:
- 所有字段值必须是字符串或数字不允许null值
- persona必须是一段连贯的文字描述不要使用换行符
- {get_language_instruction()} (gender字段必须用英文"other")
- age必须是整数30gender必须是字符串"other"
- 机构账号发言要符合其身份定位"""
Important:
- All field values MUST be strings or numbers; null values are not allowed.
- persona MUST be a single coherent block of text without unescaped newlines.
- {get_language_instruction()} (gender field MUST use the English value "other")
- age MUST be the integer 30; gender MUST be the string "other".
- Account voice MUST match the institution's identity positioning."""
def _generate_profile_rule_based(
self,