diff --git a/.kiro/specs/i18n-oasis-profile-generator-prompts/design.md b/.kiro/specs/i18n-oasis-profile-generator-prompts/design.md new file mode 100644 index 00000000..5541711b --- /dev/null +++ b/.kiro/specs/i18n-oasis-profile-generator-prompts/design.md @@ -0,0 +1,617 @@ +# Design Document — i18n-oasis-profile-generator-prompts + +## Overview + +**Purpose**: Translate the Chinese prompt strings in +`backend/app/services/oasis_profile_generator.py` (the system prompt +inside `_get_system_prompt`, the individual-persona f-string template +inside `_build_individual_persona_prompt`, the group-persona f-string +template inside `_build_group_persona_prompt`, and the four +`attrs_str`/`context_str` fallback literals) to English while +preserving every functional contract — JSON output keys, the `gender` +English enum, the `age` integer rule, the `persona` no-newline rule, +all `{variable}` interpolations, and every `get_language_instruction()` +call site. The goal is to remove the Chinese-language base-prompt bias +that currently leaks Chinese structure and word choice into persona +output even when `Accept-Language: en`. + +**Users**: MiroFish operators running the Step 2 environment-setup +pipeline under any locale; downstream Step 3 (CAMEL-OASIS subprocess) +which consumes the produced persona dictionaries. + +**Impact**: Replaces approximately one one-line system prompt and two +large f-string templates with English equivalents inside one file. No +API change, no new dependencies, no new files. The two production +callers (`backend/app/services/simulation_manager.py:316` and +`backend/app/api/simulation.py:1413`) and the OASIS subprocess are +unaffected. + +### Goals + +- Zero CJK characters in any prompt string literal contributed by + `oasis_profile_generator.py` to the system prompt or the two + user-message bodies (including the `attrs_str`/`context_str` + fallback literals). +- English persona prose (`bio`, `persona`, `profession`, + `interested_topics`) under `Accept-Language: en`. +- Continued Chinese persona prose under `Accept-Language: zh`, of + equivalent quality to the pre-change behaviour. +- `gender` field stays exactly one of `"male"`/`"female"`/`"other"` + regardless of locale. +- No diff to public signatures, taxonomy lists, LLM-call parameters, + or call sites. + +### Non-Goals + +- Externalizing prompts to `/locales/*.json` (out of scope per ticket). +- Translating logger calls in this file (covered by issue #6). +- Translating module/class/method docstrings or inline comments + (covered by issue #7). +- Refactoring the `OasisAgentProfile` schema, `MBTI_TYPES` / + `COUNTRIES` lists, or the `INDIVIDUAL_ENTITY_TYPES` / + `GROUP_ENTITY_TYPES` taxonomies. +- Modifying the rule-based fallback (`_generate_profile_rule_based`) + including its Chinese country defaults. +- Modifying the resilience helpers `_fix_truncated_json` / + `_try_fix_json` and the Chinese persona fallback fragments inside + them (e.g. `f"{entity_name}是一个{entity_type}。"`). +- Modifying `backend/app/utils/locale.py`, the locale registries, or + any non-target file. +- Modifying `backend/scripts/test_profile_format.py`. + +## Boundary Commitments + +### This Spec Owns + +- The English content of `_get_system_prompt`'s `base_prompt` literal. +- The English content of the f-string template body in + `_build_individual_persona_prompt`. +- The English content of the f-string template body in + `_build_group_persona_prompt`. +- The English replacements for the four `"无"` / `"无额外上下文"` + fallback literals (in both individual and group builders). + +### Out of Boundary + +- Locale resolution machinery (`backend/app/utils/locale.py`). +- Per-locale `llmInstruction` definitions + (`/locales/languages.json`). +- Reasoning-model output stripping inside `_fix_truncated_json` / + `_try_fix_json`. +- Logger calls and translation keys (`t("log.profile_generator.*")`) + inside `oasis_profile_generator.py` (issue #6, already merged). +- Module / class / method docstrings and inline comments inside + `oasis_profile_generator.py` (issue #7). +- Rule-based fallback (`_generate_profile_rule_based`) including its + Chinese country defaults `"中国"`. +- Chinese persona fragments inside the resilience helpers (e.g. + `f"{entity_name}是一个{entity_type}。"`) — those are runtime data + fallbacks, not LLM prompts. +- All callers of `OasisProfileGenerator` + (`simulation_manager.py`, `api/simulation.py`). +- Tests, scripts, and frontend code. +- The `print(...)` banner at line 945 (closely associated with logger + externalization #6). + +### Allowed Dependencies + +- Existing imports in the target file (no additions). Specifically: + `get_language_instruction`, `get_locale`, `set_locale`, `t` from + `..utils.locale` are already imported and remain unchanged. +- Existing LLM transport via `self.client.chat.completions.create` + (unchanged). + +### Revalidation Triggers + +The following changes elsewhere would invalidate this design: + +- A change to the JSON contract emitted by the LLM (`bio`, `persona`, + `age`, `gender`, `mbti`, `country`, `profession`, + `interested_topics` keys). +- A change to the `OasisAgentProfile` dataclass field set or the + Reddit/Twitter serializers. +- A change to `get_language_instruction()` semantics or the per-locale + `llmInstruction` strings. +- A change to OASIS subprocess profile-format expectations (verified + via `backend/scripts/test_profile_format.py`). + +## Architecture + +### Existing Architecture Analysis + +`OasisProfileGenerator` lives in `backend/app/services/`, follows the +in-process service pattern, and is invoked from a Flask handler inside +a background task. The relevant flow: + +1. The Flask handler resolves the request locale via `Accept-Language`; + `set_locale()` is propagated into worker threads in + `generate_profiles_for_entities` (locale captured at line ~910 and + restored inside `generate_single_profile` at line ~914). +2. For each entity, `generate_profile_from_entity` decides between the + individual or group prompt builder via + `self._is_individual_entity(entity_type)`. +3. The chosen builder produces a user-message string; `_get_system_prompt` + produces a system-message string. Both are sent to the LLM via + `self.client.chat.completions.create(..., response_format={"type": "json_object"})`. +4. The LLM response is JSON-decoded; on failure, `_try_fix_json` and + `_fix_truncated_json` attempt recovery; on terminal failure, + `_generate_profile_rule_based` produces a rule-based persona. +5. The result is wrapped in an `OasisAgentProfile` dataclass and + serialized to Reddit JSON or Twitter CSV via `_save_reddit_json` / + `_save_twitter_csv`. + +This design preserves all of the above. The change is purely lexical +inside three method bodies and four literal defaults. + +### Architecture Pattern & Boundary Map + +```mermaid +graph TB + Caller["simulation_manager.py / api/simulation.py"] + Generator["OasisProfileGenerator"] + Sys["_get_system_prompt"] + Ind["_build_individual_persona_prompt"] + Grp["_build_group_persona_prompt"] + Locale["locale.get_language_instruction"] + Client["openai.chat.completions.create"] + Parser["_try_fix_json / _fix_truncated_json"] + Fallback["_generate_profile_rule_based"] + Serializer["_save_reddit_json / _save_twitter_csv"] + + Caller --> Generator + Generator --> Sys + Generator --> Ind + Generator --> Grp + Sys -. inline call .-> Locale + Ind -. inline call .-> Locale + Grp -. inline call .-> Locale + Sys --> Client + Ind --> Client + Grp --> Client + Client --> Parser + Parser --> Fallback + Generator --> Serializer + + classDef change fill:#fff4ce,stroke:#a16207,color:#000 + class Sys,Ind,Grp change +``` + +The three highlighted nodes (`_get_system_prompt`, +`_build_individual_persona_prompt`, +`_build_group_persona_prompt`) are the only nodes whose **string +contents** change. Every edge — including each call to +`get_language_instruction()` — remains intact. + +**Architecture Integration**: + +- **Selected pattern**: In-place lexical translation of the three + prompt builders (Option A from `gap-analysis.md` / `research.md`). +- **Domain/feature boundaries**: Same as today; `OasisProfileGenerator` + remains the sole owner of persona prompt content. `LocaleService` + remains the sole owner of locale-postfix steering. +- **Existing patterns preserved**: locale-thread propagation, retry + logic with temperature decay, JSON resilience helpers, rule-based + fallback, two-platform serialization. +- **New components rationale**: none — no new components. +- **Steering compliance**: aligns with `tech.md` ("LLM prompts use the + `get_language_instruction()` postfix mechanism, not key files") and + `structure.md` ("services own their own prompt strings"). + +### Technology Stack & Alignment + +| Layer | Choice / Version | Role in Feature | Notes | +|-------|------------------|-----------------|-------| +| Backend / Services | Python ≥3.11 | Hosts the prompt builders | No version change | +| LLM transport | `openai` SDK against any OpenAI-compatible endpoint | Sends translated prompts | Unchanged | +| i18n | `backend/app/utils/locale.py` | Resolves locale and provides `get_language_instruction()` postfix | Unchanged | +| Storage | None | — | No persistence change | + +No new dependencies. No version bumps. The locale infrastructure used +by the change is the same one used by every sibling i18n spec already +merged. + +## File Structure Plan + +### Modified Files + +- `backend/app/services/oasis_profile_generator.py` — only file that + changes. + - `_get_system_prompt(self, is_individual: bool) -> str` — translate + `base_prompt` literal to English. Keep + `f"{base_prompt}\n\n{get_language_instruction()}"` shape. + - `_build_individual_persona_prompt(self, entity_name, entity_type, + entity_summary, entity_attributes, context) -> str` — translate + the f-string body to English; replace `"无"` and `"无额外上下文"` + defaults; keep every `{variable}` interpolation and the inline + `{get_language_instruction()}` call. + - `_build_group_persona_prompt(self, entity_name, entity_type, + entity_summary, entity_attributes, context) -> str` — same + treatment as the individual builder. + +No other files in the repository are touched by this change. + +## System Flows + +The runtime flow does not change. The only way to demonstrate this is +to compare the call graph before and after — and the call graph is +already shown in the Architecture diagram above. Skipping a separate +sequence diagram. + +## Requirements Traceability + +| Requirement | Summary | Components | Interfaces | Flows | +|-------------|---------|------------|------------|-------| +| 1.1 | `base_prompt` contains zero Chinese characters | `_get_system_prompt` | `(self, is_individual: bool) -> str` | system-message construction | +| 1.2 | Preserve `f"{base_prompt}\n\n{get_language_instruction()}"` | `_get_system_prompt` | inline `get_language_instruction()` | system-message construction | +| 1.3 | Preserve role/intent semantics | `_get_system_prompt` | — | — | +| 1.4 | Preserve signature `_get_system_prompt(self, is_individual: bool) -> str` | `_get_system_prompt` | (signature) | — | +| 2.1 | Individual prompt body in English | `_build_individual_persona_prompt` | f-string body | user-message construction | +| 2.2 | Preserve `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}` | `_build_individual_persona_prompt` | f-string interpolations | — | +| 2.3 | Preserve JSON keys `bio, persona, age, gender, mbti, country, profession, interested_topics` | `_build_individual_persona_prompt` | prompt content | — | +| 2.4 | Preserve field-level constraints (lengths, MBTI, gender enum, age int) | `_build_individual_persona_prompt` | prompt content | — | +| 2.5 | Preserve trailing-rules block semantics | `_build_individual_persona_prompt` | prompt content | — | +| 2.6 | Preserve method signature | `_build_individual_persona_prompt` | (signature) | — | +| 2.7 | Translate `"无"` and `"无额外上下文"` defaults | `_build_individual_persona_prompt` | literal defaults | — | +| 2.8 | Zero Chinese in assembled body | `_build_individual_persona_prompt` | — | — | +| 3.1 | Group prompt body in English | `_build_group_persona_prompt` | f-string body | user-message construction | +| 3.2 | Preserve interpolations | `_build_group_persona_prompt` | f-string interpolations | — | +| 3.3 | Preserve JSON keys | `_build_group_persona_prompt` | prompt content | — | +| 3.4 | Preserve field-level constraints (age=30, gender="other", etc.) | `_build_group_persona_prompt` | prompt content | — | +| 3.5 | Preserve trailing-rules semantics | `_build_group_persona_prompt` | prompt content | — | +| 3.6 | Preserve method signature | `_build_group_persona_prompt` | (signature) | — | +| 3.7 | Translate `"无"` / `"无额外上下文"` defaults | `_build_group_persona_prompt` | literal defaults | — | +| 3.8 | Zero Chinese in assembled body | `_build_group_persona_prompt` | — | — | +| 4.1 | Preserve every `get_language_instruction()` call site | all three builders | inline call | system + user message construction | +| 4.2 | Preserve locale-thread plumbing | `generate_profiles_for_entities` (untouched) | `set_locale(current_locale)` | worker thread spawn | +| 4.3 | Locale=zh produces Chinese personas | runtime behaviour | locale postfix | LLM call | +| 4.4 | Locale=en produces English personas | runtime behaviour | locale postfix | LLM call | +| 4.5 | `gender` ∈ {male, female, other} regardless of locale | prompt content | — | — | +| 4.6 | Don't alter locale.py / locales/ | (none) | — | — | +| 5.1 | Preserve `OasisAgentProfile` dataclass | (untouched) | dataclass | — | +| 5.2 | Preserve method signatures | (untouched) | signatures | — | +| 5.3 | Preserve LLM invocation parameters | (untouched) | `chat.completions.create(...)` | — | +| 5.4 | Preserve `MBTI_TYPES`, `COUNTRIES`, taxonomy lists | (untouched) | class constants | — | +| 6.1 | Preserve `_fix_truncated_json` / `_try_fix_json` | (untouched) | helpers | — | +| 6.2 | Reasoning-model recovery still works | (untouched) | resilience helpers | — | +| 6.3 | No new prompt-language-dependent pre-processing | (none added) | — | — | +| 6.4 | Round-trip yields non-empty `bio` and `persona` | runtime behaviour | LLM call | — | +| 7.1 | `pytest test_profile_format.py` passes | runtime behaviour | serializers | — | +| 7.2 | Reddit format schema preserved | (untouched) | `to_reddit_format` | — | +| 7.3 | Twitter format schema preserved | (untouched) | `to_twitter_format` | — | +| 7.4 | `gender` enum preserved | prompt content | — | — | +| 8.1 | No logger edits | (untouched) | — | — | +| 8.2 | No docstring/comment edits | (untouched) | — | — | +| 8.3 | No rule-based fallback edits | (untouched) | — | — | +| 8.4 | No edits outside the target file | (none) | — | — | +| 8.5 | No new dependencies | (none) | `pyproject.toml` / `uv.lock` untouched | — | +| 8.6 | No edits to `test_profile_format.py` | (untouched) | — | — | + +## Components and Interfaces + +| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts | +|-----------|--------------|--------|--------------|--------------------------|-----------| +| `_get_system_prompt` | backend service / prompt builder | Produce the system message (English base + locale postfix) | 1.1, 1.2, 1.3, 1.4, 4.1, 4.5 | `get_language_instruction` (P0) | Service | +| `_build_individual_persona_prompt` | backend service / prompt builder | Produce the individual-entity user message in English | 2.x, 4.1, 4.5 | `get_language_instruction` (P0); JSON encoder (P1) | Service | +| `_build_group_persona_prompt` | backend service / prompt builder | Produce the group/institution user message in English | 3.x, 4.1, 4.5 | `get_language_instruction` (P0); JSON encoder (P1) | Service | + +Only the three prompt-builder methods change. They all live inside the +single class `OasisProfileGenerator` in +`backend/app/services/oasis_profile_generator.py`. No new components. + +### Backend / Services + +#### `_get_system_prompt` + +| Field | Detail | +|-------|--------| +| Intent | Build the `system` message: a one-line English directive that frames the model as a social-media persona expert + the per-locale postfix. | +| Requirements | 1.1, 1.2, 1.3, 1.4, 4.1, 4.5 | + +**Responsibilities & Constraints** + +- Construct and return a single string of the form + `f"{base_prompt}\n\n{get_language_instruction()}"`. +- Preserve the signature + `_get_system_prompt(self, is_individual: bool) -> str`. +- The English `base_prompt` MUST convey: (a) expert role in + social-media persona generation; (b) intent to produce detailed, + realistic personas for opinion-simulation, faithful to existing + reality; (c) the JSON-output requirement and the no-unescaped-newline + rule. +- The English `base_prompt` MUST NOT contain any CJK codepoint. + +**Dependencies** + +- Outbound: `get_language_instruction()` from + `backend/app/utils/locale.py` (P0, criticality high — the entire + locale-steering chain depends on it). + +**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ] + +##### Service Interface + +```python +def _get_system_prompt(self, is_individual: bool) -> str: + """Return the LLM system message: English base + locale postfix.""" + ... +``` + +- Preconditions: none. +- Postconditions: returns a non-empty string ending with the locale + postfix produced by `get_language_instruction()`. +- Invariants: contains zero CJK codepoints. + +**Implementation Notes** + +- Integration: called only from `_call_llm_with_retry` (line ~523) + with `is_individual` decided upstream. The `is_individual` flag is + reserved for future divergence between system prompts; the current + implementation does not branch on it, and this design preserves + that. +- Validation: a CJK regex audit on the method body after the edit must + match zero codepoints. +- Risks: dropping one of the three role/intent pieces (expert framing, + JSON output requirement, no-newline rule). Implementation task lists + all three explicitly. + +#### `_build_individual_persona_prompt` + +| Field | Detail | +|-------|--------| +| Intent | Build the user-message string for an individual entity in English. Preserve every `{variable}` interpolation, the inline `{get_language_instruction()}` call, every JSON-output key, and every locale-independent constraint. | +| Requirements | 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5 | + +**Responsibilities & Constraints** + +- Preserve signature + `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`. +- Preserve `attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else ` with `` translated to English (`"None"`). +- Preserve `context_str = context[:3000] if context else ` with `` translated to English (`"No additional context"`). +- Translate the f-string body to English with these structural sections (mirror the original Chinese intent): + 1. **Lead sentence** — instruct the model to generate a detailed + social-media persona for the entity, faithful to existing reality. + 2. **Entity context block** — labelled lines for `entity_name`, + `entity_type`, `entity_summary`, `entity_attributes` (English + labels; values via `{...}` interpolation). + 3. **Context information block** — `Context information:` heading + followed by `{context_str}`. + 4. **JSON-fields enumeration** — `Generate JSON with the following + fields:` followed by the eight numbered items (`bio`, `persona`, + `age`, `gender`, `mbti`, `country`, `profession`, + `interested_topics`) with English descriptions matching + Requirement 2.4. + 5. **Trailing rules block** — `Important:` followed by: + - `All field values must be strings or numbers; do not use newlines.` + - `persona must be a single coherent block of text.` + - `{get_language_instruction()} (gender field MUST use English values: "male" or "female")` + - `Content must remain consistent with the entity information.` + - `age must be a valid integer; gender must be exactly "male" or "female".` +- Preserve every `{variable}` interpolation present in the original by + name: `{entity_name}`, `{entity_type}`, `{entity_summary}`, + `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`. +- The translated body MUST NOT contain any CJK codepoint. + +**Dependencies** + +- Outbound: `json.dumps(..., ensure_ascii=False)` (P1, formatting the + attributes dict) — unchanged. +- Outbound: `get_language_instruction()` (P0) — interpolated inline. + +**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ] + +##### Service Interface + +```python +def _build_individual_persona_prompt( + self, + entity_name: str, + entity_type: str, + entity_summary: str, + entity_attributes: Dict[str, Any], + context: str, +) -> str: + """Return the LLM user message for an individual-entity persona.""" + ... +``` + +- Preconditions: `entity_name`, `entity_type`, `entity_summary` + are strings (may be empty); `entity_attributes` is a dict (may be + empty); `context` is a string (may be empty). +- Postconditions: returns a non-empty English string with all six + interpolations resolved. +- Invariants: contains zero CJK codepoints; preserves every + `{variable}` interpolation by name. + +**Implementation Notes** + +- Integration: called from `_call_llm_with_retry` (line ~506) when + `is_individual` is true. +- Validation: post-edit CJK regex audit; interpolation-set audit + (verify the multiset of `{...}` tokens equals the pre-change set); + smoke import + `pytest backend/scripts/test_profile_format.py`. +- Risks: dropping the `gender` enum lock when translating; dropping + the inline `{get_language_instruction()}` call. The implementation + task list calls these out as discrete checks. + +#### `_build_group_persona_prompt` + +| Field | Detail | +|-------|--------| +| Intent | Build the user-message string for a group/institution entity in English. Preserve every `{variable}` interpolation, the inline `{get_language_instruction()}` call, every JSON-output key, and every locale-independent constraint (notably `age == 30` and `gender == "other"`). | +| Requirements | 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5 | + +**Responsibilities & Constraints** + +- Preserve signature + `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`. +- Preserve the `attrs_str` and `context_str` fallback handling with + English defaults (`"None"`, `"No additional context"`), identical to + the individual builder. +- Translate the f-string body to English with these structural + sections (mirror the original Chinese intent for institutions): + 1. **Lead sentence** — instruct the model to generate a detailed + social-media account profile for the institution/group, faithful + to existing reality. + 2. **Entity context block** — labelled lines for `entity_name`, + `entity_type`, `entity_summary`, `entity_attributes`. + 3. **Context information block** — `Context information:` heading + followed by `{context_str}`. + 4. **JSON-fields enumeration** — `Generate JSON with the following + fields:` followed by the eight numbered items as defined in + Requirement 3.4: `bio` (~200 chars, official voice), `persona` + (~2000 chars, single coherent text covering institutional + basics, account positioning, voice, publishing pattern, stance, + special notes, institutional memory), `age` (= integer 30, + institutional virtual age), `gender` (= literal `"other"`), + `mbti` (e.g. ISTJ for strict/conservative), `country` (country + name string), `profession` (institutional function), + `interested_topics` (array). + 5. **Trailing rules block** — `Important:` followed by: + - `All field values must be strings or numbers; null is not allowed.` + - `persona must be a single coherent block of text without newlines.` + - `{get_language_instruction()} (gender field MUST use English value "other")` + - `age must be the integer 30; gender must be the string "other".` + - `Account voice must match its identity positioning.` +- Preserve every `{variable}` interpolation present in the original. +- The translated body MUST NOT contain any CJK codepoint. + +**Dependencies** + +- Outbound: same as individual builder. + +**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ] + +##### Service Interface + +```python +def _build_group_persona_prompt( + self, + entity_name: str, + entity_type: str, + entity_summary: str, + entity_attributes: Dict[str, Any], + context: str, +) -> str: + """Return the LLM user message for a group/institution persona.""" + ... +``` + +- Preconditions / Postconditions / Invariants: same shape as the + individual builder. + +**Implementation Notes** + +- Integration: called from `_call_llm_with_retry` (line ~510) when + `is_individual` is false. +- Validation: same checks as the individual builder, plus an explicit + audit that the institutional sentinels (`age == 30`, + `gender == "other"`) appear in English in the trailing-rules block. +- Risks: same as the individual builder; additionally, the `country` + language hint (`"使用中文,如\"中国\""`) is intentionally dropped + during translation — the validation task verifies that under + `Accept-Language: en` a sample run produces an English country + name. + +## Data Models + +No data-model changes. The persona JSON schema, the +`OasisAgentProfile` dataclass, the Reddit/Twitter serializers, and the +OASIS subprocess profile-format expectations are all preserved +verbatim. + +## Error Handling + +### Error Strategy + +No new error paths. The existing flow is preserved: + +- `json.JSONDecodeError` → `_try_fix_json` → `_fix_truncated_json` → + partial-extract via regex → `_generate_profile_rule_based`. +- LLM call failure → retry with temperature decay (`0.7 - attempt * 0.1`) + up to `max_attempts = 3`. +- Terminal failure → rule-based fallback persona. +- Per-entity worker exception → fallback `OasisAgentProfile` produced + inside `generate_single_profile` at line ~932. + +The translated prompts do not introduce new failure modes. Translating +prompt language has no semantic effect on JSON parsing or on the +`response_format={"type": "json_object"}` constraint. + +### Error Categories and Responses + +- **User errors**: not applicable (this is an internal pipeline). +- **System errors**: LLM transport errors are retried; logger emits + `t("log.profile_generator.m011")` etc. Logger keys already exist in + `locales/{en,zh}.json`. +- **Business-logic errors**: `gender` not in the English enum, `age` + not an integer — the prompt explicitly mandates them; the validator + inside `_try_fix_json` does not enforce these but the OASIS + subprocess does. No change in either direction. + +### Monitoring + +Existing logger calls are unchanged. Logger keys already i18n-keyed via +`t("log.profile_generator.*")`. + +## Testing Strategy + +### Unit Tests + +- **(Existing)** + `backend/scripts/test_profile_format.py::test_profile_formats` — + must continue to pass without modification. +- **(Manual)** Smoke import: + `cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator"` + — confirms no syntax errors after editing f-strings. + +### Integration Tests + +- **(Manual)** Run the prompt builders directly under each locale: + - `set_locale("en")` → + `OasisProfileGenerator()._build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")` + — assert no CJK codepoints in the output, assert the English + locale postfix appears via `get_language_instruction()` (which is + `"Please respond in English."`). + - `set_locale("zh")` → same call → assert the locale postfix is + `"请使用中文回答。"`. +- These do not require an LLM call; they only verify the rendered + prompt string. + +### E2E Tests + +- **(Manual, optional, preferred but skippable when no LLM key + present)** Run `npm run dev` and trigger Step 2 profile generation + from the UI under English locale on a small entity set; spot-check + that bios and persona prose are in English. Skip if a live LLM key + is unavailable in CI; sibling specs #2/#4/#5 used the same manual + E2E approach. + +### Performance / Load + +Not applicable. Prompt translation has no measurable performance +impact. + +## Optional Sections + +### Security Considerations + +No security implications. No new external surfaces; no new data +retention; no change to authentication or authorization. + +### Migration Strategy + +No migration required. The change is forward-compatible: a deployment +that picks up the translated prompts continues to serve users on the +`zh` locale via the unchanged +`get_language_instruction()` postfix mechanism. + +## Supporting References + +- `gap-analysis.md` — option evaluation and effort/risk sizing. +- `research.md` — discovery findings, design decisions (in particular + the "drop the country language hint" decision), and risk register. +- `requirements.md` — EARS requirements with numeric IDs. +- Sibling specs `i18n-ontology-generator-prompts`, + `i18n-simulation-config-generator-prompts`, + `i18n-report-agent-prompts` — same translation pattern, already + merged. diff --git a/.kiro/specs/i18n-oasis-profile-generator-prompts/gap-analysis.md b/.kiro/specs/i18n-oasis-profile-generator-prompts/gap-analysis.md new file mode 100644 index 00000000..ce934662 --- /dev/null +++ b/.kiro/specs/i18n-oasis-profile-generator-prompts/gap-analysis.md @@ -0,0 +1,241 @@ +# Gap Analysis — i18n-oasis-profile-generator-prompts + +This document analyzes the gap between the requirements and the existing +codebase, lists implementation options, and recommends an approach for the +design phase. + +## 1. Current State Investigation + +### Target file + +`backend/app/services/oasis_profile_generator.py` — 1195 lines. Defines: + +- `OasisAgentProfile` dataclass with Reddit / Twitter serializers. +- `OasisProfileGenerator` class with the following public-API surface: + `__init__`, `generate_profile_from_entity`, `generate_profiles_from_entities`, + `set_graph_id`, plus private helpers `_call_llm_with_retry`, + `_generate_profile_rule_based`, `_get_system_prompt`, + `_build_individual_persona_prompt`, `_build_group_persona_prompt`, + `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, + `_save_twitter_csv`, `_save_reddit_json`, `_generate_username`. + +### Chinese surfaces in the file (by category) + +| Category | Lines | In scope this issue? | +| --- | --- | --- | +| Module / class / method docstrings | scattered | **No** — covered by #7 | +| Inline `#` comments | scattered | **No** — covered by #7 | +| `logger.{info,warning,error}` calls (translated via `t("log.profile_generator.*")`) | scattered | **No** — already done by #6 | +| `print(...)` banners (e.g. line 945) | a few | **No** — companion to #6 in spirit; not a prompt literal | +| **System prompt `base_prompt`** (line 664) | 1 line | **Yes** | +| **Individual-persona prompt body** (lines 680–714) | block | **Yes** | +| **Group-persona prompt body** (lines 729–762) | block | **Yes** | +| `attrs_str` / `context_str` defaults `"无"` / `"无额外上下文"` (lines 677, 678, 726, 727) | 4 lines | **Yes** — they substitute *into* the prompt body | +| Rule-based fallback (`_generate_profile_rule_based`, lines 764–835) including `"country": "中国"` and `"国家"` placeholders | block | **No** — runtime data, not a prompt | +| Resilience-helper Chinese fragments (`f"{entity_name}是一个{entity_type}。"` at lines 547, 644, 659) | a few | **No** — runtime data, not a prompt | + +The file already imports `get_locale`, `set_locale`, `t`, and +`get_language_instruction` from `app.utils.locale`. The locale-capture / +restore plumbing inside `generate_profiles_for_entities` (lines ~910–916) +already propagates the request locale to background-thread workers — no +changes required. + +### Locale infrastructure (already in place) + +`backend/app/utils/locale.py`: + +- `get_language_instruction()` returns the per-locale postfix from + `/locales/languages.json` (e.g. `Please respond in English.` for `en`, + `请使用中文回答。` for `zh`). +- `t(key, **kwargs)` resolves `log.*` keys for backend logger messages; + not used by this issue. +- `set_locale` / `get_locale` are thread-local, with restoration plumbed + into `generate_profiles_for_entities`. + +### Sibling specs already shipped + +- `i18n-ontology-generator-prompts` (#2 — merged) +- `i18n-simulation-config-generator-prompts` (#4 — merged) +- `i18n-report-agent-prompts` (#5 — merged) +- `i18n-externalize-backend-logs` (#6 — merged; logger keys for + `log.profile_generator.*` are already in `locales/{en,zh}.json`) + +The translation pattern they established: + +1. Translate the base prompt body (English narrative + headings). +2. Preserve every `get_language_instruction()` call site verbatim so + `Accept-Language: zh` still produces Chinese output. +3. Preserve all `{variable}` interpolations in f-strings. +4. Preserve all locale-independent "lock" rules (e.g. `gender` enum) in + English text within the prompt. +5. No new dependencies, no new files, single-file diff. + +This is a direct sibling — same pattern applies. + +### Test contract + +`backend/scripts/test_profile_format.py`: + +- Pytest-collectable function `test_profile_formats`. +- Constructs `OasisAgentProfile` instances directly (no LLM call) and + serializes them via `_save_twitter_csv` / `_save_reddit_json`. +- Verifies CSV header includes `user_id, user_name, name, bio, + friend_count, follower_count, statuses_count, created_at` and JSON + output includes `realname, username, bio, persona`. +- **Does not exercise the prompts.** A pure prompt translation cannot + break it; a refactor of dataclass field names or serializers would. + +### Callers + +- `backend/app/services/simulation_manager.py:316` — + `OasisProfileGenerator(graph_id=state.graph_id)`. +- `backend/app/api/simulation.py:1413` — `OasisProfileGenerator()`. + +Neither caller looks at prompt language; both consume the persona dict +output. No call-site changes are needed. + +## 2. Requirement-to-Asset Map + +| Req. | Asset / file | Gap | +| --- | --- | --- | +| 1. System prompt → English | `_get_system_prompt` line 664 | **Missing** — Chinese literal needs to become English literal | +| 2. Individual-persona template → English | `_build_individual_persona_prompt` lines 680–714 | **Missing** — Chinese block needs translation; preserve `{...}` interpolations and inline `{get_language_instruction()}` | +| 3. Group-persona template → English | `_build_group_persona_prompt` lines 729–762 | **Missing** — Chinese block needs translation; preserve `{...}` interpolations and inline `{get_language_instruction()}` | +| 4. Locale switching unchanged | `app.utils.locale` + the three `get_language_instruction()` call sites | **Constraint** — code path must stay byte-identical at those call sites | +| 5. Public API stability | `OasisAgentProfile` dataclass + `OasisProfileGenerator` method signatures | **Constraint** — no signatures change | +| 6. Reasoning-model parsing unchanged | `_fix_truncated_json`, `_try_fix_json` | **Constraint** — no edits | +| 7. OASIS schema parity | `_save_twitter_csv`, `_save_reddit_json`, `to_*_format` serializers | **Constraint** — no edits; pytest must continue passing | +| 8. Out-of-scope guard | logger calls, docstrings, comments, rule-based fallback | **Constraint** — explicitly do not edit | + +No requirement is blocked or unknown. Every requirement maps to a known +location with a clear, narrow change. + +## 3. Implementation Approach Options + +### Option A — In-place edit of the three prompt builders (extend existing) + +Translate `base_prompt` (1 line), the individual-persona f-string body +(~35 lines), and the group-persona f-string body (~34 lines) directly, +plus the four `"无"` / `"无额外上下文"` fallback literals. Keep all method +bodies otherwise byte-identical. + +- **Files touched**: `backend/app/services/oasis_profile_generator.py` + only. +- **Compatibility**: zero API change. All call sites unaffected. Locale + switching preserved by leaving the inline `{get_language_instruction()}` + placeholders untouched. +- **Complexity**: low. Pattern is identical to merged siblings #2, #4, + #5. + +**Trade-offs**: + +- ✅ Minimal diff, exactly the pattern reviewers expect. +- ✅ No risk to the unrelated rule-based fallback or serialization paths. +- ✅ Out-of-scope items (logger, docstrings, rule-based fallback) are not + touched, so #6/#7 remain clean. +- ❌ Leaves the file mixed-language in non-prompt parts (docstrings, rule + fallback) until #7 lands. Acceptable per scope split. + +### Option B — Move prompt strings into module-level constants + +Introduce `INDIVIDUAL_PERSONA_PROMPT_TEMPLATE` and +`GROUP_PERSONA_PROMPT_TEMPLATE` constants at module scope (mirroring +`ONTOLOGY_SYSTEM_PROMPT` style in `ontology_generator.py`), and have the +builders `.format(**kwargs)` against them. + +- **Files touched**: same single file, but with structural refactor. +- **Compatibility**: still zero public API change, but the diff is + larger and reviewers must verify equivalent behaviour around + `{get_language_instruction()}` (which would need to become a runtime + substitution not an f-string interpolation, since constants don't + re-evaluate per call). + +**Trade-offs**: + +- ✅ Constants are easier to spot in `git grep`. +- ❌ Larger diff, more review surface. +- ❌ The inline `get_language_instruction()` call is currently captured at + f-string render time; moving to a `.format(...)` template requires + passing the resolved instruction in as a kwarg — a behavioural change + that exceeds "translate prompts only". +- ❌ Diverges from the sibling pattern just shipped (#4, #5 used in-place + edits, not module constants). #2 used module constants but only for the + system prompt — the user-message template was still built inside the + method. + +### Option C — Externalize prompt text into `/locales/*.json` + +Move every prompt sentence into `locales/en.json` and `locales/zh.json`, +keyed under `prompt.profile_generator.*`, and use `t(key, **vars)` to +resolve. + +- **Compatibility**: would address `Accept-Language` purely via the + existing translation mechanism without depending on the + `get_language_instruction()` postfix. + +**Trade-offs**: + +- ✅ Most i18n-pure approach. +- ❌ Significantly larger diff (touches three repos: source file, + `en.json`, `zh.json`). +- ❌ Diverges from the established project pattern. The sibling specs + (#2, #4, #5) deliberately did **not** externalize prompts — the + project rationale (per `tech.md`) is that backend logger messages are + the i18n surface, while LLM prompts use the `get_language_instruction()` + postfix mechanism. +- ❌ Higher review and merge cost for no operational gain. + +## 4. Recommended Approach + +**Option A** — single-file in-place edit of the three prompt builders +plus the four `"无"` / `"无额外上下文"` fallback literals. + +Rationale: + +- Matches the merged sibling specs verbatim (#2, #4, #5) so reviewers + can apply the same mental checklist. +- Smallest possible diff that satisfies every acceptance criterion in + requirements.md. +- Leaves out-of-scope surfaces (logger, docstrings, rule-based + fallback) untouched — clean handoff to #7 and clean separation from + already-merged #6. +- Zero new dependencies, zero new files, zero API change, zero risk to + `test_profile_format.py`. + +### Translation choices to lock in during design + +1. The system prompt `base_prompt` becomes a single English sentence in + the spirit of the original (expert in social-media persona generation; + detailed and realistic personas for opinion simulation; faithful + reflection of real-world conditions; valid JSON, no unescaped + newlines). +2. The two persona prompt bodies adopt English section headings and + prose. The previously-Chinese hint + `country: 国家(使用中文,如"中国")` is dropped — the + `get_language_instruction()` postfix already steers locale, and the + rule-based fallback (out of scope) handles its own country values. +3. The trailing rules block keeps the locale-independent "lock" + constraints inline (`gender` enum, `age` integer requirement, + `persona` newline rule) and continues to embed + `{get_language_instruction()}` verbatim. + +## 5. Effort & Risk + +- **Effort**: **S** (1–3 days; realistically <½ day). One-file diff, + established sibling pattern, no new test infrastructure. +- **Risk**: **Low**. The translated prompts touch only the LLM + `messages` payload. The locale-switching pathway, public API, + serializers, retry logic, fallback, and tests are all untouched. The + only failure mode is a mistranslated constraint (e.g. accidentally + dropping `gender ∈ {male, female, other}`), which the design checklist + enumerates and reviewers can verify by diff. + +### Research items carried into design phase + +- None blocking. The design phase will: + - Enumerate the exact final English text for each of the three blocks. + - Verify each translated block preserves every JSON-output key, + every `{variable}` interpolation, and the inline + `{get_language_instruction()}` call. + - Spot-check that the diff stays within + `backend/app/services/oasis_profile_generator.py`. diff --git a/.kiro/specs/i18n-oasis-profile-generator-prompts/requirements.md b/.kiro/specs/i18n-oasis-profile-generator-prompts/requirements.md new file mode 100644 index 00000000..f37262bc --- /dev/null +++ b/.kiro/specs/i18n-oasis-profile-generator-prompts/requirements.md @@ -0,0 +1,145 @@ +# Requirements Document + +## Introduction + +This specification covers the English translation of the prompt strings in `backend/app/services/oasis_profile_generator.py`. The file converts Graphiti graph entities into OASIS agent persona dictionaries that drive Step 2 (Environment Setup) of the MiroFish pipeline. Today, the system prompt and the two `_build_*_persona_prompt` user-message templates are written in Chinese; the language is steered at runtime by appending `get_language_instruction()` to the system prompt and inside the user prompt body. While that postfix instructs the model *which* language to respond in, the base-prompt language biases the model's structural and lexical output, so persona prose (bio, persona, profession, interested_topics) skews Chinese under `Accept-Language: en`. Translating the base prompts to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (`get_language_instruction()` returns `请使用中文回答。` when locale is `zh`, so a Chinese model response remains achievable from an English base prompt). + +This work tracks GitHub issue [#3](https://github.com/salestech-group/MiroFish/issues/3) and is sibling to the already-merged ontology-generator (#2), simulation-config-generator (#4), and report-agent (#5) prompt translation specs. + +## Boundary Context + +- **In scope**: + - Translating the system-prompt base string in `OasisProfileGenerator._get_system_prompt` (currently `"你是社交媒体用户画像生成专家。…"` at line ~664) from Chinese to English. + - Translating the individual-persona user-message template in `OasisProfileGenerator._build_individual_persona_prompt` (currently lines ~680–714) from Chinese to English. + - Translating the group/institution-persona user-message template in `OasisProfileGenerator._build_group_persona_prompt` (currently lines ~729–762) from Chinese to English. + - Translating the small `attrs_str` and `context_str` fallback default literals (`"无"`, `"无额外上下文"`) to English equivalents. + - Preserving all functional contracts: every `get_language_instruction()` call site, all variable interpolations, all JSON output keys, the `gender` enum constraint, the `age` integer constraint, and the institutional age=30 / gender="other" rule. +- **Out of scope**: + - Logger calls (`logger.info`, `logger.warning`, `logger.error`) and the printed banner text inside `oasis_profile_generator.py` — covered by issue #6. + - Module docstring, class docstrings, method docstrings, and inline comments — covered by issue #7. + - The fallback Chinese string literals embedded in non-prompt code paths (e.g. `f"{entity_name}是一个{entity_type}。"` inside `_try_fix_json` and the rule-based fallback) — those are runtime data fallbacks, not LLM prompts, and are out of scope for this issue (they are part of the fallback flow covered when comments/docstrings #7 lands or in a future cleanup; they are not user-visible while the LLM path succeeds). + - Refactoring the OASIS profile JSON schema, the `OasisAgentProfile` dataclass, the MBTI list, the `COMMON_COUNTRIES` list, the entity-type taxonomy splits (`PERSONAL_ENTITY_TYPES` vs `GROUP_ENTITY_TYPES`), or persona-generation flow control. + - Changing OASIS profile-format compatibility — verified by `backend/scripts/test_profile_format.py`. + - Editing the locale plumbing block (currently the `current_locale = get_locale()` capture and the `set_locale(current_locale)` call inside `generate_single_profile` around lines ~910–916). +- **Adjacent expectations**: + - The Step 2 environment-setup pipeline must continue to consume the OASIS profile output unchanged. The Reddit (`to_reddit_format`) and Twitter (`to_twitter_format`) serializers are not coupled to prompt language; this is verified via the JSON schema contract preservation. + - The locale resolution chain (`Accept-Language` header → `get_locale()` → `get_language_instruction()`) is owned by `backend/app/utils/locale.py` and is unchanged by this work. + - Companion i18n issues (#6 logs, #7 comments/docstrings, #9 frontend comments, #10 e2e verification, #12 README) operate on different files or scopes and must not be touched here. + +## Requirements + +### Requirement 1: English Translation of the System Prompt + +**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the persona-generation system prompt to be authored in English, so that the LLM's persona prose is not biased toward Chinese structure or word choice. + +#### Acceptance Criteria + +1. The OASIS Profile Generator shall set the `base_prompt` constant inside `_get_system_prompt` to an English string containing zero Chinese characters. +2. The OASIS Profile Generator shall preserve the system-prompt assembly contract verbatim: the format `f"{base_prompt}\n\n{get_language_instruction()}"` and the call to `get_language_instruction()` at exactly that site. +3. The OASIS Profile Generator shall preserve the role and intent semantics of the original prompt: identifying the model as an expert in social-media user-persona generation, requesting detailed and realistic personas for opinion simulation that reflect existing real-world conditions, and mandating valid JSON output where string values must not contain unescaped newlines. +4. The OASIS Profile Generator shall preserve the function signature `_get_system_prompt(self, is_individual: bool) -> str`. + +### Requirement 2: English Translation of the Individual-Persona User-Message Template + +**Objective:** As a MiroFish operator generating personas for individual entities under `Accept-Language: en`, I want the user-message template constructed by `_build_individual_persona_prompt` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings. + +#### Acceptance Criteria + +1. The OASIS Profile Generator shall render the individual-persona user message with English section headings and prose in place of the current Chinese (entity name, entity type, entity summary, entity attributes, context section, JSON-fields enumeration, "important" trailing block). +2. The OASIS Profile Generator shall preserve all variable interpolations verbatim by name: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, and the inline `{get_language_instruction()}` call inside the trailing rules block. +3. The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics` (verbatim, English). +4. The OASIS Profile Generator shall preserve the field-level constraints in the prompt: + - `bio` ≈ 200 characters, social-media biography. + - `persona` ≈ 2000 characters, single coherent text covering: basic information (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behavior (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions in it). + - `age` MUST be an integer. + - `gender` MUST be one of `"male"` or `"female"` (English enum value, locale-independent). + - `mbti` MUST be an MBTI four-letter type (e.g. INTJ, ENFP). + - `country` MUST be a country name string. + - `profession` MUST be a profession string. + - `interested_topics` MUST be an array. +5. The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no newlines inside string values, `persona` is a single coherent text, `gender` must be the English `male`/`female` enum even when locale is `zh`, content must stay consistent with the source entity, `age` must be a valid integer. +6. The OASIS Profile Generator shall preserve the function signature `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`. +7. The OASIS Profile Generator shall preserve the `context[:3000]` truncation behaviour and the conditional fallback (`"无额外上下文"` translated to `"No additional context"`) when `context` is empty/falsy. Likewise, `attrs_str` shall fall back to an English placeholder (`"None"`) when `entity_attributes` is empty/falsy, replacing the current `"无"` literal. +8. The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled individual-persona prompt body. + +### Requirement 3: English Translation of the Group/Institution-Persona User-Message Template + +**Objective:** As a MiroFish operator generating personas for institutional/group entities under `Accept-Language: en`, I want the user-message template constructed by `_build_group_persona_prompt` to be authored in English, so that the rendered prompt does not interleave English `get_language_instruction()` directives with Chinese section headings. + +#### Acceptance Criteria + +1. The OASIS Profile Generator shall render the group-persona user message with English section headings and prose in place of the current Chinese. +2. The OASIS Profile Generator shall preserve all variable interpolations verbatim by name: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, and the inline `{get_language_instruction()}` call inside the trailing rules block. +3. The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics` (verbatim, English). +4. The OASIS Profile Generator shall preserve the field-level constraints in the prompt: + - `bio` ≈ 200 characters, an official-account biography that reads as professionally appropriate. + - `persona` ≈ 2000 characters, single coherent text covering: institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions in it). + - `age` MUST be the integer `30` (the institutional virtual-age sentinel). + - `gender` MUST be the literal `"other"` (English enum value, locale-independent), indicating non-individual. + - `mbti` MUST be an MBTI four-letter type used to characterize account voice (e.g. ISTJ for strict/conservative). + - `country` MUST be a country name string. + - `profession` MUST describe institutional function. + - `interested_topics` MUST be an array of focus areas. +5. The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no `null` values, no newlines in string values, `persona` is a single coherent text, `gender` must be the English `"other"` enum even when locale is `zh`, the institutional account voice must match its identity positioning, and `age` must be the integer `30`. +6. The OASIS Profile Generator shall preserve the function signature `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`. +7. The OASIS Profile Generator shall preserve the `context[:3000]` truncation behaviour and the conditional English-equivalent fallback for empty `context` and empty `entity_attributes`, mirroring Requirement 2. +8. The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled group-persona prompt body. + +### Requirement 4: Locale Switching Continues to Work via `get_language_instruction()` + +**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: zh` (or any other configured non-English locale), I want generated personas to remain in the requested locale at equivalent quality, so that translating the base prompt does not regress non-English support. + +#### Acceptance Criteria + +1. The OASIS Profile Generator shall preserve every existing `get_language_instruction()` call site exactly: the system-prompt site in `_get_system_prompt`, the inline call inside the trailing rules block of `_build_individual_persona_prompt`, and the inline call inside the trailing rules block of `_build_group_persona_prompt`. +2. The OASIS Profile Generator shall preserve the locale-capture/restore plumbing inside `generate_profiles_for_entities` (currently the `current_locale = get_locale()` capture and the `set_locale(current_locale)` call inside `generate_single_profile`) — this code is not modified by the change. +3. While the locale is `zh`, the OASIS Profile Generator shall produce profiles whose `bio`, `persona`, `profession`, and `interested_topics` content is in Chinese, equivalent in quality to the pre-change behaviour. +4. While the locale is `en`, the OASIS Profile Generator shall produce profiles whose `bio`, `persona`, `profession`, and `interested_topics` content is in English. +5. While the locale is `en` or `zh`, the OASIS Profile Generator shall produce profiles whose `gender` field is one of the literal English values `"male"`, `"female"` (individual entities) or `"other"` (group entities), regardless of locale. +6. The OASIS Profile Generator shall not alter `backend/app/utils/locale.py`, the `_languages`, the `_translations` registries, or the locales under `/locales/`. + +### Requirement 5: Public API and Call-Site Stability + +**Objective:** As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of `OasisProfileGenerator` and `OasisAgentProfile` to remain unchanged, so that the Step 2 environment-setup flow and existing callers continue to work without modification. + +#### Acceptance Criteria + +1. The OASIS Profile Generator shall preserve the dataclass `OasisAgentProfile`, including its field set (`user_id`, `user_name`, `name`, `bio`, `persona`, `karma`, `friend_count`, `follower_count`, `statuses_count`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`, `source_entity_uuid`, `source_entity_type`, `created_at`), default values, and the `to_reddit_format`, `to_twitter_format`, `to_full_dict` serializers. +2. The OASIS Profile Generator shall preserve the signatures and call semantics of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_for_entities`, `_call_llm_with_retry`, `_generate_profile_rule_based`, `_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, and `_generate_username`. +3. The OASIS Profile Generator shall preserve the LLM invocation parameters (`temperature`, `max_tokens`, model selection, retry behaviour) at the call sites that consume the prompts produced by the translated builders. +4. The OASIS Profile Generator shall preserve the `PERSONAL_ENTITY_TYPES` and `GROUP_ENTITY_TYPES` taxonomies, the `MBTI_TYPES` list, and the `COMMON_COUNTRIES` list verbatim. + +### Requirement 6: Reasoning-Model Output Compatibility + +**Objective:** As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with `` tags or markdown code fences), I want JSON parsing of the persona response to continue working, so that translating the base prompt does not regress provider compatibility. + +#### Acceptance Criteria + +1. The OASIS Profile Generator shall preserve the existing `_fix_truncated_json` and `_try_fix_json` resilience helpers exactly, including their regex-based extraction of `bio` and `persona` from partial output. +2. If a reasoning-model provider returns truncated, ``-tagged, or markdown-fenced output, then the existing parsing/recovery flow shall continue to apply unchanged. +3. The OASIS Profile Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language. +4. After translation, the OASIS Profile Generator shall continue to round-trip a representative entity through `generate_profile_from_entity` and produce a JSON object with at minimum a non-empty `bio` and a non-empty `persona`, matching the pre-change behaviour. + +### Requirement 7: Step 2 Environment-Setup Parity (OASIS Format Compatibility) + +**Objective:** As a MiroFish operator validating the change, I want the OASIS subprocess to accept the generated profiles unchanged, so that the translation does not silently break Step 2 → Step 3 hand-off. + +#### Acceptance Criteria + +1. While `uv run python -m pytest backend/scripts/test_profile_format.py` runs against the changed code, the test suite shall pass with zero regressions versus the pre-change baseline. +2. While a representative Reddit-format profile dictionary is produced under locale `en`, every field name shall match the existing OASIS-required schema: `user_id`, `username`, `name`, `bio`, `persona`, `karma`, `created_at`, plus optional `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`. +3. While a representative Twitter-format profile dictionary is produced under locale `en`, every field name shall match the existing OASIS-required schema: `user_id`, `username`, `name`, `bio`, `persona`, `friend_count`, `follower_count`, `statuses_count`, `created_at`, plus optional `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`. +4. The OASIS Profile Generator shall produce `gender` values that are exactly one of `"male"`, `"female"`, `"other"` regardless of locale, satisfying the OASIS subprocess's expected enum. + +### Requirement 8: Out-of-Scope Surfaces Remain Untouched + +**Objective:** As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings, so that translation responsibilities for adjacent surfaces (issues #6, #7, and the rule-based fallback) are not absorbed into this change. + +#### Acceptance Criteria + +1. The change shall not modify any `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, or `logger.debug(...)` call in `oasis_profile_generator.py` (covered by issue #6). +2. The change shall not modify the module docstring, class docstrings, method docstrings, or inline comments in `oasis_profile_generator.py` (covered by issue #7). +3. The change shall not modify the rule-based fallback Chinese fragments inside `_try_fix_json` (e.g. `f"{entity_name}是一个{entity_type}。"`) and the rule-based path inside `_generate_profile_rule_based` — those are runtime data fallbacks, not LLM prompts, and remain out of scope here. +4. The change shall not edit any file outside `backend/app/services/oasis_profile_generator.py` for production code. +5. The change shall not introduce a new dependency or modify `backend/pyproject.toml` / `backend/uv.lock`. +6. The change shall not modify `backend/scripts/test_profile_format.py` (the test is the contract; the implementation must match it). diff --git a/.kiro/specs/i18n-oasis-profile-generator-prompts/research.md b/.kiro/specs/i18n-oasis-profile-generator-prompts/research.md new file mode 100644 index 00000000..baae60a9 --- /dev/null +++ b/.kiro/specs/i18n-oasis-profile-generator-prompts/research.md @@ -0,0 +1,222 @@ +# Research & Design Decisions — i18n-oasis-profile-generator-prompts + +## Summary + +- **Feature**: `i18n-oasis-profile-generator-prompts` +- **Discovery Scope**: **Extension** (single-file translation in an existing + brownfield service; sibling pattern already merged in #2, #4, #5) +- **Key Findings**: + - The existing `get_language_instruction()` postfix mechanism (defined in + `backend/app/utils/locale.py`) is the project-canonical way to steer LLM + output language. Translating the base prompt does not interfere with it + and is the same approach taken in already-merged sibling specs. + - The only Chinese surfaces inside the prompt-rendering path are + `_get_system_prompt`, `_build_individual_persona_prompt`, + `_build_group_persona_prompt`, and the four `attrs_str`/`context_str` + fallback literals (`"无"`, `"无额外上下文"`). All other Chinese in the + file is logger keys (already done by #6), docstrings/comments + (out-of-scope, #7), or rule-based fallback data (out-of-scope). + - `backend/scripts/test_profile_format.py` does not exercise prompts; it + only constructs `OasisAgentProfile` and round-trips through + `_save_twitter_csv` / `_save_reddit_json`. A pure-translation diff + cannot break it. + +## Research Log + +### Locale steering mechanism + +- **Context**: Confirm that translating the base prompt does not regress + Chinese output under `Accept-Language: zh`. +- **Sources Consulted**: + - `backend/app/utils/locale.py` (lines 50–96). + - `locales/languages.json` (entries for `en` and `zh` with + `llmInstruction` field). + - Sibling spec `i18n-ontology-generator-prompts/design.md` and the + merged commits referenced by it. +- **Findings**: + - `get_language_instruction()` returns `Please respond in English.` + for locale `en`, `请使用中文回答。` for locale `zh`. + - The function is called as an inline f-string interpolation in the + individual-persona and group-persona prompt bodies, and explicitly + appended in `_get_system_prompt`. All three sites must be preserved + byte-for-byte. + - The thread-local locale is captured in + `generate_profiles_for_entities` (line ~910) and restored inside the + worker via `set_locale(current_locale)` (line ~914). This plumbing is + untouched by the change. +- **Implications**: + - Design lock-in: the inline `{get_language_instruction()}` call must + remain in each of the three builders. Removing or renaming it would + silently regress non-English locales. + - The Chinese hint `country: 国家(使用中文,如"中国")` in the original + prompt overrides the locale postfix and forces Chinese output for one + field. The English translation drops that hint so the locale postfix + decides the country language. The rule-based fallback (out of scope) + has its own (Chinese) defaults and is not affected. + +### Test contract + +- **Context**: Verify that `backend/scripts/test_profile_format.py` + remains green after a prompt-only translation. +- **Sources Consulted**: `backend/scripts/test_profile_format.py`, + `oasis_profile_generator.py:_save_twitter_csv`, + `oasis_profile_generator.py:_save_reddit_json`, + `oasis_profile_generator.py:to_reddit_format`, + `oasis_profile_generator.py:to_twitter_format`. +- **Findings**: + - The pytest function `test_profile_formats` constructs + `OasisAgentProfile` instances directly without invoking the LLM. + - It calls `_save_twitter_csv` and `_save_reddit_json` to verify CSV + and JSON shape. Required CSV header: `user_id, user_name, name, bio, + friend_count, follower_count, statuses_count, created_at`. Required + JSON keys: `realname, username, bio, persona`. +- **Implications**: + - Translating prompts cannot regress this test. The validation + requirement (Requirement 7) is satisfied automatically as long as + serializer code is not edited. + - No new tests are required for this change. + +### Sibling specs already shipped + +- **Context**: Confirm there is an established project pattern this work + must mirror. +- **Sources Consulted**: + - `.kiro/specs/i18n-ontology-generator-prompts/{design,tasks,requirements}.md` + - `.kiro/specs/i18n-report-agent-prompts/` + - `.kiro/specs/i18n-simulation-config-generator-prompts/` + - Recent merged commits referencing #2, #4, #5. +- **Findings**: + - All three siblings used a single-file in-place translation diff. + - All three preserved every `get_language_instruction()` call site. + - All three left logger calls and docstrings to companion issues + (#6 / #7). + - None externalized prompts to `/locales/*.json`. +- **Implications**: + - The same approach is correct here. Reviewer expectations are set by + the sibling diffs. + +### OASIS profile schema + +- **Context**: Verify that translated prompts continue to satisfy the + OASIS subprocess's expected schema (especially `gender` enum and + `age` integer). +- **Sources Consulted**: `OasisAgentProfile` dataclass, + `to_reddit_format`, `to_twitter_format`, sibling `_generate_profile_rule_based`. +- **Findings**: + - OASIS-required fields are produced by serializers, not by the + prompt: `user_id`, `username`, `name`, `bio`, `karma`/`friend_count`/`follower_count`/`statuses_count`, `created_at`. + - The prompt-defined fields land in optional positions: `age`, + `gender`, `mbti`, `country`, `profession`, `interested_topics`. + - The `gender` enum constraint (`"male"`/`"female"` for individuals, + `"other"` for groups) is locale-independent and must remain in + English text inside the translated prompt. +- **Implications**: + - The English prompt must explicitly call out `gender ∈ {male, female}` + (individual) and `gender == "other"` (group), independent of the + `get_language_instruction()` postfix. + +## Architecture Pattern Evaluation + +| Option | Description | Strengths | Risks / Limitations | Notes | +|--------|-------------|-----------|---------------------|-------| +| **A — In-place builder edit** | Translate three method bodies + four fallback literals directly | Smallest diff; matches sibling pattern; zero API change | None of note | **Selected** | +| B — Module-level constants | Hoist prompts to `INDIVIDUAL_PERSONA_PROMPT_TEMPLATE` etc. | Easier `git grep` | Larger diff; the inline `{get_language_instruction()}` call would need to become a `.format()` kwarg, which is a behavioural change beyond translation | Diverges from #4 / #5 | +| C — Externalize to `locales/*.json` | Move every prompt sentence into `t(...)` keys | Most i18n-pure | Three-file diff; diverges from project rationale (prompts use postfix mechanism, not key files) | Rejected | + +## Design Decisions + +### Decision: In-place edit of the three prompt builders (Option A) + +- **Context**: Three methods build prompt strings; one of them is a + one-line system prompt, the other two are large f-string templates + with embedded `{variable}` interpolations and an inline + `{get_language_instruction()}` call. +- **Alternatives Considered**: + 1. Option B — module-level constants. + 2. Option C — externalize to `/locales/*.json` keys. +- **Selected Approach**: Translate each method body in place. Replace + the four `"无"` / `"无额外上下文"` fallbacks with English equivalents + (`"None"` and `"No additional context"`). Preserve all `{...}` + interpolations and the inline `{get_language_instruction()}` call. +- **Rationale**: Matches merged sibling specs verbatim. Smallest review + surface. Zero API change. Out-of-scope surfaces (logger, docstrings, + rule-based fallback) cleanly avoided. +- **Trade-offs**: Leaves the file mixed-language in non-prompt parts + (docstrings, rule fallback) until #7 lands. Acceptable per scope + split. +- **Follow-up**: During implementation, run a regex audit for any + Chinese codepoints inside the three method bodies after the edit and + confirm the diff stays within + `backend/app/services/oasis_profile_generator.py`. + +### Decision: Drop the "use Chinese country names" hint + +- **Context**: The current prompt at line 704 reads + `country: 国家(使用中文,如"中国")` and at line 753 + `country: 国家(使用中文,如"中国")`. This forces Chinese for the + `country` field even under `Accept-Language: en`. +- **Alternatives Considered**: + 1. Translate to English literally: + `country: country (use English, e.g. "China")`. + 2. Drop the language hint entirely: + `country: country name string`. +- **Selected Approach**: Drop the language hint. Let + `get_language_instruction()` steer the country language alongside + every other free-text field. +- **Rationale**: Hard-coding a language in the prompt defeats the + locale-steering mechanism. The rule-based fallback (out of scope) + carries its own Chinese defaults; under the LLM path, locale should + decide. +- **Trade-offs**: Under `Accept-Language: zh`, the LLM may produce a + Chinese country name (e.g. `中国`) — this is the desired behaviour. + Under `Accept-Language: en`, the LLM produces English (`China`), + matching `COUNTRIES = ["China", "US", ...]` already in the file. +- **Follow-up**: Verify in the validation phase that a sample run under + locale `en` produces an English country name. + +### Decision: Keep `gender` enum constraint in English inside the prompt + +- **Context**: `gender` must be one of `"male"`/`"female"`/`"other"` + regardless of locale, because OASIS consumers and the + `_generate_profile_rule_based` fallback assume English values. +- **Alternatives Considered**: None — the constraint is a contract. +- **Selected Approach**: The translated prompt explicitly states the + enum in English, even when the locale postfix asks for Chinese + output: `gender MUST be one of "male" or "female" (English literal)`. +- **Rationale**: Same as the existing Chinese prompt (which already + states `必须是英文: "male" 或 "female"`). The translation preserves + the same lock-in. +- **Trade-offs**: None. +- **Follow-up**: Validation phase will check that under both locales + the produced `gender` is one of the three English literals. + +## Risks & Mitigations + +- **Risk**: Mistranslation drops a locale-independent constraint + (e.g. `gender` enum, `age` integer rule, `persona` no-newline rule). + - **Mitigation**: The implementation task list will enumerate every + constraint inline so reviewers can check by diff. +- **Risk**: Variable-name typo inside an f-string causes a `KeyError` + at runtime. + - **Mitigation**: Implementation task verifies that the set of + `{variable}` interpolations in each translated block matches the + pre-change set 1:1; a `python -c "import ..."` smoke import and a + `pytest backend/scripts/test_profile_format.py` run are mandatory. +- **Risk**: Accidentally leaving a CJK codepoint inside the three + builders. + - **Mitigation**: Final implementation step runs the project's + repo-level CJK guard regex (added by #26) constrained to the three + builders' line ranges. + +## References + +- `backend/app/services/oasis_profile_generator.py` — target file. +- `backend/app/utils/locale.py` — locale infrastructure. +- `locales/languages.json`, `locales/en.json`, `locales/zh.json` — + locale registries. +- `.kiro/specs/i18n-ontology-generator-prompts/` — sibling spec #2. +- `.kiro/specs/i18n-simulation-config-generator-prompts/` — sibling + spec #4. +- `.kiro/specs/i18n-report-agent-prompts/` — sibling spec #5. +- GitHub issue + [#3](https://github.com/salestech-group/MiroFish/issues/3). diff --git a/.kiro/specs/i18n-oasis-profile-generator-prompts/spec.json b/.kiro/specs/i18n-oasis-profile-generator-prompts/spec.json new file mode 100644 index 00000000..9b510223 --- /dev/null +++ b/.kiro/specs/i18n-oasis-profile-generator-prompts/spec.json @@ -0,0 +1,23 @@ +{ + "feature_name": "i18n-oasis-profile-generator-prompts", + "created_at": "2026-05-08T05:26:06Z", + "updated_at": "2026-05-08T05:30:00Z", + "language": "en", + "phase": "tasks-generated", + "ticket": 3, + "approvals": { + "requirements": { + "generated": true, + "approved": true + }, + "design": { + "generated": true, + "approved": true + }, + "tasks": { + "generated": true, + "approved": true + } + }, + "ready_for_implementation": true +} diff --git a/.kiro/specs/i18n-oasis-profile-generator-prompts/tasks.md b/.kiro/specs/i18n-oasis-profile-generator-prompts/tasks.md new file mode 100644 index 00000000..fc8a6810 --- /dev/null +++ b/.kiro/specs/i18n-oasis-profile-generator-prompts/tasks.md @@ -0,0 +1,66 @@ +# Implementation Plan + +- [x] 1. Translate the system-prompt builder to English + - Replace the Chinese `base_prompt` literal inside `_get_system_prompt` (currently `"你是社交媒体用户画像生成专家。…"` at line ~664) with an English rendering that conveys the same role and intent: identifies the model as an expert in social-media user-persona generation, asks for detailed and realistic personas suitable for opinion-simulation that faithfully reflect existing real-world conditions, mandates valid JSON output, and forbids unescaped newlines inside string values + - Preserve the assembled return shape `f"{base_prompt}\n\n{get_language_instruction()}"` exactly — the call to `get_language_instruction()` is unchanged in name and position + - Preserve the method signature `_get_system_prompt(self, is_individual: bool) -> str`; do not branch on `is_individual` (current behaviour preserved) + - Observable completion: `_get_system_prompt(True)` and `_get_system_prompt(False)` both return non-empty English strings ending with the per-locale postfix from `get_language_instruction()`; the `base_prompt` body contains zero CJK characters + - _Requirements: 1.1, 1.2, 1.3, 1.4_ + +- [x] 2. Translate the individual-persona user-message builder to English + - Replace the Chinese f-string body inside `_build_individual_persona_prompt` (currently lines ~680–714) with an English rendering structured as: a lead sentence requesting a detailed social-media persona faithful to existing reality; an entity-context block with English labels for `entity_name`, `entity_type`, `entity_summary`, `entity_attributes`; a `Context information:` block; a `Generate JSON with the following fields:` enumeration of the eight output keys (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`); and a trailing `Important:` rules block + - Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars; `persona` ≈ 2000 chars covering basic info (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behaviour (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions); `age` integer; `gender` MUST be the literal `"male"` or `"female"`; `mbti` four-letter type; `country` country name; `profession`; `interested_topics` array + - Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English values `"male"` / `"female"`; content stays consistent with the entity; `age` MUST be a valid integer + - Replace the `attrs_str` and `context_str` Chinese fallback defaults with English: `"无"` → `"None"` (used when `entity_attributes` is empty/falsy) and `"无额外上下文"` → `"No additional context"` (used when `context` is empty/falsy) + - Drop the country-language hint `(使用中文,如"中国")` so `get_language_instruction()` steers the country language; preserve the country line as a neutral `country: country name` entry + - Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}` + - Preserve the `context[:3000]` truncation behaviour and the method signature `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str` + - Observable completion: calling `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains the `gender` enum lock-in `"male"` / `"female"` exactly once + - _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5_ + +- [x] 3. Translate the group/institution-persona user-message builder to English + - Replace the Chinese f-string body inside `_build_group_persona_prompt` (currently lines ~729–762) with an English rendering structured the same way as Task 2 but adapted for institutional voice: lead sentence requesting a detailed social-media account profile for an institution/group faithful to existing reality; entity-context block; `Context information:` block; `Generate JSON with the following fields:` enumeration of the eight output keys; trailing `Important:` rules block + - Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars in an official-account voice; `persona` ≈ 2000 chars covering institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions); `age` MUST be the integer `30`; `gender` MUST be the literal `"other"`; `mbti` four-letter type characterizing account voice; `country`; `profession` describes institutional function; `interested_topics` array + - Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers, no `null` allowed; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English value `"other"`; `age` MUST be the integer `30` and `gender` MUST be the string `"other"`; account voice must match identity positioning + - Replace the `attrs_str` and `context_str` Chinese fallback defaults with the same English replacements applied in Task 2 (`"None"` and `"No additional context"`) + - Drop the country-language hint as in Task 2 + - Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}` + - Preserve the `context[:3000]` truncation behaviour and the method signature `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str` + - Observable completion: calling `_build_group_persona_prompt("ACME Corp", "Organization", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains both the `age == 30` lock-in and the `gender == "other"` lock-in + - _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5_ + +- [x] 4. Confirm boundary commitments around the translation + - Confirm every existing `get_language_instruction()` call site is preserved verbatim: the system-prompt assembly inside `_get_system_prompt`, the inline call inside the trailing rules block of `_build_individual_persona_prompt`, and the inline call inside the trailing rules block of `_build_group_persona_prompt` + - Confirm the locale-thread plumbing in `generate_profiles_for_entities` (capture `current_locale = get_locale()` at line ~910 and `set_locale(current_locale)` inside the worker at line ~914) is byte-identical + - Confirm the public signatures of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_for_entities`, `set_graph_id`, and the private helpers `_call_llm_with_retry`, `_generate_profile_rule_based`, `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, `_save_twitter_csv`, `_save_reddit_json`, `_generate_username` are unchanged + - Confirm the `OasisAgentProfile` dataclass field set, default values, and the `to_reddit_format`, `to_twitter_format`, `to_full_dict` serializers are unchanged + - Confirm class constants `MBTI_TYPES`, `COUNTRIES`, `INDIVIDUAL_ENTITY_TYPES`, `GROUP_ENTITY_TYPES` are unchanged + - Confirm the LLM invocation parameters at the call site that consumes the translated prompts (`response_format={"type": "json_object"}`, `temperature=0.7 - (attempt * 0.1)`, `max_attempts=3`) are unchanged + - Confirm `_fix_truncated_json` and `_try_fix_json` (including their Chinese persona fragments such as `f"{entity_name}是一个{entity_type}。"`) are not modified — these are runtime data fallbacks, not prompts, and are out of scope + - Confirm `_generate_profile_rule_based` is not modified — including its Chinese country defaults `"中国"` at lines ~807 and ~819 + - Confirm `backend/app/utils/locale.py`, `/locales/languages.json`, `/locales/en.json`, and `/locales/zh.json` are not modified + - Confirm `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, the print banner at line ~945, module / class / method docstrings, and inline comments in `oasis_profile_generator.py` are not modified (owned by issues #6 and #7) + - Confirm `backend/scripts/test_profile_format.py`, `backend/pyproject.toml`, `backend/uv.lock`, and any file outside `backend/app/services/oasis_profile_generator.py` are not modified + - Observable completion: a `git diff` review against `main` shows changes only inside `backend/app/services/oasis_profile_generator.py`, only inside `_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, and the surrounding lines (method headers, neighbouring methods) are byte-identical + - _Requirements: 1.4, 2.6, 3.6, 4.1, 4.2, 4.6, 5.1, 5.2, 5.3, 5.4, 6.1, 6.3, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6_ + +- [x] 5. Verify smoke import and OASIS profile-format pytest + - Run `cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator, OasisAgentProfile"` and confirm it exits 0 (catches f-string syntax errors) + - Run `cd backend && uv run python -m pytest backend/scripts/test_profile_format.py` (or equivalent invocation per project convention) and confirm it passes — the test does not exercise prompts, so a pure-translation diff must keep it green + - Construct an instance of `OasisProfileGenerator` (using `OasisProfileGenerator.__new__(OasisProfileGenerator)` to skip `__init__` if the LLM key is unavailable, mirroring the pattern in `test_profile_format.py`) and confirm `_get_system_prompt(True)`, `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")`, and `_build_group_persona_prompt("ACME", "Organization", "summary", {"k": "v"}, "ctx")` each return a string with zero CJK matches against the regex `[一-鿿]` + - Observable completion: smoke import exits 0; pytest passes with zero regressions; the three prompt-builder calls each produce English-only output under the default `zh` locale (the `get_language_instruction()` postfix at the end is the only place where Chinese is allowed to appear, and only when locale is `zh`) + - _Requirements: 6.4, 7.1, 7.2, 7.3, 7.4_ + +- [x] 6. Verify locale-driven output language under both `en` and `zh` + - With the thread-local locale forced via `set_locale("en")`, render each of the three builders against representative inputs and confirm: each output contains zero CJK characters; each ends with the English locale postfix `"Please respond in English."`; the `gender` enum constraint appears as English `"male"` / `"female"` (individual) or `"other"` (group) + - With `set_locale("zh")`, render the same three builders and confirm: the per-prompt body remains English-only (the translated base prompt does not depend on locale); each ends with the Chinese locale postfix `"请使用中文回答。"`; the `gender` enum constraint still appears as the English literal values + - Optionally, with a configured LLM key, run `OasisProfileGenerator().generate_profile_from_entity(...)` end-to-end under each locale against a synthetic `EntityNode` and spot-check that the produced `bio`, `persona`, `profession` are English under `en` and Chinese under `zh`, while `gender` is one of the three English enum literals under both + - Observable completion: the locale-`en` rendering is CJK-free in the prompt body and ends with the English locale postfix; the locale-`zh` rendering preserves the prompt body in English and ends with the Chinese locale postfix; if the LLM round-trip is exercised, results are recorded in the PR description + - _Requirements: 4.3, 4.4, 4.5_ + +- [x] 7. Final CJK regression sweep on the three builders + - Run a regex audit limited to the three method bodies (`_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`) using the project-level CJK guard regex (`[一-鿿]`) and confirm zero matches inside their string literals + - Run a CJK audit on the rendered output of the three builders for representative inputs and confirm zero matches in the prompt body (the locale postfix is excluded — its Chinese form is a deliberate kept use under `zh`) + - Confirm the file-level `git grep -nE '[\\x{4e00}-\\x{9fff}]' -- backend/app/services/oasis_profile_generator.py` output still flags only known out-of-scope locations: docstrings, comments, logger keys, rule-based fallback country `"中国"` defaults, and resilience-helper Chinese fragments — and does not flag any line inside the three translated method bodies + - Observable completion: the targeted regex audit returns zero matches inside the three method bodies; the file-level audit's residual CJK lines all fall outside the three method bodies and match the out-of-scope inventory in `design.md` § Boundary Commitments → Out of Boundary + - _Requirements: 1.1, 2.8, 3.8, 8.1, 8.2, 8.3_ diff --git a/backend/app/services/oasis_profile_generator.py b/backend/app/services/oasis_profile_generator.py index d80f8df3..98236ffd 100644 --- a/backend/app/services/oasis_profile_generator.py +++ b/backend/app/services/oasis_profile_generator.py @@ -661,9 +661,9 @@ class OasisProfileGenerator: def _get_system_prompt(self, is_individual: bool) -> str: """获取系统提示词""" - base_prompt = "你是社交媒体用户画像生成专家。生成详细、真实的人设用于舆论模拟,最大程度还原已有现实情况。必须返回有效的JSON格式,所有字符串值不能包含未转义的换行符。" + base_prompt = "You are an expert in social-media user-persona generation. Produce detailed, realistic personas for opinion simulation that faithfully reflect existing real-world conditions. You MUST return valid JSON; no string value may contain unescaped newlines." return f"{base_prompt}\n\n{get_language_instruction()}" - + def _build_individual_persona_prompt( self, entity_name: str, @@ -673,44 +673,44 @@ class OasisProfileGenerator: context: str ) -> str: """构建个人实体的详细人设提示词""" - - attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "无" - context_str = context[:3000] if context else "无额外上下文" - - return f"""为实体生成详细的社交媒体用户人设,最大程度还原已有现实情况。 -实体名称: {entity_name} -实体类型: {entity_type} -实体摘要: {entity_summary} -实体属性: {attrs_str} + attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None" + context_str = context[:3000] if context else "No additional context" -上下文信息: + return f"""Generate a detailed social-media user persona for the entity, faithfully reflecting existing real-world conditions. + +Entity name: {entity_name} +Entity type: {entity_type} +Entity summary: {entity_summary} +Entity attributes: {attrs_str} + +Context information: {context_str} -请生成JSON,包含以下字段: +Generate JSON with the following fields: -1. bio: 社交媒体简介,200字 -2. persona: 详细人设描述(2000字的纯文本),需包含: - - 基本信息(年龄、职业、教育背景、所在地) - - 人物背景(重要经历、与事件的关联、社会关系) - - 性格特征(MBTI类型、核心性格、情绪表达方式) - - 社交媒体行为(发帖频率、内容偏好、互动风格、语言特点) - - 立场观点(对话题的态度、可能被激怒/感动的内容) - - 独特特征(口头禅、特殊经历、个人爱好) - - 个人记忆(人设的重要部分,要介绍这个个体与事件的关联,以及这个个体在事件中的已有动作与反应) -3. age: 年龄数字(必须是整数) -4. gender: 性别,必须是英文: "male" 或 "female" -5. mbti: MBTI类型(如INTJ、ENFP等) -6. country: 国家(使用中文,如"中国") -7. profession: 职业 -8. interested_topics: 感兴趣话题数组 +1. bio: social-media biography, ~200 characters +2. persona: detailed persona description (~2000 characters of plain text), covering: + - Basic information (age, profession, education, location) + - Background (notable experience, association with the event, social ties) + - Personality (MBTI type, core traits, emotional expression) + - Social-media behavior (posting frequency, content preferences, interaction style, language traits) + - Stance (attitudes toward the topic, content likely to anger or move them) + - Unique features (catchphrases, special experiences, hobbies) + - Personal memory (a key part of the persona: this individual's relation to the event and prior actions/reactions in it) +3. age: age number (MUST be an integer) +4. gender: gender, MUST be one of the English literals: "male" or "female" +5. mbti: MBTI type (e.g. INTJ, ENFP) +6. country: country name +7. profession: profession +8. interested_topics: array of interest topics -重要: -- 所有字段值必须是字符串或数字,不要使用换行符 -- persona必须是一段连贯的文字描述 -- {get_language_instruction()} (gender字段必须用英文male/female) -- 内容要与实体信息保持一致 -- age必须是有效的整数,gender必须是"male"或"female" +Important: +- All field values MUST be strings or numbers; do not use unescaped newlines. +- persona MUST be a single coherent block of text. +- {get_language_instruction()} (gender field MUST use the English values "male" or "female") +- Content must remain consistent with the entity information. +- age MUST be a valid integer; gender MUST be "male" or "female". """ def _build_group_persona_prompt( @@ -722,44 +722,44 @@ class OasisProfileGenerator: context: str ) -> str: """构建群体/机构实体的详细人设提示词""" - - attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "无" - context_str = context[:3000] if context else "无额外上下文" - - return f"""为机构/群体实体生成详细的社交媒体账号设定,最大程度还原已有现实情况。 -实体名称: {entity_name} -实体类型: {entity_type} -实体摘要: {entity_summary} -实体属性: {attrs_str} + attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else "None" + context_str = context[:3000] if context else "No additional context" -上下文信息: + return f"""Generate a detailed social-media account profile for the institution/group entity, faithfully reflecting existing real-world conditions. + +Entity name: {entity_name} +Entity type: {entity_type} +Entity summary: {entity_summary} +Entity attributes: {attrs_str} + +Context information: {context_str} -请生成JSON,包含以下字段: +Generate JSON with the following fields: -1. bio: 官方账号简介,200字,专业得体 -2. persona: 详细账号设定描述(2000字的纯文本),需包含: - - 机构基本信息(正式名称、机构性质、成立背景、主要职能) - - 账号定位(账号类型、目标受众、核心功能) - - 发言风格(语言特点、常用表达、禁忌话题) - - 发布内容特点(内容类型、发布频率、活跃时间段) - - 立场态度(对核心话题的官方立场、面对争议的处理方式) - - 特殊说明(代表的群体画像、运营习惯) - - 机构记忆(机构人设的重要部分,要介绍这个机构与事件的关联,以及这个机构在事件中的已有动作与反应) -3. age: 固定填30(机构账号的虚拟年龄) -4. gender: 固定填"other"(机构账号使用other表示非个人) -5. mbti: MBTI类型,用于描述账号风格,如ISTJ代表严谨保守 -6. country: 国家(使用中文,如"中国") -7. profession: 机构职能描述 -8. interested_topics: 关注领域数组 +1. bio: official-account biography, ~200 characters, professional and appropriate +2. persona: detailed account-profile description (~2000 characters of plain text), covering: + - Institutional basics (formal name, institution type, founding background, primary functions) + - Account positioning (account type, target audience, core function) + - Voice (language traits, common phrasing, taboo topics) + - Publishing pattern (content types, publishing frequency, active hours) + - Stance (official position on the core topic, controversy-handling style) + - Special notes (the group portrait represented, operational habits) + - Institutional memory (a key part of the account profile: this institution's relation to the event and prior actions/reactions in it) +3. age: fixed integer 30 (the institutional virtual age) +4. gender: fixed literal "other" (institutional accounts use "other" to indicate non-individual) +5. mbti: MBTI type used to characterize account voice (e.g. ISTJ for strict/conservative) +6. country: country name +7. profession: institutional function description +8. interested_topics: array of focus areas -重要: -- 所有字段值必须是字符串或数字,不允许null值 -- persona必须是一段连贯的文字描述,不要使用换行符 -- {get_language_instruction()} (gender字段必须用英文"other") -- age必须是整数30,gender必须是字符串"other" -- 机构账号发言要符合其身份定位""" +Important: +- All field values MUST be strings or numbers; null values are not allowed. +- persona MUST be a single coherent block of text without unescaped newlines. +- {get_language_instruction()} (gender field MUST use the English value "other") +- age MUST be the integer 30; gender MUST be the string "other". +- Account voice MUST match the institution's identity positioning.""" def _generate_profile_rule_based( self,