MicroFish/.kiro/specs/i18n-oasis-profile-generato.../design.md

28 KiB

Design Document — i18n-oasis-profile-generator-prompts

Overview

Purpose: Translate the Chinese prompt strings in backend/app/services/oasis_profile_generator.py (the system prompt inside _get_system_prompt, the individual-persona f-string template inside _build_individual_persona_prompt, the group-persona f-string template inside _build_group_persona_prompt, and the four attrs_str/context_str fallback literals) to English while preserving every functional contract — JSON output keys, the gender English enum, the age integer rule, the persona no-newline rule, all {variable} interpolations, and every get_language_instruction() call site. The goal is to remove the Chinese-language base-prompt bias that currently leaks Chinese structure and word choice into persona output even when Accept-Language: en.

Users: MiroFish operators running the Step 2 environment-setup pipeline under any locale; downstream Step 3 (CAMEL-OASIS subprocess) which consumes the produced persona dictionaries.

Impact: Replaces approximately one one-line system prompt and two large f-string templates with English equivalents inside one file. No API change, no new dependencies, no new files. The two production callers (backend/app/services/simulation_manager.py:316 and backend/app/api/simulation.py:1413) and the OASIS subprocess are unaffected.

Goals

  • Zero CJK characters in any prompt string literal contributed by oasis_profile_generator.py to the system prompt or the two user-message bodies (including the attrs_str/context_str fallback literals).
  • English persona prose (bio, persona, profession, interested_topics) under Accept-Language: en.
  • Continued Chinese persona prose under Accept-Language: zh, of equivalent quality to the pre-change behaviour.
  • gender field stays exactly one of "male"/"female"/"other" regardless of locale.
  • No diff to public signatures, taxonomy lists, LLM-call parameters, or call sites.

Non-Goals

  • Externalizing prompts to /locales/*.json (out of scope per ticket).
  • Translating logger calls in this file (covered by issue #6).
  • Translating module/class/method docstrings or inline comments (covered by issue #7).
  • Refactoring the OasisAgentProfile schema, MBTI_TYPES / COUNTRIES lists, or the INDIVIDUAL_ENTITY_TYPES / GROUP_ENTITY_TYPES taxonomies.
  • Modifying the rule-based fallback (_generate_profile_rule_based) including its Chinese country defaults.
  • Modifying the resilience helpers _fix_truncated_json / _try_fix_json and the Chinese persona fallback fragments inside them (e.g. f"{entity_name}是一个{entity_type}。").
  • Modifying backend/app/utils/locale.py, the locale registries, or any non-target file.
  • Modifying backend/scripts/test_profile_format.py.

Boundary Commitments

This Spec Owns

  • The English content of _get_system_prompt's base_prompt literal.
  • The English content of the f-string template body in _build_individual_persona_prompt.
  • The English content of the f-string template body in _build_group_persona_prompt.
  • The English replacements for the four "无" / "无额外上下文" fallback literals (in both individual and group builders).

Out of Boundary

  • Locale resolution machinery (backend/app/utils/locale.py).
  • Per-locale llmInstruction definitions (/locales/languages.json).
  • Reasoning-model output stripping inside _fix_truncated_json / _try_fix_json.
  • Logger calls and translation keys (t("log.profile_generator.*")) inside oasis_profile_generator.py (issue #6, already merged).
  • Module / class / method docstrings and inline comments inside oasis_profile_generator.py (issue #7).
  • Rule-based fallback (_generate_profile_rule_based) including its Chinese country defaults "中国".
  • Chinese persona fragments inside the resilience helpers (e.g. f"{entity_name}是一个{entity_type}。") — those are runtime data fallbacks, not LLM prompts.
  • All callers of OasisProfileGenerator (simulation_manager.py, api/simulation.py).
  • Tests, scripts, and frontend code.
  • The print(...) banner at line 945 (closely associated with logger externalization #6).

Allowed Dependencies

  • Existing imports in the target file (no additions). Specifically: get_language_instruction, get_locale, set_locale, t from ..utils.locale are already imported and remain unchanged.
  • Existing LLM transport via self.client.chat.completions.create (unchanged).

Revalidation Triggers

The following changes elsewhere would invalidate this design:

  • A change to the JSON contract emitted by the LLM (bio, persona, age, gender, mbti, country, profession, interested_topics keys).
  • A change to the OasisAgentProfile dataclass field set or the Reddit/Twitter serializers.
  • A change to get_language_instruction() semantics or the per-locale llmInstruction strings.
  • A change to OASIS subprocess profile-format expectations (verified via backend/scripts/test_profile_format.py).

Architecture

Existing Architecture Analysis

OasisProfileGenerator lives in backend/app/services/, follows the in-process service pattern, and is invoked from a Flask handler inside a background task. The relevant flow:

  1. The Flask handler resolves the request locale via Accept-Language; set_locale() is propagated into worker threads in generate_profiles_for_entities (locale captured at line ~910 and restored inside generate_single_profile at line ~914).
  2. For each entity, generate_profile_from_entity decides between the individual or group prompt builder via self._is_individual_entity(entity_type).
  3. The chosen builder produces a user-message string; _get_system_prompt produces a system-message string. Both are sent to the LLM via self.client.chat.completions.create(..., response_format={"type": "json_object"}).
  4. The LLM response is JSON-decoded; on failure, _try_fix_json and _fix_truncated_json attempt recovery; on terminal failure, _generate_profile_rule_based produces a rule-based persona.
  5. The result is wrapped in an OasisAgentProfile dataclass and serialized to Reddit JSON or Twitter CSV via _save_reddit_json / _save_twitter_csv.

This design preserves all of the above. The change is purely lexical inside three method bodies and four literal defaults.

Architecture Pattern & Boundary Map

graph TB
    Caller["simulation_manager.py / api/simulation.py"]
    Generator["OasisProfileGenerator"]
    Sys["_get_system_prompt"]
    Ind["_build_individual_persona_prompt"]
    Grp["_build_group_persona_prompt"]
    Locale["locale.get_language_instruction"]
    Client["openai.chat.completions.create"]
    Parser["_try_fix_json / _fix_truncated_json"]
    Fallback["_generate_profile_rule_based"]
    Serializer["_save_reddit_json / _save_twitter_csv"]

    Caller --> Generator
    Generator --> Sys
    Generator --> Ind
    Generator --> Grp
    Sys -. inline call .-> Locale
    Ind -. inline call .-> Locale
    Grp -. inline call .-> Locale
    Sys --> Client
    Ind --> Client
    Grp --> Client
    Client --> Parser
    Parser --> Fallback
    Generator --> Serializer

    classDef change fill:#fff4ce,stroke:#a16207,color:#000
    class Sys,Ind,Grp change

The three highlighted nodes (_get_system_prompt, _build_individual_persona_prompt, _build_group_persona_prompt) are the only nodes whose string contents change. Every edge — including each call to get_language_instruction() — remains intact.

Architecture Integration:

  • Selected pattern: In-place lexical translation of the three prompt builders (Option A from gap-analysis.md / research.md).
  • Domain/feature boundaries: Same as today; OasisProfileGenerator remains the sole owner of persona prompt content. LocaleService remains the sole owner of locale-postfix steering.
  • Existing patterns preserved: locale-thread propagation, retry logic with temperature decay, JSON resilience helpers, rule-based fallback, two-platform serialization.
  • New components rationale: none — no new components.
  • Steering compliance: aligns with tech.md ("LLM prompts use the get_language_instruction() postfix mechanism, not key files") and structure.md ("services own their own prompt strings").

Technology Stack & Alignment

Layer Choice / Version Role in Feature Notes
Backend / Services Python ≥3.11 Hosts the prompt builders No version change
LLM transport openai SDK against any OpenAI-compatible endpoint Sends translated prompts Unchanged
i18n backend/app/utils/locale.py Resolves locale and provides get_language_instruction() postfix Unchanged
Storage None No persistence change

No new dependencies. No version bumps. The locale infrastructure used by the change is the same one used by every sibling i18n spec already merged.

File Structure Plan

Modified Files

  • backend/app/services/oasis_profile_generator.py — only file that changes.
    • _get_system_prompt(self, is_individual: bool) -> str — translate base_prompt literal to English. Keep f"{base_prompt}\n\n{get_language_instruction()}" shape.
    • _build_individual_persona_prompt(self, entity_name, entity_type, entity_summary, entity_attributes, context) -> str — translate the f-string body to English; replace "无" and "无额外上下文" defaults; keep every {variable} interpolation and the inline {get_language_instruction()} call.
    • _build_group_persona_prompt(self, entity_name, entity_type, entity_summary, entity_attributes, context) -> str — same treatment as the individual builder.

No other files in the repository are touched by this change.

System Flows

The runtime flow does not change. The only way to demonstrate this is to compare the call graph before and after — and the call graph is already shown in the Architecture diagram above. Skipping a separate sequence diagram.

Requirements Traceability

Requirement Summary Components Interfaces Flows
1.1 base_prompt contains zero Chinese characters _get_system_prompt (self, is_individual: bool) -> str system-message construction
1.2 Preserve f"{base_prompt}\n\n{get_language_instruction()}" _get_system_prompt inline get_language_instruction() system-message construction
1.3 Preserve role/intent semantics _get_system_prompt
1.4 Preserve signature _get_system_prompt(self, is_individual: bool) -> str _get_system_prompt (signature)
2.1 Individual prompt body in English _build_individual_persona_prompt f-string body user-message construction
2.2 Preserve {entity_name}, {entity_type}, {entity_summary}, {attrs_str}, {context_str}, {get_language_instruction()} _build_individual_persona_prompt f-string interpolations
2.3 Preserve JSON keys bio, persona, age, gender, mbti, country, profession, interested_topics _build_individual_persona_prompt prompt content
2.4 Preserve field-level constraints (lengths, MBTI, gender enum, age int) _build_individual_persona_prompt prompt content
2.5 Preserve trailing-rules block semantics _build_individual_persona_prompt prompt content
2.6 Preserve method signature _build_individual_persona_prompt (signature)
2.7 Translate "无" and "无额外上下文" defaults _build_individual_persona_prompt literal defaults
2.8 Zero Chinese in assembled body _build_individual_persona_prompt
3.1 Group prompt body in English _build_group_persona_prompt f-string body user-message construction
3.2 Preserve interpolations _build_group_persona_prompt f-string interpolations
3.3 Preserve JSON keys _build_group_persona_prompt prompt content
3.4 Preserve field-level constraints (age=30, gender="other", etc.) _build_group_persona_prompt prompt content
3.5 Preserve trailing-rules semantics _build_group_persona_prompt prompt content
3.6 Preserve method signature _build_group_persona_prompt (signature)
3.7 Translate "无" / "无额外上下文" defaults _build_group_persona_prompt literal defaults
3.8 Zero Chinese in assembled body _build_group_persona_prompt
4.1 Preserve every get_language_instruction() call site all three builders inline call system + user message construction
4.2 Preserve locale-thread plumbing generate_profiles_for_entities (untouched) set_locale(current_locale) worker thread spawn
4.3 Locale=zh produces Chinese personas runtime behaviour locale postfix LLM call
4.4 Locale=en produces English personas runtime behaviour locale postfix LLM call
4.5 gender ∈ {male, female, other} regardless of locale prompt content
4.6 Don't alter locale.py / locales/ (none)
5.1 Preserve OasisAgentProfile dataclass (untouched) dataclass
5.2 Preserve method signatures (untouched) signatures
5.3 Preserve LLM invocation parameters (untouched) chat.completions.create(...)
5.4 Preserve MBTI_TYPES, COUNTRIES, taxonomy lists (untouched) class constants
6.1 Preserve _fix_truncated_json / _try_fix_json (untouched) helpers
6.2 Reasoning-model recovery still works (untouched) resilience helpers
6.3 No new prompt-language-dependent pre-processing (none added)
6.4 Round-trip yields non-empty bio and persona runtime behaviour LLM call
7.1 pytest test_profile_format.py passes runtime behaviour serializers
7.2 Reddit format schema preserved (untouched) to_reddit_format
7.3 Twitter format schema preserved (untouched) to_twitter_format
7.4 gender enum preserved prompt content
8.1 No logger edits (untouched)
8.2 No docstring/comment edits (untouched)
8.3 No rule-based fallback edits (untouched)
8.4 No edits outside the target file (none)
8.5 No new dependencies (none) pyproject.toml / uv.lock untouched
8.6 No edits to test_profile_format.py (untouched)

Components and Interfaces

Component Domain/Layer Intent Req Coverage Key Dependencies (P0/P1) Contracts
_get_system_prompt backend service / prompt builder Produce the system message (English base + locale postfix) 1.1, 1.2, 1.3, 1.4, 4.1, 4.5 get_language_instruction (P0) Service
_build_individual_persona_prompt backend service / prompt builder Produce the individual-entity user message in English 2.x, 4.1, 4.5 get_language_instruction (P0); JSON encoder (P1) Service
_build_group_persona_prompt backend service / prompt builder Produce the group/institution user message in English 3.x, 4.1, 4.5 get_language_instruction (P0); JSON encoder (P1) Service

Only the three prompt-builder methods change. They all live inside the single class OasisProfileGenerator in backend/app/services/oasis_profile_generator.py. No new components.

Backend / Services

_get_system_prompt

Field Detail
Intent Build the system message: a one-line English directive that frames the model as a social-media persona expert + the per-locale postfix.
Requirements 1.1, 1.2, 1.3, 1.4, 4.1, 4.5

Responsibilities & Constraints

  • Construct and return a single string of the form f"{base_prompt}\n\n{get_language_instruction()}".
  • Preserve the signature _get_system_prompt(self, is_individual: bool) -> str.
  • The English base_prompt MUST convey: (a) expert role in social-media persona generation; (b) intent to produce detailed, realistic personas for opinion-simulation, faithful to existing reality; (c) the JSON-output requirement and the no-unescaped-newline rule.
  • The English base_prompt MUST NOT contain any CJK codepoint.

Dependencies

  • Outbound: get_language_instruction() from backend/app/utils/locale.py (P0, criticality high — the entire locale-steering chain depends on it).

Contracts: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]

Service Interface
def _get_system_prompt(self, is_individual: bool) -> str:
    """Return the LLM system message: English base + locale postfix."""
    ...
  • Preconditions: none.
  • Postconditions: returns a non-empty string ending with the locale postfix produced by get_language_instruction().
  • Invariants: contains zero CJK codepoints.

Implementation Notes

  • Integration: called only from _call_llm_with_retry (line ~523) with is_individual decided upstream. The is_individual flag is reserved for future divergence between system prompts; the current implementation does not branch on it, and this design preserves that.
  • Validation: a CJK regex audit on the method body after the edit must match zero codepoints.
  • Risks: dropping one of the three role/intent pieces (expert framing, JSON output requirement, no-newline rule). Implementation task lists all three explicitly.

_build_individual_persona_prompt

Field Detail
Intent Build the user-message string for an individual entity in English. Preserve every {variable} interpolation, the inline {get_language_instruction()} call, every JSON-output key, and every locale-independent constraint.
Requirements 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5

Responsibilities & Constraints

  • Preserve signature _build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str.
  • Preserve attrs_str = json.dumps(entity_attributes, ensure_ascii=False) if entity_attributes else <fallback> with <fallback> translated to English ("None").
  • Preserve context_str = context[:3000] if context else <fallback> with <fallback> translated to English ("No additional context").
  • Translate the f-string body to English with these structural sections (mirror the original Chinese intent):
    1. Lead sentence — instruct the model to generate a detailed social-media persona for the entity, faithful to existing reality.
    2. Entity context block — labelled lines for entity_name, entity_type, entity_summary, entity_attributes (English labels; values via {...} interpolation).
    3. Context information blockContext information: heading followed by {context_str}.
    4. JSON-fields enumerationGenerate JSON with the following fields: followed by the eight numbered items (bio, persona, age, gender, mbti, country, profession, interested_topics) with English descriptions matching Requirement 2.4.
    5. Trailing rules blockImportant: followed by:
      • All field values must be strings or numbers; do not use newlines.
      • persona must be a single coherent block of text.
      • {get_language_instruction()} (gender field MUST use English values: "male" or "female")
      • Content must remain consistent with the entity information.
      • age must be a valid integer; gender must be exactly "male" or "female".
  • Preserve every {variable} interpolation present in the original by name: {entity_name}, {entity_type}, {entity_summary}, {attrs_str}, {context_str}, {get_language_instruction()}.
  • The translated body MUST NOT contain any CJK codepoint.

Dependencies

  • Outbound: json.dumps(..., ensure_ascii=False) (P1, formatting the attributes dict) — unchanged.
  • Outbound: get_language_instruction() (P0) — interpolated inline.

Contracts: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]

Service Interface
def _build_individual_persona_prompt(
    self,
    entity_name: str,
    entity_type: str,
    entity_summary: str,
    entity_attributes: Dict[str, Any],
    context: str,
) -> str:
    """Return the LLM user message for an individual-entity persona."""
    ...
  • Preconditions: entity_name, entity_type, entity_summary are strings (may be empty); entity_attributes is a dict (may be empty); context is a string (may be empty).
  • Postconditions: returns a non-empty English string with all six interpolations resolved.
  • Invariants: contains zero CJK codepoints; preserves every {variable} interpolation by name.

Implementation Notes

  • Integration: called from _call_llm_with_retry (line ~506) when is_individual is true.
  • Validation: post-edit CJK regex audit; interpolation-set audit (verify the multiset of {...} tokens equals the pre-change set); smoke import + pytest backend/scripts/test_profile_format.py.
  • Risks: dropping the gender enum lock when translating; dropping the inline {get_language_instruction()} call. The implementation task list calls these out as discrete checks.

_build_group_persona_prompt

Field Detail
Intent Build the user-message string for a group/institution entity in English. Preserve every {variable} interpolation, the inline {get_language_instruction()} call, every JSON-output key, and every locale-independent constraint (notably age == 30 and gender == "other").
Requirements 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5

Responsibilities & Constraints

  • Preserve signature _build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str.
  • Preserve the attrs_str and context_str fallback handling with English defaults ("None", "No additional context"), identical to the individual builder.
  • Translate the f-string body to English with these structural sections (mirror the original Chinese intent for institutions):
    1. Lead sentence — instruct the model to generate a detailed social-media account profile for the institution/group, faithful to existing reality.
    2. Entity context block — labelled lines for entity_name, entity_type, entity_summary, entity_attributes.
    3. Context information blockContext information: heading followed by {context_str}.
    4. JSON-fields enumerationGenerate JSON with the following fields: followed by the eight numbered items as defined in Requirement 3.4: bio (~200 chars, official voice), persona (~2000 chars, single coherent text covering institutional basics, account positioning, voice, publishing pattern, stance, special notes, institutional memory), age (= integer 30, institutional virtual age), gender (= literal "other"), mbti (e.g. ISTJ for strict/conservative), country (country name string), profession (institutional function), interested_topics (array).
    5. Trailing rules blockImportant: followed by:
      • All field values must be strings or numbers; null is not allowed.
      • persona must be a single coherent block of text without newlines.
      • {get_language_instruction()} (gender field MUST use English value "other")
      • age must be the integer 30; gender must be the string "other".
      • Account voice must match its identity positioning.
  • Preserve every {variable} interpolation present in the original.
  • The translated body MUST NOT contain any CJK codepoint.

Dependencies

  • Outbound: same as individual builder.

Contracts: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]

Service Interface
def _build_group_persona_prompt(
    self,
    entity_name: str,
    entity_type: str,
    entity_summary: str,
    entity_attributes: Dict[str, Any],
    context: str,
) -> str:
    """Return the LLM user message for a group/institution persona."""
    ...
  • Preconditions / Postconditions / Invariants: same shape as the individual builder.

Implementation Notes

  • Integration: called from _call_llm_with_retry (line ~510) when is_individual is false.
  • Validation: same checks as the individual builder, plus an explicit audit that the institutional sentinels (age == 30, gender == "other") appear in English in the trailing-rules block.
  • Risks: same as the individual builder; additionally, the country language hint ("使用中文,如\"中国\"") is intentionally dropped during translation — the validation task verifies that under Accept-Language: en a sample run produces an English country name.

Data Models

No data-model changes. The persona JSON schema, the OasisAgentProfile dataclass, the Reddit/Twitter serializers, and the OASIS subprocess profile-format expectations are all preserved verbatim.

Error Handling

Error Strategy

No new error paths. The existing flow is preserved:

  • json.JSONDecodeError_try_fix_json_fix_truncated_json → partial-extract via regex → _generate_profile_rule_based.
  • LLM call failure → retry with temperature decay (0.7 - attempt * 0.1) up to max_attempts = 3.
  • Terminal failure → rule-based fallback persona.
  • Per-entity worker exception → fallback OasisAgentProfile produced inside generate_single_profile at line ~932.

The translated prompts do not introduce new failure modes. Translating prompt language has no semantic effect on JSON parsing or on the response_format={"type": "json_object"} constraint.

Error Categories and Responses

  • User errors: not applicable (this is an internal pipeline).
  • System errors: LLM transport errors are retried; logger emits t("log.profile_generator.m011") etc. Logger keys already exist in locales/{en,zh}.json.
  • Business-logic errors: gender not in the English enum, age not an integer — the prompt explicitly mandates them; the validator inside _try_fix_json does not enforce these but the OASIS subprocess does. No change in either direction.

Monitoring

Existing logger calls are unchanged. Logger keys already i18n-keyed via t("log.profile_generator.*").

Testing Strategy

Unit Tests

  • (Existing) backend/scripts/test_profile_format.py::test_profile_formats — must continue to pass without modification.
  • (Manual) Smoke import: cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator" — confirms no syntax errors after editing f-strings.

Integration Tests

  • (Manual) Run the prompt builders directly under each locale:
    • set_locale("en")OasisProfileGenerator()._build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx") — assert no CJK codepoints in the output, assert the English locale postfix appears via get_language_instruction() (which is "Please respond in English.").
    • set_locale("zh") → same call → assert the locale postfix is "请使用中文回答。".
  • These do not require an LLM call; they only verify the rendered prompt string.

E2E Tests

  • (Manual, optional, preferred but skippable when no LLM key present) Run npm run dev and trigger Step 2 profile generation from the UI under English locale on a small entity set; spot-check that bios and persona prose are in English. Skip if a live LLM key is unavailable in CI; sibling specs #2/#4/#5 used the same manual E2E approach.

Performance / Load

Not applicable. Prompt translation has no measurable performance impact.

Optional Sections

Security Considerations

No security implications. No new external surfaces; no new data retention; no change to authentication or authorization.

Migration Strategy

No migration required. The change is forward-compatible: a deployment that picks up the translated prompts continues to serve users on the zh locale via the unchanged get_language_instruction() postfix mechanism.

Supporting References

  • gap-analysis.md — option evaluation and effort/risk sizing.
  • research.md — discovery findings, design decisions (in particular the "drop the country language hint" decision), and risk register.
  • requirements.md — EARS requirements with numeric IDs.
  • Sibling specs i18n-ontology-generator-prompts, i18n-simulation-config-generator-prompts, i18n-report-agent-prompts — same translation pattern, already merged.