MicroFish/.kiro/specs/i18n-oasis-profile-generato.../research.md

# Research & Design Decisions — i18n-oasis-profile-generator-prompts

## Summary

- **Feature**: `i18n-oasis-profile-generator-prompts`
- **Discovery Scope**: **Extension** (single-file translation in an existing
  brownfield service; sibling pattern already merged in #2, #4, #5)
- **Key Findings**:
  - The existing `get_language_instruction()` postfix mechanism (defined in
    `backend/app/utils/locale.py`) is the project-canonical way to steer LLM
    output language. Translating the base prompt does not interfere with it
    and is the same approach taken in already-merged sibling specs.
  - The only Chinese surfaces inside the prompt-rendering path are
    `_get_system_prompt`, `_build_individual_persona_prompt`,
    `_build_group_persona_prompt`, and the four `attrs_str`/`context_str`
    fallback literals (`"无"`, `"无额外上下文"`). All other Chinese in the
    file is logger keys (already done by #6), docstrings/comments
    (out-of-scope, #7), or rule-based fallback data (out-of-scope).
  - `backend/scripts/test_profile_format.py` does not exercise prompts; it
    only constructs `OasisAgentProfile` and round-trips through
    `_save_twitter_csv` / `_save_reddit_json`. A pure-translation diff
    cannot break it.

## Research Log

### Locale steering mechanism

- **Context**: Confirm that translating the base prompt does not regress
  Chinese output under `Accept-Language: zh`.
- **Sources Consulted**:
  - `backend/app/utils/locale.py` (lines 50–96).
  - `locales/languages.json` (entries for `en` and `zh` with
    `llmInstruction` field).
  - Sibling spec `i18n-ontology-generator-prompts/design.md` and the
    merged commits referenced by it.
- **Findings**:
  - `get_language_instruction()` returns `Please respond in English.`
    for locale `en`, `请使用中文回答。` for locale `zh`.
  - The function is called as an inline f-string interpolation in the
    individual-persona and group-persona prompt bodies, and explicitly
    appended in `_get_system_prompt`. All three sites must be preserved
    byte-for-byte.
  - The thread-local locale is captured in
    `generate_profiles_for_entities` (line ~910) and restored inside the
    worker via `set_locale(current_locale)` (line ~914). This plumbing is
    untouched by the change.
- **Implications**:
  - Design lock-in: the inline `{get_language_instruction()}` call must
    remain in each of the three builders. Removing or renaming it would
    silently regress non-English locales.
  - The Chinese hint `country: 国家（使用中文，如"中国"）` in the original
    prompt overrides the locale postfix and forces Chinese output for one
    field. The English translation drops that hint so the locale postfix
    decides the country language. The rule-based fallback (out of scope)
    has its own (Chinese) defaults and is not affected.

### Test contract

- **Context**: Verify that `backend/scripts/test_profile_format.py`
  remains green after a prompt-only translation.
- **Sources Consulted**: `backend/scripts/test_profile_format.py`,
  `oasis_profile_generator.py:_save_twitter_csv`,
  `oasis_profile_generator.py:_save_reddit_json`,
  `oasis_profile_generator.py:to_reddit_format`,
  `oasis_profile_generator.py:to_twitter_format`.
- **Findings**:
  - The pytest function `test_profile_formats` constructs
    `OasisAgentProfile` instances directly without invoking the LLM.
  - It calls `_save_twitter_csv` and `_save_reddit_json` to verify CSV
    and JSON shape. Required CSV header: `user_id, user_name, name, bio,
    friend_count, follower_count, statuses_count, created_at`. Required
    JSON keys: `realname, username, bio, persona`.
- **Implications**:
  - Translating prompts cannot regress this test. The validation
    requirement (Requirement 7) is satisfied automatically as long as
    serializer code is not edited.
  - No new tests are required for this change.

### Sibling specs already shipped

- **Context**: Confirm there is an established project pattern this work
  must mirror.
- **Sources Consulted**:
  - `.kiro/specs/i18n-ontology-generator-prompts/{design,tasks,requirements}.md`
  - `.kiro/specs/i18n-report-agent-prompts/`
  - `.kiro/specs/i18n-simulation-config-generator-prompts/`
  - Recent merged commits referencing #2, #4, #5.
- **Findings**:
  - All three siblings used a single-file in-place translation diff.
  - All three preserved every `get_language_instruction()` call site.
  - All three left logger calls and docstrings to companion issues
    (#6 / #7).
  - None externalized prompts to `/locales/*.json`.
- **Implications**:
  - The same approach is correct here. Reviewer expectations are set by
    the sibling diffs.

### OASIS profile schema

- **Context**: Verify that translated prompts continue to satisfy the
  OASIS subprocess's expected schema (especially `gender` enum and
  `age` integer).
- **Sources Consulted**: `OasisAgentProfile` dataclass,
  `to_reddit_format`, `to_twitter_format`, sibling `_generate_profile_rule_based`.
- **Findings**:
  - OASIS-required fields are produced by serializers, not by the
    prompt: `user_id`, `username`, `name`, `bio`, `karma`/`friend_count`/`follower_count`/`statuses_count`, `created_at`.
  - The prompt-defined fields land in optional positions: `age`,
    `gender`, `mbti`, `country`, `profession`, `interested_topics`.
  - The `gender` enum constraint (`"male"`/`"female"` for individuals,
    `"other"` for groups) is locale-independent and must remain in
    English text inside the translated prompt.
- **Implications**:
  - The English prompt must explicitly call out `gender ∈ {male, female}`
    (individual) and `gender == "other"` (group), independent of the
    `get_language_instruction()` postfix.

## Architecture Pattern Evaluation

| Option | Description | Strengths | Risks / Limitations | Notes |
|--------|-------------|-----------|---------------------|-------|
| **A — In-place builder edit** | Translate three method bodies + four fallback literals directly | Smallest diff; matches sibling pattern; zero API change | None of note | **Selected** |
| B — Module-level constants | Hoist prompts to `INDIVIDUAL_PERSONA_PROMPT_TEMPLATE` etc. | Easier `git grep` | Larger diff; the inline `{get_language_instruction()}` call would need to become a `.format()` kwarg, which is a behavioural change beyond translation | Diverges from #4 / #5 |
| C — Externalize to `locales/*.json` | Move every prompt sentence into `t(...)` keys | Most i18n-pure | Three-file diff; diverges from project rationale (prompts use postfix mechanism, not key files) | Rejected |

## Design Decisions

### Decision: In-place edit of the three prompt builders (Option A)

- **Context**: Three methods build prompt strings; one of them is a
  one-line system prompt, the other two are large f-string templates
  with embedded `{variable}` interpolations and an inline
  `{get_language_instruction()}` call.
- **Alternatives Considered**:
  1. Option B — module-level constants.
  2. Option C — externalize to `/locales/*.json` keys.
- **Selected Approach**: Translate each method body in place. Replace
  the four `"无"` / `"无额外上下文"` fallbacks with English equivalents
  (`"None"` and `"No additional context"`). Preserve all `{...}`
  interpolations and the inline `{get_language_instruction()}` call.
- **Rationale**: Matches merged sibling specs verbatim. Smallest review
  surface. Zero API change. Out-of-scope surfaces (logger, docstrings,
  rule-based fallback) cleanly avoided.
- **Trade-offs**: Leaves the file mixed-language in non-prompt parts
  (docstrings, rule fallback) until #7 lands. Acceptable per scope
  split.
- **Follow-up**: During implementation, run a regex audit for any
  Chinese codepoints inside the three method bodies after the edit and
  confirm the diff stays within
  `backend/app/services/oasis_profile_generator.py`.

### Decision: Drop the "use Chinese country names" hint

- **Context**: The current prompt at line 704 reads
  `country: 国家（使用中文，如"中国"）` and at line 753
  `country: 国家（使用中文，如"中国"）`. This forces Chinese for the
  `country` field even under `Accept-Language: en`.
- **Alternatives Considered**:
  1. Translate to English literally:
     `country: country (use English, e.g. "China")`.
  2. Drop the language hint entirely:
     `country: country name string`.
- **Selected Approach**: Drop the language hint. Let
  `get_language_instruction()` steer the country language alongside
  every other free-text field.
- **Rationale**: Hard-coding a language in the prompt defeats the
  locale-steering mechanism. The rule-based fallback (out of scope)
  carries its own Chinese defaults; under the LLM path, locale should
  decide.
- **Trade-offs**: Under `Accept-Language: zh`, the LLM may produce a
  Chinese country name (e.g. `中国`) — this is the desired behaviour.
  Under `Accept-Language: en`, the LLM produces English (`China`),
  matching `COUNTRIES = ["China", "US", ...]` already in the file.
- **Follow-up**: Verify in the validation phase that a sample run under
  locale `en` produces an English country name.

### Decision: Keep `gender` enum constraint in English inside the prompt

- **Context**: `gender` must be one of `"male"`/`"female"`/`"other"`
  regardless of locale, because OASIS consumers and the
  `_generate_profile_rule_based` fallback assume English values.
- **Alternatives Considered**: None — the constraint is a contract.
- **Selected Approach**: The translated prompt explicitly states the
  enum in English, even when the locale postfix asks for Chinese
  output: `gender MUST be one of "male" or "female" (English literal)`.
- **Rationale**: Same as the existing Chinese prompt (which already
  states `必须是英文: "male" 或 "female"`). The translation preserves
  the same lock-in.
- **Trade-offs**: None.
- **Follow-up**: Validation phase will check that under both locales
  the produced `gender` is one of the three English literals.

## Risks & Mitigations

- **Risk**: Mistranslation drops a locale-independent constraint
  (e.g. `gender` enum, `age` integer rule, `persona` no-newline rule).
  - **Mitigation**: The implementation task list will enumerate every
    constraint inline so reviewers can check by diff.
- **Risk**: Variable-name typo inside an f-string causes a `KeyError`
  at runtime.
  - **Mitigation**: Implementation task verifies that the set of
    `{variable}` interpolations in each translated block matches the
    pre-change set 1:1; a `python -c "import ..."` smoke import and a
    `pytest backend/scripts/test_profile_format.py` run are mandatory.
- **Risk**: Accidentally leaving a CJK codepoint inside the three
  builders.
  - **Mitigation**: Final implementation step runs the project's
    repo-level CJK guard regex (added by #26) constrained to the three
    builders' line ranges.

## References

- `backend/app/services/oasis_profile_generator.py` — target file.
- `backend/app/utils/locale.py` — locale infrastructure.
- `locales/languages.json`, `locales/en.json`, `locales/zh.json` —
  locale registries.
- `.kiro/specs/i18n-ontology-generator-prompts/` — sibling spec #2.
- `.kiro/specs/i18n-simulation-config-generator-prompts/` — sibling
  spec #4.
- `.kiro/specs/i18n-report-agent-prompts/` — sibling spec #5.
- GitHub issue
  [#3](https://github.com/salestech-group/MiroFish/issues/3).