MicroFish/.kiro/specs/i18n-oasis-profile-generato.../research.md

223 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research & Design Decisions — i18n-oasis-profile-generator-prompts
## Summary
- **Feature**: `i18n-oasis-profile-generator-prompts`
- **Discovery Scope**: **Extension** (single-file translation in an existing
brownfield service; sibling pattern already merged in #2, #4, #5)
- **Key Findings**:
- The existing `get_language_instruction()` postfix mechanism (defined in
`backend/app/utils/locale.py`) is the project-canonical way to steer LLM
output language. Translating the base prompt does not interfere with it
and is the same approach taken in already-merged sibling specs.
- The only Chinese surfaces inside the prompt-rendering path are
`_get_system_prompt`, `_build_individual_persona_prompt`,
`_build_group_persona_prompt`, and the four `attrs_str`/`context_str`
fallback literals (`"无"`, `"无额外上下文"`). All other Chinese in the
file is logger keys (already done by #6), docstrings/comments
(out-of-scope, #7), or rule-based fallback data (out-of-scope).
- `backend/scripts/test_profile_format.py` does not exercise prompts; it
only constructs `OasisAgentProfile` and round-trips through
`_save_twitter_csv` / `_save_reddit_json`. A pure-translation diff
cannot break it.
## Research Log
### Locale steering mechanism
- **Context**: Confirm that translating the base prompt does not regress
Chinese output under `Accept-Language: zh`.
- **Sources Consulted**:
- `backend/app/utils/locale.py` (lines 5096).
- `locales/languages.json` (entries for `en` and `zh` with
`llmInstruction` field).
- Sibling spec `i18n-ontology-generator-prompts/design.md` and the
merged commits referenced by it.
- **Findings**:
- `get_language_instruction()` returns `Please respond in English.`
for locale `en`, `请使用中文回答。` for locale `zh`.
- The function is called as an inline f-string interpolation in the
individual-persona and group-persona prompt bodies, and explicitly
appended in `_get_system_prompt`. All three sites must be preserved
byte-for-byte.
- The thread-local locale is captured in
`generate_profiles_for_entities` (line ~910) and restored inside the
worker via `set_locale(current_locale)` (line ~914). This plumbing is
untouched by the change.
- **Implications**:
- Design lock-in: the inline `{get_language_instruction()}` call must
remain in each of the three builders. Removing or renaming it would
silently regress non-English locales.
- The Chinese hint `country: 国家(使用中文,如"中国"` in the original
prompt overrides the locale postfix and forces Chinese output for one
field. The English translation drops that hint so the locale postfix
decides the country language. The rule-based fallback (out of scope)
has its own (Chinese) defaults and is not affected.
### Test contract
- **Context**: Verify that `backend/scripts/test_profile_format.py`
remains green after a prompt-only translation.
- **Sources Consulted**: `backend/scripts/test_profile_format.py`,
`oasis_profile_generator.py:_save_twitter_csv`,
`oasis_profile_generator.py:_save_reddit_json`,
`oasis_profile_generator.py:to_reddit_format`,
`oasis_profile_generator.py:to_twitter_format`.
- **Findings**:
- The pytest function `test_profile_formats` constructs
`OasisAgentProfile` instances directly without invoking the LLM.
- It calls `_save_twitter_csv` and `_save_reddit_json` to verify CSV
and JSON shape. Required CSV header: `user_id, user_name, name, bio,
friend_count, follower_count, statuses_count, created_at`. Required
JSON keys: `realname, username, bio, persona`.
- **Implications**:
- Translating prompts cannot regress this test. The validation
requirement (Requirement 7) is satisfied automatically as long as
serializer code is not edited.
- No new tests are required for this change.
### Sibling specs already shipped
- **Context**: Confirm there is an established project pattern this work
must mirror.
- **Sources Consulted**:
- `.kiro/specs/i18n-ontology-generator-prompts/{design,tasks,requirements}.md`
- `.kiro/specs/i18n-report-agent-prompts/`
- `.kiro/specs/i18n-simulation-config-generator-prompts/`
- Recent merged commits referencing #2, #4, #5.
- **Findings**:
- All three siblings used a single-file in-place translation diff.
- All three preserved every `get_language_instruction()` call site.
- All three left logger calls and docstrings to companion issues
(#6 / #7).
- None externalized prompts to `/locales/*.json`.
- **Implications**:
- The same approach is correct here. Reviewer expectations are set by
the sibling diffs.
### OASIS profile schema
- **Context**: Verify that translated prompts continue to satisfy the
OASIS subprocess's expected schema (especially `gender` enum and
`age` integer).
- **Sources Consulted**: `OasisAgentProfile` dataclass,
`to_reddit_format`, `to_twitter_format`, sibling `_generate_profile_rule_based`.
- **Findings**:
- OASIS-required fields are produced by serializers, not by the
prompt: `user_id`, `username`, `name`, `bio`, `karma`/`friend_count`/`follower_count`/`statuses_count`, `created_at`.
- The prompt-defined fields land in optional positions: `age`,
`gender`, `mbti`, `country`, `profession`, `interested_topics`.
- The `gender` enum constraint (`"male"`/`"female"` for individuals,
`"other"` for groups) is locale-independent and must remain in
English text inside the translated prompt.
- **Implications**:
- The English prompt must explicitly call out `gender ∈ {male, female}`
(individual) and `gender == "other"` (group), independent of the
`get_language_instruction()` postfix.
## Architecture Pattern Evaluation
| Option | Description | Strengths | Risks / Limitations | Notes |
|--------|-------------|-----------|---------------------|-------|
| **A — In-place builder edit** | Translate three method bodies + four fallback literals directly | Smallest diff; matches sibling pattern; zero API change | None of note | **Selected** |
| B — Module-level constants | Hoist prompts to `INDIVIDUAL_PERSONA_PROMPT_TEMPLATE` etc. | Easier `git grep` | Larger diff; the inline `{get_language_instruction()}` call would need to become a `.format()` kwarg, which is a behavioural change beyond translation | Diverges from #4 / #5 |
| C — Externalize to `locales/*.json` | Move every prompt sentence into `t(...)` keys | Most i18n-pure | Three-file diff; diverges from project rationale (prompts use postfix mechanism, not key files) | Rejected |
## Design Decisions
### Decision: In-place edit of the three prompt builders (Option A)
- **Context**: Three methods build prompt strings; one of them is a
one-line system prompt, the other two are large f-string templates
with embedded `{variable}` interpolations and an inline
`{get_language_instruction()}` call.
- **Alternatives Considered**:
1. Option B — module-level constants.
2. Option C — externalize to `/locales/*.json` keys.
- **Selected Approach**: Translate each method body in place. Replace
the four `"无"` / `"无额外上下文"` fallbacks with English equivalents
(`"None"` and `"No additional context"`). Preserve all `{...}`
interpolations and the inline `{get_language_instruction()}` call.
- **Rationale**: Matches merged sibling specs verbatim. Smallest review
surface. Zero API change. Out-of-scope surfaces (logger, docstrings,
rule-based fallback) cleanly avoided.
- **Trade-offs**: Leaves the file mixed-language in non-prompt parts
(docstrings, rule fallback) until #7 lands. Acceptable per scope
split.
- **Follow-up**: During implementation, run a regex audit for any
Chinese codepoints inside the three method bodies after the edit and
confirm the diff stays within
`backend/app/services/oasis_profile_generator.py`.
### Decision: Drop the "use Chinese country names" hint
- **Context**: The current prompt at line 704 reads
`country: 国家(使用中文,如"中国"` and at line 753
`country: 国家(使用中文,如"中国"`. This forces Chinese for the
`country` field even under `Accept-Language: en`.
- **Alternatives Considered**:
1. Translate to English literally:
`country: country (use English, e.g. "China")`.
2. Drop the language hint entirely:
`country: country name string`.
- **Selected Approach**: Drop the language hint. Let
`get_language_instruction()` steer the country language alongside
every other free-text field.
- **Rationale**: Hard-coding a language in the prompt defeats the
locale-steering mechanism. The rule-based fallback (out of scope)
carries its own Chinese defaults; under the LLM path, locale should
decide.
- **Trade-offs**: Under `Accept-Language: zh`, the LLM may produce a
Chinese country name (e.g. `中国`) — this is the desired behaviour.
Under `Accept-Language: en`, the LLM produces English (`China`),
matching `COUNTRIES = ["China", "US", ...]` already in the file.
- **Follow-up**: Verify in the validation phase that a sample run under
locale `en` produces an English country name.
### Decision: Keep `gender` enum constraint in English inside the prompt
- **Context**: `gender` must be one of `"male"`/`"female"`/`"other"`
regardless of locale, because OASIS consumers and the
`_generate_profile_rule_based` fallback assume English values.
- **Alternatives Considered**: None — the constraint is a contract.
- **Selected Approach**: The translated prompt explicitly states the
enum in English, even when the locale postfix asks for Chinese
output: `gender MUST be one of "male" or "female" (English literal)`.
- **Rationale**: Same as the existing Chinese prompt (which already
states `必须是英文: "male" 或 "female"`). The translation preserves
the same lock-in.
- **Trade-offs**: None.
- **Follow-up**: Validation phase will check that under both locales
the produced `gender` is one of the three English literals.
## Risks & Mitigations
- **Risk**: Mistranslation drops a locale-independent constraint
(e.g. `gender` enum, `age` integer rule, `persona` no-newline rule).
- **Mitigation**: The implementation task list will enumerate every
constraint inline so reviewers can check by diff.
- **Risk**: Variable-name typo inside an f-string causes a `KeyError`
at runtime.
- **Mitigation**: Implementation task verifies that the set of
`{variable}` interpolations in each translated block matches the
pre-change set 1:1; a `python -c "import ..."` smoke import and a
`pytest backend/scripts/test_profile_format.py` run are mandatory.
- **Risk**: Accidentally leaving a CJK codepoint inside the three
builders.
- **Mitigation**: Final implementation step runs the project's
repo-level CJK guard regex (added by #26) constrained to the three
builders' line ranges.
## References
- `backend/app/services/oasis_profile_generator.py` — target file.
- `backend/app/utils/locale.py` — locale infrastructure.
- `locales/languages.json`, `locales/en.json`, `locales/zh.json`
locale registries.
- `.kiro/specs/i18n-ontology-generator-prompts/` — sibling spec #2.
- `.kiro/specs/i18n-simulation-config-generator-prompts/` — sibling
spec #4.
- `.kiro/specs/i18n-report-agent-prompts/` — sibling spec #5.
- GitHub issue
[#3](https://github.com/salestech-group/MiroFish/issues/3).