223 lines
11 KiB
Markdown
223 lines
11 KiB
Markdown
# Research & Design Decisions — i18n-oasis-profile-generator-prompts
|
||
|
||
## Summary
|
||
|
||
- **Feature**: `i18n-oasis-profile-generator-prompts`
|
||
- **Discovery Scope**: **Extension** (single-file translation in an existing
|
||
brownfield service; sibling pattern already merged in #2, #4, #5)
|
||
- **Key Findings**:
|
||
- The existing `get_language_instruction()` postfix mechanism (defined in
|
||
`backend/app/utils/locale.py`) is the project-canonical way to steer LLM
|
||
output language. Translating the base prompt does not interfere with it
|
||
and is the same approach taken in already-merged sibling specs.
|
||
- The only Chinese surfaces inside the prompt-rendering path are
|
||
`_get_system_prompt`, `_build_individual_persona_prompt`,
|
||
`_build_group_persona_prompt`, and the four `attrs_str`/`context_str`
|
||
fallback literals (`"无"`, `"无额外上下文"`). All other Chinese in the
|
||
file is logger keys (already done by #6), docstrings/comments
|
||
(out-of-scope, #7), or rule-based fallback data (out-of-scope).
|
||
- `backend/scripts/test_profile_format.py` does not exercise prompts; it
|
||
only constructs `OasisAgentProfile` and round-trips through
|
||
`_save_twitter_csv` / `_save_reddit_json`. A pure-translation diff
|
||
cannot break it.
|
||
|
||
## Research Log
|
||
|
||
### Locale steering mechanism
|
||
|
||
- **Context**: Confirm that translating the base prompt does not regress
|
||
Chinese output under `Accept-Language: zh`.
|
||
- **Sources Consulted**:
|
||
- `backend/app/utils/locale.py` (lines 50–96).
|
||
- `locales/languages.json` (entries for `en` and `zh` with
|
||
`llmInstruction` field).
|
||
- Sibling spec `i18n-ontology-generator-prompts/design.md` and the
|
||
merged commits referenced by it.
|
||
- **Findings**:
|
||
- `get_language_instruction()` returns `Please respond in English.`
|
||
for locale `en`, `请使用中文回答。` for locale `zh`.
|
||
- The function is called as an inline f-string interpolation in the
|
||
individual-persona and group-persona prompt bodies, and explicitly
|
||
appended in `_get_system_prompt`. All three sites must be preserved
|
||
byte-for-byte.
|
||
- The thread-local locale is captured in
|
||
`generate_profiles_for_entities` (line ~910) and restored inside the
|
||
worker via `set_locale(current_locale)` (line ~914). This plumbing is
|
||
untouched by the change.
|
||
- **Implications**:
|
||
- Design lock-in: the inline `{get_language_instruction()}` call must
|
||
remain in each of the three builders. Removing or renaming it would
|
||
silently regress non-English locales.
|
||
- The Chinese hint `country: 国家(使用中文,如"中国")` in the original
|
||
prompt overrides the locale postfix and forces Chinese output for one
|
||
field. The English translation drops that hint so the locale postfix
|
||
decides the country language. The rule-based fallback (out of scope)
|
||
has its own (Chinese) defaults and is not affected.
|
||
|
||
### Test contract
|
||
|
||
- **Context**: Verify that `backend/scripts/test_profile_format.py`
|
||
remains green after a prompt-only translation.
|
||
- **Sources Consulted**: `backend/scripts/test_profile_format.py`,
|
||
`oasis_profile_generator.py:_save_twitter_csv`,
|
||
`oasis_profile_generator.py:_save_reddit_json`,
|
||
`oasis_profile_generator.py:to_reddit_format`,
|
||
`oasis_profile_generator.py:to_twitter_format`.
|
||
- **Findings**:
|
||
- The pytest function `test_profile_formats` constructs
|
||
`OasisAgentProfile` instances directly without invoking the LLM.
|
||
- It calls `_save_twitter_csv` and `_save_reddit_json` to verify CSV
|
||
and JSON shape. Required CSV header: `user_id, user_name, name, bio,
|
||
friend_count, follower_count, statuses_count, created_at`. Required
|
||
JSON keys: `realname, username, bio, persona`.
|
||
- **Implications**:
|
||
- Translating prompts cannot regress this test. The validation
|
||
requirement (Requirement 7) is satisfied automatically as long as
|
||
serializer code is not edited.
|
||
- No new tests are required for this change.
|
||
|
||
### Sibling specs already shipped
|
||
|
||
- **Context**: Confirm there is an established project pattern this work
|
||
must mirror.
|
||
- **Sources Consulted**:
|
||
- `.kiro/specs/i18n-ontology-generator-prompts/{design,tasks,requirements}.md`
|
||
- `.kiro/specs/i18n-report-agent-prompts/`
|
||
- `.kiro/specs/i18n-simulation-config-generator-prompts/`
|
||
- Recent merged commits referencing #2, #4, #5.
|
||
- **Findings**:
|
||
- All three siblings used a single-file in-place translation diff.
|
||
- All three preserved every `get_language_instruction()` call site.
|
||
- All three left logger calls and docstrings to companion issues
|
||
(#6 / #7).
|
||
- None externalized prompts to `/locales/*.json`.
|
||
- **Implications**:
|
||
- The same approach is correct here. Reviewer expectations are set by
|
||
the sibling diffs.
|
||
|
||
### OASIS profile schema
|
||
|
||
- **Context**: Verify that translated prompts continue to satisfy the
|
||
OASIS subprocess's expected schema (especially `gender` enum and
|
||
`age` integer).
|
||
- **Sources Consulted**: `OasisAgentProfile` dataclass,
|
||
`to_reddit_format`, `to_twitter_format`, sibling `_generate_profile_rule_based`.
|
||
- **Findings**:
|
||
- OASIS-required fields are produced by serializers, not by the
|
||
prompt: `user_id`, `username`, `name`, `bio`, `karma`/`friend_count`/`follower_count`/`statuses_count`, `created_at`.
|
||
- The prompt-defined fields land in optional positions: `age`,
|
||
`gender`, `mbti`, `country`, `profession`, `interested_topics`.
|
||
- The `gender` enum constraint (`"male"`/`"female"` for individuals,
|
||
`"other"` for groups) is locale-independent and must remain in
|
||
English text inside the translated prompt.
|
||
- **Implications**:
|
||
- The English prompt must explicitly call out `gender ∈ {male, female}`
|
||
(individual) and `gender == "other"` (group), independent of the
|
||
`get_language_instruction()` postfix.
|
||
|
||
## Architecture Pattern Evaluation
|
||
|
||
| Option | Description | Strengths | Risks / Limitations | Notes |
|
||
|--------|-------------|-----------|---------------------|-------|
|
||
| **A — In-place builder edit** | Translate three method bodies + four fallback literals directly | Smallest diff; matches sibling pattern; zero API change | None of note | **Selected** |
|
||
| B — Module-level constants | Hoist prompts to `INDIVIDUAL_PERSONA_PROMPT_TEMPLATE` etc. | Easier `git grep` | Larger diff; the inline `{get_language_instruction()}` call would need to become a `.format()` kwarg, which is a behavioural change beyond translation | Diverges from #4 / #5 |
|
||
| C — Externalize to `locales/*.json` | Move every prompt sentence into `t(...)` keys | Most i18n-pure | Three-file diff; diverges from project rationale (prompts use postfix mechanism, not key files) | Rejected |
|
||
|
||
## Design Decisions
|
||
|
||
### Decision: In-place edit of the three prompt builders (Option A)
|
||
|
||
- **Context**: Three methods build prompt strings; one of them is a
|
||
one-line system prompt, the other two are large f-string templates
|
||
with embedded `{variable}` interpolations and an inline
|
||
`{get_language_instruction()}` call.
|
||
- **Alternatives Considered**:
|
||
1. Option B — module-level constants.
|
||
2. Option C — externalize to `/locales/*.json` keys.
|
||
- **Selected Approach**: Translate each method body in place. Replace
|
||
the four `"无"` / `"无额外上下文"` fallbacks with English equivalents
|
||
(`"None"` and `"No additional context"`). Preserve all `{...}`
|
||
interpolations and the inline `{get_language_instruction()}` call.
|
||
- **Rationale**: Matches merged sibling specs verbatim. Smallest review
|
||
surface. Zero API change. Out-of-scope surfaces (logger, docstrings,
|
||
rule-based fallback) cleanly avoided.
|
||
- **Trade-offs**: Leaves the file mixed-language in non-prompt parts
|
||
(docstrings, rule fallback) until #7 lands. Acceptable per scope
|
||
split.
|
||
- **Follow-up**: During implementation, run a regex audit for any
|
||
Chinese codepoints inside the three method bodies after the edit and
|
||
confirm the diff stays within
|
||
`backend/app/services/oasis_profile_generator.py`.
|
||
|
||
### Decision: Drop the "use Chinese country names" hint
|
||
|
||
- **Context**: The current prompt at line 704 reads
|
||
`country: 国家(使用中文,如"中国")` and at line 753
|
||
`country: 国家(使用中文,如"中国")`. This forces Chinese for the
|
||
`country` field even under `Accept-Language: en`.
|
||
- **Alternatives Considered**:
|
||
1. Translate to English literally:
|
||
`country: country (use English, e.g. "China")`.
|
||
2. Drop the language hint entirely:
|
||
`country: country name string`.
|
||
- **Selected Approach**: Drop the language hint. Let
|
||
`get_language_instruction()` steer the country language alongside
|
||
every other free-text field.
|
||
- **Rationale**: Hard-coding a language in the prompt defeats the
|
||
locale-steering mechanism. The rule-based fallback (out of scope)
|
||
carries its own Chinese defaults; under the LLM path, locale should
|
||
decide.
|
||
- **Trade-offs**: Under `Accept-Language: zh`, the LLM may produce a
|
||
Chinese country name (e.g. `中国`) — this is the desired behaviour.
|
||
Under `Accept-Language: en`, the LLM produces English (`China`),
|
||
matching `COUNTRIES = ["China", "US", ...]` already in the file.
|
||
- **Follow-up**: Verify in the validation phase that a sample run under
|
||
locale `en` produces an English country name.
|
||
|
||
### Decision: Keep `gender` enum constraint in English inside the prompt
|
||
|
||
- **Context**: `gender` must be one of `"male"`/`"female"`/`"other"`
|
||
regardless of locale, because OASIS consumers and the
|
||
`_generate_profile_rule_based` fallback assume English values.
|
||
- **Alternatives Considered**: None — the constraint is a contract.
|
||
- **Selected Approach**: The translated prompt explicitly states the
|
||
enum in English, even when the locale postfix asks for Chinese
|
||
output: `gender MUST be one of "male" or "female" (English literal)`.
|
||
- **Rationale**: Same as the existing Chinese prompt (which already
|
||
states `必须是英文: "male" 或 "female"`). The translation preserves
|
||
the same lock-in.
|
||
- **Trade-offs**: None.
|
||
- **Follow-up**: Validation phase will check that under both locales
|
||
the produced `gender` is one of the three English literals.
|
||
|
||
## Risks & Mitigations
|
||
|
||
- **Risk**: Mistranslation drops a locale-independent constraint
|
||
(e.g. `gender` enum, `age` integer rule, `persona` no-newline rule).
|
||
- **Mitigation**: The implementation task list will enumerate every
|
||
constraint inline so reviewers can check by diff.
|
||
- **Risk**: Variable-name typo inside an f-string causes a `KeyError`
|
||
at runtime.
|
||
- **Mitigation**: Implementation task verifies that the set of
|
||
`{variable}` interpolations in each translated block matches the
|
||
pre-change set 1:1; a `python -c "import ..."` smoke import and a
|
||
`pytest backend/scripts/test_profile_format.py` run are mandatory.
|
||
- **Risk**: Accidentally leaving a CJK codepoint inside the three
|
||
builders.
|
||
- **Mitigation**: Final implementation step runs the project's
|
||
repo-level CJK guard regex (added by #26) constrained to the three
|
||
builders' line ranges.
|
||
|
||
## References
|
||
|
||
- `backend/app/services/oasis_profile_generator.py` — target file.
|
||
- `backend/app/utils/locale.py` — locale infrastructure.
|
||
- `locales/languages.json`, `locales/en.json`, `locales/zh.json` —
|
||
locale registries.
|
||
- `.kiro/specs/i18n-ontology-generator-prompts/` — sibling spec #2.
|
||
- `.kiro/specs/i18n-simulation-config-generator-prompts/` — sibling
|
||
spec #4.
|
||
- `.kiro/specs/i18n-report-agent-prompts/` — sibling spec #5.
|
||
- GitHub issue
|
||
[#3](https://github.com/salestech-group/MiroFish/issues/3).
|