19 KiB
Requirements Document
Introduction
This specification covers the English translation of the prompt strings in backend/app/services/oasis_profile_generator.py. The file converts Graphiti graph entities into OASIS agent persona dictionaries that drive Step 2 (Environment Setup) of the MiroFish pipeline. Today, the system prompt and the two _build_*_persona_prompt user-message templates are written in Chinese; the language is steered at runtime by appending get_language_instruction() to the system prompt and inside the user prompt body. While that postfix instructs the model which language to respond in, the base-prompt language biases the model's structural and lexical output, so persona prose (bio, persona, profession, interested_topics) skews Chinese under Accept-Language: en. Translating the base prompts to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (get_language_instruction() returns 请使用中文回答。 when locale is zh, so a Chinese model response remains achievable from an English base prompt).
This work tracks GitHub issue #3 and is sibling to the already-merged ontology-generator (#2), simulation-config-generator (#4), and report-agent (#5) prompt translation specs.
Boundary Context
- In scope:
- Translating the system-prompt base string in
OasisProfileGenerator._get_system_prompt(currently"你是社交媒体用户画像生成专家。…"at line ~664) from Chinese to English. - Translating the individual-persona user-message template in
OasisProfileGenerator._build_individual_persona_prompt(currently lines ~680–714) from Chinese to English. - Translating the group/institution-persona user-message template in
OasisProfileGenerator._build_group_persona_prompt(currently lines ~729–762) from Chinese to English. - Translating the small
attrs_strandcontext_strfallback default literals ("无","无额外上下文") to English equivalents. - Preserving all functional contracts: every
get_language_instruction()call site, all variable interpolations, all JSON output keys, thegenderenum constraint, theageinteger constraint, and the institutional age=30 / gender="other" rule.
- Translating the system-prompt base string in
- Out of scope:
- Logger calls (
logger.info,logger.warning,logger.error) and the printed banner text insideoasis_profile_generator.py— covered by issue #6. - Module docstring, class docstrings, method docstrings, and inline comments — covered by issue #7.
- The fallback Chinese string literals embedded in non-prompt code paths (e.g.
f"{entity_name}是一个{entity_type}。"inside_try_fix_jsonand the rule-based fallback) — those are runtime data fallbacks, not LLM prompts, and are out of scope for this issue (they are part of the fallback flow covered when comments/docstrings #7 lands or in a future cleanup; they are not user-visible while the LLM path succeeds). - Refactoring the OASIS profile JSON schema, the
OasisAgentProfiledataclass, the MBTI list, theCOMMON_COUNTRIESlist, the entity-type taxonomy splits (PERSONAL_ENTITY_TYPESvsGROUP_ENTITY_TYPES), or persona-generation flow control. - Changing OASIS profile-format compatibility — verified by
backend/scripts/test_profile_format.py. - Editing the locale plumbing block (currently the
current_locale = get_locale()capture and theset_locale(current_locale)call insidegenerate_single_profilearound lines ~910–916).
- Logger calls (
- Adjacent expectations:
- The Step 2 environment-setup pipeline must continue to consume the OASIS profile output unchanged. The Reddit (
to_reddit_format) and Twitter (to_twitter_format) serializers are not coupled to prompt language; this is verified via the JSON schema contract preservation. - The locale resolution chain (
Accept-Languageheader →get_locale()→get_language_instruction()) is owned bybackend/app/utils/locale.pyand is unchanged by this work. - Companion i18n issues (#6 logs, #7 comments/docstrings, #9 frontend comments, #10 e2e verification, #12 README) operate on different files or scopes and must not be touched here.
- The Step 2 environment-setup pipeline must continue to consume the OASIS profile output unchanged. The Reddit (
Requirements
Requirement 1: English Translation of the System Prompt
Objective: As a MiroFish operator running the pipeline under Accept-Language: en, I want the persona-generation system prompt to be authored in English, so that the LLM's persona prose is not biased toward Chinese structure or word choice.
Acceptance Criteria
- The OASIS Profile Generator shall set the
base_promptconstant inside_get_system_promptto an English string containing zero Chinese characters. - The OASIS Profile Generator shall preserve the system-prompt assembly contract verbatim: the format
f"{base_prompt}\n\n{get_language_instruction()}"and the call toget_language_instruction()at exactly that site. - The OASIS Profile Generator shall preserve the role and intent semantics of the original prompt: identifying the model as an expert in social-media user-persona generation, requesting detailed and realistic personas for opinion simulation that reflect existing real-world conditions, and mandating valid JSON output where string values must not contain unescaped newlines.
- The OASIS Profile Generator shall preserve the function signature
_get_system_prompt(self, is_individual: bool) -> str.
Requirement 2: English Translation of the Individual-Persona User-Message Template
Objective: As a MiroFish operator generating personas for individual entities under Accept-Language: en, I want the user-message template constructed by _build_individual_persona_prompt to be authored in English, so that the rendered prompt does not interleave English get_language_instruction() directives with Chinese section headings.
Acceptance Criteria
- The OASIS Profile Generator shall render the individual-persona user message with English section headings and prose in place of the current Chinese (entity name, entity type, entity summary, entity attributes, context section, JSON-fields enumeration, "important" trailing block).
- The OASIS Profile Generator shall preserve all variable interpolations verbatim by name:
{entity_name},{entity_type},{entity_summary},{attrs_str},{context_str}, and the inline{get_language_instruction()}call inside the trailing rules block. - The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys
bio,persona,age,gender,mbti,country,profession,interested_topics(verbatim, English). - The OASIS Profile Generator shall preserve the field-level constraints in the prompt:
bio≈ 200 characters, social-media biography.persona≈ 2000 characters, single coherent text covering: basic information (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behavior (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions in it).ageMUST be an integer.genderMUST be one of"male"or"female"(English enum value, locale-independent).mbtiMUST be an MBTI four-letter type (e.g. INTJ, ENFP).countryMUST be a country name string.professionMUST be a profession string.interested_topicsMUST be an array.
- The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no newlines inside string values,
personais a single coherent text,gendermust be the Englishmale/femaleenum even when locale iszh, content must stay consistent with the source entity,agemust be a valid integer. - The OASIS Profile Generator shall preserve the function signature
_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str. - The OASIS Profile Generator shall preserve the
context[:3000]truncation behaviour and the conditional fallback ("无额外上下文"translated to"No additional context") whencontextis empty/falsy. Likewise,attrs_strshall fall back to an English placeholder ("None") whenentity_attributesis empty/falsy, replacing the current"无"literal. - The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled individual-persona prompt body.
Requirement 3: English Translation of the Group/Institution-Persona User-Message Template
Objective: As a MiroFish operator generating personas for institutional/group entities under Accept-Language: en, I want the user-message template constructed by _build_group_persona_prompt to be authored in English, so that the rendered prompt does not interleave English get_language_instruction() directives with Chinese section headings.
Acceptance Criteria
- The OASIS Profile Generator shall render the group-persona user message with English section headings and prose in place of the current Chinese.
- The OASIS Profile Generator shall preserve all variable interpolations verbatim by name:
{entity_name},{entity_type},{entity_summary},{attrs_str},{context_str}, and the inline{get_language_instruction()}call inside the trailing rules block. - The OASIS Profile Generator shall preserve the JSON output contract enumerated in the prompt: the keys
bio,persona,age,gender,mbti,country,profession,interested_topics(verbatim, English). - The OASIS Profile Generator shall preserve the field-level constraints in the prompt:
bio≈ 200 characters, an official-account biography that reads as professionally appropriate.persona≈ 2000 characters, single coherent text covering: institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions in it).ageMUST be the integer30(the institutional virtual-age sentinel).genderMUST be the literal"other"(English enum value, locale-independent), indicating non-individual.mbtiMUST be an MBTI four-letter type used to characterize account voice (e.g. ISTJ for strict/conservative).countryMUST be a country name string.professionMUST describe institutional function.interested_topicsMUST be an array of focus areas.
- The OASIS Profile Generator shall preserve the trailing-block rules verbatim in spirit: every value is a string or number, no
nullvalues, no newlines in string values,personais a single coherent text,gendermust be the English"other"enum even when locale iszh, the institutional account voice must match its identity positioning, andagemust be the integer30. - The OASIS Profile Generator shall preserve the function signature
_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str. - The OASIS Profile Generator shall preserve the
context[:3000]truncation behaviour and the conditional English-equivalent fallback for emptycontextand emptyentity_attributes, mirroring Requirement 2. - The OASIS Profile Generator shall return zero Chinese characters across all string literals contributed to the assembled group-persona prompt body.
Requirement 4: Locale Switching Continues to Work via get_language_instruction()
Objective: As a MiroFish operator running the pipeline under Accept-Language: zh (or any other configured non-English locale), I want generated personas to remain in the requested locale at equivalent quality, so that translating the base prompt does not regress non-English support.
Acceptance Criteria
- The OASIS Profile Generator shall preserve every existing
get_language_instruction()call site exactly: the system-prompt site in_get_system_prompt, the inline call inside the trailing rules block of_build_individual_persona_prompt, and the inline call inside the trailing rules block of_build_group_persona_prompt. - The OASIS Profile Generator shall preserve the locale-capture/restore plumbing inside
generate_profiles_for_entities(currently thecurrent_locale = get_locale()capture and theset_locale(current_locale)call insidegenerate_single_profile) — this code is not modified by the change. - While the locale is
zh, the OASIS Profile Generator shall produce profiles whosebio,persona,profession, andinterested_topicscontent is in Chinese, equivalent in quality to the pre-change behaviour. - While the locale is
en, the OASIS Profile Generator shall produce profiles whosebio,persona,profession, andinterested_topicscontent is in English. - While the locale is
enorzh, the OASIS Profile Generator shall produce profiles whosegenderfield is one of the literal English values"male","female"(individual entities) or"other"(group entities), regardless of locale. - The OASIS Profile Generator shall not alter
backend/app/utils/locale.py, the_languages, the_translationsregistries, or the locales under/locales/.
Requirement 5: Public API and Call-Site Stability
Objective: As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of OasisProfileGenerator and OasisAgentProfile to remain unchanged, so that the Step 2 environment-setup flow and existing callers continue to work without modification.
Acceptance Criteria
- The OASIS Profile Generator shall preserve the dataclass
OasisAgentProfile, including its field set (user_id,user_name,name,bio,persona,karma,friend_count,follower_count,statuses_count,age,gender,mbti,country,profession,interested_topics,source_entity_uuid,source_entity_type,created_at), default values, and theto_reddit_format,to_twitter_format,to_full_dictserializers. - The OASIS Profile Generator shall preserve the signatures and call semantics of
OasisProfileGenerator.__init__,generate_profile_from_entity,generate_profiles_for_entities,_call_llm_with_retry,_generate_profile_rule_based,_get_system_prompt,_build_individual_persona_prompt,_build_group_persona_prompt,_print_generated_profile,_fix_truncated_json,_try_fix_json, and_generate_username. - The OASIS Profile Generator shall preserve the LLM invocation parameters (
temperature,max_tokens, model selection, retry behaviour) at the call sites that consume the prompts produced by the translated builders. - The OASIS Profile Generator shall preserve the
PERSONAL_ENTITY_TYPESandGROUP_ENTITY_TYPEStaxonomies, theMBTI_TYPESlist, and theCOMMON_COUNTRIESlist verbatim.
Requirement 6: Reasoning-Model Output Compatibility
Objective: As a MiroFish operator using a reasoning-model provider (e.g. MiniMax, GLM with <think> tags or markdown code fences), I want JSON parsing of the persona response to continue working, so that translating the base prompt does not regress provider compatibility.
Acceptance Criteria
- The OASIS Profile Generator shall preserve the existing
_fix_truncated_jsonand_try_fix_jsonresilience helpers exactly, including their regex-based extraction ofbioandpersonafrom partial output. - If a reasoning-model provider returns truncated,
<think>-tagged, or markdown-fenced output, then the existing parsing/recovery flow shall continue to apply unchanged. - The OASIS Profile Generator shall not introduce any new pre-processing of the LLM response that depends on prompt language.
- After translation, the OASIS Profile Generator shall continue to round-trip a representative entity through
generate_profile_from_entityand produce a JSON object with at minimum a non-emptybioand a non-emptypersona, matching the pre-change behaviour.
Requirement 7: Step 2 Environment-Setup Parity (OASIS Format Compatibility)
Objective: As a MiroFish operator validating the change, I want the OASIS subprocess to accept the generated profiles unchanged, so that the translation does not silently break Step 2 → Step 3 hand-off.
Acceptance Criteria
- While
uv run python -m pytest backend/scripts/test_profile_format.pyruns against the changed code, the test suite shall pass with zero regressions versus the pre-change baseline. - While a representative Reddit-format profile dictionary is produced under locale
en, every field name shall match the existing OASIS-required schema:user_id,username,name,bio,persona,karma,created_at, plus optionalage,gender,mbti,country,profession,interested_topics. - While a representative Twitter-format profile dictionary is produced under locale
en, every field name shall match the existing OASIS-required schema:user_id,username,name,bio,persona,friend_count,follower_count,statuses_count,created_at, plus optionalage,gender,mbti,country,profession,interested_topics. - The OASIS Profile Generator shall produce
gendervalues that are exactly one of"male","female","other"regardless of locale, satisfying the OASIS subprocess's expected enum.
Requirement 8: Out-of-Scope Surfaces Remain Untouched
Objective: As a reviewer of this PR, I want the change to remain narrowly scoped to prompt strings, so that translation responsibilities for adjacent surfaces (issues #6, #7, and the rule-based fallback) are not absorbed into this change.
Acceptance Criteria
- The change shall not modify any
logger.warning(...),logger.info(...),logger.error(...), orlogger.debug(...)call inoasis_profile_generator.py(covered by issue #6). - The change shall not modify the module docstring, class docstrings, method docstrings, or inline comments in
oasis_profile_generator.py(covered by issue #7). - The change shall not modify the rule-based fallback Chinese fragments inside
_try_fix_json(e.g.f"{entity_name}是一个{entity_type}。") and the rule-based path inside_generate_profile_rule_based— those are runtime data fallbacks, not LLM prompts, and remain out of scope here. - The change shall not edit any file outside
backend/app/services/oasis_profile_generator.pyfor production code. - The change shall not introduce a new dependency or modify
backend/pyproject.toml/backend/uv.lock. - The change shall not modify
backend/scripts/test_profile_format.py(the test is the contract; the implementation must match it).