MicroFish/.kiro/specs/i18n-oasis-profile-generato.../tasks.md

14 KiB
Raw Blame History

Implementation Plan

  • 1. Translate the system-prompt builder to English

    • Replace the Chinese base_prompt literal inside _get_system_prompt (currently "你是社交媒体用户画像生成专家。…" at line ~664) with an English rendering that conveys the same role and intent: identifies the model as an expert in social-media user-persona generation, asks for detailed and realistic personas suitable for opinion-simulation that faithfully reflect existing real-world conditions, mandates valid JSON output, and forbids unescaped newlines inside string values
    • Preserve the assembled return shape f"{base_prompt}\n\n{get_language_instruction()}" exactly — the call to get_language_instruction() is unchanged in name and position
    • Preserve the method signature _get_system_prompt(self, is_individual: bool) -> str; do not branch on is_individual (current behaviour preserved)
    • Observable completion: _get_system_prompt(True) and _get_system_prompt(False) both return non-empty English strings ending with the per-locale postfix from get_language_instruction(); the base_prompt body contains zero CJK characters
    • Requirements: 1.1, 1.2, 1.3, 1.4
  • 2. Translate the individual-persona user-message builder to English

    • Replace the Chinese f-string body inside _build_individual_persona_prompt (currently lines ~680714) with an English rendering structured as: a lead sentence requesting a detailed social-media persona faithful to existing reality; an entity-context block with English labels for entity_name, entity_type, entity_summary, entity_attributes; a Context information: block; a Generate JSON with the following fields: enumeration of the eight output keys (bio, persona, age, gender, mbti, country, profession, interested_topics); and a trailing Important: rules block
    • Translate the field-level descriptions verbatim in spirit: bio ≈ 200 chars; persona ≈ 2000 chars covering basic info (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behaviour (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions); age integer; gender MUST be the literal "male" or "female"; mbti four-letter type; country country name; profession; interested_topics array
    • Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers; persona is a single coherent text without unescaped newlines; the inline {get_language_instruction()} call remains followed by the parenthetical reminder that gender MUST use the English values "male" / "female"; content stays consistent with the entity; age MUST be a valid integer
    • Replace the attrs_str and context_str Chinese fallback defaults with English: "无""None" (used when entity_attributes is empty/falsy) and "无额外上下文""No additional context" (used when context is empty/falsy)
    • Drop the country-language hint (使用中文,如"中国" so get_language_instruction() steers the country language; preserve the country line as a neutral country: country name entry
    • Preserve every f-string interpolation by name and position: {entity_name}, {entity_type}, {entity_summary}, {attrs_str}, {context_str}, {get_language_instruction()}
    • Preserve the context[:3000] truncation behaviour and the method signature _build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str
    • Observable completion: calling _build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx") returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains the gender enum lock-in "male" / "female" exactly once
    • Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5
  • 3. Translate the group/institution-persona user-message builder to English

    • Replace the Chinese f-string body inside _build_group_persona_prompt (currently lines ~729762) with an English rendering structured the same way as Task 2 but adapted for institutional voice: lead sentence requesting a detailed social-media account profile for an institution/group faithful to existing reality; entity-context block; Context information: block; Generate JSON with the following fields: enumeration of the eight output keys; trailing Important: rules block
    • Translate the field-level descriptions verbatim in spirit: bio ≈ 200 chars in an official-account voice; persona ≈ 2000 chars covering institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions); age MUST be the integer 30; gender MUST be the literal "other"; mbti four-letter type characterizing account voice; country; profession describes institutional function; interested_topics array
    • Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers, no null allowed; persona is a single coherent text without unescaped newlines; the inline {get_language_instruction()} call remains followed by the parenthetical reminder that gender MUST use the English value "other"; age MUST be the integer 30 and gender MUST be the string "other"; account voice must match identity positioning
    • Replace the attrs_str and context_str Chinese fallback defaults with the same English replacements applied in Task 2 ("None" and "No additional context")
    • Drop the country-language hint as in Task 2
    • Preserve every f-string interpolation by name and position: {entity_name}, {entity_type}, {entity_summary}, {attrs_str}, {context_str}, {get_language_instruction()}
    • Preserve the context[:3000] truncation behaviour and the method signature _build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str
    • Observable completion: calling _build_group_persona_prompt("ACME Corp", "Organization", "summary", {"k": "v"}, "ctx") returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains both the age == 30 lock-in and the gender == "other" lock-in
    • Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5
  • 4. Confirm boundary commitments around the translation

    • Confirm every existing get_language_instruction() call site is preserved verbatim: the system-prompt assembly inside _get_system_prompt, the inline call inside the trailing rules block of _build_individual_persona_prompt, and the inline call inside the trailing rules block of _build_group_persona_prompt
    • Confirm the locale-thread plumbing in generate_profiles_for_entities (capture current_locale = get_locale() at line ~910 and set_locale(current_locale) inside the worker at line ~914) is byte-identical
    • Confirm the public signatures of OasisProfileGenerator.__init__, generate_profile_from_entity, generate_profiles_for_entities, set_graph_id, and the private helpers _call_llm_with_retry, _generate_profile_rule_based, _print_generated_profile, _fix_truncated_json, _try_fix_json, _save_twitter_csv, _save_reddit_json, _generate_username are unchanged
    • Confirm the OasisAgentProfile dataclass field set, default values, and the to_reddit_format, to_twitter_format, to_full_dict serializers are unchanged
    • Confirm class constants MBTI_TYPES, COUNTRIES, INDIVIDUAL_ENTITY_TYPES, GROUP_ENTITY_TYPES are unchanged
    • Confirm the LLM invocation parameters at the call site that consumes the translated prompts (response_format={"type": "json_object"}, temperature=0.7 - (attempt * 0.1), max_attempts=3) are unchanged
    • Confirm _fix_truncated_json and _try_fix_json (including their Chinese persona fragments such as f"{entity_name}是一个{entity_type}。") are not modified — these are runtime data fallbacks, not prompts, and are out of scope
    • Confirm _generate_profile_rule_based is not modified — including its Chinese country defaults "中国" at lines ~807 and ~819
    • Confirm backend/app/utils/locale.py, /locales/languages.json, /locales/en.json, and /locales/zh.json are not modified
    • Confirm logger.warning(...), logger.info(...), logger.error(...), the print banner at line ~945, module / class / method docstrings, and inline comments in oasis_profile_generator.py are not modified (owned by issues #6 and #7)
    • Confirm backend/scripts/test_profile_format.py, backend/pyproject.toml, backend/uv.lock, and any file outside backend/app/services/oasis_profile_generator.py are not modified
    • Observable completion: a git diff review against main shows changes only inside backend/app/services/oasis_profile_generator.py, only inside _get_system_prompt, _build_individual_persona_prompt, _build_group_persona_prompt, and the surrounding lines (method headers, neighbouring methods) are byte-identical
    • Requirements: 1.4, 2.6, 3.6, 4.1, 4.2, 4.6, 5.1, 5.2, 5.3, 5.4, 6.1, 6.3, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6
  • 5. Verify smoke import and OASIS profile-format pytest

    • Run cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator, OasisAgentProfile" and confirm it exits 0 (catches f-string syntax errors)
    • Run cd backend && uv run python -m pytest backend/scripts/test_profile_format.py (or equivalent invocation per project convention) and confirm it passes — the test does not exercise prompts, so a pure-translation diff must keep it green
    • Construct an instance of OasisProfileGenerator (using OasisProfileGenerator.__new__(OasisProfileGenerator) to skip __init__ if the LLM key is unavailable, mirroring the pattern in test_profile_format.py) and confirm _get_system_prompt(True), _build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx"), and _build_group_persona_prompt("ACME", "Organization", "summary", {"k": "v"}, "ctx") each return a string with zero CJK matches against the regex [一-鿿]
    • Observable completion: smoke import exits 0; pytest passes with zero regressions; the three prompt-builder calls each produce English-only output under the default zh locale (the get_language_instruction() postfix at the end is the only place where Chinese is allowed to appear, and only when locale is zh)
    • Requirements: 6.4, 7.1, 7.2, 7.3, 7.4
  • 6. Verify locale-driven output language under both en and zh

    • With the thread-local locale forced via set_locale("en"), render each of the three builders against representative inputs and confirm: each output contains zero CJK characters; each ends with the English locale postfix "Please respond in English."; the gender enum constraint appears as English "male" / "female" (individual) or "other" (group)
    • With set_locale("zh"), render the same three builders and confirm: the per-prompt body remains English-only (the translated base prompt does not depend on locale); each ends with the Chinese locale postfix "请使用中文回答。"; the gender enum constraint still appears as the English literal values
    • Optionally, with a configured LLM key, run OasisProfileGenerator().generate_profile_from_entity(...) end-to-end under each locale against a synthetic EntityNode and spot-check that the produced bio, persona, profession are English under en and Chinese under zh, while gender is one of the three English enum literals under both
    • Observable completion: the locale-en rendering is CJK-free in the prompt body and ends with the English locale postfix; the locale-zh rendering preserves the prompt body in English and ends with the Chinese locale postfix; if the LLM round-trip is exercised, results are recorded in the PR description
    • Requirements: 4.3, 4.4, 4.5
  • 7. Final CJK regression sweep on the three builders

    • Run a regex audit limited to the three method bodies (_get_system_prompt, _build_individual_persona_prompt, _build_group_persona_prompt) using the project-level CJK guard regex ([一-鿿]) and confirm zero matches inside their string literals
    • Run a CJK audit on the rendered output of the three builders for representative inputs and confirm zero matches in the prompt body (the locale postfix is excluded — its Chinese form is a deliberate kept use under zh)
    • Confirm the file-level git grep -nE '[\\x{4e00}-\\x{9fff}]' -- backend/app/services/oasis_profile_generator.py output still flags only known out-of-scope locations: docstrings, comments, logger keys, rule-based fallback country "中国" defaults, and resilience-helper Chinese fragments — and does not flag any line inside the three translated method bodies
    • Observable completion: the targeted regex audit returns zero matches inside the three method bodies; the file-level audit's residual CJK lines all fall outside the three method bodies and match the out-of-scope inventory in design.md § Boundary Commitments → Out of Boundary
    • Requirements: 1.1, 2.8, 3.8, 8.1, 8.2, 8.3