MicroFish/.kiro/specs/i18n-oasis-profile-generato.../tasks.md

67 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Implementation Plan
- [x] 1. Translate the system-prompt builder to English
- Replace the Chinese `base_prompt` literal inside `_get_system_prompt` (currently `"你是社交媒体用户画像生成专家。…"` at line ~664) with an English rendering that conveys the same role and intent: identifies the model as an expert in social-media user-persona generation, asks for detailed and realistic personas suitable for opinion-simulation that faithfully reflect existing real-world conditions, mandates valid JSON output, and forbids unescaped newlines inside string values
- Preserve the assembled return shape `f"{base_prompt}\n\n{get_language_instruction()}"` exactly — the call to `get_language_instruction()` is unchanged in name and position
- Preserve the method signature `_get_system_prompt(self, is_individual: bool) -> str`; do not branch on `is_individual` (current behaviour preserved)
- Observable completion: `_get_system_prompt(True)` and `_get_system_prompt(False)` both return non-empty English strings ending with the per-locale postfix from `get_language_instruction()`; the `base_prompt` body contains zero CJK characters
- _Requirements: 1.1, 1.2, 1.3, 1.4_
- [x] 2. Translate the individual-persona user-message builder to English
- Replace the Chinese f-string body inside `_build_individual_persona_prompt` (currently lines ~680714) with an English rendering structured as: a lead sentence requesting a detailed social-media persona faithful to existing reality; an entity-context block with English labels for `entity_name`, `entity_type`, `entity_summary`, `entity_attributes`; a `Context information:` block; a `Generate JSON with the following fields:` enumeration of the eight output keys (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`); and a trailing `Important:` rules block
- Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars; `persona` ≈ 2000 chars covering basic info (age, profession, education, location), background (notable experience, event association, social ties), personality (MBTI, core traits, emotional expression), social-media behaviour (posting frequency, content preferences, interaction style, language traits), stance (attitudes toward the topic, emotional triggers), unique features (catchphrases, special experiences, hobbies), and personal memory (the entity's relation to the event and prior actions/reactions); `age` integer; `gender` MUST be the literal `"male"` or `"female"`; `mbti` four-letter type; `country` country name; `profession`; `interested_topics` array
- Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English values `"male"` / `"female"`; content stays consistent with the entity; `age` MUST be a valid integer
- Replace the `attrs_str` and `context_str` Chinese fallback defaults with English: `"无"``"None"` (used when `entity_attributes` is empty/falsy) and `"无额外上下文"``"No additional context"` (used when `context` is empty/falsy)
- Drop the country-language hint `(使用中文,如"中国"` so `get_language_instruction()` steers the country language; preserve the country line as a neutral `country: country name` entry
- Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`
- Preserve the `context[:3000]` truncation behaviour and the method signature `_build_individual_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`
- Observable completion: calling `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains the `gender` enum lock-in `"male"` / `"female"` exactly once
- _Requirements: 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 4.1, 4.5_
- [x] 3. Translate the group/institution-persona user-message builder to English
- Replace the Chinese f-string body inside `_build_group_persona_prompt` (currently lines ~729762) with an English rendering structured the same way as Task 2 but adapted for institutional voice: lead sentence requesting a detailed social-media account profile for an institution/group faithful to existing reality; entity-context block; `Context information:` block; `Generate JSON with the following fields:` enumeration of the eight output keys; trailing `Important:` rules block
- Translate the field-level descriptions verbatim in spirit: `bio` ≈ 200 chars in an official-account voice; `persona` ≈ 2000 chars covering institutional basics (formal name, type, founding background, primary functions), account positioning (account type, target audience, core function), voice (language traits, common phrasing, taboo topics), publishing pattern (content types, publishing frequency, active hours), stance (official position on the core topic, controversy-handling style), special notes (group portrait represented, operational habits), and institutional memory (the institution's relation to the event and prior actions/reactions); `age` MUST be the integer `30`; `gender` MUST be the literal `"other"`; `mbti` four-letter type characterizing account voice; `country`; `profession` describes institutional function; `interested_topics` array
- Translate the trailing rules block to English while keeping every locale-independent constraint intact: all values are strings or numbers, no `null` allowed; `persona` is a single coherent text without unescaped newlines; the inline `{get_language_instruction()}` call remains followed by the parenthetical reminder that `gender` MUST use the English value `"other"`; `age` MUST be the integer `30` and `gender` MUST be the string `"other"`; account voice must match identity positioning
- Replace the `attrs_str` and `context_str` Chinese fallback defaults with the same English replacements applied in Task 2 (`"None"` and `"No additional context"`)
- Drop the country-language hint as in Task 2
- Preserve every f-string interpolation by name and position: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`
- Preserve the `context[:3000]` truncation behaviour and the method signature `_build_group_persona_prompt(self, entity_name: str, entity_type: str, entity_summary: str, entity_attributes: Dict[str, Any], context: str) -> str`
- Observable completion: calling `_build_group_persona_prompt("ACME Corp", "Organization", "summary", {"k": "v"}, "ctx")` returns a non-empty English string with all six interpolations resolved, with zero CJK characters in any literal contributed by this method, and the string contains both the `age == 30` lock-in and the `gender == "other"` lock-in
- _Requirements: 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 4.1, 4.5_
- [x] 4. Confirm boundary commitments around the translation
- Confirm every existing `get_language_instruction()` call site is preserved verbatim: the system-prompt assembly inside `_get_system_prompt`, the inline call inside the trailing rules block of `_build_individual_persona_prompt`, and the inline call inside the trailing rules block of `_build_group_persona_prompt`
- Confirm the locale-thread plumbing in `generate_profiles_for_entities` (capture `current_locale = get_locale()` at line ~910 and `set_locale(current_locale)` inside the worker at line ~914) is byte-identical
- Confirm the public signatures of `OasisProfileGenerator.__init__`, `generate_profile_from_entity`, `generate_profiles_for_entities`, `set_graph_id`, and the private helpers `_call_llm_with_retry`, `_generate_profile_rule_based`, `_print_generated_profile`, `_fix_truncated_json`, `_try_fix_json`, `_save_twitter_csv`, `_save_reddit_json`, `_generate_username` are unchanged
- Confirm the `OasisAgentProfile` dataclass field set, default values, and the `to_reddit_format`, `to_twitter_format`, `to_full_dict` serializers are unchanged
- Confirm class constants `MBTI_TYPES`, `COUNTRIES`, `INDIVIDUAL_ENTITY_TYPES`, `GROUP_ENTITY_TYPES` are unchanged
- Confirm the LLM invocation parameters at the call site that consumes the translated prompts (`response_format={"type": "json_object"}`, `temperature=0.7 - (attempt * 0.1)`, `max_attempts=3`) are unchanged
- Confirm `_fix_truncated_json` and `_try_fix_json` (including their Chinese persona fragments such as `f"{entity_name}是一个{entity_type}。"`) are not modified — these are runtime data fallbacks, not prompts, and are out of scope
- Confirm `_generate_profile_rule_based` is not modified — including its Chinese country defaults `"中国"` at lines ~807 and ~819
- Confirm `backend/app/utils/locale.py`, `/locales/languages.json`, `/locales/en.json`, and `/locales/zh.json` are not modified
- Confirm `logger.warning(...)`, `logger.info(...)`, `logger.error(...)`, the print banner at line ~945, module / class / method docstrings, and inline comments in `oasis_profile_generator.py` are not modified (owned by issues #6 and #7)
- Confirm `backend/scripts/test_profile_format.py`, `backend/pyproject.toml`, `backend/uv.lock`, and any file outside `backend/app/services/oasis_profile_generator.py` are not modified
- Observable completion: a `git diff` review against `main` shows changes only inside `backend/app/services/oasis_profile_generator.py`, only inside `_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`, and the surrounding lines (method headers, neighbouring methods) are byte-identical
- _Requirements: 1.4, 2.6, 3.6, 4.1, 4.2, 4.6, 5.1, 5.2, 5.3, 5.4, 6.1, 6.3, 8.1, 8.2, 8.3, 8.4, 8.5, 8.6_
- [x] 5. Verify smoke import and OASIS profile-format pytest
- Run `cd backend && uv run python -c "from app.services.oasis_profile_generator import OasisProfileGenerator, OasisAgentProfile"` and confirm it exits 0 (catches f-string syntax errors)
- Run `cd backend && uv run python -m pytest backend/scripts/test_profile_format.py` (or equivalent invocation per project convention) and confirm it passes — the test does not exercise prompts, so a pure-translation diff must keep it green
- Construct an instance of `OasisProfileGenerator` (using `OasisProfileGenerator.__new__(OasisProfileGenerator)` to skip `__init__` if the LLM key is unavailable, mirroring the pattern in `test_profile_format.py`) and confirm `_get_system_prompt(True)`, `_build_individual_persona_prompt("Alice", "Student", "summary", {"k": "v"}, "ctx")`, and `_build_group_persona_prompt("ACME", "Organization", "summary", {"k": "v"}, "ctx")` each return a string with zero CJK matches against the regex `[一-鿿]`
- Observable completion: smoke import exits 0; pytest passes with zero regressions; the three prompt-builder calls each produce English-only output under the default `zh` locale (the `get_language_instruction()` postfix at the end is the only place where Chinese is allowed to appear, and only when locale is `zh`)
- _Requirements: 6.4, 7.1, 7.2, 7.3, 7.4_
- [x] 6. Verify locale-driven output language under both `en` and `zh`
- With the thread-local locale forced via `set_locale("en")`, render each of the three builders against representative inputs and confirm: each output contains zero CJK characters; each ends with the English locale postfix `"Please respond in English."`; the `gender` enum constraint appears as English `"male"` / `"female"` (individual) or `"other"` (group)
- With `set_locale("zh")`, render the same three builders and confirm: the per-prompt body remains English-only (the translated base prompt does not depend on locale); each ends with the Chinese locale postfix `"请使用中文回答。"`; the `gender` enum constraint still appears as the English literal values
- Optionally, with a configured LLM key, run `OasisProfileGenerator().generate_profile_from_entity(...)` end-to-end under each locale against a synthetic `EntityNode` and spot-check that the produced `bio`, `persona`, `profession` are English under `en` and Chinese under `zh`, while `gender` is one of the three English enum literals under both
- Observable completion: the locale-`en` rendering is CJK-free in the prompt body and ends with the English locale postfix; the locale-`zh` rendering preserves the prompt body in English and ends with the Chinese locale postfix; if the LLM round-trip is exercised, results are recorded in the PR description
- _Requirements: 4.3, 4.4, 4.5_
- [x] 7. Final CJK regression sweep on the three builders
- Run a regex audit limited to the three method bodies (`_get_system_prompt`, `_build_individual_persona_prompt`, `_build_group_persona_prompt`) using the project-level CJK guard regex (`[一-鿿]`) and confirm zero matches inside their string literals
- Run a CJK audit on the rendered output of the three builders for representative inputs and confirm zero matches in the prompt body (the locale postfix is excluded — its Chinese form is a deliberate kept use under `zh`)
- Confirm the file-level `git grep -nE '[\\x{4e00}-\\x{9fff}]' -- backend/app/services/oasis_profile_generator.py` output still flags only known out-of-scope locations: docstrings, comments, logger keys, rule-based fallback country `"中国"` defaults, and resilience-helper Chinese fragments — and does not flag any line inside the three translated method bodies
- Observable completion: the targeted regex audit returns zero matches inside the three method bodies; the file-level audit's residual CJK lines all fall outside the three method bodies and match the out-of-scope inventory in `design.md` § Boundary Commitments → Out of Boundary
- _Requirements: 1.1, 2.8, 3.8, 8.1, 8.2, 8.3_