MicroFish/.kiro/specs/i18n-oasis-profile-generato.../design.md

318 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design Document — i18n-oasis-profile-generator-prompts
## Overview
**Purpose**: Translate the Chinese prompt strings, context-builder section labels, fallback persona templates, and console-output formatting in `backend/app/services/oasis_profile_generator.py` to English while preserving every functional contract — LLM JSON output schema, the `_normalize_gender` mapping that must continue to accept Chinese gender values, the `_generate_profile_rule_based` default `country: "中国"` data value, all f-string interpolations, and the `get_language_instruction()` locale-postfix mechanism. The goal is to remove the Chinese-language base-prompt and context-label bias that currently leaks Chinese structure and word choice into OASIS profile output even when `Accept-Language: en`.
**Users**: MiroFish operators running the Step 2 OASIS profile generation under any locale; downstream OASIS / CAMEL-OASIS consumers of the agent JSON / CSV produced by `OasisProfileGenerator`.
**Impact**: Replaces approximately one base-prompt string, two large user-message templates, four context-builder section labels, three fallback persona templates, and ten console-output strings with English equivalents inside one file. No API surface change. No new dependencies. No new files. Callers (`backend/app/api/simulation.py`, etc.) and OASIS consumers are unaffected.
### Goals
- Zero CJK characters in any prompt string literal contributed by `oasis_profile_generator.py` to the system prompt, the user message, or the context block.
- Zero CJK characters in any console-output literal in `_print_generated_profile` and the surrounding banners.
- English `bio` / `persona` output under `Accept-Language: en`.
- Continued Chinese `bio` / `persona` output under `Accept-Language: zh`, of equivalent quality to the pre-change behaviour.
- No diff to public signatures, dataclass schema, LLM-call parameters, or call sites.
### Non-Goals
- Externalizing prompts to `/locales/*.json` (out of scope per ticket and consistent with `i18n-ontology-generator-prompts`).
- Translating logger calls in this file (covered by issue #6).
- Translating module/class/method docstrings or inline comments in this file (covered by issue #7).
- Refactoring the OASIS profile JSON schema, the OASIS adapter, or the simulation flow.
- Modifying the `_normalize_gender` mapping table (it must keep accepting Chinese gender keys).
- Modifying the `_generate_profile_rule_based` default `"中国"` country value (data, not prompt).
- Modifying the `ValueError("LLM_API_KEY 未配置")` raise (covered by issue #6).
- Modifying `backend/app/utils/locale.py`, the locale registries, or any non-target file.
## Boundary Commitments
### This Spec Owns
- The English content of the `base_prompt` string in `OasisProfileGenerator._get_system_prompt` (line 664).
- The English content of every string literal in `OasisProfileGenerator._build_individual_persona_prompt` (lines 677714).
- The English content of every string literal in `OasisProfileGenerator._build_group_persona_prompt` (lines 726762).
- The English content of the section-label literals embedded in `OasisProfileGenerator._search_zep_for_entity` (lines 384, 390, 392) and `OasisProfileGenerator._build_entity_context` (lines 422, 438, 440, 443, 463, 472, 475).
- The English content of the fallback persona templates in `OasisProfileGenerator._generate_profile_with_llm` (line 547) and `OasisProfileGenerator._try_fix_json` (lines 644, 659).
- The English content of the no-attributes / no-context placeholder literals (`"无"`, `"无额外上下文"`) at lines 677, 678, 726, 727.
- The English content of every string literal in `OasisProfileGenerator._print_generated_profile` (lines 1011, 1017, 1019, 1022, 1025, 1026, 1027, 1028) and the surrounding banners in `OasisProfileGenerator.generate_profiles_from_entities` (lines 945, 1001).
### Out of Boundary
- Locale resolution machinery (`backend/app/utils/locale.py`).
- Per-locale `llmInstruction` definitions (`/locales/languages.json`).
- Reasoning-model output stripping (`backend/app/utils/llm_client.py`).
- All `logger.*` calls (already keyed via `t("log.profile_generator.*")`; covered by issue #6).
- Module / class / method docstrings and inline comments (covered by issue #7), including the inline comments at lines 65, 93, 641, 804807, 816819.
- The `_normalize_gender` mapping table (lines 11231132) — must continue to accept Chinese gender keys from upstream.
- The hard-coded `country: "中国"` default in `_generate_profile_rule_based` (lines 807, 819) — this is a data value, not a prompt.
- The `ValueError("LLM_API_KEY 未配置")` raise (line 194) — covered by issue #6.
- All callers of `OasisProfileGenerator`, including `backend/app/api/simulation.py`.
- Tests, scripts, and frontend code.
### Allowed Dependencies
- Existing `get_language_instruction`, `get_locale`, `set_locale`, `t` imports from `..utils.locale` (already imported; unchanged).
- Existing `OpenAI` SDK invocation (unchanged).
- No new imports.
### Revalidation Triggers
The following changes elsewhere would invalidate this design and require revisiting the prompt:
- A change to the JSON contract emitted by the LLM (`bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`).
- A change to `OasisAgentProfile` field semantics.
- A change to `get_language_instruction()` semantics or the per-locale `llmInstruction` strings.
- A change to OASIS / CAMEL-OASIS profile field expectations (e.g. if `gender` accepts more than `male` / `female` / `other`).
## Architecture
### Existing Architecture Analysis
`OasisProfileGenerator` lives in `backend/app/services/`, follows the in-process service pattern with bounded thread-pool fan-out for batched profile generation, and is invoked from `backend/app/api/simulation.py` inside a background `Task`. It depends on:
- `OpenAI` SDK for the LLM call.
- `GraphitiAdapter` (legacy `zep_client` field name) for the Zep / Graphiti graph search.
- `get_language_instruction()` for locale steering.
- `t()` for already-keyed log strings.
The relevant flow is:
1. The Flask handler resolves the request locale via `Accept-Language`; the locale is propagated to thread-pool workers via the `set_locale(current_locale)` capture in `generate_profiles_from_entities` (line 914).
2. For each entity, `_build_entity_context()` is called: it composes a context block by concatenating headed sub-sections (entity attributes, related facts/edges, related node summaries, Graphiti-search facts, Graphiti-search nodes). Some of these labels are currently in Chinese.
3. The context string is interpolated into the user-message template by either `_build_individual_persona_prompt` or `_build_group_persona_prompt`. Both templates are currently in Chinese, with English `gender` token directives interleaved.
4. The system prompt is built by `_get_system_prompt`: a Chinese base prompt followed by the locale-appropriate `get_language_instruction()`.
5. The two messages are sent to `chat.completions.create` with `response_format={"type": "json_object"}`. The result flows through `json.loads``_try_fix_json``_fix_truncated_json` fallback chain. Synthesized fallback personas use the Chinese template `f"{entity_name}是一个{entity_type}。"` if the LLM result is unusable.
6. After per-profile completion, `_print_generated_profile` writes a Chinese-headed banner to stdout, and `generate_profiles_from_entities` writes Chinese batch banners.
This design preserves all of the above structurally. The change is purely lexical inside the seven regions of one file.
### Architecture Pattern & Boundary Map
```mermaid
graph TB
Caller[simulation.py handler]
Generator[OasisProfileGenerator]
Locale[locale.get_language_instruction]
Graph[GraphitiAdapter graph.search]
LLM[OpenAI chat.completions]
Caller -->|generate_profiles_from_entities| Generator
Generator -->|build context block| Generator
Generator -->|read locale postfix| Locale
Generator -->|search facts/nodes| Graph
Generator -->|JSON request| LLM
LLM -->|raw JSON| Generator
Generator -->|OasisAgentProfile| Caller
```
**Architecture Integration**:
- Selected pattern: **In-place lexical translation** of seven regions of an existing service. No structural change.
- Domain/feature boundaries: locale machinery vs. prompt assembly vs. LLM transport remain cleanly separated.
- Existing patterns preserved: prompt-as-f-string user-message construction; Chinese-keyed `_normalize_gender` mapping; `t(...)` for log strings; `get_language_instruction()` postfix concatenation.
- New components rationale: none — no new components.
- Steering compliance: matches the established `i18n-*-prompts` family pattern (issues #2, #3, #4, #5) of in-place translation rather than `t()` keying for prompt bodies. Respects the steering note that "existing files mix English and Chinese in comments/docstrings — preserve both; do not translate one into the other unless asked." This ticket is the explicit ask for prompt strings, scoped to exclude comments/docstrings.
### Technology Stack
| Layer | Choice / Version | Role in Feature | Notes |
|-------|------------------|-----------------|-------|
| Backend / Services | Python 3.11+ | Hosts `OasisProfileGenerator` | Existing — unchanged. |
| Backend / Services | `openai` SDK | Issues the prompt; returns JSON | Existing — unchanged. |
| Backend / Services | `backend/app/utils/locale.py` | Resolves `Accept-Language``llmInstruction` postfix | Existing — unchanged. |
| Backend / Services | `GraphitiAdapter` | Provides Graphiti graph search facts/nodes | Existing — unchanged. |
No new dependencies. No version changes.
## File Structure Plan
### Modified Files
- `backend/app/services/oasis_profile_generator.py` — Replace the body of `_get_system_prompt` `base_prompt`; replace every Chinese string literal in `_build_individual_persona_prompt` and `_build_group_persona_prompt` with English equivalents; replace the four section labels in `_search_zep_for_entity` and the six section labels in `_build_entity_context`; replace the three fallback persona templates; replace the two `"无"` / `"无额外上下文"` placeholders; replace the console-output literals in `_print_generated_profile` and the two `print(...)` banners in `generate_profiles_from_entities`. Preserve every other character of the file.
No new files. No deletions. No moves.
## System Flows
The control-flow diagram in *Architecture Pattern & Boundary Map* covers the relevant flow; no additional diagrams are needed for this string-literal change.
## Requirements Traceability
| Requirement | Summary | Components | Interfaces | Flows |
|-------------|---------|------------|------------|-------|
| 1.11.4 | English `_get_system_prompt` `base_prompt`; preserve `get_language_instruction()` site | OasisProfileGenerator → `_get_system_prompt` | None changed | Architecture diagram |
| 2.12.9 | English `_build_individual_persona_prompt`; preserve interpolations and JSON keys | OasisProfileGenerator → `_build_individual_persona_prompt` | f-string interpolation | n/a |
| 3.13.9 | English `_build_group_persona_prompt`; preserve fixed-value rules and interpolations | OasisProfileGenerator → `_build_group_persona_prompt` | f-string interpolation | n/a |
| 4.14.10 | English context-builder section labels | OasisProfileGenerator → `_search_zep_for_entity`, `_build_entity_context` | Prompt-only | n/a |
| 5.15.3 | English fallback persona templates | OasisProfileGenerator → `_generate_profile_with_llm`, `_try_fix_json` | None changed | n/a |
| 6.16.7 | English console-output formatting | OasisProfileGenerator → `_print_generated_profile`, `generate_profiles_from_entities` | None changed | n/a |
| 7.17.4 | Locale switching preserved via `get_language_instruction()` | OasisProfileGenerator + Locale | `get_language_instruction()` | Architecture diagram |
| 8.18.6 | Public API and call-site stability; preserve `_normalize_gender` and `country: "中国"` data default | OasisProfileGenerator (signatures, dataclass) | Public surface | n/a |
| 9.19.3 | Reasoning-model compatibility | OasisProfileGenerator → `chat.completions.create` + `_try_fix_json` | OpenAI SDK | Architecture diagram |
| 10.110.7 | Out-of-scope surfaces untouched | OasisProfileGenerator (boundary commitment) | n/a | n/a |
## Components and Interfaces
| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies (P0/P1) | Contracts |
|-----------|--------------|--------|--------------|--------------------------|-----------|
| OasisProfileGenerator (modified) | Backend / Service | Render English profile-generation prompts and context labels; preserve all behaviour | 1.110.7 | `OpenAI.chat.completions.create` (P0), `get_language_instruction` (P0), `GraphitiAdapter.graph.search` (P1), `_normalize_gender` (P0) | Service |
### Backend / Service
#### OasisProfileGenerator (modified)
| Field | Detail |
|-------|--------|
| Intent | Translate prompt strings, context labels, fallback persona templates, and console output to English while preserving every functional contract. |
| Requirements | 1.1, 1.2, 1.3, 1.4, 2.12.9, 3.13.9, 4.14.10, 5.15.3, 6.16.7, 7.17.4, 8.18.6, 9.19.3, 10.110.7 |
**Responsibilities & Constraints**
- Owns: the English wording of the system prompt body, the two user-message templates, the context-builder section labels, the fallback persona templates, the no-attributes / no-context placeholders, and the console-output formatting.
- Domain boundary: prompt content and proximate console output only. Does not own locale resolution, transport, validation, or data values like the OASIS `country` default.
- Invariants:
- All seven owned regions after translation MUST contain zero CJK characters.
- The translated user-message templates MUST present the same eight required JSON keys: `bio`, `persona`, `age`, `gender`, `mbti`, `country`, `profession`, `interested_topics`.
- The translated individual-persona template MUST require `gender ∈ {"male", "female"}` and `age` to be a valid integer.
- The translated group-persona template MUST require `age == 30` and `gender == "other"`.
- The translated user-message templates MUST preserve the f-string interpolations: `{entity_name}`, `{entity_type}`, `{entity_summary}`, `{attrs_str}`, `{context_str}`, `{get_language_instruction()}`.
- The translated context-builder labels MUST preserve the section structure (heading + bulleted body).
- The translated fallback persona templates MUST preserve the `entity_summary or template` priority order.
- The call to `get_language_instruction()` MUST remain at its current locations.
- The call to `self.client.chat.completions.create(...)` MUST remain unchanged.
- All public signatures, dataclass schema, and the private helper signatures MUST remain unchanged.
- All `logger.*` calls (already keyed) and inline comments and docstrings in this file MUST remain unchanged (out of scope per #6 and #7).
- The `_normalize_gender` mapping table MUST remain unchanged.
- The rule-based `country: "中国"` default MUST remain unchanged.
**Dependencies**
- Inbound: `backend/app/api/simulation.py` — production caller (P0).
- Outbound: `backend/app/utils/locale.get_language_instruction` — locale postfix (P0); `backend/app/utils/locale.t` — already-keyed log strings (P0); `backend/app/services/graphiti_adapter.GraphitiAdapter.graph.search` — facts/nodes retrieval (P1); `OpenAI.chat.completions.create` — JSON LLM transport (P0).
- External: none.
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]
##### Service Interface
The public Python interface is unchanged. Representative signatures:
```python
class OasisProfileGenerator:
def __init__(
self,
api_key: Optional[str] = None,
base_url: Optional[str] = None,
model_name: Optional[str] = None,
zep_api_key: Optional[str] = None,
graph_id: Optional[str] = None,
) -> None: ...
def generate_profile_from_entity(
self,
entity: EntityNode,
user_id: int,
use_llm: bool = True,
) -> OasisAgentProfile: ...
def generate_profiles_from_entities(
self,
entities: List[EntityNode],
use_llm: bool = True,
progress_callback: Optional[callable] = None,
graph_id: Optional[str] = None,
parallel_count: int = 5,
realtime_output_path: Optional[str] = None,
output_platform: str = "reddit",
) -> List[OasisAgentProfile]: ...
def save_profiles(
self,
profiles: List[OasisAgentProfile],
file_path: str,
platform: str = "reddit",
) -> None: ...
```
- Preconditions: a configured LLM provider; a configured Graphiti / Neo4j graph; a non-empty `entities` list when batching.
- Postconditions: `OasisAgentProfile` instances with English `bio` and `persona` under locale `en`, Chinese under locale `zh`, and structurally equivalent across locales.
- Invariants: see *Responsibilities & Constraints*.
**Implementation Notes**
- **Integration**: No new imports. No call-site changes. The diff is confined to seven regions of one file.
- **Validation**: After implementation, run a targeted regex check (`[一-鿿]`) over the seven owned regions to confirm zero CJK; smoke-test `_build_individual_persona_prompt(...)` and `_build_group_persona_prompt(...)` with representative inputs to confirm interpolations still work; round-trip a single profile end-to-end under both `en` and `zh` locales.
- **Risks**: English-base bias on Chinese-locale output (mitigated by the `llmInstruction` postfix already present in both system and user messages). Reduced LLM compliance with `gender ∈ {male, female}` for individual entities (mitigated by retaining the explicit English-token directive verbatim in the rules block).
## Data Models
No data-model changes. The `OasisAgentProfile` dataclass is preserved verbatim.
## Error Handling
### Error Strategy
Error handling is unchanged from the existing implementation:
- LLM transport errors propagate from `chat.completions.create`.
- Truncation (`finish_reason == "length"`) is repaired by `_fix_truncated_json`.
- Invalid JSON falls through to `_try_fix_json`, then to a synthesized fallback profile (now with English persona text).
- Per-entity exceptions are caught and a fallback `OasisAgentProfile` is constructed with English fallback strings.
### Error Categories and Responses
- **User errors (4xx)**: not applicable at this layer; surfaced by the API handler.
- **System errors (5xx)**: LLM/network failures propagate to the API handler, which converts them to JSON error responses.
- **Business logic errors**: malformed JSON is auto-repaired or replaced with a fallback profile.
### Monitoring
Existing `logger.*` calls (keyed via `t("log.profile_generator.*")`) cover progress and warnings; no new monitoring is added.
## Testing Strategy
### Unit Tests
Given the project's intentionally minimal test harness (`backend/scripts/test_profile_format.py` only), the change is verified via:
- **Static check**: a one-shot regex assertion against the patched module ensuring zero CJK characters in the seven owned regions. This can be a quick `python -c` invocation during PR review.
- **Round-trip smoke test**: instantiate `OasisProfileGenerator()`, call `_build_individual_persona_prompt(...)` and `_build_group_persona_prompt(...)` with representative inputs, and verify all required interpolations appear in the output and no CJK characters remain.
- **Fallback rendering**: simulate a JSON parse failure and verify the English fallback persona template is produced.
### Integration Tests
- **Step 2 profile generation under EN locale**: run a small batched profile generation against a real Graphiti graph with locale `en`. Verify produced profiles have English `bio` / `persona` and pass the existing OASIS profile-format check.
### E2E/UI Tests
Not applicable — change does not affect frontend.
### Performance/Load
Not applicable — token counts may differ slightly between Chinese and English renderings, but the LLM call has no `max_tokens` cap and remains within provider-acceptable limits.
## Optional Sections
### Security Considerations
Not applicable. Translation does not introduce new authentication, authorization, data-handling, or input-validation paths.
### Performance & Scalability
Not applicable.
### Migration Strategy
Not applicable. The change is a single in-place edit; no data migration. Rollback is `git revert`.
## Supporting References
- `backend/app/services/oasis_profile_generator.py` — current Chinese prompt content (the source of translation).
- `backend/app/utils/locale.py` — locale resolver.
- `backend/app/api/simulation.py` — call site.
- `.kiro/specs/i18n-ontology-generator-prompts/design.md` — adjacent reference design for in-place prompt translation.
- `.ticket/25.md` — ticket snapshot.