MicroFish/.kiro/specs/i18n-oasis-profile-generato.../design.md

21 KiB
Raw Blame History

Design Document — i18n-oasis-profile-generator-prompts

Overview

Purpose: Translate the Chinese prompt strings, context-builder section labels, fallback persona templates, and console-output formatting in backend/app/services/oasis_profile_generator.py to English while preserving every functional contract — LLM JSON output schema, the _normalize_gender mapping that must continue to accept Chinese gender values, the _generate_profile_rule_based default country: "中国" data value, all f-string interpolations, and the get_language_instruction() locale-postfix mechanism. The goal is to remove the Chinese-language base-prompt and context-label bias that currently leaks Chinese structure and word choice into OASIS profile output even when Accept-Language: en.

Users: MiroFish operators running the Step 2 OASIS profile generation under any locale; downstream OASIS / CAMEL-OASIS consumers of the agent JSON / CSV produced by OasisProfileGenerator.

Impact: Replaces approximately one base-prompt string, two large user-message templates, four context-builder section labels, three fallback persona templates, and ten console-output strings with English equivalents inside one file. No API surface change. No new dependencies. No new files. Callers (backend/app/api/simulation.py, etc.) and OASIS consumers are unaffected.

Goals

  • Zero CJK characters in any prompt string literal contributed by oasis_profile_generator.py to the system prompt, the user message, or the context block.
  • Zero CJK characters in any console-output literal in _print_generated_profile and the surrounding banners.
  • English bio / persona output under Accept-Language: en.
  • Continued Chinese bio / persona output under Accept-Language: zh, of equivalent quality to the pre-change behaviour.
  • No diff to public signatures, dataclass schema, LLM-call parameters, or call sites.

Non-Goals

  • Externalizing prompts to /locales/*.json (out of scope per ticket and consistent with i18n-ontology-generator-prompts).
  • Translating logger calls in this file (covered by issue #6).
  • Translating module/class/method docstrings or inline comments in this file (covered by issue #7).
  • Refactoring the OASIS profile JSON schema, the OASIS adapter, or the simulation flow.
  • Modifying the _normalize_gender mapping table (it must keep accepting Chinese gender keys).
  • Modifying the _generate_profile_rule_based default "中国" country value (data, not prompt).
  • Modifying the ValueError("LLM_API_KEY 未配置") raise (covered by issue #6).
  • Modifying backend/app/utils/locale.py, the locale registries, or any non-target file.

Boundary Commitments

This Spec Owns

  • The English content of the base_prompt string in OasisProfileGenerator._get_system_prompt (line 664).
  • The English content of every string literal in OasisProfileGenerator._build_individual_persona_prompt (lines 677714).
  • The English content of every string literal in OasisProfileGenerator._build_group_persona_prompt (lines 726762).
  • The English content of the section-label literals embedded in OasisProfileGenerator._search_zep_for_entity (lines 384, 390, 392) and OasisProfileGenerator._build_entity_context (lines 422, 438, 440, 443, 463, 472, 475).
  • The English content of the fallback persona templates in OasisProfileGenerator._generate_profile_with_llm (line 547) and OasisProfileGenerator._try_fix_json (lines 644, 659).
  • The English content of the no-attributes / no-context placeholder literals ("无", "无额外上下文") at lines 677, 678, 726, 727.
  • The English content of every string literal in OasisProfileGenerator._print_generated_profile (lines 1011, 1017, 1019, 1022, 1025, 1026, 1027, 1028) and the surrounding banners in OasisProfileGenerator.generate_profiles_from_entities (lines 945, 1001).

Out of Boundary

  • Locale resolution machinery (backend/app/utils/locale.py).
  • Per-locale llmInstruction definitions (/locales/languages.json).
  • Reasoning-model output stripping (backend/app/utils/llm_client.py).
  • All logger.* calls (already keyed via t("log.profile_generator.*"); covered by issue #6).
  • Module / class / method docstrings and inline comments (covered by issue #7), including the inline comments at lines 65, 93, 641, 804807, 816819.
  • The _normalize_gender mapping table (lines 11231132) — must continue to accept Chinese gender keys from upstream.
  • The hard-coded country: "中国" default in _generate_profile_rule_based (lines 807, 819) — this is a data value, not a prompt.
  • The ValueError("LLM_API_KEY 未配置") raise (line 194) — covered by issue #6.
  • All callers of OasisProfileGenerator, including backend/app/api/simulation.py.
  • Tests, scripts, and frontend code.

Allowed Dependencies

  • Existing get_language_instruction, get_locale, set_locale, t imports from ..utils.locale (already imported; unchanged).
  • Existing OpenAI SDK invocation (unchanged).
  • No new imports.

Revalidation Triggers

The following changes elsewhere would invalidate this design and require revisiting the prompt:

  • A change to the JSON contract emitted by the LLM (bio, persona, age, gender, mbti, country, profession, interested_topics).
  • A change to OasisAgentProfile field semantics.
  • A change to get_language_instruction() semantics or the per-locale llmInstruction strings.
  • A change to OASIS / CAMEL-OASIS profile field expectations (e.g. if gender accepts more than male / female / other).

Architecture

Existing Architecture Analysis

OasisProfileGenerator lives in backend/app/services/, follows the in-process service pattern with bounded thread-pool fan-out for batched profile generation, and is invoked from backend/app/api/simulation.py inside a background Task. It depends on:

  • OpenAI SDK for the LLM call.
  • GraphitiAdapter (legacy zep_client field name) for the Zep / Graphiti graph search.
  • get_language_instruction() for locale steering.
  • t() for already-keyed log strings.

The relevant flow is:

  1. The Flask handler resolves the request locale via Accept-Language; the locale is propagated to thread-pool workers via the set_locale(current_locale) capture in generate_profiles_from_entities (line 914).
  2. For each entity, _build_entity_context() is called: it composes a context block by concatenating headed sub-sections (entity attributes, related facts/edges, related node summaries, Graphiti-search facts, Graphiti-search nodes). Some of these labels are currently in Chinese.
  3. The context string is interpolated into the user-message template by either _build_individual_persona_prompt or _build_group_persona_prompt. Both templates are currently in Chinese, with English gender token directives interleaved.
  4. The system prompt is built by _get_system_prompt: a Chinese base prompt followed by the locale-appropriate get_language_instruction().
  5. The two messages are sent to chat.completions.create with response_format={"type": "json_object"}. The result flows through json.loads_try_fix_json_fix_truncated_json fallback chain. Synthesized fallback personas use the Chinese template f"{entity_name}是一个{entity_type}。" if the LLM result is unusable.
  6. After per-profile completion, _print_generated_profile writes a Chinese-headed banner to stdout, and generate_profiles_from_entities writes Chinese batch banners.

This design preserves all of the above structurally. The change is purely lexical inside the seven regions of one file.

Architecture Pattern & Boundary Map

graph TB
    Caller[simulation.py handler]
    Generator[OasisProfileGenerator]
    Locale[locale.get_language_instruction]
    Graph[GraphitiAdapter graph.search]
    LLM[OpenAI chat.completions]

    Caller -->|generate_profiles_from_entities| Generator
    Generator -->|build context block| Generator
    Generator -->|read locale postfix| Locale
    Generator -->|search facts/nodes| Graph
    Generator -->|JSON request| LLM
    LLM -->|raw JSON| Generator
    Generator -->|OasisAgentProfile| Caller

Architecture Integration:

  • Selected pattern: In-place lexical translation of seven regions of an existing service. No structural change.
  • Domain/feature boundaries: locale machinery vs. prompt assembly vs. LLM transport remain cleanly separated.
  • Existing patterns preserved: prompt-as-f-string user-message construction; Chinese-keyed _normalize_gender mapping; t(...) for log strings; get_language_instruction() postfix concatenation.
  • New components rationale: none — no new components.
  • Steering compliance: matches the established i18n-*-prompts family pattern (issues #2, #3, #4, #5) of in-place translation rather than t() keying for prompt bodies. Respects the steering note that "existing files mix English and Chinese in comments/docstrings — preserve both; do not translate one into the other unless asked." This ticket is the explicit ask for prompt strings, scoped to exclude comments/docstrings.

Technology Stack

Layer Choice / Version Role in Feature Notes
Backend / Services Python 3.11+ Hosts OasisProfileGenerator Existing — unchanged.
Backend / Services openai SDK Issues the prompt; returns JSON Existing — unchanged.
Backend / Services backend/app/utils/locale.py Resolves Accept-LanguagellmInstruction postfix Existing — unchanged.
Backend / Services GraphitiAdapter Provides Graphiti graph search facts/nodes Existing — unchanged.

No new dependencies. No version changes.

File Structure Plan

Modified Files

  • backend/app/services/oasis_profile_generator.py — Replace the body of _get_system_prompt base_prompt; replace every Chinese string literal in _build_individual_persona_prompt and _build_group_persona_prompt with English equivalents; replace the four section labels in _search_zep_for_entity and the six section labels in _build_entity_context; replace the three fallback persona templates; replace the two "无" / "无额外上下文" placeholders; replace the console-output literals in _print_generated_profile and the two print(...) banners in generate_profiles_from_entities. Preserve every other character of the file.

No new files. No deletions. No moves.

System Flows

The control-flow diagram in Architecture Pattern & Boundary Map covers the relevant flow; no additional diagrams are needed for this string-literal change.

Requirements Traceability

Requirement Summary Components Interfaces Flows
1.11.4 English _get_system_prompt base_prompt; preserve get_language_instruction() site OasisProfileGenerator → _get_system_prompt None changed Architecture diagram
2.12.9 English _build_individual_persona_prompt; preserve interpolations and JSON keys OasisProfileGenerator → _build_individual_persona_prompt f-string interpolation n/a
3.13.9 English _build_group_persona_prompt; preserve fixed-value rules and interpolations OasisProfileGenerator → _build_group_persona_prompt f-string interpolation n/a
4.14.10 English context-builder section labels OasisProfileGenerator → _search_zep_for_entity, _build_entity_context Prompt-only n/a
5.15.3 English fallback persona templates OasisProfileGenerator → _generate_profile_with_llm, _try_fix_json None changed n/a
6.16.7 English console-output formatting OasisProfileGenerator → _print_generated_profile, generate_profiles_from_entities None changed n/a
7.17.4 Locale switching preserved via get_language_instruction() OasisProfileGenerator + Locale get_language_instruction() Architecture diagram
8.18.6 Public API and call-site stability; preserve _normalize_gender and country: "中国" data default OasisProfileGenerator (signatures, dataclass) Public surface n/a
9.19.3 Reasoning-model compatibility OasisProfileGenerator → chat.completions.create + _try_fix_json OpenAI SDK Architecture diagram
10.110.7 Out-of-scope surfaces untouched OasisProfileGenerator (boundary commitment) n/a n/a

Components and Interfaces

Component Domain/Layer Intent Req Coverage Key Dependencies (P0/P1) Contracts
OasisProfileGenerator (modified) Backend / Service Render English profile-generation prompts and context labels; preserve all behaviour 1.110.7 OpenAI.chat.completions.create (P0), get_language_instruction (P0), GraphitiAdapter.graph.search (P1), _normalize_gender (P0) Service

Backend / Service

OasisProfileGenerator (modified)

Field Detail
Intent Translate prompt strings, context labels, fallback persona templates, and console output to English while preserving every functional contract.
Requirements 1.1, 1.2, 1.3, 1.4, 2.12.9, 3.13.9, 4.14.10, 5.15.3, 6.16.7, 7.17.4, 8.18.6, 9.19.3, 10.110.7

Responsibilities & Constraints

  • Owns: the English wording of the system prompt body, the two user-message templates, the context-builder section labels, the fallback persona templates, the no-attributes / no-context placeholders, and the console-output formatting.
  • Domain boundary: prompt content and proximate console output only. Does not own locale resolution, transport, validation, or data values like the OASIS country default.
  • Invariants:
    • All seven owned regions after translation MUST contain zero CJK characters.
    • The translated user-message templates MUST present the same eight required JSON keys: bio, persona, age, gender, mbti, country, profession, interested_topics.
    • The translated individual-persona template MUST require gender ∈ {"male", "female"} and age to be a valid integer.
    • The translated group-persona template MUST require age == 30 and gender == "other".
    • The translated user-message templates MUST preserve the f-string interpolations: {entity_name}, {entity_type}, {entity_summary}, {attrs_str}, {context_str}, {get_language_instruction()}.
    • The translated context-builder labels MUST preserve the section structure (heading + bulleted body).
    • The translated fallback persona templates MUST preserve the entity_summary or template priority order.
    • The call to get_language_instruction() MUST remain at its current locations.
    • The call to self.client.chat.completions.create(...) MUST remain unchanged.
    • All public signatures, dataclass schema, and the private helper signatures MUST remain unchanged.
    • All logger.* calls (already keyed) and inline comments and docstrings in this file MUST remain unchanged (out of scope per #6 and #7).
    • The _normalize_gender mapping table MUST remain unchanged.
    • The rule-based country: "中国" default MUST remain unchanged.

Dependencies

  • Inbound: backend/app/api/simulation.py — production caller (P0).
  • Outbound: backend/app/utils/locale.get_language_instruction — locale postfix (P0); backend/app/utils/locale.t — already-keyed log strings (P0); backend/app/services/graphiti_adapter.GraphitiAdapter.graph.search — facts/nodes retrieval (P1); OpenAI.chat.completions.create — JSON LLM transport (P0).
  • External: none.

Contracts: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [ ]

Service Interface

The public Python interface is unchanged. Representative signatures:

class OasisProfileGenerator:
    def __init__(
        self,
        api_key: Optional[str] = None,
        base_url: Optional[str] = None,
        model_name: Optional[str] = None,
        zep_api_key: Optional[str] = None,
        graph_id: Optional[str] = None,
    ) -> None: ...

    def generate_profile_from_entity(
        self,
        entity: EntityNode,
        user_id: int,
        use_llm: bool = True,
    ) -> OasisAgentProfile: ...

    def generate_profiles_from_entities(
        self,
        entities: List[EntityNode],
        use_llm: bool = True,
        progress_callback: Optional[callable] = None,
        graph_id: Optional[str] = None,
        parallel_count: int = 5,
        realtime_output_path: Optional[str] = None,
        output_platform: str = "reddit",
    ) -> List[OasisAgentProfile]: ...

    def save_profiles(
        self,
        profiles: List[OasisAgentProfile],
        file_path: str,
        platform: str = "reddit",
    ) -> None: ...
  • Preconditions: a configured LLM provider; a configured Graphiti / Neo4j graph; a non-empty entities list when batching.
  • Postconditions: OasisAgentProfile instances with English bio and persona under locale en, Chinese under locale zh, and structurally equivalent across locales.
  • Invariants: see Responsibilities & Constraints.

Implementation Notes

  • Integration: No new imports. No call-site changes. The diff is confined to seven regions of one file.
  • Validation: After implementation, run a targeted regex check ([一-鿿]) over the seven owned regions to confirm zero CJK; smoke-test _build_individual_persona_prompt(...) and _build_group_persona_prompt(...) with representative inputs to confirm interpolations still work; round-trip a single profile end-to-end under both en and zh locales.
  • Risks: English-base bias on Chinese-locale output (mitigated by the llmInstruction postfix already present in both system and user messages). Reduced LLM compliance with gender ∈ {male, female} for individual entities (mitigated by retaining the explicit English-token directive verbatim in the rules block).

Data Models

No data-model changes. The OasisAgentProfile dataclass is preserved verbatim.

Error Handling

Error Strategy

Error handling is unchanged from the existing implementation:

  • LLM transport errors propagate from chat.completions.create.
  • Truncation (finish_reason == "length") is repaired by _fix_truncated_json.
  • Invalid JSON falls through to _try_fix_json, then to a synthesized fallback profile (now with English persona text).
  • Per-entity exceptions are caught and a fallback OasisAgentProfile is constructed with English fallback strings.

Error Categories and Responses

  • User errors (4xx): not applicable at this layer; surfaced by the API handler.
  • System errors (5xx): LLM/network failures propagate to the API handler, which converts them to JSON error responses.
  • Business logic errors: malformed JSON is auto-repaired or replaced with a fallback profile.

Monitoring

Existing logger.* calls (keyed via t("log.profile_generator.*")) cover progress and warnings; no new monitoring is added.

Testing Strategy

Unit Tests

Given the project's intentionally minimal test harness (backend/scripts/test_profile_format.py only), the change is verified via:

  • Static check: a one-shot regex assertion against the patched module ensuring zero CJK characters in the seven owned regions. This can be a quick python -c invocation during PR review.
  • Round-trip smoke test: instantiate OasisProfileGenerator(), call _build_individual_persona_prompt(...) and _build_group_persona_prompt(...) with representative inputs, and verify all required interpolations appear in the output and no CJK characters remain.
  • Fallback rendering: simulate a JSON parse failure and verify the English fallback persona template is produced.

Integration Tests

  • Step 2 profile generation under EN locale: run a small batched profile generation against a real Graphiti graph with locale en. Verify produced profiles have English bio / persona and pass the existing OASIS profile-format check.

E2E/UI Tests

Not applicable — change does not affect frontend.

Performance/Load

Not applicable — token counts may differ slightly between Chinese and English renderings, but the LLM call has no max_tokens cap and remains within provider-acceptable limits.

Optional Sections

Security Considerations

Not applicable. Translation does not introduce new authentication, authorization, data-handling, or input-validation paths.

Performance & Scalability

Not applicable.

Migration Strategy

Not applicable. The change is a single in-place edit; no data migration. Rollback is git revert.

Supporting References

  • backend/app/services/oasis_profile_generator.py — current Chinese prompt content (the source of translation).
  • backend/app/utils/locale.py — locale resolver.
  • backend/app/api/simulation.py — call site.
  • .kiro/specs/i18n-ontology-generator-prompts/design.md — adjacent reference design for in-place prompt translation.
  • .ticket/25.md — ticket snapshot.