MicroFish/.kiro/specs/i18n-simulation-config-gene.../requirements.md

144 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Requirements Document
## Introduction
This specification covers the English translation of the three LLM prompt blocks in `backend/app/services/simulation_config_generator.py`. The file produces the simulation parameters consumed by the OASIS subprocess (Step 3 of the MiroFish pipeline): time/event/agent/platform configuration, hot-topic extraction, narrative direction, and stance assignment. Today, all three prompts are written in Chinese; the language is steered at runtime by appending `get_language_instruction()` to each system prompt. While that postfix instructs the model *which* language to respond in, the base-prompt language biases the model's structural and lexical output. As a result, the natural-language output fields (`content`, `narrative_direction`, `hot_topics`, `reasoning`) skew Chinese under `Accept-Language: en`. Translating the base prompts to English removes that bias while preserving the existing locale-switching mechanism for non-English locales (verified: `get_language_instruction()` returns the Chinese postfix `请使用中文回答。` when locale is `zh`).
This work tracks GitHub issue [#4](https://github.com/salestech-group/MiroFish/issues/4).
## Boundary Context
- **In scope**:
- Translating the time-configuration prompt and its system prompt in `_generate_time_config` (block 1, lines ~543588).
- Translating the event-configuration prompt and its system prompt in `_generate_event_config` (block 2, lines ~676705).
- Translating the per-batch agent-configuration prompt and its system prompt in `_generate_agent_configs_batch` (block 3, lines ~833869).
- Preserving every `get_language_instruction()` call site exactly as today (lines 589, 706, 870 — the three postfix injections that follow each system prompt).
- Preserving the existing English-only constraint directives that already follow `get_language_instruction()`: `poster_type` PascalCase English (block 2), `stance` ∈ {`supportive`, `opposing`, `neutral`, `observer`} (block 3).
- Preserving every variable interpolation (`{context_truncated}`, `{simulation_requirement}`, `{type_info}`, `{max_agents_allowed}`, `{json.dumps(entity_list, ...)}`, etc.) verbatim by name and position.
- Preserving the JSON output contract of each prompt (key names, value types, required fields).
- **Out of scope**:
- Logger messages (`logger.info`, `logger.warning`, `logger.error`) inside the same file — covered by issue #6.
- Module docstring, class docstrings, method docstrings, and inline comments — covered by issue #7.
- Refactoring the prompt structure, JSON output schema, retry/repair logic in `_call_llm_with_retry`, or any data-class definitions.
- Changing default simulation parameters (rounds count, action lists, etc. — owned by `app/config.py`).
- The fallback string in `_get_default_time_config` (`"使用默认中国人作息配置每轮1小时"`) and the fallback `"使用默认配置"` in `_generate_event_config` exception handler — these are returned as `reasoning` values, not prompt content. Translation of these is closer to log/comment scope (#6/#7); for symmetry with the prompt translation goal they SHOULD be translated to English when locale-agnostic, but only as long as no behavioural side effects are introduced (see Requirement 6).
- The `_build_context` Chinese section headings (`## 模拟需求`, `## 实体信息`, `## 原始文档内容`, `...(文档已截断)`) and `_summarize_entities` headings (`### {entity_type} ({len(type_entities)}个)`, `... 还有 {n} 个`) — these are interpolated into prompts as part of `{context_truncated}` and bias the model's output language. Translation of these section headings is in scope (see Requirement 7) because they contribute to the same model-output language bias the three prompt blocks address.
- **Adjacent expectations**:
- The OASIS simulation subprocess and IPC layer (`services/simulation_ipc.py`) consume the resulting `SimulationParameters` payload. No coupling to prompt language exists in that consumer; the JSON shape of `SimulationParameters.to_dict()` is unchanged by this work.
- The locale resolution chain (`Accept-Language` header → `get_locale()``get_language_instruction()`) lives in `backend/app/utils/locale.py` and is unchanged.
- Companion i18n issues (#2 closed, #3 closed, #5, #6, #7) operate on different files or scopes and must not be touched here.
## Requirements
### Requirement 1: English Translation of the Time-Configuration Prompt (Block 1)
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the time-configuration prompt and system prompt to be authored in English, so that the LLM's `reasoning` field for time configuration is not biased toward Chinese structure or word choice.
#### Acceptance Criteria
1. The Simulation Config Generator shall render the user prompt inside `_generate_time_config` containing zero Chinese characters in any string-literal content.
2. The Simulation Config Generator shall render the system prompt inside `_generate_time_config` containing zero Chinese characters in any string-literal content.
3. The Simulation Config Generator shall preserve the JSON output contract of the time-config prompt verbatim by key name: `total_simulation_hours`, `minutes_per_round`, `agents_per_hour_min`, `agents_per_hour_max`, `peak_hours`, `off_peak_hours`, `morning_hours`, `work_hours`, `reasoning`.
4. The Simulation Config Generator shall preserve the field-level numeric constraints currently described in the prompt: `total_simulation_hours` ∈ 24168, `minutes_per_round` ∈ 30120 (recommend 60), `agents_per_hour_min`/`max` ∈ 1`max_agents_allowed`.
5. The Simulation Config Generator shall preserve the variable interpolations `{context_truncated}` and `{max_agents_allowed}` verbatim by name and position.
6. The Simulation Config Generator shall preserve the prompt's guidance that the model should infer the target user group's timezone and circadian habits from the simulation scenario, with the UTC+8 reference example retained as illustrative guidance.
7. The Simulation Config Generator shall preserve the call to `get_language_instruction()` exactly at line ~589, appended after the translated system prompt.
### Requirement 2: English Translation of the Event-Configuration Prompt (Block 2)
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the event-configuration prompt and system prompt to be authored in English, so that generated `hot_topics`, `narrative_direction`, initial-post `content`, and `reasoning` fields are not biased toward Chinese structure or word choice.
#### Acceptance Criteria
1. The Simulation Config Generator shall render the user prompt inside `_generate_event_config` containing zero Chinese characters in any string-literal content.
2. The Simulation Config Generator shall render the system prompt inside `_generate_event_config` containing zero Chinese characters in any string-literal content.
3. The Simulation Config Generator shall preserve the JSON output contract of the event-config prompt verbatim by key name: `hot_topics` (list of strings), `narrative_direction` (string), `initial_posts` (list of objects with keys `content` and `poster_type`), `reasoning` (string).
4. The Simulation Config Generator shall preserve the variable interpolations `{simulation_requirement}`, `{context_truncated}`, and `{type_info}` verbatim by name and position.
5. The Simulation Config Generator shall preserve the call to `get_language_instruction()` exactly at line ~706 appended after the translated system prompt.
6. The Simulation Config Generator shall preserve verbatim the trailing English-only directive on `poster_type` formatting (currently: `IMPORTANT: The 'poster_type' field value MUST be in English PascalCase exactly matching the available entity types. Only 'content', 'narrative_direction', 'hot_topics' and 'reasoning' fields should use the specified language.`). The wording may be lightly normalized so it reads cleanly after a now-English system prompt, but the constraint semantics shall not change.
7. The Simulation Config Generator shall preserve the prompt's example list mapping entity types to expected post authors (Official/University → official statements, MediaOutlet → news, Student → student opinions) — translated to English while keeping each pairing intact.
8. When the locale is `zh`, the Simulation Config Generator shall produce `hot_topics`, `narrative_direction`, initial-post `content`, and `reasoning` fields in Chinese, equivalent in quality to the pre-change behaviour.
### Requirement 3: English Translation of the Agent-Config Batch Prompt (Block 3)
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the agent-config batch prompt and system prompt to be authored in English, so that the LLM's per-agent configuration emission is not biased by Chinese-specific behavioural priors when the seed scenario is non-Chinese.
#### Acceptance Criteria
1. The Simulation Config Generator shall render the user prompt inside `_generate_agent_configs_batch` containing zero Chinese characters in any string-literal content.
2. The Simulation Config Generator shall render the system prompt inside `_generate_agent_configs_batch` containing zero Chinese characters in any string-literal content.
3. The Simulation Config Generator shall preserve the JSON output contract of the agent-config batch prompt verbatim by key name: `agent_configs` (list) with sub-keys `agent_id`, `activity_level`, `posts_per_hour`, `comments_per_hour`, `active_hours`, `response_delay_min`, `response_delay_max`, `sentiment_bias`, `stance`, `influence_weight`.
4. The Simulation Config Generator shall preserve the variable interpolations `{simulation_requirement}` and the embedded `json.dumps(entity_list, ensure_ascii=False, indent=2)` rendering of the entity list verbatim.
5. The Simulation Config Generator shall preserve the per-entity-type heuristic ranges currently embedded in the prompt: officials (University/GovernmentAgency) — low activity 0.10.3, work hours, slow response 60240 min, high influence 2.53.0; media (MediaOutlet) — mid activity 0.40.6, all-day 823, fast response 530 min, high influence 2.02.5; individuals (Student/Person/Alumni) — high activity 0.60.9, evening 1823, fast response 115 min, low influence 0.81.2; public figures/experts — mid activity 0.40.6, mid-high influence 1.52.0.
6. The Simulation Config Generator shall preserve the call to `get_language_instruction()` exactly at line ~870, appended after the translated system prompt.
7. The Simulation Config Generator shall preserve verbatim the trailing English-only directive on `stance` and JSON-key formatting (currently: `IMPORTANT: The 'stance' field value MUST be one of the English strings: 'supportive', 'opposing', 'neutral', 'observer'. All JSON field names and numeric values must remain unchanged. Only natural language text fields should use the specified language.`). The wording may be lightly normalized so it reads cleanly after a now-English system prompt, but the constraint semantics shall not change.
### Requirement 4: Locale Switching Continues to Work via `get_language_instruction()`
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: zh` (or any other configured non-English locale), I want the simulation-config output to remain in the requested locale of equivalent quality, so that translating the base prompts does not regress non-English support.
#### Acceptance Criteria
1. The Simulation Config Generator shall preserve the three call sites of `get_language_instruction()` at the same line positions (relative to each prompt block) and in the same syntactic form: `system_prompt = f"{system_prompt}\n\n{get_language_instruction()}..."`.
2. When the locale is `zh`, the Simulation Config Generator shall produce a `time_config.reasoning`, `event_config.narrative_direction`, `event_config.hot_topics`, `event_config.initial_posts[*].content`, and a final `generation_reasoning` whose natural-language portions are in Chinese.
3. When the locale is `en`, the Simulation Config Generator shall produce the same set of natural-language fields in English.
4. The Simulation Config Generator shall not alter `backend/app/utils/locale.py`, the `_languages` registry, the `_translations` registry, or any file under `/locales/`.
5. Where a locale produces JSON output that is structurally invalid (e.g. a reasoning model emits `<think>` tags), the existing JSON repair logic in `_fix_truncated_json` and `_try_fix_config_json` shall continue to apply unchanged, regardless of prompt language.
### Requirement 5: Public API and Call-Site Stability
**Objective:** As a developer maintaining the rest of the MiroFish backend pipeline, I want the public surface of `SimulationConfigGenerator` to remain unchanged, so that the simulation pipeline (Step 3) continues to work without modification.
#### Acceptance Criteria
1. The Simulation Config Generator shall preserve the signature of `SimulationConfigGenerator.__init__(self, api_key: Optional[str] = None, base_url: Optional[str] = None, model_name: Optional[str] = None)`.
2. The Simulation Config Generator shall preserve the signature of `SimulationConfigGenerator.generate_config(...)` including all parameters and return type.
3. The Simulation Config Generator shall preserve the signatures of the private methods `_generate_time_config`, `_generate_event_config`, `_generate_agent_configs_batch`, `_parse_time_config`, `_parse_event_config`, `_assign_initial_post_agents`, `_generate_agent_config_by_rule`, `_call_llm_with_retry`, `_fix_truncated_json`, `_try_fix_config_json`, `_get_default_time_config`, `_build_context`, `_summarize_entities`.
4. The Simulation Config Generator shall preserve the dataclass definitions `AgentActivityConfig`, `TimeSimulationConfig`, `EventConfig`, `PlatformConfig`, `SimulationParameters` exactly (no field additions, removals, renames, or default-value changes).
5. The Simulation Config Generator shall preserve the class-level constants `MAX_CONTEXT_LENGTH = 50000`, `AGENTS_PER_BATCH = 15`, `TIME_CONFIG_CONTEXT_LENGTH = 10000`, `EVENT_CONFIG_CONTEXT_LENGTH = 8000`, `ENTITY_SUMMARY_LENGTH = 300`, `AGENT_SUMMARY_LENGTH = 300`, `ENTITIES_PER_TYPE_DISPLAY = 20`.
6. The Simulation Config Generator shall preserve the LLM invocation parameters in `_call_llm_with_retry`: `response_format={"type": "json_object"}`, `temperature=0.7 - (attempt * 0.1)`, `max_attempts = 3`, no `max_tokens` setting.
### Requirement 6: Default-Path Output Compatibility
**Objective:** As a MiroFish operator hitting an LLM-failure fallback path, I want the default `reasoning` strings to remain compatible with downstream consumers, so that translating prompts does not silently break the `generation_reasoning` join or any downstream display.
#### Acceptance Criteria
1. The Simulation Config Generator shall continue to produce a non-empty `reasoning` field on the default path returned by `_get_default_time_config` and the exception path of `_generate_event_config`.
2. The Simulation Config Generator may translate the two literal default-path `reasoning` strings (`"使用默认中国人作息配置每轮1小时"` and `"使用默认配置"`) to English. If translated, both translations shall be locale-agnostic English (no Chinese characters), and both shall remain non-empty.
3. The Simulation Config Generator shall preserve the join semantics of `generation_reasoning = " | ".join(reasoning_parts)` — a `" | "` separator with the existing label prefixes contributed by `t('progress.timeConfigLabel')`, `t('progress.eventConfigLabel')`, etc.
### Requirement 7: Context-Builder Section Headings Translated
**Objective:** As a MiroFish operator running the pipeline under `Accept-Language: en`, I want the section headings injected into prompts via `_build_context` and `_summarize_entities` to be authored in English, so that the assembled prompt does not interleave English instruction blocks with Chinese section markers, which would otherwise re-introduce the same model-output language bias the prompt translations seek to eliminate.
#### Acceptance Criteria
1. The Simulation Config Generator shall render the section headings emitted by `_build_context` in English: replacing `## 模拟需求` with an English equivalent (e.g. `## Simulation Requirement`), `## 实体信息 ({n}个)` with `## Entities ({n})`, `## 原始文档内容` with `## Source Document Content`, and the truncation marker `(文档已截断)` with an English equivalent (e.g. `(document truncated)`).
2. The Simulation Config Generator shall render the per-entity-type breakdown in `_summarize_entities` in English: replacing `### {entity_type} ({n}个)` with `### {entity_type} ({n})` and the trailing overflow marker `... 还有 {n} 个` with an English equivalent (e.g. `... and {n} more`).
3. The Simulation Config Generator shall preserve `entity.name` and `entity.summary` data verbatim in the rendered context (no translation of user-provided content).
4. The change to context-builder headings shall not modify the public signatures of `_build_context` or `_summarize_entities`.
### Requirement 8: End-to-End Step 3 Parity
**Objective:** As a MiroFish operator validating the change, I want the OASIS subprocess to start cleanly and run at least one round under the English-prompt configuration, so that the translation does not silently degrade the simulation pipeline.
#### Acceptance Criteria
1. When a representative seed simulation requirement is processed end-to-end with locale `en`, `SimulationConfigGenerator.generate_config(...)` shall return a fully-populated `SimulationParameters` object (non-empty `agent_configs`, populated `time_config`, populated `event_config`).
2. When the resulting `SimulationParameters` is handed to the OASIS subprocess via `simulation_ipc.py`, the subprocess shall start without raising a schema or validation error attributable to the translated prompts.
3. When the resulting `SimulationParameters` is handed to the OASIS subprocess, the subprocess shall execute at least one simulation round without erroring on a `stance` not being one of `supportive`/`opposing`/`neutral`/`observer`, or a `poster_type` not matching an available entity type.
4. The Simulation Config Generator shall not change the `SimulationParameters.to_dict()` payload shape consumed by the IPC layer (verified via Requirement 5).
### Requirement 9: Out-of-Scope Surfaces Remain Untouched
**Objective:** As a reviewer of this PR, I want the change to remain narrowly scoped to prompt-content strings (and the directly related context-builder headings of Requirement 7), so that translation responsibilities for adjacent surfaces (issues #6 and #7) are not absorbed into this change.
#### Acceptance Criteria
1. The change shall not modify any `logger.info(...)`, `logger.warning(...)`, `logger.error(...)`, or `logger.debug(...)` call in `simulation_config_generator.py` (covered by issue #6).
2. The change shall not modify the module docstring at lines 111, the class docstring on `SimulationConfigGenerator`, the dataclass docstrings (`AgentActivityConfig`, `TimeSimulationConfig`, `EventConfig`, `PlatformConfig`, `SimulationParameters`), or any inline `#` comment in `simulation_config_generator.py` (covered by issue #7).
3. The change shall not modify any file outside `backend/app/services/simulation_config_generator.py` for production code, except for adding test fixtures or scripts under a clearly-isolated directory if a verification harness is needed.
4. The change shall not introduce a new dependency or modify `backend/pyproject.toml` / `backend/uv.lock`.
5. The change shall not edit `backend/app/config.py`, `backend/app/services/simulation_ipc.py`, `backend/app/services/simulation_runner.py`, `backend/app/utils/locale.py`, or any file under `/locales/`.