Gap Analysis — `i18n-translate-backend-comments`

Scope Recap

Ticket: salestech-group/MiroFish#7
Goal: Translate Chinese docstrings and # comments in backend/ to English without behavior changes.
Blast radius: Comments and docstrings only; runtime semantics preserved.

A scan with the regex [一-鿿] across backend/**/*.py (excluding .venv) returns 37 in-app files plus 2 test files:

Area	Count	Files
`backend/app/__init__.py`	1	`__init__.py`
`backend/app/config.py`	1	`config.py`
`backend/app/api/`	4	`__init__.py`, `graph.py`, `report.py`, `simulation.py`
`backend/app/models/`	3	`__init__.py`, `project.py`, `task.py`
`backend/app/services/`	12	`__init__.py`, `graph_builder.py`, `oasis_profile_generator.py`, `ontology_generator.py`, `report_agent.py`, `simulation_config_generator.py`, `simulation_ipc.py`, `simulation_manager.py`, `simulation_runner.py`, `text_processor.py`, `zep_entity_reader.py`, `zep_graph_memory_updater.py`, `zep_tools.py`
`backend/app/utils/`	7	`__init__.py`, `file_parser.py`, `llm_client.py`, `locale.py`, `logger.py`, `retry.py`, `zep_paging.py`
`backend/run.py`	1	`run.py`
`backend/scripts/`	5	`action_logger.py`, `run_parallel_simulation.py`, `run_reddit_simulation.py`, `run_twitter_simulation.py`, `test_profile_format.py`
`backend/tests/` (extra, not in ticket file list)	2	`test_locale.py`, `test_locale_request_resolution.py`

Spot checks (models/task.py, models/project.py, services/text_processor.py, utils/locale.py):

Module-level docstrings in Chinese (e.g. """任务状态管理""").
Class/method docstrings in Chinese, often Google-shaped (Args: translated as 参数:).
Inline # comments tagging fields, sections, or restating obvious code (e.g. # 标准化换行 above an \n normalization call).
Status-enum trailing comments (e.g. PENDING = "pending" # 等待中).

Project guideline: 4-space indent, max 120 char/line, double-quoted strings (Python).
Docstring style: Google-style per dev-guidelines.md. Existing files mix English-shape Args:/Returns: keys with Chinese descriptions, or use Chinese keys (参数:, 返回:). Translate both to canonical Google-style English.
File-level convention: snake_case filenames, Python __init__.py modules typically have a one-line module docstring.

None. This work touches only commentary; no API contracts, schemas, or imports change.

Requirement	Status	Notes
R1 (coverage)	Feasible — straightforward	Files identified by `grep` rule.
R2 (behavior preservation)	Feasible	Achieved by limiting diffs to comment/docstring lines. Need to be careful with multi-line triple-quoted docstrings vs string literals (they are syntactically identical to strings — disambiguation: docstring is the first statement of a module/class/function body).
R3 (comment hygiene)	Feasible	Some judgment required; will adopt heuristic: drop comments whose translated form would be a single verb-phrase paraphrase of the next executable line.
R4 (style compliance)	Feasible	Watch line-length when translating dense Chinese to English (English is typically longer); rewrap as needed without changing executable code.
R5 (verification)	Feasible	The `grep -rln '[一-鿿]'` rule is reliable. Residual hits should land only in: prompt template strings (#2/#3/#4/#5), logger/API string literals (#6), and the `tests/test_locale*` files (intentional Chinese test data).
R6 (tracking/branching)	Feasible	Branch + commit conventions are standard for this repo; `/done` skill enforces them.

Constraint: Triple-quoted strings used as values (not as docstrings) must NOT be edited if their content is in scope of issues #2–#6 (prompts/log messages/error messages). Disambiguation matters.
Constraint: Chinese characters appearing inside f-string literal segments must remain. They are out of scope.
Unknown / Research Needed: None — task is mechanical and well-bounded.

i18n-externalize-backend-logs (#6) owns translating logger.{info,warning,error} Chinese arguments and API response strings.
i18n-report-agent-prompts (#5), and tickets #2/#3/#4 own prompt template strings.
We must NOT touch any string literal that those tickets own. After this PR, residual grep hits should reduce by exactly the count of comments and docstrings translated and nothing else.
The two backend/tests/test_locale*.py files are not in the ticket's listed file scope, and inspection shows their Chinese is exclusively in string literals (test data and a Unicode range check). They are out of scope by R1's enumerated paths and remain untouched.

Walk the 37 in-scope files in a deterministic order (alphabetical), translating docstrings/comments per file, running the residual grep after each batch.
Group commit by area (models, utils, services, api, scripts, root) to keep PR diff readable.
✅ Simple, low risk, easy to revert per-area.
✅ Maps directly to the requirements; easy to verify.
❌ Larger PR than option B, but ticket explicitly allows a single PR.

Split into one PR per package (models/, utils/, …). The ticket allows this.
✅ Smaller diffs to review.
❌ More overhead (multiple branches/PRs); not necessary for a mechanical change of this size.

Build a one-shot translation script (LLM-driven) that rewrites docstrings/comments.
✅ Could scale to other repos.
❌ Out of proportion for a single-ticket task; risk of errant edits to string literals; tooling itself becomes a deliverable to test and maintain.

Effort: M (3–7 days of focused work) — 37 files, hundreds of comments. In an interactive AI-assisted run, this collapses to a few hours.
Risk: Low — comments-only diff; covered by mechanical verification (grep + pytest); easy to rollback per file/area.

Preferred approach: Option A (single-pass file-by-file, package-grouped commits, single PR).
Key decisions to capture in design:
- Order of traversal (proposed: models/ → utils/ → services/ → api/ → scripts/ → root files __init__.py, config.py, run.py).
- Heuristic for "drops the obvious comment" (one-line rule).
- How to handle Google-style docstring keys: always translate 参数: → Args:, 返回: → Returns:, 异常: → Raises:.
- Verification cadence: re-run the grep after each package batch.
Research items to carry forward: None.