MicroFish/.kiro/specs/i18n-translate-backend-comm.../gap-analysis.md

# Gap Analysis — `i18n-translate-backend-comments`

## Scope Recap
- **Ticket**: salestech-group/MiroFish#7
- **Goal**: Translate Chinese docstrings and `#` comments in `backend/` to English without behavior changes.
- **Blast radius**: Comments and docstrings only; runtime semantics preserved.

## Current State Investigation

### Discovered files
A scan with the regex `[一-鿿]` across `backend/**/*.py` (excluding `.venv`) returns **37 in-app files** plus 2 test files:

| Area | Count | Files |
| --- | --- | --- |
| `backend/app/__init__.py` | 1 | `__init__.py` |
| `backend/app/config.py` | 1 | `config.py` |
| `backend/app/api/` | 4 | `__init__.py`, `graph.py`, `report.py`, `simulation.py` |
| `backend/app/models/` | 3 | `__init__.py`, `project.py`, `task.py` |
| `backend/app/services/` | 12 | `__init__.py`, `graph_builder.py`, `oasis_profile_generator.py`, `ontology_generator.py`, `report_agent.py`, `simulation_config_generator.py`, `simulation_ipc.py`, `simulation_manager.py`, `simulation_runner.py`, `text_processor.py`, `zep_entity_reader.py`, `zep_graph_memory_updater.py`, `zep_tools.py` |
| `backend/app/utils/` | 7 | `__init__.py`, `file_parser.py`, `llm_client.py`, `locale.py`, `logger.py`, `retry.py`, `zep_paging.py` |
| `backend/run.py` | 1 | `run.py` |
| `backend/scripts/` | 5 | `action_logger.py`, `run_parallel_simulation.py`, `run_reddit_simulation.py`, `run_twitter_simulation.py`, `test_profile_format.py` |
| `backend/tests/` (extra, not in ticket file list) | 2 | `test_locale.py`, `test_locale_request_resolution.py` |

Spot checks (`models/task.py`, `models/project.py`, `services/text_processor.py`, `utils/locale.py`):
- Module-level docstrings in Chinese (e.g. `"""任务状态管理"""`).
- Class/method docstrings in Chinese, often Google-shaped (`Args:` translated as `参数:`).
- Inline `#` comments tagging fields, sections, or restating obvious code (e.g. `# 标准化换行` above an `\n` normalization call).
- Status-enum trailing comments (e.g. `PENDING = "pending"  # 等待中`).

### Conventions to preserve
- Project guideline: 4-space indent, max 120 char/line, double-quoted strings (Python).
- Docstring style: Google-style per `dev-guidelines.md`. Existing files mix English-shape `Args:`/`Returns:` keys with Chinese descriptions, or use Chinese keys (`参数:`, `返回:`). Translate both to canonical Google-style English.
- File-level convention: `snake_case` filenames, Python `__init__.py` modules typically have a one-line module docstring.

### Integration surfaces
None. This work touches only commentary; no API contracts, schemas, or imports change.

## Requirements Feasibility

| Requirement | Status | Notes |
| --- | --- | --- |
| R1 (coverage) | Feasible — straightforward | Files identified by `grep` rule. |
| R2 (behavior preservation) | Feasible | Achieved by limiting diffs to comment/docstring lines. Need to be careful with multi-line triple-quoted docstrings vs string literals (they are syntactically identical to strings — disambiguation: docstring is the *first* statement of a module/class/function body). |
| R3 (comment hygiene) | Feasible | Some judgment required; will adopt heuristic: drop comments whose translated form would be a single verb-phrase paraphrase of the next executable line. |
| R4 (style compliance) | Feasible | Watch line-length when translating dense Chinese to English (English is typically longer); rewrap as needed without changing executable code. |
| R5 (verification) | Feasible | The `grep -rln '[一-鿿]'` rule is reliable. Residual hits should land only in: prompt template strings (#2/#3/#4/#5), logger/API string literals (#6), and the `tests/test_locale*` files (intentional Chinese test data). |
| R6 (tracking/branching) | Feasible | Branch + commit conventions are standard for this repo; `/done` skill enforces them. |

### Gaps and constraints
- **Constraint**: Triple-quoted strings used as values (not as docstrings) must NOT be edited if their content is in scope of issues #2–#6 (prompts/log messages/error messages). Disambiguation matters.
- **Constraint**: Chinese characters appearing inside f-string literal segments must remain. They are out of scope.
- **Unknown / Research Needed**: None — task is mechanical and well-bounded.

### Adjacent specs / overlap with other tickets
- `i18n-externalize-backend-logs` (#6) owns translating `logger.{info,warning,error}` Chinese arguments and API response strings.
- `i18n-report-agent-prompts` (#5), and tickets #2/#3/#4 own prompt template strings.
- We must NOT touch any string literal that those tickets own. After this PR, residual `grep` hits should reduce by exactly the count of comments and docstrings translated and nothing else.
- The two `backend/tests/test_locale*.py` files are **not in the ticket's listed file scope**, and inspection shows their Chinese is exclusively in string literals (test data and a Unicode range check). They are out of scope by R1's enumerated paths and remain untouched.

## Implementation Approach Options

### Option A — Single-pass file-by-file translation (recommended)
- Walk the 37 in-scope files in a deterministic order (alphabetical), translating docstrings/comments per file, running the residual grep after each batch.
- Group commit by area (models, utils, services, api, scripts, root) to keep PR diff readable.
- ✅ Simple, low risk, easy to revert per-area.
- ✅ Maps directly to the requirements; easy to verify.
- ❌ Larger PR than option B, but ticket explicitly allows a single PR.

### Option B — Multi-PR per package
- Split into one PR per package (`models/`, `utils/`, …). The ticket allows this.
- ✅ Smaller diffs to review.
- ❌ More overhead (multiple branches/PRs); not necessary for a mechanical change of this size.

### Option C — Tooling-assisted bulk script
- Build a one-shot translation script (LLM-driven) that rewrites docstrings/comments.
- ✅ Could scale to other repos.
- ❌ Out of proportion for a single-ticket task; risk of errant edits to string literals; tooling itself becomes a deliverable to test and maintain.

## Effort and Risk
- **Effort**: **M (3–7 days of focused work)** — 37 files, hundreds of comments. In an interactive AI-assisted run, this collapses to a few hours.
- **Risk**: **Low** — comments-only diff; covered by mechanical verification (grep + pytest); easy to rollback per file/area.

## Recommendations for Design Phase

- **Preferred approach**: Option A (single-pass file-by-file, package-grouped commits, single PR).
- **Key decisions to capture in design**:
  - Order of traversal (proposed: `models/` → `utils/` → `services/` → `api/` → `scripts/` → root files `__init__.py`, `config.py`, `run.py`).
  - Heuristic for "drops the obvious comment" (one-line rule).
  - How to handle Google-style docstring keys: always translate `参数:` → `Args:`, `返回:` → `Returns:`, `异常:` → `Raises:`.
  - Verification cadence: re-run the grep after each package batch.
- **Research items to carry forward**: None.