MicroFish/.kiro/specs/i18n-translate-backend-comm.../gap-analysis.md

93 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Gap Analysis — `i18n-translate-backend-comments`
## Scope Recap
- **Ticket**: salestech-group/MiroFish#7
- **Goal**: Translate Chinese docstrings and `#` comments in `backend/` to English without behavior changes.
- **Blast radius**: Comments and docstrings only; runtime semantics preserved.
## Current State Investigation
### Discovered files
A scan with the regex `[一-鿿]` across `backend/**/*.py` (excluding `.venv`) returns **37 in-app files** plus 2 test files:
| Area | Count | Files |
| --- | --- | --- |
| `backend/app/__init__.py` | 1 | `__init__.py` |
| `backend/app/config.py` | 1 | `config.py` |
| `backend/app/api/` | 4 | `__init__.py`, `graph.py`, `report.py`, `simulation.py` |
| `backend/app/models/` | 3 | `__init__.py`, `project.py`, `task.py` |
| `backend/app/services/` | 12 | `__init__.py`, `graph_builder.py`, `oasis_profile_generator.py`, `ontology_generator.py`, `report_agent.py`, `simulation_config_generator.py`, `simulation_ipc.py`, `simulation_manager.py`, `simulation_runner.py`, `text_processor.py`, `zep_entity_reader.py`, `zep_graph_memory_updater.py`, `zep_tools.py` |
| `backend/app/utils/` | 7 | `__init__.py`, `file_parser.py`, `llm_client.py`, `locale.py`, `logger.py`, `retry.py`, `zep_paging.py` |
| `backend/run.py` | 1 | `run.py` |
| `backend/scripts/` | 5 | `action_logger.py`, `run_parallel_simulation.py`, `run_reddit_simulation.py`, `run_twitter_simulation.py`, `test_profile_format.py` |
| `backend/tests/` (extra, not in ticket file list) | 2 | `test_locale.py`, `test_locale_request_resolution.py` |
Spot checks (`models/task.py`, `models/project.py`, `services/text_processor.py`, `utils/locale.py`):
- Module-level docstrings in Chinese (e.g. `"""任务状态管理"""`).
- Class/method docstrings in Chinese, often Google-shaped (`Args:` translated as `参数:`).
- Inline `#` comments tagging fields, sections, or restating obvious code (e.g. `# 标准化换行` above an `\n` normalization call).
- Status-enum trailing comments (e.g. `PENDING = "pending" # 等待中`).
### Conventions to preserve
- Project guideline: 4-space indent, max 120 char/line, double-quoted strings (Python).
- Docstring style: Google-style per `dev-guidelines.md`. Existing files mix English-shape `Args:`/`Returns:` keys with Chinese descriptions, or use Chinese keys (`参数:`, `返回:`). Translate both to canonical Google-style English.
- File-level convention: `snake_case` filenames, Python `__init__.py` modules typically have a one-line module docstring.
### Integration surfaces
None. This work touches only commentary; no API contracts, schemas, or imports change.
## Requirements Feasibility
| Requirement | Status | Notes |
| --- | --- | --- |
| R1 (coverage) | Feasible — straightforward | Files identified by `grep` rule. |
| R2 (behavior preservation) | Feasible | Achieved by limiting diffs to comment/docstring lines. Need to be careful with multi-line triple-quoted docstrings vs string literals (they are syntactically identical to strings — disambiguation: docstring is the *first* statement of a module/class/function body). |
| R3 (comment hygiene) | Feasible | Some judgment required; will adopt heuristic: drop comments whose translated form would be a single verb-phrase paraphrase of the next executable line. |
| R4 (style compliance) | Feasible | Watch line-length when translating dense Chinese to English (English is typically longer); rewrap as needed without changing executable code. |
| R5 (verification) | Feasible | The `grep -rln '[一-鿿]'` rule is reliable. Residual hits should land only in: prompt template strings (#2/#3/#4/#5), logger/API string literals (#6), and the `tests/test_locale*` files (intentional Chinese test data). |
| R6 (tracking/branching) | Feasible | Branch + commit conventions are standard for this repo; `/done` skill enforces them. |
### Gaps and constraints
- **Constraint**: Triple-quoted strings used as values (not as docstrings) must NOT be edited if their content is in scope of issues #2#6 (prompts/log messages/error messages). Disambiguation matters.
- **Constraint**: Chinese characters appearing inside f-string literal segments must remain. They are out of scope.
- **Unknown / Research Needed**: None — task is mechanical and well-bounded.
### Adjacent specs / overlap with other tickets
- `i18n-externalize-backend-logs` (#6) owns translating `logger.{info,warning,error}` Chinese arguments and API response strings.
- `i18n-report-agent-prompts` (#5), and tickets #2/#3/#4 own prompt template strings.
- We must NOT touch any string literal that those tickets own. After this PR, residual `grep` hits should reduce by exactly the count of comments and docstrings translated and nothing else.
- The two `backend/tests/test_locale*.py` files are **not in the ticket's listed file scope**, and inspection shows their Chinese is exclusively in string literals (test data and a Unicode range check). They are out of scope by R1's enumerated paths and remain untouched.
## Implementation Approach Options
### Option A — Single-pass file-by-file translation (recommended)
- Walk the 37 in-scope files in a deterministic order (alphabetical), translating docstrings/comments per file, running the residual grep after each batch.
- Group commit by area (models, utils, services, api, scripts, root) to keep PR diff readable.
- ✅ Simple, low risk, easy to revert per-area.
- ✅ Maps directly to the requirements; easy to verify.
- ❌ Larger PR than option B, but ticket explicitly allows a single PR.
### Option B — Multi-PR per package
- Split into one PR per package (`models/`, `utils/`, …). The ticket allows this.
- ✅ Smaller diffs to review.
- ❌ More overhead (multiple branches/PRs); not necessary for a mechanical change of this size.
### Option C — Tooling-assisted bulk script
- Build a one-shot translation script (LLM-driven) that rewrites docstrings/comments.
- ✅ Could scale to other repos.
- ❌ Out of proportion for a single-ticket task; risk of errant edits to string literals; tooling itself becomes a deliverable to test and maintain.
## Effort and Risk
- **Effort**: **M (37 days of focused work)** — 37 files, hundreds of comments. In an interactive AI-assisted run, this collapses to a few hours.
- **Risk**: **Low** — comments-only diff; covered by mechanical verification (grep + pytest); easy to rollback per file/area.
## Recommendations for Design Phase
- **Preferred approach**: Option A (single-pass file-by-file, package-grouped commits, single PR).
- **Key decisions to capture in design**:
- Order of traversal (proposed: `models/``utils/``services/``api/``scripts/` → root files `__init__.py`, `config.py`, `run.py`).
- Heuristic for "drops the obvious comment" (one-line rule).
- How to handle Google-style docstring keys: always translate `参数:``Args:`, `返回:``Returns:`, `异常:``Raises:`.
- Verification cadence: re-run the grep after each package batch.
- **Research items to carry forward**: None.