6.9 KiB
6.9 KiB
Gap Analysis — i18n-translate-backend-comments
Scope Recap
- Ticket: salestech-group/MiroFish#7
- Goal: Translate Chinese docstrings and
#comments inbackend/to English without behavior changes. - Blast radius: Comments and docstrings only; runtime semantics preserved.
Current State Investigation
Discovered files
A scan with the regex [一-鿿] across backend/**/*.py (excluding .venv) returns 37 in-app files plus 2 test files:
| Area | Count | Files |
|---|---|---|
backend/app/__init__.py |
1 | __init__.py |
backend/app/config.py |
1 | config.py |
backend/app/api/ |
4 | __init__.py, graph.py, report.py, simulation.py |
backend/app/models/ |
3 | __init__.py, project.py, task.py |
backend/app/services/ |
12 | __init__.py, graph_builder.py, oasis_profile_generator.py, ontology_generator.py, report_agent.py, simulation_config_generator.py, simulation_ipc.py, simulation_manager.py, simulation_runner.py, text_processor.py, zep_entity_reader.py, zep_graph_memory_updater.py, zep_tools.py |
backend/app/utils/ |
7 | __init__.py, file_parser.py, llm_client.py, locale.py, logger.py, retry.py, zep_paging.py |
backend/run.py |
1 | run.py |
backend/scripts/ |
5 | action_logger.py, run_parallel_simulation.py, run_reddit_simulation.py, run_twitter_simulation.py, test_profile_format.py |
backend/tests/ (extra, not in ticket file list) |
2 | test_locale.py, test_locale_request_resolution.py |
Spot checks (models/task.py, models/project.py, services/text_processor.py, utils/locale.py):
- Module-level docstrings in Chinese (e.g.
"""任务状态管理"""). - Class/method docstrings in Chinese, often Google-shaped (
Args:translated as参数:). - Inline
#comments tagging fields, sections, or restating obvious code (e.g.# 标准化换行above an\nnormalization call). - Status-enum trailing comments (e.g.
PENDING = "pending" # 等待中).
Conventions to preserve
- Project guideline: 4-space indent, max 120 char/line, double-quoted strings (Python).
- Docstring style: Google-style per
dev-guidelines.md. Existing files mix English-shapeArgs:/Returns:keys with Chinese descriptions, or use Chinese keys (参数:,返回:). Translate both to canonical Google-style English. - File-level convention:
snake_casefilenames, Python__init__.pymodules typically have a one-line module docstring.
Integration surfaces
None. This work touches only commentary; no API contracts, schemas, or imports change.
Requirements Feasibility
| Requirement | Status | Notes |
|---|---|---|
| R1 (coverage) | Feasible — straightforward | Files identified by grep rule. |
| R2 (behavior preservation) | Feasible | Achieved by limiting diffs to comment/docstring lines. Need to be careful with multi-line triple-quoted docstrings vs string literals (they are syntactically identical to strings — disambiguation: docstring is the first statement of a module/class/function body). |
| R3 (comment hygiene) | Feasible | Some judgment required; will adopt heuristic: drop comments whose translated form would be a single verb-phrase paraphrase of the next executable line. |
| R4 (style compliance) | Feasible | Watch line-length when translating dense Chinese to English (English is typically longer); rewrap as needed without changing executable code. |
| R5 (verification) | Feasible | The grep -rln '[一-鿿]' rule is reliable. Residual hits should land only in: prompt template strings (#2/#3/#4/#5), logger/API string literals (#6), and the tests/test_locale* files (intentional Chinese test data). |
| R6 (tracking/branching) | Feasible | Branch + commit conventions are standard for this repo; /done skill enforces them. |
Gaps and constraints
- Constraint: Triple-quoted strings used as values (not as docstrings) must NOT be edited if their content is in scope of issues #2–#6 (prompts/log messages/error messages). Disambiguation matters.
- Constraint: Chinese characters appearing inside f-string literal segments must remain. They are out of scope.
- Unknown / Research Needed: None — task is mechanical and well-bounded.
Adjacent specs / overlap with other tickets
i18n-externalize-backend-logs(#6) owns translatinglogger.{info,warning,error}Chinese arguments and API response strings.i18n-report-agent-prompts(#5), and tickets #2/#3/#4 own prompt template strings.- We must NOT touch any string literal that those tickets own. After this PR, residual
grephits should reduce by exactly the count of comments and docstrings translated and nothing else. - The two
backend/tests/test_locale*.pyfiles are not in the ticket's listed file scope, and inspection shows their Chinese is exclusively in string literals (test data and a Unicode range check). They are out of scope by R1's enumerated paths and remain untouched.
Implementation Approach Options
Option A — Single-pass file-by-file translation (recommended)
- Walk the 37 in-scope files in a deterministic order (alphabetical), translating docstrings/comments per file, running the residual grep after each batch.
- Group commit by area (models, utils, services, api, scripts, root) to keep PR diff readable.
- ✅ Simple, low risk, easy to revert per-area.
- ✅ Maps directly to the requirements; easy to verify.
- ❌ Larger PR than option B, but ticket explicitly allows a single PR.
Option B — Multi-PR per package
- Split into one PR per package (
models/,utils/, …). The ticket allows this. - ✅ Smaller diffs to review.
- ❌ More overhead (multiple branches/PRs); not necessary for a mechanical change of this size.
Option C — Tooling-assisted bulk script
- Build a one-shot translation script (LLM-driven) that rewrites docstrings/comments.
- ✅ Could scale to other repos.
- ❌ Out of proportion for a single-ticket task; risk of errant edits to string literals; tooling itself becomes a deliverable to test and maintain.
Effort and Risk
- Effort: M (3–7 days of focused work) — 37 files, hundreds of comments. In an interactive AI-assisted run, this collapses to a few hours.
- Risk: Low — comments-only diff; covered by mechanical verification (grep + pytest); easy to rollback per file/area.
Recommendations for Design Phase
- Preferred approach: Option A (single-pass file-by-file, package-grouped commits, single PR).
- Key decisions to capture in design:
- Order of traversal (proposed:
models/→utils/→services/→api/→scripts/→ root files__init__.py,config.py,run.py). - Heuristic for "drops the obvious comment" (one-line rule).
- How to handle Google-style docstring keys: always translate
参数:→Args:,返回:→Returns:,异常:→Raises:. - Verification cadence: re-run the grep after each package batch.
- Order of traversal (proposed:
- Research items to carry forward: None.