MicroFish/.kiro/specs/i18n-translate-backend-comm.../gap-analysis.md

6.9 KiB
Raw Blame History

Gap Analysis — i18n-translate-backend-comments

Scope Recap

  • Ticket: salestech-group/MiroFish#7
  • Goal: Translate Chinese docstrings and # comments in backend/ to English without behavior changes.
  • Blast radius: Comments and docstrings only; runtime semantics preserved.

Current State Investigation

Discovered files

A scan with the regex [一-鿿] across backend/**/*.py (excluding .venv) returns 37 in-app files plus 2 test files:

Area Count Files
backend/app/__init__.py 1 __init__.py
backend/app/config.py 1 config.py
backend/app/api/ 4 __init__.py, graph.py, report.py, simulation.py
backend/app/models/ 3 __init__.py, project.py, task.py
backend/app/services/ 12 __init__.py, graph_builder.py, oasis_profile_generator.py, ontology_generator.py, report_agent.py, simulation_config_generator.py, simulation_ipc.py, simulation_manager.py, simulation_runner.py, text_processor.py, zep_entity_reader.py, zep_graph_memory_updater.py, zep_tools.py
backend/app/utils/ 7 __init__.py, file_parser.py, llm_client.py, locale.py, logger.py, retry.py, zep_paging.py
backend/run.py 1 run.py
backend/scripts/ 5 action_logger.py, run_parallel_simulation.py, run_reddit_simulation.py, run_twitter_simulation.py, test_profile_format.py
backend/tests/ (extra, not in ticket file list) 2 test_locale.py, test_locale_request_resolution.py

Spot checks (models/task.py, models/project.py, services/text_processor.py, utils/locale.py):

  • Module-level docstrings in Chinese (e.g. """任务状态管理""").
  • Class/method docstrings in Chinese, often Google-shaped (Args: translated as 参数:).
  • Inline # comments tagging fields, sections, or restating obvious code (e.g. # 标准化换行 above an \n normalization call).
  • Status-enum trailing comments (e.g. PENDING = "pending" # 等待中).

Conventions to preserve

  • Project guideline: 4-space indent, max 120 char/line, double-quoted strings (Python).
  • Docstring style: Google-style per dev-guidelines.md. Existing files mix English-shape Args:/Returns: keys with Chinese descriptions, or use Chinese keys (参数:, 返回:). Translate both to canonical Google-style English.
  • File-level convention: snake_case filenames, Python __init__.py modules typically have a one-line module docstring.

Integration surfaces

None. This work touches only commentary; no API contracts, schemas, or imports change.

Requirements Feasibility

Requirement Status Notes
R1 (coverage) Feasible — straightforward Files identified by grep rule.
R2 (behavior preservation) Feasible Achieved by limiting diffs to comment/docstring lines. Need to be careful with multi-line triple-quoted docstrings vs string literals (they are syntactically identical to strings — disambiguation: docstring is the first statement of a module/class/function body).
R3 (comment hygiene) Feasible Some judgment required; will adopt heuristic: drop comments whose translated form would be a single verb-phrase paraphrase of the next executable line.
R4 (style compliance) Feasible Watch line-length when translating dense Chinese to English (English is typically longer); rewrap as needed without changing executable code.
R5 (verification) Feasible The grep -rln '[一-鿿]' rule is reliable. Residual hits should land only in: prompt template strings (#2/#3/#4/#5), logger/API string literals (#6), and the tests/test_locale* files (intentional Chinese test data).
R6 (tracking/branching) Feasible Branch + commit conventions are standard for this repo; /done skill enforces them.

Gaps and constraints

  • Constraint: Triple-quoted strings used as values (not as docstrings) must NOT be edited if their content is in scope of issues #2#6 (prompts/log messages/error messages). Disambiguation matters.
  • Constraint: Chinese characters appearing inside f-string literal segments must remain. They are out of scope.
  • Unknown / Research Needed: None — task is mechanical and well-bounded.

Adjacent specs / overlap with other tickets

  • i18n-externalize-backend-logs (#6) owns translating logger.{info,warning,error} Chinese arguments and API response strings.
  • i18n-report-agent-prompts (#5), and tickets #2/#3/#4 own prompt template strings.
  • We must NOT touch any string literal that those tickets own. After this PR, residual grep hits should reduce by exactly the count of comments and docstrings translated and nothing else.
  • The two backend/tests/test_locale*.py files are not in the ticket's listed file scope, and inspection shows their Chinese is exclusively in string literals (test data and a Unicode range check). They are out of scope by R1's enumerated paths and remain untouched.

Implementation Approach Options

  • Walk the 37 in-scope files in a deterministic order (alphabetical), translating docstrings/comments per file, running the residual grep after each batch.
  • Group commit by area (models, utils, services, api, scripts, root) to keep PR diff readable.
  • Simple, low risk, easy to revert per-area.
  • Maps directly to the requirements; easy to verify.
  • Larger PR than option B, but ticket explicitly allows a single PR.

Option B — Multi-PR per package

  • Split into one PR per package (models/, utils/, …). The ticket allows this.
  • Smaller diffs to review.
  • More overhead (multiple branches/PRs); not necessary for a mechanical change of this size.

Option C — Tooling-assisted bulk script

  • Build a one-shot translation script (LLM-driven) that rewrites docstrings/comments.
  • Could scale to other repos.
  • Out of proportion for a single-ticket task; risk of errant edits to string literals; tooling itself becomes a deliverable to test and maintain.

Effort and Risk

  • Effort: M (37 days of focused work) — 37 files, hundreds of comments. In an interactive AI-assisted run, this collapses to a few hours.
  • Risk: Low — comments-only diff; covered by mechanical verification (grep + pytest); easy to rollback per file/area.

Recommendations for Design Phase

  • Preferred approach: Option A (single-pass file-by-file, package-grouped commits, single PR).
  • Key decisions to capture in design:
    • Order of traversal (proposed: models/utils/services/api/scripts/ → root files __init__.py, config.py, run.py).
    • Heuristic for "drops the obvious comment" (one-line rule).
    • How to handle Google-style docstring keys: always translate 参数:Args:, 返回:Returns:, 异常:Raises:.
    • Verification cadence: re-run the grep after each package batch.
  • Research items to carry forward: None.