MicroFish/.kiro/specs/i18n-translate-backend-comm.../research.md

5.9 KiB

Research & Design Decisions — i18n-translate-backend-comments

Summary

  • Feature: i18n-translate-backend-comments
  • Discovery Scope: Simple Addition (mechanical translation, no architectural change)
  • Key Findings:
    • 37 in-scope backend/ Python files contain Chinese characters in docstrings or # comments. The full list is in gap-analysis.md.
    • Existing docstrings mix English-shape Google-style keys (Args:/Returns:) with Chinese descriptions, and a smaller subset uses Chinese keys (参数:/返回:/异常:). Both patterns must converge to canonical English Google-style.
    • Several tests/test_locale*.py files contain Chinese only inside string literals (intentional test data) and are out of scope by the ticket's enumerated paths.

Research Log

Discovery scan: where is Chinese in backend/?

  • Context: Need a deterministic enumeration of files to translate.
  • Sources Consulted: grep/Python-driven scan against backend/**/*.py.
  • Findings:
    • 37 in-app files (under backend/app/, backend/run.py, backend/scripts/).
    • 2 additional test files in backend/tests/ whose Chinese is only in string literals; not in ticket scope.
    • .venv/ matches are noise and excluded.
  • Implications: The ticket-listed paths are exhaustive; no unexpected location. Order of traversal can be alphabetical within package groups.

Disambiguation: docstring vs string literal

  • Context: A triple-quoted string is a docstring iff it is the first statement of a module, class, or function body. Otherwise it is a value (e.g. a prompt template) owned by adjacent tickets.
  • Sources Consulted: Python language reference; spot inspection of services/ontology_generator.py, services/report_agent.py.
  • Findings:
    • In-scope files contain both kinds of triple-quoted strings.
    • Translating only the first-statement triple-quoted string per scope keeps the change comments-and-docstrings-only.
  • Implications: Translation pass must visually verify each triple-quoted string is the first statement before rewriting; otherwise leave it alone.

Google-style docstring conversions

  • Context: dev-guidelines.md requires Google-style docstrings; existing Chinese docstrings sometimes use Chinese keys.
  • Findings: The following key map applies:
    • 参数:Args:
    • 返回:Returns:
    • 异常:Raises:
    • 产生: / 生成:Yields:
    • 示例:Example: (or Examples:)
    • 注意: / 备注:Note: (or Notes:)
  • Implications: Document this mapping in design.md so the implementation pass is mechanical.

Architecture Pattern Evaluation

Option Description Strengths Risks / Limitations Notes
Manual file-by-file pass Walk in alphabetical order, package-grouped commits Predictable, easy to review per package Human time required Selected approach
Multi-PR per package One PR per backend package Smaller diffs to review Higher overhead, more PR churn Allowed by ticket but not required
Tooling-assisted bulk script LLM-driven find-and-replace tool Reusable Risk of touching string literals; tool itself becomes a deliverable Out of proportion

Design Decisions

Decision: Single-pass, package-grouped commits, single PR

  • Context: 37 files, mechanical change, ticket allows either single or split PRs.
  • Alternatives Considered:
    1. Multi-PR per package — more granular review but higher overhead.
    2. Tooling-assisted bulk script — overkill for one ticket.
  • Selected Approach: Single PR with one or more commits, grouped by package (models/, utils/, services/, api/, scripts/, root) so reviewers can read the diff one package at a time.
  • Rationale: Mechanical change with low risk; ticket explicitly allows it; reduces PR overhead; /done produces one PR per branch by default.
  • Trade-offs: One large PR, but partitioned by commit. Reviewer can use commit history to navigate.
  • Follow-up: After each package commit, re-run residual grep and pytest to maintain the invariant.

Decision: First-statement disambiguation rule

  • Context: Distinguish docstrings (in scope) from value strings (out of scope).
  • Selected Approach: A triple-quoted string is treated as a docstring (in scope) only if it is the first statement of a module / class / function body. All other triple-quoted strings are values (out of scope).
  • Rationale: Matches Python's own definition; keeps boundary with adjacent tickets unambiguous.

Decision: Drop comments that restate code

  • Context: R3 requires deletion of comments whose translated form would merely paraphrase the next line.
  • Selected Approach: Apply a one-line heuristic: if the translated comment would be a verb phrase that mirrors the immediately following executable line, delete the comment instead of writing it.
  • Rationale: Aligns with project rule "comment the why, not the what".

Risks & Mitigations

  • Risk: Accidental edit to a string literal (would belong to ticket #2/#3/#4/#5/#6) — Mitigation: After each package commit, run git diff --stat and a per-file diff sanity check; verify only # lines and docstring lines change.
  • Risk: Tests failing because a string-shape changed — Mitigation: Run uv run python -m pytest backend/scripts/test_profile_format.py after each commit.
  • Risk: Line length violations after English expansion — Mitigation: Reflow long English at <= 120 chars within the docstring/comment only; never reflow code.

References

  • dev-guidelines.md — repo-level coding standards, Google-style docstring requirement.
  • .claude/rules/commits.md — Conventional Commits standard for the commit message.
  • Issue #7 — salestech-group/MiroFish: source ticket.
  • Issues #2/#3/#4/#5/#6 — adjacent i18n tickets that own the string-literal Chinese.