chore(i18n): refresh cjk baseline and update spec status

backend/app baseline drops from 2792 to 307 after the comment/docstring translation pass. Mark i18n-translate-backend-comments tasks complete in the spec and update HANDOFF.md to record the second-installment scope. Add the AST-aware scanner used during verification under the spec directory so future audits can re-run it.
2026-05-09 10:59:51 +00:00 · 2026-05-09 10:59:51 +00:00 · 339cc396dd
parent 5815ed28d2
commit 339cc396dd
4 changed files with 153 additions and 44 deletions
--- a/.kiro/specs/i18n-ci-guard/baseline.txt
+++ b/.kiro/specs/i18n-ci-guard/baseline.txt
@ -1,5 +1,5 @@
 # Per-path CJK baseline for the i18n CI guard.
 # Format: <path>\t<count>. Sorted lexicographically.
 # Refresh via: python scripts/ci/i18n_cjk_guard.py --update-baseline
-backend/app	2792
-frontend/src	902
+backend/app	307
+frontend/src	124
--- a/.kiro/specs/i18n-translate-backend-comments/HANDOFF.md
+++ b/.kiro/specs/i18n-translate-backend-comments/HANDOFF.md
@ -1,61 +1,78 @@
 # Handoff — `i18n-translate-backend-comments` (Issue #7)

 ## Status
-**Partial completion.** This is the first installment of the ticket-#7 cleanup. The ticket explicitly allows splitting the work across multiple small PRs ("Low-risk, high-volume mechanical task; can be split across multiple small PRs"). This PR ships translations for the smaller files; the larger service and API files remain for follow-up PRs.
+**Complete.** All in-scope Chinese docstrings and `#` comments under `backend/` have been translated to English.

-## Completed in this PR (23 files)
-All translated to English with no behavior or string-literal changes:
+This second installment of the ticket-#7 cleanup builds on the first installment (PR #20) and finishes the remaining 12 files. Together, the two installments cover the full 35-file in-scope set.

+## Completed across both installments (35 files)
+
+### First installment (PR #20 — landed on `feat/i18n-6-externalize-backend-logs`, then merged here via `merge main` into this branch)
 - **Root**: `backend/app/__init__.py`, `backend/app/config.py`, `backend/run.py`
 - **API package init**: `backend/app/api/__init__.py`
 - **Models** (full package): `backend/app/models/__init__.py`, `project.py`, `task.py`
- **Utils** (full package): `backend/app/utils/__init__.py`, `file_parser.py`, `llm_client.py`, `locale.py` (no docstring/comment Chinese to begin with), `logger.py`, `retry.py`, `zep_paging.py`
+- **Utils** (full package): `backend/app/utils/__init__.py`, `file_parser.py`, `llm_client.py`, `locale.py`, `logger.py`, `retry.py`, `zep_paging.py`
 - **Services** (partial): `backend/app/services/__init__.py`, `graph_builder.py`, `ontology_generator.py`, `simulation_ipc.py`, `simulation_manager.py`, `text_processor.py`, `zep_entity_reader.py`
 - **Scripts** (partial): `backend/scripts/action_logger.py`, `backend/scripts/test_profile_format.py`

-## Remaining for follow-up PRs (12 files)
-Per the AST-aware scanner used in this PR (`/tmp/scan_chinese.py`), the residual in-scope work totals **2,235 hits** (1,203 docstring lines + 1,032 inline-comment lines) across these files:
-
-| File | Approx in-scope hits | Approx LOC |
+### Second installment (this PR — finishes the ticket)
+| File | Starting in-scope hits | Comment-the-obvious deletions |
 | --- | --- | --- |
-| `backend/app/api/graph.py` | ~50 | 665 |
-| `backend/app/api/report.py` | ~80 | 1020 |
-| `backend/app/api/simulation.py` | ~250 | 2712 |
-| `backend/app/services/oasis_profile_generator.py` | ~230 | 1195 |
-| `backend/app/services/report_agent.py` | ~520 | 2572 |
-| `backend/app/services/simulation_config_generator.py` | ~150 | 991 |
-| `backend/app/services/simulation_runner.py` | ~330 | 1768 |
-| `backend/app/services/zep_graph_memory_updater.py` | ~110 | 544 |
-| `backend/app/services/zep_tools.py` | ~280 | 1741 |
-| `backend/scripts/run_parallel_simulation.py` | ~150 | 1699 |
-| `backend/scripts/run_reddit_simulation.py` | ~50 | 769 |
-| `backend/scripts/run_twitter_simulation.py` | ~50 | 780 |
+| `backend/app/api/graph.py` | 70 | 25 |
+| `backend/app/api/report.py` | 104 | 11 |
+| `backend/app/api/simulation.py` | 351 | ~25 |
+| `backend/app/services/oasis_profile_generator.py` | 185 | ~14 |
+| `backend/app/services/report_agent.py` | 335 | 8 |
+| `backend/app/services/simulation_config_generator.py` | 148 | 0 |
+| `backend/app/services/simulation_runner.py` | 277 | ~31 |
+| `backend/app/services/zep_graph_memory_updater.py` | 97 | 5 |
+| `backend/app/services/zep_tools.py` | 269 | 6 |
+| `backend/scripts/run_parallel_simulation.py` | 227 | ~7 |
+| `backend/scripts/run_reddit_simulation.py` | 75 | 12 |
+| `backend/scripts/run_twitter_simulation.py` | 97 | 21 |
+| **Total** | **2,235** | **~165** |

-(Counts are approximate and exclude string-literal Chinese, which is owned by adjacent tickets #2/#3/#4/#5/#6.)
+After the pass, every file in the table reports zero in-scope hits from the AST scanner.

-## Suggested follow-up split
+## Remaining residuals (out of scope — owned by sibling tickets)
+After this PR, the only files under `backend/` that still contain CJK characters do so exclusively inside string literals. These are owned by sibling tickets and are intentional residuals for this spec:

-Three additional PRs of similar size to this one would complete the ticket:
+- LLM prompt template strings: `oasis_profile_generator.py`, `ontology_generator.py`, `simulation_config_generator.py`, `report_agent.py` — owned by tickets #2 / #3 / #4 / #5.
+- Runtime log strings, API response messages, exception arguments, CLI prints: distributed across `api/`, `services/`, `scripts/`, `utils/retry.py`, `utils/locale.py`, `run.py`, `app/config.py` — owned by ticket #6 (with follow-up tickets #18, #24 for residuals).
+- Sample-data values returned to clients: `services/zep_tools.py`, `services/zep_graph_memory_updater.py`, `services/zep_entity_reader.py`, etc.

-1. **PR 2 — `services/{oasis_profile_generator, simulation_config_generator, simulation_runner, zep_graph_memory_updater, zep_tools}`**
-2. **PR 3 — `services/report_agent.py`** (single big file; isolating it keeps the diff reviewable)
-3. **PR 4 — `api/{graph,report,simulation}.py` + `scripts/run_{parallel,reddit,twitter}_simulation.py`**
+The CJK CI guard (`scripts/ci/i18n_cjk_guard.py`) enforces that this set never grows; the per-path baseline at `.kiro/specs/i18n-ci-guard/baseline.txt` is updated as part of this PR to reflect the new (lower) count.

-## Verification methodology used
-The AST-aware scanner (`/tmp/scan_chinese.py` — also kept in commit context) classifies every Chinese-containing line into one of three buckets: `DOCSTRING` (in scope), `COMMENT` (in scope), `STRING_VALUE` (out of scope, owned by adjacent tickets). Each translated file was verified with:
+## Verification methodology
+The AST-aware scanner at `.kiro/specs/i18n-translate-backend-comments/scan_chinese.py` (committed in this branch) classifies every CJK-bearing line into one of three buckets:

-1. `python -m py_compile <file>` — syntactic validity.
-2. The scanner returning `{'DOCSTRING': 0, 'COMMENT': 0}` for that file.
-3. `git diff <file>` review — only `#` lines and docstring lines change; no executable lines.
+- `DOCSTRING` — line lies inside a module/class/function docstring (in scope).
+- `COMMENT`  — line contains a `#` and is not inside a docstring or string-literal span (in scope).
+- `STRING`   — line is part of a string-literal value (out of scope, owned by sibling tickets).
+
+For every translated file in this installment:
+
+1. `python3 -m py_compile <file>` succeeds.
+2. The scanner reports `0` in-scope hits.
+3. `git diff <file>` shows only docstring lines and `#` comment lines changed; no signature, import, decorator, expression, or string-literal byte changes.
+
+For two of the largest files (`api/simulation.py`, `report_agent.py`), the implementing agent additionally ran an AST-equivalence check (parsing both before and after, stripping docstrings, and confirming structural equality) to validate that no executable surface changed.

 ## Test environment caveat
-The repo's `uv sync` requires building `tiktoken` from source, which needs Rust. The sandbox running this implementation pass does not have Rust, so `cd backend && uv run python -m pytest scripts/test_profile_format.py` (the verification command in the spec) cannot be executed end-to-end here; the test command also fails on import for unrelated reasons (missing `graphiti_core`, etc.) before any of this PR's changes touched the tree. Because the change set is comments-and-docstrings-only, runtime behavior cannot be affected; the syntactic-validity check stands in for the test run in this environment.
+The repo's `uv sync` builds `tiktoken` from source, which requires a Rust toolchain. The sandbox running this implementation pass does not have Rust, so `cd backend && uv run python -m pytest scripts/test_profile_format.py` cannot be executed end-to-end here. Because the change set is comments-and-docstrings-only, runtime behavior cannot be affected; the syntactic-validity check (`py_compile` across all 12 files) stands in for the test run in this environment.

 A developer with the project's normal dev environment (Rust toolchain installed, full `uv sync` succeeded) should re-run `cd backend && uv run python -m pytest scripts/test_profile_format.py` against this branch before merging to confirm.

 ## What is NOT changed
- No string literal anywhere in the touched files.
+- No string literal anywhere in the touched files (verified by AST classification).
 - No executable Python statement.
- No symbol renamed.
- No file added or removed.
+- No symbol renamed; `zep_*` legacy filenames preserved per steering rule.
+- No file added or removed (other than the AST scanner inside `.kiro/specs/i18n-translate-backend-comments/`).
 - No dependency added or version-bumped.
+
+## Branch & PR
+- Branch: `docs/i18n-7-translate-backend-comments` (re-used from PR #20; that PR was merged into `feat/i18n-6-externalize-backend-logs` after `feat/i18n-6` had already merged into `main`, which orphaned PR #20's content from `main`).
+- This PR re-targets the branch at `main`, including: the four prior commits from PR #20, a `Merge branch 'main'` commit (one conflict resolved in `services/ontology_generator.py` to combine PR #20's translated comment with main's English prompt-string), and the new commits for the 12 files completed here.
+- Commits follow Conventional Commits in the form `docs(i18n): translate chinese docstrings/comments in backend/<area>`.
+- The PR description references issue #7 with `Closes #7`.
+- No `Co-Authored-By:` watermarks.
--- a/.kiro/specs/i18n-translate-backend-comments/scan_chinese.py
+++ b/.kiro/specs/i18n-translate-backend-comments/scan_chinese.py
@ -0,0 +1,92 @@
+#!/usr/bin/env python3
+"""AST-aware classifier of Chinese characters in a Python source file.
+
+Usage::
+
+    python3 .kiro/specs/i18n-translate-backend-comments/scan_chinese.py <path>
+
+Classifies every line containing CJK Unified Ideographs (U+4E00..U+9FFF)
+into one of three buckets:
+
+* ``DOCSTRING`` — line lies within a module/class/function docstring (in
+  scope for ticket #7).
+* ``COMMENT``   — line contains a ``#`` and is not inside a docstring or
+  a string literal span (in scope for ticket #7).
+* ``STRING``    — line is part of a string literal value (out of scope —
+  owned by sibling tickets #2/#3/#4/#5/#6).
+
+Exit code is the count of in-scope hits (DOCSTRING + COMMENT). Stdout
+lists each in-scope hit as ``<line> <bucket>: <content>`` so callers can
+inspect them.
+"""
+
+from __future__ import annotations
+
+import ast
+import pathlib
+import re
+import sys
+
+CJK_RE = re.compile(r"[一-鿿]")
+
+
+def classify(path: pathlib.Path) -> int:
+    text = path.read_text(encoding="utf-8")
+    lines = text.split("\n")
+    tree = ast.parse(text)
+
+    docstring_lines: set[int] = set()
+    for node in ast.walk(tree):
+        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef, ast.Module)):
+            ds = ast.get_docstring(node, clean=False)
+            if ds is None:
+                continue
+            body = node.body
+            if not body or not isinstance(body[0], ast.Expr):
+                continue
+            const = body[0].value
+            if isinstance(const, ast.Constant) and isinstance(const.value, str):
+                start = const.lineno
+                end = getattr(const, "end_lineno", start)
+                for ln in range(start, end + 1):
+                    docstring_lines.add(ln)
+
+    string_value_lines: set[int] = set()
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Constant) and isinstance(node.value, str):
+            start = node.lineno
+            end = getattr(node, "end_lineno", start)
+            for ln in range(start, end + 1):
+                string_value_lines.add(ln)
+
+    in_scope_count = 0
+    for i, line in enumerate(lines, start=1):
+        if not CJK_RE.search(line):
+            continue
+        if i in docstring_lines:
+            print(f"{i:5d} DOCSTRING: {line.rstrip()[:120]}")
+            in_scope_count += 1
+        elif i in string_value_lines:
+            # Out of scope: owned by sibling tickets.
+            pass
+        elif "#" in line:
+            print(f"{i:5d} COMMENT  : {line.rstrip()[:120]}")
+            in_scope_count += 1
+        # else: unclassified — treat as out of scope (STRING value spanning).
+
+    return in_scope_count
+
+
+def main(argv: list[str]) -> int:
+    if len(argv) < 2:
+        print("usage: scan_chinese.py <path>", file=sys.stderr)
+        return 2
+    path = pathlib.Path(argv[1])
+    in_scope = classify(path)
+    print(f"---", file=sys.stderr)
+    print(f"in-scope CJK hits in {path}: {in_scope}", file=sys.stderr)
+    return 0 if in_scope == 0 else 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main(sys.argv))
--- a/.kiro/specs/i18n-translate-backend-comments/tasks.md
+++ b/.kiro/specs/i18n-translate-backend-comments/tasks.md
@ -2,7 +2,7 @@

 ## Foundation

- [ ] 1. Establish baseline and working branch
+- [x] 1. Establish baseline and working branch
 - [x] 1.1 Create translation working branch and capture baseline state
  - Create branch `docs/i18n-7-translate-backend-comments` from `main`.
  - Capture the baseline residual hits by running the discovery scan (the regex `[一-鿿]` against `backend/**/*.py`, excluding `.venv`); record the file list as the work queue.
@ -12,7 +12,7 @@

 ## Core — Per-Package Translation

- [ ] 2. Translate Chinese docstrings and inline comments per package
+- [x] 2. Translate Chinese docstrings and inline comments per package

 - [x] 2.1 (P) Translate `backend/app/models/`
  - Translate Chinese module/class/function docstrings and `#` comments in `backend/app/models/__init__.py`, `backend/app/models/project.py`, and `backend/app/models/task.py`.
@ -35,7 +35,7 @@
  - _Requirements: 1.1, 1.2, 1.4, 2.1, 2.2, 2.3, 2.4, 3.1, 3.2, 3.3, 3.4, 4.1, 4.2, 4.3, 4.4, 4.5_
  - _Boundary: backend/app/utils/_

- [-] 2.3 (P) Translate `backend/app/services/` — partial (7 of 12 files done; 5 remain — see HANDOFF.md)
+- [x] 2.3 (P) Translate `backend/app/services/` — complete (all 12 files; finished in this installment)
  - Translate Chinese docstrings and `#` comments across all 12 service files: `__init__.py`, `graph_builder.py`, `ontology_generator.py`, `oasis_profile_generator.py`, `report_agent.py`, `simulation_config_generator.py`, `simulation_ipc.py`, `simulation_manager.py`, `simulation_runner.py`, `text_processor.py`, `zep_entity_reader.py`, `zep_graph_memory_updater.py`, `zep_tools.py`.
  - Treat all triple-quoted prompt templates and value strings as out of scope (owned by issues #2/#3/#4/#5/#6) — only the first-statement docstrings of modules/classes/functions are in scope.
  - Apply Rules 1–5 from `design.md`.
@ -45,7 +45,7 @@
  - _Requirements: 1.1, 1.2, 1.4, 2.1, 2.2, 2.3, 2.4, 3.1, 3.2, 3.3, 3.4, 4.1, 4.2, 4.3, 4.4, 4.5_
  - _Boundary: backend/app/services/_

- [-] 2.4 (P) Translate `backend/app/api/` — partial (only `__init__.py` done; 3 files remain — see HANDOFF.md)
+- [x] 2.4 (P) Translate `backend/app/api/` — complete (all 4 files; finished in this installment)
  - Translate Chinese docstrings and `#` comments in `__init__.py`, `graph.py`, `report.py`, `simulation.py`.
  - Treat any user-facing string-literal Chinese in API responses as out of scope (owned by issue #6).
  - Apply Rules 1–5 from `design.md`.
@ -55,7 +55,7 @@
  - _Requirements: 1.1, 1.2, 1.4, 2.1, 2.2, 2.3, 2.4, 3.1, 3.2, 3.3, 3.4, 4.1, 4.2, 4.3, 4.4, 4.5_
  - _Boundary: backend/app/api/_

- [-] 2.5 (P) Translate `backend/scripts/` — partial (`action_logger.py`, `test_profile_format.py` done; 3 `run_*_simulation.py` files remain — see HANDOFF.md)
+- [x] 2.5 (P) Translate `backend/scripts/` — complete (all 5 files; finished in this installment)
  - Translate Chinese docstrings and `#` comments in `action_logger.py`, `run_parallel_simulation.py`, `run_reddit_simulation.py`, `run_twitter_simulation.py`, `test_profile_format.py`.
  - Apply Rules 1–5 from `design.md`.
  - Be especially careful with `test_profile_format.py`: any Chinese in test data string literals is out of scope; only docstrings and `#` comments are in scope.
@ -77,9 +77,9 @@

 ## Validation

- [ ] 3. Final verification and PR preparation
+- [x] 3. Final verification and PR preparation

- [-] 3.1 Run the final verification gate — partial (per-file scanner + py_compile pass; full pytest blocked by pre-existing env issues, see HANDOFF.md)
+- [x] 3.1 Run the final verification gate — scanner + py_compile pass on all 12 newly-translated files; CJK guard baseline updated (backend/app: 2792 → 307); pytest blocked by pre-existing env issues, see HANDOFF.md
  - Run the residual scan one more time and confirm the only remaining hits are files where the Chinese is in string literals owned by issues #2/#3/#4/#5/#6, plus the intentional Chinese in `backend/tests/test_locale*.py`.
  - Run `cd backend && uv run python -m pytest scripts/test_profile_format.py` and confirm exit 0.
  - Run `git diff --stat origin/main...HEAD` and confirm only in-scope file paths under `backend/app/`, `backend/run.py`, and `backend/scripts/` are listed.