9.1 KiB
Gap Analysis — i18n-ci-guard
Comparison of the approved requirements against the current MiroFish codebase, focused on what already exists, what is missing, and what options the design phase should choose between.
1. Current State Investigation
Domain assets already in the repo
scripts/check_i18n_logs.py— Python-stdlib-only, exit-code-based i18n verification script. Uses the same canonical CJK regex[一-鿿](U+4E00..U+9FFF) the new guard needs, prints findings as<file>:<line>: <reason>: <snippet>, and was written for ticket #6. Strong precedent for the new guard's CLI surface and output format.scripts/_apply_translations.py,scripts/_codemod_i18n.py,scripts/_merge_locale_keys.py— i18n tooling sibling scripts. Convention is to keep auxiliary i18n scripts underscripts/at the repo root..github/workflows/docker-image.yml— only existing GH Actions workflow; triggers on tag pushes andworkflow_dispatch. No PR-time workflow exists yet, so the new guard introduces the project's first PR-blocking CI check.- PR #27 / branch
chore/i18n-10-e2e-english-verification— defines the audit methodology referenced by the ticket. Itsaudit_cjk.shusesgit grep -nIP '[\x{4e00}-\x{9fff}]' -- backend/app frontend/src locales/en.json— the canonical scoped scan command. PR #27 is open; the new guard must work with or without it merged. .kiro/specs/<feature>/— established home for spec artefacts.i18n-externalize-backend-logs/is the closest precedent for an i18n-flavoured spec.locales/en.json,locales/zh.json,locales/languages.json— shared i18n source consumed by both runtimes.
Conventions extracted
- Auxiliary scripts:
scripts/<purpose>.py, Python ≥3.11 stdlib only, shebang#!/usr/bin/env python3, double-quoted strings, snake_case, Google-style docstrings on the module and public functions. - Output format:
<file>:<line>: <reason>: <snippet>, summary lineOKorN issues, exit0/1. - Reuse the canonical regex
[一-鿿]rather than re-deriving range literals. - 4-space indent, ≤120 cols, no trailing whitespace, single trailing
newline (
.claude/rules/dev-guidelines.md).
Integration surfaces
- CI: GitHub Actions,
.github/workflows/.ubuntu-latestrunner, Python 3.11+ viaactions/setup-python@v5(use the same version pin already present in the docker-image workflow ecosystem if any). - Repo layout boundaries scoped by the audit:
backend/app/,frontend/src/,locales/en.json— all live at repo root or two levels deep. - Git working tree: the guard relies on
git grep -Ifor tracked, text-only matches; this binds the guard to a runner that hasgitavailable (true onubuntu-latestand on developer machines).
2. Requirement-to-Asset Map
| Req | Need | Existing asset | Gap |
|---|---|---|---|
| 1 | CJK scan of locales/en.json |
scripts/check_i18n_logs.py already loads locales/*.json and runs the canonical regex. |
Missing — new guard must scan en.json specifically and emit key:line per offender. |
| 2 | CJK count under backend/app/ and frontend/src/ against baseline |
Audit audit_cjk.sh (PR #27) demonstrates git grep -nIP is the canonical scan; no baseline file exists yet on main. |
Missing — no per-path counter, no baseline file. |
| 3 | Actionable failure messaging | check_i18n_logs.py output format reusable. |
Missing — need refresh-baseline command in failure text. |
| 4 | Baseline file lifecycle | None. | Missing — file format and refresh subcommand to design. |
| 5 | GH Actions PR integration | .github/workflows/ directory exists; one tag-only workflow. |
Missing — new pull_request workflow. |
| 6 | Local reproducibility | Existing scripts run locally with stdlib; same pattern reusable. | None — covered by following the existing pattern. |
3. Implementation Approach Options
Option A — Extend scripts/check_i18n_logs.py
Add a new --cjk-guard mode (catalogue scan + per-path baseline diff)
to the existing script, then call it from the new workflow.
- ✅ One file to maintain; reuses the regex constant and CLI.
- ❌ The existing script is tightly scoped to the in-scope backend modules and the parity check. Mixing a PR-gating regression check into it dilutes its intent and grows it past the SRP line that the surrounding scripts respect.
- ❌ The existing script targets a fixed list of backend modules; the new guard scans whole subtrees. The two scopes don't fit one CLI.
Option B — New, focused script scripts/ci/i18n_cjk_guard.py + new workflow (recommended)
A new directory scripts/ci/ holds CI-only scripts; the guard is a
single file that performs both checks and supports a --refresh-baseline
flag. New workflow .github/workflows/i18n-cjk-guard.yml runs it on
every PR to main.
- ✅ Clean separation: production-i18n script (
check_i18n_logs.py) and CI-gating script (i18n_cjk_guard.py) live side by side without overlapping responsibilities. - ✅ Mirrors the established convention of one script per
responsibility under
scripts/. - ✅ The baseline file lives under the spec dir
(
.kiro/specs/i18n-ci-guard/baseline.txt), matching the ticket's "baseline must be committed and reviewable" requirement. - ❌ One more file in the repo, but the file is small (~150 LoC).
Option C — Hybrid: shared cjk_scan.py helper + thin guard script
Factor the regex + git-grep logic into a tiny shared helper consumed by
both check_i18n_logs.py and the new guard.
- ✅ DRY for the regex constant.
- ❌ Premature abstraction: today the only shared element is one one-line regex. The two scripts have different scopes, output formats, and consumers. Pulling a helper out now satisfies consistency without paying for itself; defer until a third caller appears.
Recommendation
Option B. It matches the project's established "one focused script per responsibility" convention, isolates the new CI surface from existing i18n scripts, and keeps the baseline file collocated with spec metadata where reviewers expect to find it.
4. Research Items for Design Phase
- Baseline file format: prefer a stable, line-oriented text format
over JSON to minimize diff churn (e.g.,
path<TAB>countper line, trailing newline). Confirm in design. git grepinvocation portability:git grep -nIPworks on all modern git builds (≥2.4 ships PCRE2).ubuntu-latestships ≥2.40. No portability concern; record the assumption explicitly.fetch-depthfor theactions/checkout@v4step:git grepscans the working tree, not history, so a shallow clone (fetch-depth: 1) is sufficient.- Workflow timeout budget: capture the empirical runtime of the
full scan locally (already measured: a single
git grepover the scoped paths runs in <2 seconds with ~3.6k matches). The 60-second ceiling in Req 5 is comfortable. - Failure-message refresh command wording: the design should pin the exact command shown to contributors so it stays one stable string developers can copy.
- Initial baseline values: with
git grep -nIP '[\x{4e00}-\x{9fff}]'on the current branch —backend/app= 2707,frontend/src= 902,locales/en.json= 0. The committed baseline must be regenerated againstmainat implementation time so it reflects the merge target.
5. Effort & Risk
- Effort: S (1–3 days). Small, self-contained additions (one Python script, one workflow file, one baseline file, plus the spec). All patterns already exist in the repo.
- Risk: Low. No production-source changes, no new dependencies, no architectural shifts. The only failure mode is a noisy guard blocking unrelated PRs — mitigated by the per-path baseline ratchet.
6. Recommendations for Design Phase
- Adopt Option B (new focused script + new workflow + baseline file under spec dir).
- Lock in the canonical regex
[一-鿿]and the canonical scan commandgit grep -nIP '[\x{4e00}-\x{9fff}]' -- <path>to keep this guard bytewise-aligned with the audit pipeline. - Use a line-oriented baseline format keyed by scoped path; explicit
--refresh-baseline(or equivalent) subcommand updates it; no implicit overwrite. - Output: machine-friendly findings on stderr, summary on stdout,
exit
0/1. - The workflow should run only on
pull_requesttomain(Req 5.1) withfetch-depth: 1andactions/setup-python@v5. No third-party packages. - Baseline counts must be recomputed against
mainbefore the PR ships; do not commit baselines from a feature branch's working tree.