170 lines
9.1 KiB
Markdown
170 lines
9.1 KiB
Markdown
# Gap Analysis — i18n-ci-guard
|
||
|
||
Comparison of the approved requirements against the current MiroFish
|
||
codebase, focused on what already exists, what is missing, and what
|
||
options the design phase should choose between.
|
||
|
||
## 1. Current State Investigation
|
||
|
||
### Domain assets already in the repo
|
||
|
||
- **`scripts/check_i18n_logs.py`** — Python-stdlib-only, exit-code-based
|
||
i18n verification script. Uses the same canonical CJK regex
|
||
`[一-鿿]` (`U+4E00..U+9FFF`) the new guard needs, prints findings as
|
||
`<file>:<line>: <reason>: <snippet>`, and was written for ticket #6.
|
||
Strong precedent for the new guard's CLI surface and output format.
|
||
- **`scripts/_apply_translations.py`, `scripts/_codemod_i18n.py`,
|
||
`scripts/_merge_locale_keys.py`** — i18n tooling sibling scripts.
|
||
Convention is to keep auxiliary i18n scripts under `scripts/` at the
|
||
repo root.
|
||
- **`.github/workflows/docker-image.yml`** — only existing GH Actions
|
||
workflow; triggers on tag pushes and `workflow_dispatch`. No PR-time
|
||
workflow exists yet, so the new guard introduces the project's first
|
||
PR-blocking CI check.
|
||
- **PR #27 / branch `chore/i18n-10-e2e-english-verification`** — defines
|
||
the audit methodology referenced by the ticket. Its `audit_cjk.sh`
|
||
uses `git grep -nIP '[\x{4e00}-\x{9fff}]' -- backend/app frontend/src
|
||
locales/en.json` — the canonical scoped scan command. PR #27 is open;
|
||
the new guard must work with or without it merged.
|
||
- **`.kiro/specs/<feature>/`** — established home for spec artefacts.
|
||
`i18n-externalize-backend-logs/` is the closest precedent for an
|
||
i18n-flavoured spec.
|
||
- **`locales/en.json`, `locales/zh.json`, `locales/languages.json`** —
|
||
shared i18n source consumed by both runtimes.
|
||
|
||
### Conventions extracted
|
||
|
||
- Auxiliary scripts: `scripts/<purpose>.py`, Python ≥3.11 stdlib only,
|
||
shebang `#!/usr/bin/env python3`, double-quoted strings, snake_case,
|
||
Google-style docstrings on the module and public functions.
|
||
- Output format: `<file>:<line>: <reason>: <snippet>`, summary line
|
||
`OK` or `N issues`, exit `0`/`1`.
|
||
- Reuse the canonical regex `[一-鿿]` rather than re-deriving range
|
||
literals.
|
||
- 4-space indent, ≤120 cols, no trailing whitespace, single trailing
|
||
newline (`.claude/rules/dev-guidelines.md`).
|
||
|
||
### Integration surfaces
|
||
|
||
- **CI**: GitHub Actions, `.github/workflows/`. `ubuntu-latest` runner,
|
||
Python 3.11+ via `actions/setup-python@v5` (use the same version
|
||
pin already present in the docker-image workflow ecosystem if any).
|
||
- **Repo layout boundaries** scoped by the audit: `backend/app/`,
|
||
`frontend/src/`, `locales/en.json` — all live at repo root or two
|
||
levels deep.
|
||
- **Git working tree**: the guard relies on `git grep -I` for tracked,
|
||
text-only matches; this binds the guard to a runner that has `git`
|
||
available (true on `ubuntu-latest` and on developer machines).
|
||
|
||
## 2. Requirement-to-Asset Map
|
||
|
||
| Req | Need | Existing asset | Gap |
|
||
| --- | --------------------------------- | ----------------------------------------------------------------------------------------------- | ----------- |
|
||
| 1 | CJK scan of `locales/en.json` | `scripts/check_i18n_logs.py` already loads `locales/*.json` and runs the canonical regex. | Missing — new guard must scan en.json specifically and emit `key:line` per offender. |
|
||
| 2 | CJK count under `backend/app/` and `frontend/src/` against baseline | Audit `audit_cjk.sh` (PR #27) demonstrates `git grep -nIP` is the canonical scan; no baseline file exists yet on main. | Missing — no per-path counter, no baseline file. |
|
||
| 3 | Actionable failure messaging | `check_i18n_logs.py` output format reusable. | Missing — need refresh-baseline command in failure text. |
|
||
| 4 | Baseline file lifecycle | None. | Missing — file format and refresh subcommand to design. |
|
||
| 5 | GH Actions PR integration | `.github/workflows/` directory exists; one tag-only workflow. | Missing — new `pull_request` workflow. |
|
||
| 6 | Local reproducibility | Existing scripts run locally with stdlib; same pattern reusable. | None — covered by following the existing pattern. |
|
||
|
||
## 3. Implementation Approach Options
|
||
|
||
### Option A — Extend `scripts/check_i18n_logs.py`
|
||
|
||
Add a new `--cjk-guard` mode (catalogue scan + per-path baseline diff)
|
||
to the existing script, then call it from the new workflow.
|
||
|
||
- ✅ One file to maintain; reuses the regex constant and CLI.
|
||
- ❌ The existing script is tightly scoped to the in-scope backend
|
||
modules and the parity check. Mixing a PR-gating regression check into
|
||
it dilutes its intent and grows it past the SRP line that the
|
||
surrounding scripts respect.
|
||
- ❌ The existing script targets a fixed list of backend modules; the
|
||
new guard scans whole subtrees. The two scopes don't fit one CLI.
|
||
|
||
### Option B — New, focused script `scripts/ci/i18n_cjk_guard.py` + new workflow (recommended)
|
||
|
||
A new directory `scripts/ci/` holds CI-only scripts; the guard is a
|
||
single file that performs both checks and supports a `--refresh-baseline`
|
||
flag. New workflow `.github/workflows/i18n-cjk-guard.yml` runs it on
|
||
every PR to `main`.
|
||
|
||
- ✅ Clean separation: production-i18n script (`check_i18n_logs.py`)
|
||
and CI-gating script (`i18n_cjk_guard.py`) live side by side without
|
||
overlapping responsibilities.
|
||
- ✅ Mirrors the established convention of one script per
|
||
responsibility under `scripts/`.
|
||
- ✅ The baseline file lives under the spec dir
|
||
(`.kiro/specs/i18n-ci-guard/baseline.txt`), matching the ticket's
|
||
"baseline must be committed and reviewable" requirement.
|
||
- ❌ One more file in the repo, but the file is small (~150 LoC).
|
||
|
||
### Option C — Hybrid: shared `cjk_scan.py` helper + thin guard script
|
||
|
||
Factor the regex + git-grep logic into a tiny shared helper consumed by
|
||
both `check_i18n_logs.py` and the new guard.
|
||
|
||
- ✅ DRY for the regex constant.
|
||
- ❌ Premature abstraction: today the only shared element is one
|
||
one-line regex. The two scripts have different scopes, output
|
||
formats, and consumers. Pulling a helper out now satisfies
|
||
consistency without paying for itself; defer until a third caller
|
||
appears.
|
||
|
||
### Recommendation
|
||
|
||
**Option B**. It matches the project's established "one focused script
|
||
per responsibility" convention, isolates the new CI surface from
|
||
existing i18n scripts, and keeps the baseline file collocated with
|
||
spec metadata where reviewers expect to find it.
|
||
|
||
## 4. Research Items for Design Phase
|
||
|
||
- **Baseline file format**: prefer a stable, line-oriented text format
|
||
over JSON to minimize diff churn (e.g., `path<TAB>count` per line,
|
||
trailing newline). Confirm in design.
|
||
- **`git grep` invocation portability**: `git grep -nIP` works on all
|
||
modern git builds (≥2.4 ships PCRE2). `ubuntu-latest` ships ≥2.40.
|
||
No portability concern; record the assumption explicitly.
|
||
- **`fetch-depth`** for the `actions/checkout@v4` step: `git grep`
|
||
scans the working tree, not history, so a shallow clone (`fetch-depth:
|
||
1`) is sufficient.
|
||
- **Workflow timeout budget**: capture the empirical runtime of the
|
||
full scan locally (already measured: a single `git grep` over the
|
||
scoped paths runs in <2 seconds with ~3.6k matches). The 60-second
|
||
ceiling in Req 5 is comfortable.
|
||
- **Failure-message refresh command** wording: the design should pin
|
||
the exact command shown to contributors so it stays one stable
|
||
string developers can copy.
|
||
- **Initial baseline values**: with `git grep -nIP '[\x{4e00}-\x{9fff}]'`
|
||
on the current branch — `backend/app` = 2707, `frontend/src` = 902,
|
||
`locales/en.json` = 0. The committed baseline must be regenerated
|
||
against `main` at implementation time so it reflects the merge target.
|
||
|
||
## 5. Effort & Risk
|
||
|
||
- **Effort**: **S** (1–3 days). Small, self-contained additions
|
||
(one Python script, one workflow file, one baseline file, plus the
|
||
spec). All patterns already exist in the repo.
|
||
- **Risk**: **Low**. No production-source changes, no new dependencies,
|
||
no architectural shifts. The only failure mode is a noisy guard
|
||
blocking unrelated PRs — mitigated by the per-path baseline ratchet.
|
||
|
||
## 6. Recommendations for Design Phase
|
||
|
||
- Adopt **Option B** (new focused script + new workflow + baseline file
|
||
under spec dir).
|
||
- Lock in the canonical regex `[一-鿿]` and the canonical scan command
|
||
`git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path>` to keep this guard
|
||
bytewise-aligned with the audit pipeline.
|
||
- Use a line-oriented baseline format keyed by scoped path; explicit
|
||
`--refresh-baseline` (or equivalent) subcommand updates it; no
|
||
implicit overwrite.
|
||
- Output: machine-friendly findings on stderr, summary on stdout,
|
||
exit `0`/`1`.
|
||
- The workflow should run only on `pull_request` to `main` (Req 5.1)
|
||
with `fetch-depth: 1` and `actions/setup-python@v5`. No third-party
|
||
packages.
|
||
- Baseline counts must be recomputed against `main` before the PR
|
||
ships; do not commit baselines from a feature branch's working tree.
|