MicroFish/.kiro/specs/i18n-ci-guard/gap-analysis.md

# Gap Analysis — i18n-ci-guard

Comparison of the approved requirements against the current MiroFish
codebase, focused on what already exists, what is missing, and what
options the design phase should choose between.

## 1. Current State Investigation

### Domain assets already in the repo

- **`scripts/check_i18n_logs.py`** — Python-stdlib-only, exit-code-based
  i18n verification script. Uses the same canonical CJK regex
  `[一-鿿]` (`U+4E00..U+9FFF`) the new guard needs, prints findings as
  `<file>:<line>: <reason>: <snippet>`, and was written for ticket #6.
  Strong precedent for the new guard's CLI surface and output format.
- **`scripts/_apply_translations.py`, `scripts/_codemod_i18n.py`,
  `scripts/_merge_locale_keys.py`** — i18n tooling sibling scripts.
  Convention is to keep auxiliary i18n scripts under `scripts/` at the
  repo root.
- **`.github/workflows/docker-image.yml`** — only existing GH Actions
  workflow; triggers on tag pushes and `workflow_dispatch`. No PR-time
  workflow exists yet, so the new guard introduces the project's first
  PR-blocking CI check.
- **PR #27 / branch `chore/i18n-10-e2e-english-verification`** — defines
  the audit methodology referenced by the ticket. Its `audit_cjk.sh`
  uses `git grep -nIP '[\x{4e00}-\x{9fff}]' -- backend/app frontend/src
  locales/en.json` — the canonical scoped scan command. PR #27 is open;
  the new guard must work with or without it merged.
- **`.kiro/specs/<feature>/`** — established home for spec artefacts.
  `i18n-externalize-backend-logs/` is the closest precedent for an
  i18n-flavoured spec.
- **`locales/en.json`, `locales/zh.json`, `locales/languages.json`** —
  shared i18n source consumed by both runtimes.

### Conventions extracted

- Auxiliary scripts: `scripts/<purpose>.py`, Python ≥3.11 stdlib only,
  shebang `#!/usr/bin/env python3`, double-quoted strings, snake_case,
  Google-style docstrings on the module and public functions.
- Output format: `<file>:<line>: <reason>: <snippet>`, summary line
  `OK` or `N issues`, exit `0`/`1`.
- Reuse the canonical regex `[一-鿿]` rather than re-deriving range
  literals.
- 4-space indent, ≤120 cols, no trailing whitespace, single trailing
  newline (`.claude/rules/dev-guidelines.md`).

### Integration surfaces

- **CI**: GitHub Actions, `.github/workflows/`. `ubuntu-latest` runner,
  Python 3.11+ via `actions/setup-python@v5` (use the same version
  pin already present in the docker-image workflow ecosystem if any).
- **Repo layout boundaries** scoped by the audit: `backend/app/`,
  `frontend/src/`, `locales/en.json` — all live at repo root or two
  levels deep.
- **Git working tree**: the guard relies on `git grep -I` for tracked,
  text-only matches; this binds the guard to a runner that has `git`
  available (true on `ubuntu-latest` and on developer machines).

## 2. Requirement-to-Asset Map

| Req | Need                              | Existing asset                                                                                  | Gap         |
| --- | --------------------------------- | ----------------------------------------------------------------------------------------------- | ----------- |
| 1   | CJK scan of `locales/en.json`     | `scripts/check_i18n_logs.py` already loads `locales/*.json` and runs the canonical regex.       | Missing — new guard must scan en.json specifically and emit `key:line` per offender. |
| 2   | CJK count under `backend/app/` and `frontend/src/` against baseline | Audit `audit_cjk.sh` (PR #27) demonstrates `git grep -nIP` is the canonical scan; no baseline file exists yet on main. | Missing — no per-path counter, no baseline file. |
| 3   | Actionable failure messaging      | `check_i18n_logs.py` output format reusable.                                                    | Missing — need refresh-baseline command in failure text. |
| 4   | Baseline file lifecycle           | None.                                                                                            | Missing — file format and refresh subcommand to design. |
| 5   | GH Actions PR integration         | `.github/workflows/` directory exists; one tag-only workflow.                                   | Missing — new `pull_request` workflow. |
| 6   | Local reproducibility             | Existing scripts run locally with stdlib; same pattern reusable.                                | None — covered by following the existing pattern. |

## 3. Implementation Approach Options

### Option A — Extend `scripts/check_i18n_logs.py`

Add a new `--cjk-guard` mode (catalogue scan + per-path baseline diff)
to the existing script, then call it from the new workflow.

- ✅ One file to maintain; reuses the regex constant and CLI.
- ❌ The existing script is tightly scoped to the in-scope backend
  modules and the parity check. Mixing a PR-gating regression check into
  it dilutes its intent and grows it past the SRP line that the
  surrounding scripts respect.
- ❌ The existing script targets a fixed list of backend modules; the
  new guard scans whole subtrees. The two scopes don't fit one CLI.

### Option B — New, focused script `scripts/ci/i18n_cjk_guard.py` + new workflow (recommended)

A new directory `scripts/ci/` holds CI-only scripts; the guard is a
single file that performs both checks and supports a `--refresh-baseline`
flag. New workflow `.github/workflows/i18n-cjk-guard.yml` runs it on
every PR to `main`.

- ✅ Clean separation: production-i18n script (`check_i18n_logs.py`)
  and CI-gating script (`i18n_cjk_guard.py`) live side by side without
  overlapping responsibilities.
- ✅ Mirrors the established convention of one script per
  responsibility under `scripts/`.
- ✅ The baseline file lives under the spec dir
  (`.kiro/specs/i18n-ci-guard/baseline.txt`), matching the ticket's
  "baseline must be committed and reviewable" requirement.
- ❌ One more file in the repo, but the file is small (~150 LoC).

### Option C — Hybrid: shared `cjk_scan.py` helper + thin guard script

Factor the regex + git-grep logic into a tiny shared helper consumed by
both `check_i18n_logs.py` and the new guard.

- ✅ DRY for the regex constant.
- ❌ Premature abstraction: today the only shared element is one
  one-line regex. The two scripts have different scopes, output
  formats, and consumers. Pulling a helper out now satisfies
  consistency without paying for itself; defer until a third caller
  appears.

### Recommendation

**Option B**. It matches the project's established "one focused script
per responsibility" convention, isolates the new CI surface from
existing i18n scripts, and keeps the baseline file collocated with
spec metadata where reviewers expect to find it.

## 4. Research Items for Design Phase

- **Baseline file format**: prefer a stable, line-oriented text format
  over JSON to minimize diff churn (e.g., `path<TAB>count` per line,
  trailing newline). Confirm in design.
- **`git grep` invocation portability**: `git grep -nIP` works on all
  modern git builds (≥2.4 ships PCRE2). `ubuntu-latest` ships ≥2.40.
  No portability concern; record the assumption explicitly.
- **`fetch-depth`** for the `actions/checkout@v4` step: `git grep`
  scans the working tree, not history, so a shallow clone (`fetch-depth:
  1`) is sufficient.
- **Workflow timeout budget**: capture the empirical runtime of the
  full scan locally (already measured: a single `git grep` over the
  scoped paths runs in <2 seconds with ~3.6k matches). The 60-second
  ceiling in Req 5 is comfortable.
- **Failure-message refresh command** wording: the design should pin
  the exact command shown to contributors so it stays one stable
  string developers can copy.
- **Initial baseline values**: with `git grep -nIP '[\x{4e00}-\x{9fff}]'`
  on the current branch — `backend/app` = 2707, `frontend/src` = 902,
  `locales/en.json` = 0. The committed baseline must be regenerated
  against `main` at implementation time so it reflects the merge target.

## 5. Effort & Risk

- **Effort**: **S** (1–3 days). Small, self-contained additions
  (one Python script, one workflow file, one baseline file, plus the
  spec). All patterns already exist in the repo.
- **Risk**: **Low**. No production-source changes, no new dependencies,
  no architectural shifts. The only failure mode is a noisy guard
  blocking unrelated PRs — mitigated by the per-path baseline ratchet.

## 6. Recommendations for Design Phase

- Adopt **Option B** (new focused script + new workflow + baseline file
  under spec dir).
- Lock in the canonical regex `[一-鿿]` and the canonical scan command
  `git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path>` to keep this guard
  bytewise-aligned with the audit pipeline.
- Use a line-oriented baseline format keyed by scoped path; explicit
  `--refresh-baseline` (or equivalent) subcommand updates it; no
  implicit overwrite.
- Output: machine-friendly findings on stderr, summary on stdout,
  exit `0`/`1`.
- The workflow should run only on `pull_request` to `main` (Req 5.1)
  with `fetch-depth: 1` and `actions/setup-python@v5`. No third-party
  packages.
- Baseline counts must be recomputed against `main` before the PR
  ships; do not commit baselines from a feature branch's working tree.