# Design — i18n-ci-guard ## Overview This feature installs a permanent, PR-time CI guard that blocks regressions of the project's English-by-default state. It performs two checks: `locales/en.json` must contain zero CJK characters, and the total CJK match count under `backend/app/` and `frontend/src/` must not exceed a committed per-path baseline. The guard is a single Python script invoked by a single GitHub Actions workflow. **Purpose**: This feature delivers an automatic regression gate to the i18n initiative so reviewers do not have to spot CJK reintroductions by eye. **Users**: Project maintainers and PR authors. Maintainers gain a hard regression gate; PR authors gain a script they can run locally to catch regressions before pushing. **Impact**: Adds the project's first `pull_request`-triggered CI workflow. No production source under `backend/app/`, `frontend/src/`, or `locales/` is modified by this spec — only new files are added. ### Goals - Fail any PR that introduces a CJK character into `locales/en.json`. - Fail any PR whose CJK match count under `backend/app/` or `frontend/src/` exceeds the committed baseline. - Print a single actionable failure message that includes the exact command a contributor must run if the regression is intentional. - Run end-to-end under sixty seconds on `ubuntu-latest`. - Be reproducible verbatim on a developer machine with Python ≥3.11 and `git`. ### Non-Goals - Re-implementing the full classification pipeline from `.kiro/specs/i18n-e2e-english-verification/` (that work belongs to PR #27). - Auto-updating the baseline on `main`. - Translating any production source to satisfy a higher baseline. The initial baseline is recorded against `main` and only ratchets down over time. - Gating commits at pre-commit time. The guard is CI-only; a future spec may wrap it in a hook. ## Boundary Commitments ### This Spec Owns - The guard script `scripts/ci/i18n_cjk_guard.py` and its CLI contract. - The workflow `.github/workflows/i18n-cjk-guard.yml` and its trigger configuration. - The baseline file `.kiro/specs/i18n-ci-guard/baseline.txt` and its format. - The pass/fail semantics of both checks. ### Out of Boundary - Any change to files under `backend/app/`, `frontend/src/`, or `locales/` — except `locales/en.json` if it is found to contain CJK during initial baseline calibration (a remediation translation would be a separate spec/PR). - The classification heuristics in PR #27's `classify.py`. - Pre-commit hooks; IDE integrations; alternative scoped paths beyond `backend/app/` and `frontend/src/`. ### Allowed Dependencies - Python ≥3.11 standard library. - `git` (for `git grep -nIP` invocation). - `actions/checkout@v4` and `actions/setup-python@v5` from the GitHub Actions Marketplace. ### Revalidation Triggers - Adding a third scoped path → baseline file format changes; consumers (none today) re-check. - Changing the regex range → audit pipeline alignment must be re-confirmed. - Switching from `pull_request` to `merge_group` or other event → required-status-check rules in branch protection must be re-checked. ## Architecture ### Existing Architecture Analysis - **Repo layout**: monorepo split by runtime (`backend/`, `frontend/`) with shared `locales/` at root. The guard scopes its scan to `backend/app/`, `frontend/src/`, and `locales/en.json`, matching the audit pipeline's canonical scope. - **Existing scripts pattern**: `scripts/.py` for developer tools. The new `scripts/ci/` subdirectory introduces a clear, CI-only home without disturbing the existing developer scripts. - **Existing CI**: `.github/workflows/docker-image.yml` is tag-only. No `pull_request` workflow exists. The new workflow is additive and does not affect the docker-image workflow. ### Architecture Pattern & Boundary Map ```mermaid flowchart LR PR[Pull Request to main] -->|trigger| WF[.github/workflows/i18n-cjk-guard.yml] WF -->|setup-python + checkout| RUN[python scripts/ci/i18n_cjk_guard.py] RUN -->|read| EN[locales/en.json] RUN -->|git grep -nIP| BAPP[backend/app/] RUN -->|git grep -nIP| FSRC[frontend/src/] RUN -->|read| BL[.kiro/specs/i18n-ci-guard/baseline.txt] RUN -->|exit 0 or 1| WF WF -->|status| PR DEV[Developer terminal] -->|python scripts/ci/i18n_cjk_guard.py| RUN DEV -->|--update-baseline| RUN RUN -.->|writes| BL ``` **Architecture Integration**: - **Selected pattern**: single-purpose script + thin workflow. Matches the project's existing `scripts/.py` convention. - **Domain boundaries**: the guard is a pure verification tool with no side effects on production code. Its only writeable surface is the baseline file, and only when explicitly invoked with `--update-baseline`. - **Existing patterns preserved**: stdlib-only Python tooling (precedent: `scripts/check_i18n_logs.py`); single-file workflows in `.github/workflows/`. - **New components rationale**: a new file rather than an extension of an existing script — the existing script is scoped to a fixed module list and is not a regression gate. - **Steering compliance**: respects layer-based structure (script lives at repo root in `scripts/ci/`, not under `backend/` or `frontend/`), no new heavy dependencies, no `os.getenv` calls outside `backend/app/config.py`. ### Technology Stack | Layer | Choice / Version | Role in Feature | Notes | |-------|------------------|-----------------|-------| | Frontend / CLI | Python 3.11 stdlib (`argparse`, `json`, `re`, `subprocess`, `pathlib`, `sys`) | Guard CLI | Stdlib only — Req 5.5 | | Backend / Services | n/a | — | Guard does not touch backend services | | Data / Storage | Plain-text baseline file under `.kiro/specs/` | Per-path count store | One line per path, `\t` | | Messaging / Events | n/a | — | — | | Infrastructure / Runtime | GitHub Actions `ubuntu-latest`, `actions/checkout@v4`, `actions/setup-python@v5` | PR-time runner | `fetch-depth: 1` is sufficient | ## File Structure Plan ### Directory Structure ``` scripts/ └── ci/ └── i18n_cjk_guard.py # Guard CLI (new) .github/ └── workflows/ └── i18n-cjk-guard.yml # PR-time workflow (new) .kiro/specs/i18n-ci-guard/ ├── spec.json # (existing, updated) ├── requirements.md # (existing) ├── gap-analysis.md # (existing) ├── research.md # (existing) ├── design.md # (this file) ├── tasks.md # (created in next phase) └── baseline.txt # Per-path CJK match counts (new) ``` ### Modified Files - `.kiro/specs/i18n-ci-guard/spec.json` — phase / approval fields updated by Kiro flow only. - No production source files are modified by this spec. ## System Flows ### Guard execution (default mode) ```mermaid sequenceDiagram participant CI as GitHub Actions participant Script as i18n_cjk_guard.py participant Repo as Working tree participant BL as baseline.txt CI->>Script: python scripts/ci/i18n_cjk_guard.py Script->>Repo: read locales/en.json Script->>Script: scan for CJK chars alt en.json has CJK Script-->>CI: exit 1 + per-key findings else en.json clean Script->>Repo: git grep -nIP backend/app/ Script->>Repo: git grep -nIP frontend/src/ Script->>BL: read baseline counts alt any current count > baseline Script-->>CI: exit 1 + per-path delta + refresh hint else within baseline Script-->>CI: exit 0 + summary end end ``` ### Baseline refresh ```mermaid sequenceDiagram participant Dev as Developer participant Script as i18n_cjk_guard.py participant Repo as Working tree participant BL as baseline.txt Dev->>Script: python scripts/ci/i18n_cjk_guard.py --update-baseline Script->>Repo: git grep -nIP backend/app/ Script->>Repo: git grep -nIP frontend/src/ Script->>BL: write per-path counts (sorted) Script-->>Dev: exit 0 + new counts ``` The two checks run in fixed order: en.json first (cheap, decisive), then per-path counts. Both run under all conditions; the script does not short-circuit after the first failure so the contributor sees the complete diagnostic in one CI log. ## Requirements Traceability | Requirement | Summary | Components | Interfaces | Flows | |-------------|---------|------------|------------|-------| | 1.1 | Scan en.json for CJK | `i18n_cjk_guard.py` | CLI default mode | Guard execution | | 1.2 | Fail with key:line per offender | `i18n_cjk_guard.py` | CLI stderr output | Guard execution | | 1.3 | Report clean state | `i18n_cjk_guard.py` | CLI stdout summary | Guard execution | | 1.4 | Hard error if file missing | `i18n_cjk_guard.py` | CLI stderr + exit 1 | Guard execution | | 2.1 | Count CJK matches per scoped path | `i18n_cjk_guard.py` | `git grep -nIP` invocation | Guard execution | | 2.2 | Read baseline counts | `i18n_cjk_guard.py`, `baseline.txt` | File read | Guard execution | | 2.3 | Fail on regression | `i18n_cjk_guard.py` | Exit 1 | Guard execution | | 2.4 | Pass when within baseline | `i18n_cjk_guard.py` | Exit 0 | Guard execution | | 2.5 | Skip binary files | `git grep -I` | — | Guard execution | | 2.6 | Tracked-only scope | `git grep` default | — | Guard execution | | 3.1 | Per-key locale failure detail | `i18n_cjk_guard.py` | CLI stderr lines | Guard execution | | 3.2 | Per-path regression detail | `i18n_cjk_guard.py` | CLI stderr lines | Guard execution | | 3.3 | Print refresh command | `i18n_cjk_guard.py` | CLI stderr footer | Guard execution | | 3.4 | Success summary lines | `i18n_cjk_guard.py` | CLI stdout | Guard execution | | 4.1 | Baseline under spec dir | `baseline.txt` | File path | — | | 4.2 | Diff-friendly text format | `baseline.txt` | File format | — | | 4.3 | Refresh via flag | `i18n_cjk_guard.py` | `--update-baseline` | Baseline refresh | | 4.4 | No implicit baseline writes | `i18n_cjk_guard.py` | CLI default mode | Guard execution | | 4.5 | Hard error if baseline missing | `i18n_cjk_guard.py` | Exit 1 + message | Guard execution | | 5.1 | PR-only trigger to main | `i18n-cjk-guard.yml` | `on.pull_request.branches` | — | | 5.2 | Checkout PR head | `i18n-cjk-guard.yml` | `actions/checkout@v4` | — | | 5.3 | Surface output on failure | `i18n-cjk-guard.yml` | Default GH log | — | | 5.4 | Pass on exit 0 | `i18n-cjk-guard.yml` | Default | — | | 5.5 | Stdlib-only, no third-party | `i18n_cjk_guard.py`, `i18n-cjk-guard.yml` | — | — | | 5.6 | ≤60s runtime | `i18n-cjk-guard.yml` | `timeout-minutes: 1` | — | | 6.1 | Same result locally | `i18n_cjk_guard.py` | CLI | — | | 6.2 | Single stable entry point | `scripts/ci/i18n_cjk_guard.py` | Path | — | | 6.3 | No env vars / secrets | `i18n_cjk_guard.py` | CLI | — | ## Components and Interfaces | Component | Domain/Layer | Intent | Req Coverage | Key Dependencies | Contracts | |-----------|--------------|--------|--------------|------------------|-----------| | `i18n_cjk_guard.py` | CI script | Two-check guard CLI | 1.1–6.3 | `git`, Python stdlib | Service (CLI) | | `i18n-cjk-guard.yml` | CI workflow | Run guard on every PR to main | 5.1–5.6 | `actions/checkout@v4`, `actions/setup-python@v5` | Batch / Job | | `baseline.txt` | Data | Per-path baseline counts | 4.1, 4.2, 2.2 | — | State (file) | ### CI Script #### `i18n_cjk_guard.py` | Field | Detail | |-------|--------| | Intent | Run two CJK-regression checks; optionally refresh the baseline | | Requirements | 1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4, 4.1, 4.3, 4.4, 4.5, 5.5, 6.1, 6.2, 6.3 | | Owner / Reviewers | i18n maintainers | **Responsibilities & Constraints** - Owns the canonical guard semantics: which paths are scoped, which regex is canonical, what counts as a regression. - Runs in pure Python 3.11 stdlib + a single `git` subprocess per scoped path. - Never modifies any file other than the baseline file, and only when invoked with `--update-baseline`. - Always runs both checks (does not short-circuit), so a single CI log shows every failure mode at once. **Dependencies** - Inbound: `i18n-cjk-guard.yml` workflow; developers running locally. - Outbound: `git` subprocess (`git grep`, `git rev-parse`). - External: none. **Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [x] ##### Service Interface (CLI) ```text i18n_cjk_guard.py [--update-baseline] [--baseline PATH] [--repo-root PATH] ``` Type-annotated module signature (Python type hints, public functions only): ```python def main(argv: list[str]) -> int: ... def run_check(repo_root: pathlib.Path, baseline_path: pathlib.Path) -> int: """Run both checks; return 0 on success, 1 on any failure.""" def update_baseline(repo_root: pathlib.Path, baseline_path: pathlib.Path) -> int: """Refresh the baseline file with current per-path counts; return 0.""" def scan_locale_cjk(en_json_path: pathlib.Path) -> list[LocaleFinding]: """Return a list of (key, line_number, snippet) tuples for every CJK occurrence in locales/en.json. Empty list when clean.""" def count_path_cjk(repo_root: pathlib.Path, scoped_path: str) -> int: """Return the number of CJK match lines under scoped_path, using `git grep -nIP '[\\x{4e00}-\\x{9fff}]' -- `.""" def read_baseline(baseline_path: pathlib.Path) -> dict[str, int]: """Parse the baseline file. Each non-empty, non-comment line is '\\t'. Raise BaselineError on any malformed input or missing file.""" def write_baseline(baseline_path: pathlib.Path, counts: dict[str, int]) -> None: """Atomically overwrite the baseline file with sorted entries and a single trailing newline.""" ``` Where: ```python LocaleFinding = tuple[str, int, str] # (dotted_key, line_number, snippet) SCOPED_PATHS: tuple[str, ...] = ("backend/app", "frontend/src") EN_JSON_REL_PATH: str = "locales/en.json" CJK_PATTERN: str = "[\\x{4e00}-\\x{9fff}]" # passed to git grep -P CJK_RE: re.Pattern[str] = re.compile(r"[一-鿿]") SNIPPET_MAX_LEN: int = 80 ``` - **Preconditions**: invoked with CWD at the repo root or `--repo-root` set; `git` is on `$PATH`; the working tree is the intended scan target. - **Postconditions** (default mode): exit 0 iff both checks pass; exit 1 otherwise. Stdout receives the success summary; stderr receives findings on failure. The baseline file is unchanged. - **Postconditions** (`--update-baseline`): the baseline file is rewritten to current per-path counts and exit 0 is returned. - **Invariants**: regex range, scoped paths, and baseline file path are constants — no env-var override. ##### State Management - **State model**: a dict `{: }` parsed from the baseline file. - **Persistence**: plain-text file at `.kiro/specs/i18n-ci-guard/baseline.txt`. Atomic write via `tmp + os.replace`. - **Concurrency**: single-writer (developer running `--update-baseline`); CI workers only read. **Implementation Notes** - Output format mirrors `scripts/check_i18n_logs.py`: `:: : ` on stderr, summary on stdout, trailing `OK` or `N issues`. - The exact refresh command printed on regression failure is: `python scripts/ci/i18n_cjk_guard.py --update-baseline`. - `count_path_cjk` invokes `git grep` via `subprocess.run` with `check=False`; `git grep` exits 1 when there are zero matches, so the function treats exit codes 0 and 1 as success and any other code as a hard error. - Localised key extraction for `en.json` walks the parsed JSON dict; line numbers are obtained by re-reading the file as text and matching the value's first textual occurrence. - Risks: see `research.md` § Risks & Mitigations. ### CI Workflow #### `i18n-cjk-guard.yml` | Field | Detail | |-------|--------| | Intent | Run the guard on every PR to `main` | | Requirements | 5.1, 5.2, 5.3, 5.4, 5.5, 5.6 | | Owner / Reviewers | i18n maintainers | **Contracts**: Batch / Job [x] ##### Batch / Job Contract - **Trigger**: `on: pull_request: branches: [main]`. - **Input / validation**: PR head ref checkout via `actions/checkout@v4` with `fetch-depth: 1`. Python set up via `actions/setup-python@v5` with `python-version: '3.11'`. - **Output / destination**: pass/fail status surfaced as a GitHub Actions check on the PR. Script stdout/stderr appears in the workflow log. - **Idempotency & recovery**: re-running the workflow re-evaluates the same working tree; no persistent side effects on the runner. ##### Workflow shape (sketch) ```yaml name: i18n CJK Guard on: pull_request: branches: [main] jobs: guard: runs-on: ubuntu-latest timeout-minutes: 1 steps: - uses: actions/checkout@v4 with: fetch-depth: 1 - uses: actions/setup-python@v5 with: python-version: '3.11' - run: python scripts/ci/i18n_cjk_guard.py ``` ### Baseline Data File #### `baseline.txt` | Field | Detail | |-------|--------| | Intent | Persist the per-path CJK match-count baseline | | Requirements | 2.2, 4.1, 4.2 | **Contracts**: State [x] ##### Format ```text # Per-path CJK baseline for the i18n CI guard. # Format: \t. Sorted lexicographically. # Refresh via: python scripts/ci/i18n_cjk_guard.py --update-baseline backend/app frontend/src ``` - One header block of `#`-prefixed comments (parser ignores). - Blank lines ignored. - Lines must match `^(?P[^\t\n]+)\t(?P\d+)$`. - Trailing newline mandatory. ## Data Models ### Domain Model - `LocaleFinding` — value object `(dotted_key: str, line_number: int, snippet: str)`. - `PathCount` — pair `(scoped_path: str, count: int)`. The full baseline is a `dict[str, int]` keyed by scoped path. Invariants: - `count` is a non-negative integer. - `scoped_path` is one of `SCOPED_PATHS`. - `LocaleFinding.snippet` is at most `SNIPPET_MAX_LEN` characters, truncated with an ellipsis when needed. ## Error Handling ### Error Strategy - All non-zero exits are accompanied by a stderr message identifying the failing check, the offending file or path, and (for regressions) the refresh command. The script never raises uncaught exceptions past `main()` in normal flow; unexpected I/O errors propagate as `OSError` with a clear traceback so CI logs surface them clearly. ### Error Categories and Responses - **Locale failure** (Req 1.2): one stderr line per offending key (`locales/en.json:: cjk-in-en: = `), then a trailing `N issues` summary. - **Regression failure** (Req 3.2): one stderr line per regressed path (`: cjk-regression: baseline= current= delta=+`) followed by a one-line refresh hint: `# refresh via: python scripts/ci/i18n_cjk_guard.py --update-baseline`. - **Missing en.json** (Req 1.4): stderr `locales/en.json: missing catalogue file`, exit 1. - **Missing or malformed baseline** (Req 4.5): stderr `: missing or malformed; refresh via …`, exit 1. - **`git grep` unavailable / non-PCRE**: stderr `git grep failed: `, exit 1. ### Monitoring - The guard is a single short-lived script. All observability is delegated to GitHub Actions logs (stdout/stderr, run duration). No external telemetry. ## Testing Strategy ### Unit Tests (Python) Place tests under `scripts/ci/tests/test_i18n_cjk_guard.py` (or invoke the script directly via subprocess in a tmp git repo). The project's test runner is `pytest` (already used by `backend/`), but the new tests must be runnable with `python -m pytest` from the repo root without backend dependencies. Tests are scoped to: 1. `scan_locale_cjk` — clean catalogue returns empty list; planted CJK value returns a single `LocaleFinding` with the correct key and line number. 2. `count_path_cjk` — given a tmp git repo with N planted CJK lines, returns N; binary file matches are excluded; untracked file matches are excluded. 3. `read_baseline` / `write_baseline` round-trip — write counts, re-read, equal. 4. `read_baseline` malformed input — non-tab line → `BaselineError`. 5. `run_check` end-to-end — passing baseline → exit 0; regressed baseline → exit 1 and stderr contains the refresh command. ### Integration Tests 1. Workflow shape — `actionlint` (optional, if installed locally) on `i18n-cjk-guard.yml`. At minimum, `python -c "import yaml; yaml.safe_load(open('.github/workflows/i18n-cjk-guard.yml'))"` for YAML validity. 2. Local end-to-end — run `python scripts/ci/i18n_cjk_guard.py` from the repo root with the committed baseline; expect exit 0 on a clean checkout of `main`. 3. Refresh end-to-end — run with `--update-baseline`; verify baseline file is rewritten and a second default run is exit 0. ### Performance / Load - Single-pass `git grep` over the scoped paths runs in <2 s on the current repo. The workflow's `timeout-minutes: 1` is a hard ceiling per Req 5.6. ## Optional Sections ### Security Considerations - The guard reads only tracked text files; no secrets are accessed. - The workflow uses `GITHUB_TOKEN` only implicitly via `actions/checkout`; no additional permissions are requested (`permissions:` block omitted relies on the repo default of `contents: read`, which is sufficient).