# Research & Design Decisions — i18n-ci-guard ## Summary - **Feature**: `i18n-ci-guard` - **Discovery Scope**: Simple Addition (one Python script + one GH Actions workflow + one baseline file). Extension-flavoured because it builds on established `scripts/` conventions and the canonical CJK regex used by the larger audit pipeline. - **Key Findings**: - The canonical CJK match command `git grep -nIP '[\x{4e00}-\x{9fff}]' -- ` is already used by the unmerged audit pipeline (PR #27) and is portable on every git ≥2.4 (`ubuntu-latest` ships ≥2.40). - `scripts/check_i18n_logs.py` is a strong CLI/style precedent: Python-stdlib-only, exit `0`/`1`, output as `:: : `, canonical regex `[一-鿿]`. - The repository has no existing `pull_request`-triggered GH Actions workflow; this guard introduces the first one. The only existing workflow (`.github/workflows/docker-image.yml`) runs on tag pushes only. - Current per-path counts on this branch: `backend/app=2707, frontend/src=902, locales/en.json=0`. These are sample counts; the committed baseline must be regenerated against `main` at implementation time. ## Research Log ### Canonical scan command - **Context**: Requirement 2 needs a stable per-path CJK count and Requirement 5.5 forbids third-party packages. - **Sources Consulted**: - `audit_cjk.sh` from PR #27 commit `3481408`. - `git grep` man page. - **Findings**: - `git grep -nIP '[\x{4e00}-\x{9fff}]' -- ` returns one match per matching line in tracked, text-only files. `-I` excludes binary files; `-P` enables PCRE2 so the `\x{...}` Unicode range works. - This matches the input format consumed by the existing audit classifier, so the guard's match counts are directly comparable across pipelines. - **Implications**: - The guard re-uses this exact command; no new dependencies. - Because `-I` skips binary files and tracked-only is the default, Requirements 2.5 and 2.6 are satisfied by the command itself rather than by additional script logic. ### Baseline file format - **Context**: Requirement 4 needs a diff-friendly committed baseline. - **Sources Consulted**: - Diff churn behaviour of JSON vs. line-oriented text in this repo's history (e.g. `locales/*.json` PR diffs frequently re-key, while plain-text `parity.txt` from PR #27 reads cleanly). - **Findings**: - Line-oriented `\t` files produce minimal diffs and require no JSON parser. - A two-line file (one per scoped path) is large enough to be self-explanatory and small enough to never line-shuffle. - **Implications**: - Use plain text, sorted by path, single trailing newline. Reject the file as malformed if the script cannot parse it (Req 4.5). ### Locale-catalogue scan path - **Context**: Requirement 1 wants `key:line` per CJK offender in `locales/en.json`. - **Sources Consulted**: - `scripts/check_i18n_logs.py` (`flatten_keys` reuse pattern). - `check_parity.py` from PR #27 (`flatten`, `[cjk-in-en]` block). - **Findings**: - Both precedents flatten the locale dict and run the canonical regex against each leaf string value. Line numbers are derivable by re-reading the file as text and matching the value's first occurrence (good enough for an actionable error message). - Empty-string values and non-string leaf values (booleans, null) are skipped. - **Implications**: - Implement a tiny flatten-then-scan helper inside the guard script; do not add a new shared utility module. ### GH Actions trigger and budget - **Context**: Requirements 5.1, 5.5, 5.6. - **Sources Consulted**: - GitHub-hosted runners reference (`ubuntu-latest`). - `actions/setup-python@v5` README. - **Findings**: - `ubuntu-latest` has Python 3.10+ pre-installed; `actions/setup-python@v5` pins to 3.11 in <5 s. - A single `git grep` over the scoped paths runs in <2 s on this repo (~3.6k matches). End-to-end the workflow comfortably fits inside the 60 s ceiling. - **Implications**: - Use `actions/checkout@v4` with `fetch-depth: 1`, `actions/setup-python@v5` with `python-version: '3.11'`, and run the script directly. No caching layer needed. ## Architecture Pattern Evaluation | Option | Description | Strengths | Risks / Limitations | Notes | |--------|-------------|-----------|---------------------|-------| | A. Extend `check_i18n_logs.py` | Add `--cjk-guard` mode to existing script | Reuses one file | Conflates two scopes; existing script is module-scoped, guard is subtree-scoped | Rejected | | B. New `scripts/ci/i18n_cjk_guard.py` + new workflow | Single-purpose script + workflow + baseline file | Clean SRP; matches "one script per responsibility" precedent | One additional file | **Selected** | | C. Shared `cjk_scan.py` helper + thin guard | Factor regex/git-grep into helper | DRY for regex constant | Premature abstraction; only one shared symbol today | Rejected | ## Design Decisions ### Decision: Single-purpose CI script + GH Actions workflow (Option B) - **Context**: Requirements 1–6 demand a small, self-contained guard. - **Alternatives Considered**: A (extend), C (shared helper). - **Selected Approach**: New script `scripts/ci/i18n_cjk_guard.py`, new workflow `.github/workflows/i18n-cjk-guard.yml`, baseline file `.kiro/specs/i18n-ci-guard/baseline.txt`. - **Rationale**: Matches the project's "one focused script per responsibility" convention; isolates a CI-blocking surface from the existing i18n developer scripts; keeps the baseline collocated with the spec for review traceability. - **Trade-offs**: One more file in `scripts/` vs. tighter cohesion. - **Follow-up**: When a third caller wants the canonical regex, factor it out then. ### Decision: Plain-text baseline format - **Context**: Requirement 4.2 demands stable, diff-friendly format. - **Alternatives Considered**: JSON, YAML. - **Selected Approach**: One line per scoped path: `\t`, sorted lexicographically by path, single trailing newline. - **Rationale**: Zero parser dependency; predictable diffs; trivial to refresh atomically. - **Trade-offs**: Less expressive than JSON (no nested structure), but the data model is two integers — nesting is unnecessary. ### Decision: Refresh via `--update-baseline` subcommand-style flag - **Context**: Requirement 4.3 needs an explicit refresh path. - **Alternatives Considered**: Separate `update_baseline.py` script; Makefile target. - **Selected Approach**: Single script with two modes: default (check + exit 0/1) and `--update-baseline` (overwrite baseline + exit 0). - **Rationale**: One CLI surface to remember; the failure message prints the exact command to run. - **Trade-offs**: Slightly more conditional logic in one script; acceptable given the small total LoC. ### Decision: Workflow runs only on `pull_request` to `main` - **Context**: Requirement 5.1. - **Alternatives Considered**: Run on `push` to all branches as well; run on `pull_request` to any base branch. - **Selected Approach**: `on.pull_request.branches: [main]` only. - **Rationale**: Aligns with how the existing project uses `main` as the protected branch (see `gh pr list` history; every feature PR targets `main`). Avoids redundant runs on intra-branch chains. - **Trade-offs**: A direct push to `main` would not be guarded — but branch protection already discourages that path (per `dev-guidelines.md`). ## Risks & Mitigations - **Risk**: Baseline drifts upward unintentionally during `--update-baseline` runs, hiding real regressions. - *Mitigation*: Failure message instructs contributors to refresh *only when intentional*; the baseline file is reviewed in the same PR diff. Acceptance Criteria 3.3 makes this explicit. - **Risk**: `git grep -P` not built with PCRE on a developer's local git build (rare on Linux/macOS, possible on minimal Windows builds). - *Mitigation*: The guard prints a clear error if `git grep` exits non-zero with PCRE mode; documents Python ≥3.11 + git ≥2.20 as prerequisites. - **Risk**: Baseline counts captured on a feature branch include changes not yet on `main`, mis-anchoring the ratchet. - *Mitigation*: The implementation task explicitly recomputes baseline against `origin/main` before committing; documented in `tasks.md`. ## References - PR #27 audit pipeline (`audit_cjk.sh`, `check_parity.py`, `classify.py`) — methodology source of truth. - `scripts/check_i18n_logs.py` — CLI/style precedent. - `git grep` man page — `-n`, `-I`, `-P` flag semantics. - GitHub Actions `actions/setup-python@v5` and `actions/checkout@v4` README pages.