ci(i18n): add cjk regression guard for every pull request
Adds a stdlib-only Python script and a new GitHub Actions workflow that fail any pull request which reintroduces CJK characters into locales/en.json or which raises the total CJK match count under backend/app or frontend/src above a committed per-path baseline. The guard captures the two highest-signal checks of the larger i18n-e2e-english-verification audit so it can run on every PR with a sub-second budget and without depending on that pipeline being on main. The committed baseline lets the codebase ratchet down toward English-only without blocking unrelated PRs on pre-existing CJK content; refresh it intentionally via the documented flag. Closes #26
This commit is contained in:
parent
063b7fb17d
commit
081de636f1
|
|
@ -0,0 +1,26 @@
|
||||||
|
name: i18n CJK Guard
|
||||||
|
|
||||||
|
on:
|
||||||
|
pull_request:
|
||||||
|
branches: [main]
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
contents: read
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
guard:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
timeout-minutes: 1
|
||||||
|
steps:
|
||||||
|
- name: Checkout
|
||||||
|
uses: actions/checkout@v4
|
||||||
|
with:
|
||||||
|
fetch-depth: 1
|
||||||
|
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v5
|
||||||
|
with:
|
||||||
|
python-version: '3.11'
|
||||||
|
|
||||||
|
- name: Run i18n CJK guard
|
||||||
|
run: python scripts/ci/i18n_cjk_guard.py
|
||||||
|
|
@ -0,0 +1,5 @@
|
||||||
|
# Per-path CJK baseline for the i18n CI guard.
|
||||||
|
# Format: <path>\t<count>. Sorted lexicographically.
|
||||||
|
# Refresh via: python scripts/ci/i18n_cjk_guard.py --update-baseline
|
||||||
|
backend/app 2792
|
||||||
|
frontend/src 902
|
||||||
|
|
@ -0,0 +1,544 @@
|
||||||
|
# Design — i18n-ci-guard
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This feature installs a permanent, PR-time CI guard that blocks
|
||||||
|
regressions of the project's English-by-default state. It performs two
|
||||||
|
checks: `locales/en.json` must contain zero CJK characters, and the
|
||||||
|
total CJK match count under `backend/app/` and `frontend/src/` must not
|
||||||
|
exceed a committed per-path baseline. The guard is a single Python
|
||||||
|
script invoked by a single GitHub Actions workflow.
|
||||||
|
|
||||||
|
**Purpose**: This feature delivers an automatic regression gate to the
|
||||||
|
i18n initiative so reviewers do not have to spot CJK reintroductions
|
||||||
|
by eye.
|
||||||
|
**Users**: Project maintainers and PR authors. Maintainers gain a
|
||||||
|
hard regression gate; PR authors gain a script they can run locally to
|
||||||
|
catch regressions before pushing.
|
||||||
|
**Impact**: Adds the project's first `pull_request`-triggered CI
|
||||||
|
workflow. No production source under `backend/app/`, `frontend/src/`,
|
||||||
|
or `locales/` is modified by this spec — only new files are added.
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
|
||||||
|
- Fail any PR that introduces a CJK character into `locales/en.json`.
|
||||||
|
- Fail any PR whose CJK match count under `backend/app/` or
|
||||||
|
`frontend/src/` exceeds the committed baseline.
|
||||||
|
- Print a single actionable failure message that includes the exact
|
||||||
|
command a contributor must run if the regression is intentional.
|
||||||
|
- Run end-to-end under sixty seconds on `ubuntu-latest`.
|
||||||
|
- Be reproducible verbatim on a developer machine with Python ≥3.11
|
||||||
|
and `git`.
|
||||||
|
|
||||||
|
### Non-Goals
|
||||||
|
|
||||||
|
- Re-implementing the full classification pipeline from
|
||||||
|
`.kiro/specs/i18n-e2e-english-verification/` (that work belongs to
|
||||||
|
PR #27).
|
||||||
|
- Auto-updating the baseline on `main`.
|
||||||
|
- Translating any production source to satisfy a higher baseline. The
|
||||||
|
initial baseline is recorded against `main` and only ratchets down
|
||||||
|
over time.
|
||||||
|
- Gating commits at pre-commit time. The guard is CI-only; a future
|
||||||
|
spec may wrap it in a hook.
|
||||||
|
|
||||||
|
## Boundary Commitments
|
||||||
|
|
||||||
|
### This Spec Owns
|
||||||
|
|
||||||
|
- The guard script `scripts/ci/i18n_cjk_guard.py` and its CLI
|
||||||
|
contract.
|
||||||
|
- The workflow `.github/workflows/i18n-cjk-guard.yml` and its
|
||||||
|
trigger configuration.
|
||||||
|
- The baseline file `.kiro/specs/i18n-ci-guard/baseline.txt` and its
|
||||||
|
format.
|
||||||
|
- The pass/fail semantics of both checks.
|
||||||
|
|
||||||
|
### Out of Boundary
|
||||||
|
|
||||||
|
- Any change to files under `backend/app/`, `frontend/src/`, or
|
||||||
|
`locales/` — except `locales/en.json` if it is found to contain CJK
|
||||||
|
during initial baseline calibration (a remediation translation would
|
||||||
|
be a separate spec/PR).
|
||||||
|
- The classification heuristics in PR #27's `classify.py`.
|
||||||
|
- Pre-commit hooks; IDE integrations; alternative scoped paths beyond
|
||||||
|
`backend/app/` and `frontend/src/`.
|
||||||
|
|
||||||
|
### Allowed Dependencies
|
||||||
|
|
||||||
|
- Python ≥3.11 standard library.
|
||||||
|
- `git` (for `git grep -nIP` invocation).
|
||||||
|
- `actions/checkout@v4` and `actions/setup-python@v5` from the
|
||||||
|
GitHub Actions Marketplace.
|
||||||
|
|
||||||
|
### Revalidation Triggers
|
||||||
|
|
||||||
|
- Adding a third scoped path → baseline file format changes; consumers
|
||||||
|
(none today) re-check.
|
||||||
|
- Changing the regex range → audit pipeline alignment must be
|
||||||
|
re-confirmed.
|
||||||
|
- Switching from `pull_request` to `merge_group` or other event →
|
||||||
|
required-status-check rules in branch protection must be re-checked.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
### Existing Architecture Analysis
|
||||||
|
|
||||||
|
- **Repo layout**: monorepo split by runtime (`backend/`, `frontend/`)
|
||||||
|
with shared `locales/` at root. The guard scopes its scan to
|
||||||
|
`backend/app/`, `frontend/src/`, and `locales/en.json`, matching the
|
||||||
|
audit pipeline's canonical scope.
|
||||||
|
- **Existing scripts pattern**: `scripts/<purpose>.py` for developer
|
||||||
|
tools. The new `scripts/ci/` subdirectory introduces a clear,
|
||||||
|
CI-only home without disturbing the existing developer scripts.
|
||||||
|
- **Existing CI**: `.github/workflows/docker-image.yml` is tag-only.
|
||||||
|
No `pull_request` workflow exists. The new workflow is additive and
|
||||||
|
does not affect the docker-image workflow.
|
||||||
|
|
||||||
|
### Architecture Pattern & Boundary Map
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
PR[Pull Request to main] -->|trigger| WF[.github/workflows/i18n-cjk-guard.yml]
|
||||||
|
WF -->|setup-python + checkout| RUN[python scripts/ci/i18n_cjk_guard.py]
|
||||||
|
RUN -->|read| EN[locales/en.json]
|
||||||
|
RUN -->|git grep -nIP| BAPP[backend/app/]
|
||||||
|
RUN -->|git grep -nIP| FSRC[frontend/src/]
|
||||||
|
RUN -->|read| BL[.kiro/specs/i18n-ci-guard/baseline.txt]
|
||||||
|
RUN -->|exit 0 or 1| WF
|
||||||
|
WF -->|status| PR
|
||||||
|
|
||||||
|
DEV[Developer terminal] -->|python scripts/ci/i18n_cjk_guard.py| RUN
|
||||||
|
DEV -->|--update-baseline| RUN
|
||||||
|
RUN -.->|writes| BL
|
||||||
|
```
|
||||||
|
|
||||||
|
**Architecture Integration**:
|
||||||
|
|
||||||
|
- **Selected pattern**: single-purpose script + thin workflow.
|
||||||
|
Matches the project's existing `scripts/<purpose>.py` convention.
|
||||||
|
- **Domain boundaries**: the guard is a pure verification tool with no
|
||||||
|
side effects on production code. Its only writeable surface is the
|
||||||
|
baseline file, and only when explicitly invoked with
|
||||||
|
`--update-baseline`.
|
||||||
|
- **Existing patterns preserved**: stdlib-only Python tooling
|
||||||
|
(precedent: `scripts/check_i18n_logs.py`); single-file workflows in
|
||||||
|
`.github/workflows/`.
|
||||||
|
- **New components rationale**: a new file rather than an extension of
|
||||||
|
an existing script — the existing script is scoped to a fixed
|
||||||
|
module list and is not a regression gate.
|
||||||
|
- **Steering compliance**: respects layer-based structure (script
|
||||||
|
lives at repo root in `scripts/ci/`, not under `backend/` or
|
||||||
|
`frontend/`), no new heavy dependencies, no `os.getenv` calls
|
||||||
|
outside `backend/app/config.py`.
|
||||||
|
|
||||||
|
### Technology Stack
|
||||||
|
|
||||||
|
| Layer | Choice / Version | Role in Feature | Notes |
|
||||||
|
|-------|------------------|-----------------|-------|
|
||||||
|
| Frontend / CLI | Python 3.11 stdlib (`argparse`, `json`, `re`, `subprocess`, `pathlib`, `sys`) | Guard CLI | Stdlib only — Req 5.5 |
|
||||||
|
| Backend / Services | n/a | — | Guard does not touch backend services |
|
||||||
|
| Data / Storage | Plain-text baseline file under `.kiro/specs/` | Per-path count store | One line per path, `<path>\t<count>` |
|
||||||
|
| Messaging / Events | n/a | — | — |
|
||||||
|
| Infrastructure / Runtime | GitHub Actions `ubuntu-latest`, `actions/checkout@v4`, `actions/setup-python@v5` | PR-time runner | `fetch-depth: 1` is sufficient |
|
||||||
|
|
||||||
|
## File Structure Plan
|
||||||
|
|
||||||
|
### Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
scripts/
|
||||||
|
└── ci/
|
||||||
|
└── i18n_cjk_guard.py # Guard CLI (new)
|
||||||
|
|
||||||
|
.github/
|
||||||
|
└── workflows/
|
||||||
|
└── i18n-cjk-guard.yml # PR-time workflow (new)
|
||||||
|
|
||||||
|
.kiro/specs/i18n-ci-guard/
|
||||||
|
├── spec.json # (existing, updated)
|
||||||
|
├── requirements.md # (existing)
|
||||||
|
├── gap-analysis.md # (existing)
|
||||||
|
├── research.md # (existing)
|
||||||
|
├── design.md # (this file)
|
||||||
|
├── tasks.md # (created in next phase)
|
||||||
|
└── baseline.txt # Per-path CJK match counts (new)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Modified Files
|
||||||
|
|
||||||
|
- `.kiro/specs/i18n-ci-guard/spec.json` — phase / approval fields
|
||||||
|
updated by Kiro flow only.
|
||||||
|
- No production source files are modified by this spec.
|
||||||
|
|
||||||
|
## System Flows
|
||||||
|
|
||||||
|
### Guard execution (default mode)
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant CI as GitHub Actions
|
||||||
|
participant Script as i18n_cjk_guard.py
|
||||||
|
participant Repo as Working tree
|
||||||
|
participant BL as baseline.txt
|
||||||
|
|
||||||
|
CI->>Script: python scripts/ci/i18n_cjk_guard.py
|
||||||
|
Script->>Repo: read locales/en.json
|
||||||
|
Script->>Script: scan for CJK chars
|
||||||
|
alt en.json has CJK
|
||||||
|
Script-->>CI: exit 1 + per-key findings
|
||||||
|
else en.json clean
|
||||||
|
Script->>Repo: git grep -nIP backend/app/
|
||||||
|
Script->>Repo: git grep -nIP frontend/src/
|
||||||
|
Script->>BL: read baseline counts
|
||||||
|
alt any current count > baseline
|
||||||
|
Script-->>CI: exit 1 + per-path delta + refresh hint
|
||||||
|
else within baseline
|
||||||
|
Script-->>CI: exit 0 + summary
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
### Baseline refresh
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant Dev as Developer
|
||||||
|
participant Script as i18n_cjk_guard.py
|
||||||
|
participant Repo as Working tree
|
||||||
|
participant BL as baseline.txt
|
||||||
|
|
||||||
|
Dev->>Script: python scripts/ci/i18n_cjk_guard.py --update-baseline
|
||||||
|
Script->>Repo: git grep -nIP backend/app/
|
||||||
|
Script->>Repo: git grep -nIP frontend/src/
|
||||||
|
Script->>BL: write per-path counts (sorted)
|
||||||
|
Script-->>Dev: exit 0 + new counts
|
||||||
|
```
|
||||||
|
|
||||||
|
The two checks run in fixed order: en.json first (cheap, decisive),
|
||||||
|
then per-path counts. Both run under all conditions; the script does
|
||||||
|
not short-circuit after the first failure so the contributor sees the
|
||||||
|
complete diagnostic in one CI log.
|
||||||
|
|
||||||
|
## Requirements Traceability
|
||||||
|
|
||||||
|
| Requirement | Summary | Components | Interfaces | Flows |
|
||||||
|
|-------------|---------|------------|------------|-------|
|
||||||
|
| 1.1 | Scan en.json for CJK | `i18n_cjk_guard.py` | CLI default mode | Guard execution |
|
||||||
|
| 1.2 | Fail with key:line per offender | `i18n_cjk_guard.py` | CLI stderr output | Guard execution |
|
||||||
|
| 1.3 | Report clean state | `i18n_cjk_guard.py` | CLI stdout summary | Guard execution |
|
||||||
|
| 1.4 | Hard error if file missing | `i18n_cjk_guard.py` | CLI stderr + exit 1 | Guard execution |
|
||||||
|
| 2.1 | Count CJK matches per scoped path | `i18n_cjk_guard.py` | `git grep -nIP` invocation | Guard execution |
|
||||||
|
| 2.2 | Read baseline counts | `i18n_cjk_guard.py`, `baseline.txt` | File read | Guard execution |
|
||||||
|
| 2.3 | Fail on regression | `i18n_cjk_guard.py` | Exit 1 | Guard execution |
|
||||||
|
| 2.4 | Pass when within baseline | `i18n_cjk_guard.py` | Exit 0 | Guard execution |
|
||||||
|
| 2.5 | Skip binary files | `git grep -I` | — | Guard execution |
|
||||||
|
| 2.6 | Tracked-only scope | `git grep` default | — | Guard execution |
|
||||||
|
| 3.1 | Per-key locale failure detail | `i18n_cjk_guard.py` | CLI stderr lines | Guard execution |
|
||||||
|
| 3.2 | Per-path regression detail | `i18n_cjk_guard.py` | CLI stderr lines | Guard execution |
|
||||||
|
| 3.3 | Print refresh command | `i18n_cjk_guard.py` | CLI stderr footer | Guard execution |
|
||||||
|
| 3.4 | Success summary lines | `i18n_cjk_guard.py` | CLI stdout | Guard execution |
|
||||||
|
| 4.1 | Baseline under spec dir | `baseline.txt` | File path | — |
|
||||||
|
| 4.2 | Diff-friendly text format | `baseline.txt` | File format | — |
|
||||||
|
| 4.3 | Refresh via flag | `i18n_cjk_guard.py` | `--update-baseline` | Baseline refresh |
|
||||||
|
| 4.4 | No implicit baseline writes | `i18n_cjk_guard.py` | CLI default mode | Guard execution |
|
||||||
|
| 4.5 | Hard error if baseline missing | `i18n_cjk_guard.py` | Exit 1 + message | Guard execution |
|
||||||
|
| 5.1 | PR-only trigger to main | `i18n-cjk-guard.yml` | `on.pull_request.branches` | — |
|
||||||
|
| 5.2 | Checkout PR head | `i18n-cjk-guard.yml` | `actions/checkout@v4` | — |
|
||||||
|
| 5.3 | Surface output on failure | `i18n-cjk-guard.yml` | Default GH log | — |
|
||||||
|
| 5.4 | Pass on exit 0 | `i18n-cjk-guard.yml` | Default | — |
|
||||||
|
| 5.5 | Stdlib-only, no third-party | `i18n_cjk_guard.py`, `i18n-cjk-guard.yml` | — | — |
|
||||||
|
| 5.6 | ≤60s runtime | `i18n-cjk-guard.yml` | `timeout-minutes: 1` | — |
|
||||||
|
| 6.1 | Same result locally | `i18n_cjk_guard.py` | CLI | — |
|
||||||
|
| 6.2 | Single stable entry point | `scripts/ci/i18n_cjk_guard.py` | Path | — |
|
||||||
|
| 6.3 | No env vars / secrets | `i18n_cjk_guard.py` | CLI | — |
|
||||||
|
|
||||||
|
## Components and Interfaces
|
||||||
|
|
||||||
|
| Component | Domain/Layer | Intent | Req Coverage | Key Dependencies | Contracts |
|
||||||
|
|-----------|--------------|--------|--------------|------------------|-----------|
|
||||||
|
| `i18n_cjk_guard.py` | CI script | Two-check guard CLI | 1.1–6.3 | `git`, Python stdlib | Service (CLI) |
|
||||||
|
| `i18n-cjk-guard.yml` | CI workflow | Run guard on every PR to main | 5.1–5.6 | `actions/checkout@v4`, `actions/setup-python@v5` | Batch / Job |
|
||||||
|
| `baseline.txt` | Data | Per-path baseline counts | 4.1, 4.2, 2.2 | — | State (file) |
|
||||||
|
|
||||||
|
### CI Script
|
||||||
|
|
||||||
|
#### `i18n_cjk_guard.py`
|
||||||
|
|
||||||
|
| Field | Detail |
|
||||||
|
|-------|--------|
|
||||||
|
| Intent | Run two CJK-regression checks; optionally refresh the baseline |
|
||||||
|
| Requirements | 1.1, 1.2, 1.3, 1.4, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 3.1, 3.2, 3.3, 3.4, 4.1, 4.3, 4.4, 4.5, 5.5, 6.1, 6.2, 6.3 |
|
||||||
|
| Owner / Reviewers | i18n maintainers |
|
||||||
|
|
||||||
|
**Responsibilities & Constraints**
|
||||||
|
|
||||||
|
- Owns the canonical guard semantics: which paths are scoped, which
|
||||||
|
regex is canonical, what counts as a regression.
|
||||||
|
- Runs in pure Python 3.11 stdlib + a single `git` subprocess per
|
||||||
|
scoped path.
|
||||||
|
- Never modifies any file other than the baseline file, and only when
|
||||||
|
invoked with `--update-baseline`.
|
||||||
|
- Always runs both checks (does not short-circuit), so a single CI log
|
||||||
|
shows every failure mode at once.
|
||||||
|
|
||||||
|
**Dependencies**
|
||||||
|
|
||||||
|
- Inbound: `i18n-cjk-guard.yml` workflow; developers running locally.
|
||||||
|
- Outbound: `git` subprocess (`git grep`, `git rev-parse`).
|
||||||
|
- External: none.
|
||||||
|
|
||||||
|
**Contracts**: Service [x] / API [ ] / Event [ ] / Batch [ ] / State [x]
|
||||||
|
|
||||||
|
##### Service Interface (CLI)
|
||||||
|
|
||||||
|
```text
|
||||||
|
i18n_cjk_guard.py [--update-baseline] [--baseline PATH] [--repo-root PATH]
|
||||||
|
```
|
||||||
|
|
||||||
|
Type-annotated module signature (Python type hints, public functions
|
||||||
|
only):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def main(argv: list[str]) -> int: ...
|
||||||
|
|
||||||
|
def run_check(repo_root: pathlib.Path, baseline_path: pathlib.Path) -> int:
|
||||||
|
"""Run both checks; return 0 on success, 1 on any failure."""
|
||||||
|
|
||||||
|
def update_baseline(repo_root: pathlib.Path, baseline_path: pathlib.Path) -> int:
|
||||||
|
"""Refresh the baseline file with current per-path counts; return 0."""
|
||||||
|
|
||||||
|
def scan_locale_cjk(en_json_path: pathlib.Path) -> list[LocaleFinding]:
|
||||||
|
"""Return a list of (key, line_number, snippet) tuples for every
|
||||||
|
CJK occurrence in locales/en.json. Empty list when clean."""
|
||||||
|
|
||||||
|
def count_path_cjk(repo_root: pathlib.Path, scoped_path: str) -> int:
|
||||||
|
"""Return the number of CJK match lines under scoped_path,
|
||||||
|
using `git grep -nIP '[\\x{4e00}-\\x{9fff}]' -- <scoped_path>`."""
|
||||||
|
|
||||||
|
def read_baseline(baseline_path: pathlib.Path) -> dict[str, int]:
|
||||||
|
"""Parse the baseline file. Each non-empty, non-comment line is
|
||||||
|
'<path>\\t<count>'. Raise BaselineError on any malformed input
|
||||||
|
or missing file."""
|
||||||
|
|
||||||
|
def write_baseline(baseline_path: pathlib.Path, counts: dict[str, int]) -> None:
|
||||||
|
"""Atomically overwrite the baseline file with sorted entries
|
||||||
|
and a single trailing newline."""
|
||||||
|
```
|
||||||
|
|
||||||
|
Where:
|
||||||
|
|
||||||
|
```python
|
||||||
|
LocaleFinding = tuple[str, int, str] # (dotted_key, line_number, snippet)
|
||||||
|
SCOPED_PATHS: tuple[str, ...] = ("backend/app", "frontend/src")
|
||||||
|
EN_JSON_REL_PATH: str = "locales/en.json"
|
||||||
|
CJK_PATTERN: str = "[\\x{4e00}-\\x{9fff}]" # passed to git grep -P
|
||||||
|
CJK_RE: re.Pattern[str] = re.compile(r"[一-鿿]")
|
||||||
|
SNIPPET_MAX_LEN: int = 80
|
||||||
|
```
|
||||||
|
|
||||||
|
- **Preconditions**: invoked with CWD at the repo root or
|
||||||
|
`--repo-root` set; `git` is on `$PATH`; the working tree is the
|
||||||
|
intended scan target.
|
||||||
|
- **Postconditions** (default mode): exit 0 iff both checks pass;
|
||||||
|
exit 1 otherwise. Stdout receives the success summary; stderr
|
||||||
|
receives findings on failure. The baseline file is unchanged.
|
||||||
|
- **Postconditions** (`--update-baseline`): the baseline file is
|
||||||
|
rewritten to current per-path counts and exit 0 is returned.
|
||||||
|
- **Invariants**: regex range, scoped paths, and baseline file path
|
||||||
|
are constants — no env-var override.
|
||||||
|
|
||||||
|
##### State Management
|
||||||
|
|
||||||
|
- **State model**: a dict `{<scoped_path>: <count>}` parsed from
|
||||||
|
the baseline file.
|
||||||
|
- **Persistence**: plain-text file at
|
||||||
|
`.kiro/specs/i18n-ci-guard/baseline.txt`. Atomic write via
|
||||||
|
`tmp + os.replace`.
|
||||||
|
- **Concurrency**: single-writer (developer running
|
||||||
|
`--update-baseline`); CI workers only read.
|
||||||
|
|
||||||
|
**Implementation Notes**
|
||||||
|
|
||||||
|
- Output format mirrors `scripts/check_i18n_logs.py`:
|
||||||
|
`<file>:<line>: <reason>: <snippet>` on stderr, summary on stdout,
|
||||||
|
trailing `OK` or `N issues`.
|
||||||
|
- The exact refresh command printed on regression failure is:
|
||||||
|
`python scripts/ci/i18n_cjk_guard.py --update-baseline`.
|
||||||
|
- `count_path_cjk` invokes `git grep` via `subprocess.run` with
|
||||||
|
`check=False`; `git grep` exits 1 when there are zero matches, so
|
||||||
|
the function treats exit codes 0 and 1 as success and any other
|
||||||
|
code as a hard error.
|
||||||
|
- Localised key extraction for `en.json` walks the parsed JSON dict;
|
||||||
|
line numbers are obtained by re-reading the file as text and
|
||||||
|
matching the value's first textual occurrence.
|
||||||
|
- Risks: see `research.md` § Risks & Mitigations.
|
||||||
|
|
||||||
|
### CI Workflow
|
||||||
|
|
||||||
|
#### `i18n-cjk-guard.yml`
|
||||||
|
|
||||||
|
| Field | Detail |
|
||||||
|
|-------|--------|
|
||||||
|
| Intent | Run the guard on every PR to `main` |
|
||||||
|
| Requirements | 5.1, 5.2, 5.3, 5.4, 5.5, 5.6 |
|
||||||
|
| Owner / Reviewers | i18n maintainers |
|
||||||
|
|
||||||
|
**Contracts**: Batch / Job [x]
|
||||||
|
|
||||||
|
##### Batch / Job Contract
|
||||||
|
|
||||||
|
- **Trigger**: `on: pull_request: branches: [main]`.
|
||||||
|
- **Input / validation**: PR head ref checkout via
|
||||||
|
`actions/checkout@v4` with `fetch-depth: 1`. Python set up via
|
||||||
|
`actions/setup-python@v5` with `python-version: '3.11'`.
|
||||||
|
- **Output / destination**: pass/fail status surfaced as a GitHub
|
||||||
|
Actions check on the PR. Script stdout/stderr appears in the
|
||||||
|
workflow log.
|
||||||
|
- **Idempotency & recovery**: re-running the workflow re-evaluates the
|
||||||
|
same working tree; no persistent side effects on the runner.
|
||||||
|
|
||||||
|
##### Workflow shape (sketch)
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: i18n CJK Guard
|
||||||
|
on:
|
||||||
|
pull_request:
|
||||||
|
branches: [main]
|
||||||
|
jobs:
|
||||||
|
guard:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
timeout-minutes: 1
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v4
|
||||||
|
with:
|
||||||
|
fetch-depth: 1
|
||||||
|
- uses: actions/setup-python@v5
|
||||||
|
with:
|
||||||
|
python-version: '3.11'
|
||||||
|
- run: python scripts/ci/i18n_cjk_guard.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Baseline Data File
|
||||||
|
|
||||||
|
#### `baseline.txt`
|
||||||
|
|
||||||
|
| Field | Detail |
|
||||||
|
|-------|--------|
|
||||||
|
| Intent | Persist the per-path CJK match-count baseline |
|
||||||
|
| Requirements | 2.2, 4.1, 4.2 |
|
||||||
|
|
||||||
|
**Contracts**: State [x]
|
||||||
|
|
||||||
|
##### Format
|
||||||
|
|
||||||
|
```text
|
||||||
|
# Per-path CJK baseline for the i18n CI guard.
|
||||||
|
# Format: <path>\t<count>. Sorted lexicographically.
|
||||||
|
# Refresh via: python scripts/ci/i18n_cjk_guard.py --update-baseline
|
||||||
|
backend/app <int>
|
||||||
|
frontend/src <int>
|
||||||
|
```
|
||||||
|
|
||||||
|
- One header block of `#`-prefixed comments (parser ignores).
|
||||||
|
- Blank lines ignored.
|
||||||
|
- Lines must match `^(?P<path>[^\t\n]+)\t(?P<count>\d+)$`.
|
||||||
|
- Trailing newline mandatory.
|
||||||
|
|
||||||
|
## Data Models
|
||||||
|
|
||||||
|
### Domain Model
|
||||||
|
|
||||||
|
- `LocaleFinding` — value object
|
||||||
|
`(dotted_key: str, line_number: int, snippet: str)`.
|
||||||
|
- `PathCount` — pair `(scoped_path: str, count: int)`. The full
|
||||||
|
baseline is a `dict[str, int]` keyed by scoped path.
|
||||||
|
|
||||||
|
Invariants:
|
||||||
|
|
||||||
|
- `count` is a non-negative integer.
|
||||||
|
- `scoped_path` is one of `SCOPED_PATHS`.
|
||||||
|
- `LocaleFinding.snippet` is at most `SNIPPET_MAX_LEN` characters,
|
||||||
|
truncated with an ellipsis when needed.
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
### Error Strategy
|
||||||
|
|
||||||
|
- All non-zero exits are accompanied by a stderr message identifying
|
||||||
|
the failing check, the offending file or path, and (for regressions)
|
||||||
|
the refresh command. The script never raises uncaught exceptions
|
||||||
|
past `main()` in normal flow; unexpected I/O errors propagate as
|
||||||
|
`OSError` with a clear traceback so CI logs surface them clearly.
|
||||||
|
|
||||||
|
### Error Categories and Responses
|
||||||
|
|
||||||
|
- **Locale failure** (Req 1.2): one stderr line per offending key
|
||||||
|
(`locales/en.json:<line>: cjk-in-en: <key> = <snippet>`), then a
|
||||||
|
trailing `N issues` summary.
|
||||||
|
- **Regression failure** (Req 3.2): one stderr line per regressed
|
||||||
|
path (`<path>: cjk-regression: baseline=<b> current=<c> delta=+<d>`)
|
||||||
|
followed by a one-line refresh hint:
|
||||||
|
`# refresh via: python scripts/ci/i18n_cjk_guard.py --update-baseline`.
|
||||||
|
- **Missing en.json** (Req 1.4): stderr `locales/en.json: missing
|
||||||
|
catalogue file`, exit 1.
|
||||||
|
- **Missing or malformed baseline** (Req 4.5): stderr
|
||||||
|
`<baseline-path>: missing or malformed; refresh via …`, exit 1.
|
||||||
|
- **`git grep` unavailable / non-PCRE**: stderr
|
||||||
|
`git grep failed: <stderr>`, exit 1.
|
||||||
|
|
||||||
|
### Monitoring
|
||||||
|
|
||||||
|
- The guard is a single short-lived script. All observability is
|
||||||
|
delegated to GitHub Actions logs (stdout/stderr, run duration).
|
||||||
|
No external telemetry.
|
||||||
|
|
||||||
|
## Testing Strategy
|
||||||
|
|
||||||
|
### Unit Tests (Python)
|
||||||
|
|
||||||
|
Place tests under `scripts/ci/tests/test_i18n_cjk_guard.py` (or invoke
|
||||||
|
the script directly via subprocess in a tmp git repo). The project's
|
||||||
|
test runner is `pytest` (already used by `backend/`), but the new
|
||||||
|
tests must be runnable with `python -m pytest` from the repo root
|
||||||
|
without backend dependencies. Tests are scoped to:
|
||||||
|
|
||||||
|
1. `scan_locale_cjk` — clean catalogue returns empty list; planted CJK
|
||||||
|
value returns a single `LocaleFinding` with the correct key and
|
||||||
|
line number.
|
||||||
|
2. `count_path_cjk` — given a tmp git repo with N planted CJK lines,
|
||||||
|
returns N; binary file matches are excluded; untracked file
|
||||||
|
matches are excluded.
|
||||||
|
3. `read_baseline` / `write_baseline` round-trip — write counts,
|
||||||
|
re-read, equal.
|
||||||
|
4. `read_baseline` malformed input — non-tab line → `BaselineError`.
|
||||||
|
5. `run_check` end-to-end — passing baseline → exit 0; regressed
|
||||||
|
baseline → exit 1 and stderr contains the refresh command.
|
||||||
|
|
||||||
|
### Integration Tests
|
||||||
|
|
||||||
|
1. Workflow shape — `actionlint` (optional, if installed locally) on
|
||||||
|
`i18n-cjk-guard.yml`. At minimum, `python -c "import yaml;
|
||||||
|
yaml.safe_load(open('.github/workflows/i18n-cjk-guard.yml'))"` for
|
||||||
|
YAML validity.
|
||||||
|
2. Local end-to-end — run
|
||||||
|
`python scripts/ci/i18n_cjk_guard.py` from the repo root with the
|
||||||
|
committed baseline; expect exit 0 on a clean checkout of `main`.
|
||||||
|
3. Refresh end-to-end — run with `--update-baseline`; verify
|
||||||
|
baseline file is rewritten and a second default run is exit 0.
|
||||||
|
|
||||||
|
### Performance / Load
|
||||||
|
|
||||||
|
- Single-pass `git grep` over the scoped paths runs in <2 s on the
|
||||||
|
current repo. The workflow's `timeout-minutes: 1` is a hard ceiling
|
||||||
|
per Req 5.6.
|
||||||
|
|
||||||
|
## Optional Sections
|
||||||
|
|
||||||
|
### Security Considerations
|
||||||
|
|
||||||
|
- The guard reads only tracked text files; no secrets are accessed.
|
||||||
|
- The workflow uses `GITHUB_TOKEN` only implicitly via
|
||||||
|
`actions/checkout`; no additional permissions are requested
|
||||||
|
(`permissions:` block omitted relies on the repo default of
|
||||||
|
`contents: read`, which is sufficient).
|
||||||
|
|
@ -0,0 +1,169 @@
|
||||||
|
# Gap Analysis — i18n-ci-guard
|
||||||
|
|
||||||
|
Comparison of the approved requirements against the current MiroFish
|
||||||
|
codebase, focused on what already exists, what is missing, and what
|
||||||
|
options the design phase should choose between.
|
||||||
|
|
||||||
|
## 1. Current State Investigation
|
||||||
|
|
||||||
|
### Domain assets already in the repo
|
||||||
|
|
||||||
|
- **`scripts/check_i18n_logs.py`** — Python-stdlib-only, exit-code-based
|
||||||
|
i18n verification script. Uses the same canonical CJK regex
|
||||||
|
`[一-鿿]` (`U+4E00..U+9FFF`) the new guard needs, prints findings as
|
||||||
|
`<file>:<line>: <reason>: <snippet>`, and was written for ticket #6.
|
||||||
|
Strong precedent for the new guard's CLI surface and output format.
|
||||||
|
- **`scripts/_apply_translations.py`, `scripts/_codemod_i18n.py`,
|
||||||
|
`scripts/_merge_locale_keys.py`** — i18n tooling sibling scripts.
|
||||||
|
Convention is to keep auxiliary i18n scripts under `scripts/` at the
|
||||||
|
repo root.
|
||||||
|
- **`.github/workflows/docker-image.yml`** — only existing GH Actions
|
||||||
|
workflow; triggers on tag pushes and `workflow_dispatch`. No PR-time
|
||||||
|
workflow exists yet, so the new guard introduces the project's first
|
||||||
|
PR-blocking CI check.
|
||||||
|
- **PR #27 / branch `chore/i18n-10-e2e-english-verification`** — defines
|
||||||
|
the audit methodology referenced by the ticket. Its `audit_cjk.sh`
|
||||||
|
uses `git grep -nIP '[\x{4e00}-\x{9fff}]' -- backend/app frontend/src
|
||||||
|
locales/en.json` — the canonical scoped scan command. PR #27 is open;
|
||||||
|
the new guard must work with or without it merged.
|
||||||
|
- **`.kiro/specs/<feature>/`** — established home for spec artefacts.
|
||||||
|
`i18n-externalize-backend-logs/` is the closest precedent for an
|
||||||
|
i18n-flavoured spec.
|
||||||
|
- **`locales/en.json`, `locales/zh.json`, `locales/languages.json`** —
|
||||||
|
shared i18n source consumed by both runtimes.
|
||||||
|
|
||||||
|
### Conventions extracted
|
||||||
|
|
||||||
|
- Auxiliary scripts: `scripts/<purpose>.py`, Python ≥3.11 stdlib only,
|
||||||
|
shebang `#!/usr/bin/env python3`, double-quoted strings, snake_case,
|
||||||
|
Google-style docstrings on the module and public functions.
|
||||||
|
- Output format: `<file>:<line>: <reason>: <snippet>`, summary line
|
||||||
|
`OK` or `N issues`, exit `0`/`1`.
|
||||||
|
- Reuse the canonical regex `[一-鿿]` rather than re-deriving range
|
||||||
|
literals.
|
||||||
|
- 4-space indent, ≤120 cols, no trailing whitespace, single trailing
|
||||||
|
newline (`.claude/rules/dev-guidelines.md`).
|
||||||
|
|
||||||
|
### Integration surfaces
|
||||||
|
|
||||||
|
- **CI**: GitHub Actions, `.github/workflows/`. `ubuntu-latest` runner,
|
||||||
|
Python 3.11+ via `actions/setup-python@v5` (use the same version
|
||||||
|
pin already present in the docker-image workflow ecosystem if any).
|
||||||
|
- **Repo layout boundaries** scoped by the audit: `backend/app/`,
|
||||||
|
`frontend/src/`, `locales/en.json` — all live at repo root or two
|
||||||
|
levels deep.
|
||||||
|
- **Git working tree**: the guard relies on `git grep -I` for tracked,
|
||||||
|
text-only matches; this binds the guard to a runner that has `git`
|
||||||
|
available (true on `ubuntu-latest` and on developer machines).
|
||||||
|
|
||||||
|
## 2. Requirement-to-Asset Map
|
||||||
|
|
||||||
|
| Req | Need | Existing asset | Gap |
|
||||||
|
| --- | --------------------------------- | ----------------------------------------------------------------------------------------------- | ----------- |
|
||||||
|
| 1 | CJK scan of `locales/en.json` | `scripts/check_i18n_logs.py` already loads `locales/*.json` and runs the canonical regex. | Missing — new guard must scan en.json specifically and emit `key:line` per offender. |
|
||||||
|
| 2 | CJK count under `backend/app/` and `frontend/src/` against baseline | Audit `audit_cjk.sh` (PR #27) demonstrates `git grep -nIP` is the canonical scan; no baseline file exists yet on main. | Missing — no per-path counter, no baseline file. |
|
||||||
|
| 3 | Actionable failure messaging | `check_i18n_logs.py` output format reusable. | Missing — need refresh-baseline command in failure text. |
|
||||||
|
| 4 | Baseline file lifecycle | None. | Missing — file format and refresh subcommand to design. |
|
||||||
|
| 5 | GH Actions PR integration | `.github/workflows/` directory exists; one tag-only workflow. | Missing — new `pull_request` workflow. |
|
||||||
|
| 6 | Local reproducibility | Existing scripts run locally with stdlib; same pattern reusable. | None — covered by following the existing pattern. |
|
||||||
|
|
||||||
|
## 3. Implementation Approach Options
|
||||||
|
|
||||||
|
### Option A — Extend `scripts/check_i18n_logs.py`
|
||||||
|
|
||||||
|
Add a new `--cjk-guard` mode (catalogue scan + per-path baseline diff)
|
||||||
|
to the existing script, then call it from the new workflow.
|
||||||
|
|
||||||
|
- ✅ One file to maintain; reuses the regex constant and CLI.
|
||||||
|
- ❌ The existing script is tightly scoped to the in-scope backend
|
||||||
|
modules and the parity check. Mixing a PR-gating regression check into
|
||||||
|
it dilutes its intent and grows it past the SRP line that the
|
||||||
|
surrounding scripts respect.
|
||||||
|
- ❌ The existing script targets a fixed list of backend modules; the
|
||||||
|
new guard scans whole subtrees. The two scopes don't fit one CLI.
|
||||||
|
|
||||||
|
### Option B — New, focused script `scripts/ci/i18n_cjk_guard.py` + new workflow (recommended)
|
||||||
|
|
||||||
|
A new directory `scripts/ci/` holds CI-only scripts; the guard is a
|
||||||
|
single file that performs both checks and supports a `--refresh-baseline`
|
||||||
|
flag. New workflow `.github/workflows/i18n-cjk-guard.yml` runs it on
|
||||||
|
every PR to `main`.
|
||||||
|
|
||||||
|
- ✅ Clean separation: production-i18n script (`check_i18n_logs.py`)
|
||||||
|
and CI-gating script (`i18n_cjk_guard.py`) live side by side without
|
||||||
|
overlapping responsibilities.
|
||||||
|
- ✅ Mirrors the established convention of one script per
|
||||||
|
responsibility under `scripts/`.
|
||||||
|
- ✅ The baseline file lives under the spec dir
|
||||||
|
(`.kiro/specs/i18n-ci-guard/baseline.txt`), matching the ticket's
|
||||||
|
"baseline must be committed and reviewable" requirement.
|
||||||
|
- ❌ One more file in the repo, but the file is small (~150 LoC).
|
||||||
|
|
||||||
|
### Option C — Hybrid: shared `cjk_scan.py` helper + thin guard script
|
||||||
|
|
||||||
|
Factor the regex + git-grep logic into a tiny shared helper consumed by
|
||||||
|
both `check_i18n_logs.py` and the new guard.
|
||||||
|
|
||||||
|
- ✅ DRY for the regex constant.
|
||||||
|
- ❌ Premature abstraction: today the only shared element is one
|
||||||
|
one-line regex. The two scripts have different scopes, output
|
||||||
|
formats, and consumers. Pulling a helper out now satisfies
|
||||||
|
consistency without paying for itself; defer until a third caller
|
||||||
|
appears.
|
||||||
|
|
||||||
|
### Recommendation
|
||||||
|
|
||||||
|
**Option B**. It matches the project's established "one focused script
|
||||||
|
per responsibility" convention, isolates the new CI surface from
|
||||||
|
existing i18n scripts, and keeps the baseline file collocated with
|
||||||
|
spec metadata where reviewers expect to find it.
|
||||||
|
|
||||||
|
## 4. Research Items for Design Phase
|
||||||
|
|
||||||
|
- **Baseline file format**: prefer a stable, line-oriented text format
|
||||||
|
over JSON to minimize diff churn (e.g., `path<TAB>count` per line,
|
||||||
|
trailing newline). Confirm in design.
|
||||||
|
- **`git grep` invocation portability**: `git grep -nIP` works on all
|
||||||
|
modern git builds (≥2.4 ships PCRE2). `ubuntu-latest` ships ≥2.40.
|
||||||
|
No portability concern; record the assumption explicitly.
|
||||||
|
- **`fetch-depth`** for the `actions/checkout@v4` step: `git grep`
|
||||||
|
scans the working tree, not history, so a shallow clone (`fetch-depth:
|
||||||
|
1`) is sufficient.
|
||||||
|
- **Workflow timeout budget**: capture the empirical runtime of the
|
||||||
|
full scan locally (already measured: a single `git grep` over the
|
||||||
|
scoped paths runs in <2 seconds with ~3.6k matches). The 60-second
|
||||||
|
ceiling in Req 5 is comfortable.
|
||||||
|
- **Failure-message refresh command** wording: the design should pin
|
||||||
|
the exact command shown to contributors so it stays one stable
|
||||||
|
string developers can copy.
|
||||||
|
- **Initial baseline values**: with `git grep -nIP '[\x{4e00}-\x{9fff}]'`
|
||||||
|
on the current branch — `backend/app` = 2707, `frontend/src` = 902,
|
||||||
|
`locales/en.json` = 0. The committed baseline must be regenerated
|
||||||
|
against `main` at implementation time so it reflects the merge target.
|
||||||
|
|
||||||
|
## 5. Effort & Risk
|
||||||
|
|
||||||
|
- **Effort**: **S** (1–3 days). Small, self-contained additions
|
||||||
|
(one Python script, one workflow file, one baseline file, plus the
|
||||||
|
spec). All patterns already exist in the repo.
|
||||||
|
- **Risk**: **Low**. No production-source changes, no new dependencies,
|
||||||
|
no architectural shifts. The only failure mode is a noisy guard
|
||||||
|
blocking unrelated PRs — mitigated by the per-path baseline ratchet.
|
||||||
|
|
||||||
|
## 6. Recommendations for Design Phase
|
||||||
|
|
||||||
|
- Adopt **Option B** (new focused script + new workflow + baseline file
|
||||||
|
under spec dir).
|
||||||
|
- Lock in the canonical regex `[一-鿿]` and the canonical scan command
|
||||||
|
`git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path>` to keep this guard
|
||||||
|
bytewise-aligned with the audit pipeline.
|
||||||
|
- Use a line-oriented baseline format keyed by scoped path; explicit
|
||||||
|
`--refresh-baseline` (or equivalent) subcommand updates it; no
|
||||||
|
implicit overwrite.
|
||||||
|
- Output: machine-friendly findings on stderr, summary on stdout,
|
||||||
|
exit `0`/`1`.
|
||||||
|
- The workflow should run only on `pull_request` to `main` (Req 5.1)
|
||||||
|
with `fetch-depth: 1` and `actions/setup-python@v5`. No third-party
|
||||||
|
packages.
|
||||||
|
- Baseline counts must be recomputed against `main` before the PR
|
||||||
|
ships; do not commit baselines from a feature branch's working tree.
|
||||||
|
|
@ -0,0 +1,189 @@
|
||||||
|
# Requirements Document
|
||||||
|
|
||||||
|
## Project Description (Input)
|
||||||
|
Add a permanent CI guard that runs an i18n CJK audit on every pull request.
|
||||||
|
|
||||||
|
Linked GitHub issue: #26 (.ticket/26.md).
|
||||||
|
|
||||||
|
The guard must fail a PR build when:
|
||||||
|
1. locales/en.json contains any CJK character (range U+4E00..U+9FFF), or
|
||||||
|
2. The total count of CJK matches across backend/app/ and frontend/src/ regresses (i.e. exceeds) a committed baseline value.
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
The i18n initiative has driven the project toward English-by-default UI, logs,
|
||||||
|
prompts, and documentation. Manual audits (see PR #27, the
|
||||||
|
`i18n-e2e-english-verification` spec) have repeatedly surfaced regressions
|
||||||
|
where Chinese strings re-enter the codebase. This spec installs a permanent,
|
||||||
|
self-contained CI guard that runs on every pull request and fails the build
|
||||||
|
when (a) `locales/en.json` is no longer CJK-clean, or (b) the total CJK match
|
||||||
|
count under `backend/app/` and `frontend/src/` regresses against a committed
|
||||||
|
baseline.
|
||||||
|
|
||||||
|
The guard is intentionally minimal: it captures the two highest-signal checks
|
||||||
|
from the larger audit pipeline so it can run on every PR with a sub-minute
|
||||||
|
budget and without depending on the (currently unmerged) verification spec.
|
||||||
|
The committed baseline lets the project ratchet down gaps over time without
|
||||||
|
blocking unrelated PRs on pre-existing CJK content.
|
||||||
|
|
||||||
|
## Boundary Context
|
||||||
|
|
||||||
|
- **In scope**:
|
||||||
|
- A locally runnable Python script that performs both guard checks on the
|
||||||
|
current working tree.
|
||||||
|
- A baseline file committed under the spec directory recording the
|
||||||
|
accepted CJK match counts per scoped path.
|
||||||
|
- A GitHub Actions workflow that runs the script on every pull request
|
||||||
|
targeting `main` and fails the build when either check fails.
|
||||||
|
- A clear, actionable failure message (which path regressed, baseline
|
||||||
|
value, current value, command to update the baseline).
|
||||||
|
- **Out of scope**:
|
||||||
|
- The full classification pipeline (`classify.py`, `render_report.py`,
|
||||||
|
`post_comment.sh`) from the unmerged `i18n-e2e-english-verification`
|
||||||
|
spec — those scripts perform deeper audit work and are not required
|
||||||
|
for the PR-time guard.
|
||||||
|
- Auto-updating the baseline on `main` (the baseline is a normal
|
||||||
|
reviewable file).
|
||||||
|
- Translation work itself; this spec only enforces a regression gate.
|
||||||
|
- Any change to production source under `backend/app/`, `frontend/src/`,
|
||||||
|
or `locales/` apart from translations needed to satisfy the guard
|
||||||
|
against its own initial baseline.
|
||||||
|
- **Adjacent expectations**:
|
||||||
|
- PR #27 (`chore/i18n-10-e2e-english-verification`) provides the
|
||||||
|
methodology referenced here. This spec must remain functional whether
|
||||||
|
PR #27 has been merged or not.
|
||||||
|
- The guard reuses the canonical CJK regex range
|
||||||
|
`[一-鿿]` already established by that audit.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
### Requirement 1: Locale-catalogue CJK cleanliness check
|
||||||
|
|
||||||
|
**Objective:** As a maintainer of the English locale catalogue, I want every
|
||||||
|
PR to fail when `locales/en.json` reintroduces any CJK character, so that the
|
||||||
|
English catalogue stays CJK-free.
|
||||||
|
|
||||||
|
#### Acceptance Criteria
|
||||||
|
|
||||||
|
1. When the guard script is run from the repository root, the i18n CI Guard
|
||||||
|
shall scan the contents of `locales/en.json` for any character in the
|
||||||
|
range `U+4E00..U+9FFF`.
|
||||||
|
2. If `locales/en.json` contains at least one such character, the i18n CI
|
||||||
|
Guard shall exit with a non-zero status and report each offending
|
||||||
|
`key:line` pair on standard output.
|
||||||
|
3. While `locales/en.json` contains zero such characters, the i18n CI Guard
|
||||||
|
shall report the catalogue as CJK-clean.
|
||||||
|
4. If `locales/en.json` is missing or unreadable, the i18n CI Guard shall
|
||||||
|
exit with a non-zero status and emit an explicit error message naming
|
||||||
|
the missing file.
|
||||||
|
|
||||||
|
### Requirement 2: Backend/frontend CJK regression check against committed baseline
|
||||||
|
|
||||||
|
**Objective:** As a maintainer of English support across the codebase, I
|
||||||
|
want every PR to fail when the total CJK match count under `backend/app/`
|
||||||
|
or `frontend/src/` exceeds a committed baseline, so that the codebase
|
||||||
|
ratchets monotonically toward English-only without blocking PRs on
|
||||||
|
pre-existing CJK content.
|
||||||
|
|
||||||
|
#### Acceptance Criteria
|
||||||
|
|
||||||
|
1. When the guard script is run, the i18n CI Guard shall count the total
|
||||||
|
number of CJK matches (range `U+4E00..U+9FFF`, line-level, text files
|
||||||
|
only) under each of the scoped paths `backend/app/` and `frontend/src/`.
|
||||||
|
2. The i18n CI Guard shall read the baseline counts from a single
|
||||||
|
committed baseline file under the spec directory.
|
||||||
|
3. If the current count for any scoped path exceeds the baseline count for
|
||||||
|
that path, the i18n CI Guard shall exit with a non-zero status.
|
||||||
|
4. While the current count for every scoped path is less than or equal to
|
||||||
|
the baseline, the i18n CI Guard shall exit with status zero for this
|
||||||
|
check.
|
||||||
|
5. The i18n CI Guard shall ignore matches inside binary files
|
||||||
|
(image, font, archive, lockfile, or other non-text formats) by relying
|
||||||
|
on `git grep -I` semantics.
|
||||||
|
6. The i18n CI Guard shall scope its scan to tracked files only (matches
|
||||||
|
in untracked or ignored files shall not contribute to the count).
|
||||||
|
|
||||||
|
### Requirement 3: Actionable failure messaging
|
||||||
|
|
||||||
|
**Objective:** As a contributor whose PR was rejected by the guard, I want
|
||||||
|
the failure message to tell me exactly what regressed and how to fix it,
|
||||||
|
so that I can either translate the offending content or — when intentional —
|
||||||
|
update the baseline through normal review.
|
||||||
|
|
||||||
|
#### Acceptance Criteria
|
||||||
|
|
||||||
|
1. If the locale-catalogue check fails, the i18n CI Guard shall print, for
|
||||||
|
each offending entry: the dotted catalogue key, the line number in
|
||||||
|
`locales/en.json`, and a truncated snippet of the value.
|
||||||
|
2. If the regression check fails, the i18n CI Guard shall print, for each
|
||||||
|
regressed scoped path: the path name, the baseline count, the current
|
||||||
|
count, and the delta.
|
||||||
|
3. If the regression check fails, the i18n CI Guard shall print the exact
|
||||||
|
shell command a contributor must run locally to refresh the baseline
|
||||||
|
file so the PR can be re-reviewed against the new value.
|
||||||
|
4. The i18n CI Guard shall print, on success, a one-line summary per check
|
||||||
|
confirming the catalogue is CJK-clean and the per-path counts are at or
|
||||||
|
below baseline.
|
||||||
|
|
||||||
|
### Requirement 4: Baseline file lifecycle
|
||||||
|
|
||||||
|
**Objective:** As a reviewer enforcing English support, I want the baseline
|
||||||
|
to live in the repository as a small, human-readable file that only changes
|
||||||
|
through code review, so that downward ratcheting is intentional and
|
||||||
|
auditable.
|
||||||
|
|
||||||
|
#### Acceptance Criteria
|
||||||
|
|
||||||
|
1. The i18n CI Guard shall store the baseline as a single committed file
|
||||||
|
under `.kiro/specs/i18n-ci-guard/`.
|
||||||
|
2. The baseline file shall record one count per scoped path, in a stable,
|
||||||
|
diff-friendly text format (no JSON line shuffling, no trailing
|
||||||
|
whitespace).
|
||||||
|
3. When the guard script is invoked with an explicit "refresh baseline"
|
||||||
|
subcommand or flag, the i18n CI Guard shall overwrite the baseline file
|
||||||
|
with the current per-path counts and exit with status zero.
|
||||||
|
4. While no refresh flag is supplied, the i18n CI Guard shall never modify
|
||||||
|
the baseline file.
|
||||||
|
5. If the baseline file is missing at check time, the i18n CI Guard shall
|
||||||
|
exit with a non-zero status and instruct the contributor to refresh it.
|
||||||
|
|
||||||
|
### Requirement 5: GitHub Actions PR integration
|
||||||
|
|
||||||
|
**Objective:** As a project maintainer, I want every pull request targeting
|
||||||
|
`main` to be gated by the guard, so that no merge silently regresses the
|
||||||
|
English-only state of the catalogue or codebase.
|
||||||
|
|
||||||
|
#### Acceptance Criteria
|
||||||
|
|
||||||
|
1. The i18n CI Guard workflow shall trigger on every `pull_request` event
|
||||||
|
whose base ref is `main`.
|
||||||
|
2. While the workflow runs, the i18n CI Guard shall check out the PR head
|
||||||
|
commit with full history sufficient for `git grep` to scan tracked
|
||||||
|
files.
|
||||||
|
3. When the guard script exits with non-zero status, the workflow shall
|
||||||
|
fail and surface the script's standard output and standard error in the
|
||||||
|
GitHub Actions log.
|
||||||
|
4. When the guard script exits with status zero, the workflow shall pass.
|
||||||
|
5. The workflow shall use only Python from the standard
|
||||||
|
`actions/setup-python` distribution and tools already available on the
|
||||||
|
GitHub-hosted `ubuntu-latest` runner (`bash`, `git`); it shall not
|
||||||
|
install third-party Python packages.
|
||||||
|
6. The workflow shall complete within sixty seconds of wall-clock time on
|
||||||
|
a clean `ubuntu-latest` runner.
|
||||||
|
|
||||||
|
### Requirement 6: Local reproducibility
|
||||||
|
|
||||||
|
**Objective:** As a developer preparing a PR, I want to run the same guard
|
||||||
|
locally before pushing, so that I can catch regressions before CI does.
|
||||||
|
|
||||||
|
#### Acceptance Criteria
|
||||||
|
|
||||||
|
1. When the guard script is invoked from a developer machine that has
|
||||||
|
Python 3.11 or newer and `git` available, the i18n CI Guard shall
|
||||||
|
produce the same pass/fail result and the same per-path counts that
|
||||||
|
it would produce in CI for the same working tree.
|
||||||
|
2. The i18n CI Guard shall expose a single, stable invocation entry point
|
||||||
|
(a script under `scripts/ci/`) documented in the spec's design and
|
||||||
|
README touchpoints.
|
||||||
|
3. The i18n CI Guard shall require zero environment variables or secrets
|
||||||
|
to run locally.
|
||||||
|
|
@ -0,0 +1,175 @@
|
||||||
|
# Research & Design Decisions — i18n-ci-guard
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
- **Feature**: `i18n-ci-guard`
|
||||||
|
- **Discovery Scope**: Simple Addition (one Python script + one GH Actions
|
||||||
|
workflow + one baseline file). Extension-flavoured because it builds on
|
||||||
|
established `scripts/` conventions and the canonical CJK regex used by
|
||||||
|
the larger audit pipeline.
|
||||||
|
- **Key Findings**:
|
||||||
|
- The canonical CJK match command `git grep -nIP '[\x{4e00}-\x{9fff}]'
|
||||||
|
-- <path>` is already used by the unmerged audit pipeline (PR #27)
|
||||||
|
and is portable on every git ≥2.4 (`ubuntu-latest` ships ≥2.40).
|
||||||
|
- `scripts/check_i18n_logs.py` is a strong CLI/style precedent:
|
||||||
|
Python-stdlib-only, exit `0`/`1`, output as `<file>:<line>:
|
||||||
|
<reason>: <snippet>`, canonical regex `[一-鿿]`.
|
||||||
|
- The repository has no existing `pull_request`-triggered GH Actions
|
||||||
|
workflow; this guard introduces the first one. The only existing
|
||||||
|
workflow (`.github/workflows/docker-image.yml`) runs on tag pushes
|
||||||
|
only.
|
||||||
|
- Current per-path counts on this branch:
|
||||||
|
`backend/app=2707, frontend/src=902, locales/en.json=0`. These are
|
||||||
|
sample counts; the committed baseline must be regenerated against
|
||||||
|
`main` at implementation time.
|
||||||
|
|
||||||
|
## Research Log
|
||||||
|
|
||||||
|
### Canonical scan command
|
||||||
|
- **Context**: Requirement 2 needs a stable per-path CJK count and
|
||||||
|
Requirement 5.5 forbids third-party packages.
|
||||||
|
- **Sources Consulted**:
|
||||||
|
- `audit_cjk.sh` from PR #27 commit `3481408`.
|
||||||
|
- `git grep` man page.
|
||||||
|
- **Findings**:
|
||||||
|
- `git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path>` returns one match
|
||||||
|
per matching line in tracked, text-only files. `-I` excludes binary
|
||||||
|
files; `-P` enables PCRE2 so the `\x{...}` Unicode range works.
|
||||||
|
- This matches the input format consumed by the existing audit
|
||||||
|
classifier, so the guard's match counts are directly comparable
|
||||||
|
across pipelines.
|
||||||
|
- **Implications**:
|
||||||
|
- The guard re-uses this exact command; no new dependencies.
|
||||||
|
- Because `-I` skips binary files and tracked-only is the default,
|
||||||
|
Requirements 2.5 and 2.6 are satisfied by the command itself
|
||||||
|
rather than by additional script logic.
|
||||||
|
|
||||||
|
### Baseline file format
|
||||||
|
- **Context**: Requirement 4 needs a diff-friendly committed baseline.
|
||||||
|
- **Sources Consulted**:
|
||||||
|
- Diff churn behaviour of JSON vs. line-oriented text in this repo's
|
||||||
|
history (e.g. `locales/*.json` PR diffs frequently re-key, while
|
||||||
|
plain-text `parity.txt` from PR #27 reads cleanly).
|
||||||
|
- **Findings**:
|
||||||
|
- Line-oriented `<path>\t<count>` files produce minimal diffs and
|
||||||
|
require no JSON parser.
|
||||||
|
- A two-line file (one per scoped path) is large enough to be
|
||||||
|
self-explanatory and small enough to never line-shuffle.
|
||||||
|
- **Implications**:
|
||||||
|
- Use plain text, sorted by path, single trailing newline. Reject
|
||||||
|
the file as malformed if the script cannot parse it (Req 4.5).
|
||||||
|
|
||||||
|
### Locale-catalogue scan path
|
||||||
|
- **Context**: Requirement 1 wants `key:line` per CJK offender in
|
||||||
|
`locales/en.json`.
|
||||||
|
- **Sources Consulted**:
|
||||||
|
- `scripts/check_i18n_logs.py` (`flatten_keys` reuse pattern).
|
||||||
|
- `check_parity.py` from PR #27 (`flatten`, `[cjk-in-en]` block).
|
||||||
|
- **Findings**:
|
||||||
|
- Both precedents flatten the locale dict and run the canonical
|
||||||
|
regex against each leaf string value. Line numbers are derivable
|
||||||
|
by re-reading the file as text and matching the value's first
|
||||||
|
occurrence (good enough for an actionable error message).
|
||||||
|
- Empty-string values and non-string leaf values (booleans, null)
|
||||||
|
are skipped.
|
||||||
|
- **Implications**:
|
||||||
|
- Implement a tiny flatten-then-scan helper inside the guard
|
||||||
|
script; do not add a new shared utility module.
|
||||||
|
|
||||||
|
### GH Actions trigger and budget
|
||||||
|
- **Context**: Requirements 5.1, 5.5, 5.6.
|
||||||
|
- **Sources Consulted**:
|
||||||
|
- GitHub-hosted runners reference (`ubuntu-latest`).
|
||||||
|
- `actions/setup-python@v5` README.
|
||||||
|
- **Findings**:
|
||||||
|
- `ubuntu-latest` has Python 3.10+ pre-installed; `actions/setup-python@v5`
|
||||||
|
pins to 3.11 in <5 s.
|
||||||
|
- A single `git grep` over the scoped paths runs in <2 s on this
|
||||||
|
repo (~3.6k matches). End-to-end the workflow comfortably fits
|
||||||
|
inside the 60 s ceiling.
|
||||||
|
- **Implications**:
|
||||||
|
- Use `actions/checkout@v4` with `fetch-depth: 1`,
|
||||||
|
`actions/setup-python@v5` with `python-version: '3.11'`, and run
|
||||||
|
the script directly. No caching layer needed.
|
||||||
|
|
||||||
|
## Architecture Pattern Evaluation
|
||||||
|
|
||||||
|
| Option | Description | Strengths | Risks / Limitations | Notes |
|
||||||
|
|--------|-------------|-----------|---------------------|-------|
|
||||||
|
| A. Extend `check_i18n_logs.py` | Add `--cjk-guard` mode to existing script | Reuses one file | Conflates two scopes; existing script is module-scoped, guard is subtree-scoped | Rejected |
|
||||||
|
| B. New `scripts/ci/i18n_cjk_guard.py` + new workflow | Single-purpose script + workflow + baseline file | Clean SRP; matches "one script per responsibility" precedent | One additional file | **Selected** |
|
||||||
|
| C. Shared `cjk_scan.py` helper + thin guard | Factor regex/git-grep into helper | DRY for regex constant | Premature abstraction; only one shared symbol today | Rejected |
|
||||||
|
|
||||||
|
## Design Decisions
|
||||||
|
|
||||||
|
### Decision: Single-purpose CI script + GH Actions workflow (Option B)
|
||||||
|
- **Context**: Requirements 1–6 demand a small, self-contained guard.
|
||||||
|
- **Alternatives Considered**: A (extend), C (shared helper).
|
||||||
|
- **Selected Approach**: New script `scripts/ci/i18n_cjk_guard.py`,
|
||||||
|
new workflow `.github/workflows/i18n-cjk-guard.yml`, baseline file
|
||||||
|
`.kiro/specs/i18n-ci-guard/baseline.txt`.
|
||||||
|
- **Rationale**: Matches the project's "one focused script per
|
||||||
|
responsibility" convention; isolates a CI-blocking surface from the
|
||||||
|
existing i18n developer scripts; keeps the baseline collocated with
|
||||||
|
the spec for review traceability.
|
||||||
|
- **Trade-offs**: One more file in `scripts/` vs. tighter cohesion.
|
||||||
|
- **Follow-up**: When a third caller wants the canonical regex, factor
|
||||||
|
it out then.
|
||||||
|
|
||||||
|
### Decision: Plain-text baseline format
|
||||||
|
- **Context**: Requirement 4.2 demands stable, diff-friendly format.
|
||||||
|
- **Alternatives Considered**: JSON, YAML.
|
||||||
|
- **Selected Approach**: One line per scoped path: `<path>\t<count>`,
|
||||||
|
sorted lexicographically by path, single trailing newline.
|
||||||
|
- **Rationale**: Zero parser dependency; predictable diffs; trivial
|
||||||
|
to refresh atomically.
|
||||||
|
- **Trade-offs**: Less expressive than JSON (no nested structure), but
|
||||||
|
the data model is two integers — nesting is unnecessary.
|
||||||
|
|
||||||
|
### Decision: Refresh via `--update-baseline` subcommand-style flag
|
||||||
|
- **Context**: Requirement 4.3 needs an explicit refresh path.
|
||||||
|
- **Alternatives Considered**: Separate `update_baseline.py` script;
|
||||||
|
Makefile target.
|
||||||
|
- **Selected Approach**: Single script with two modes: default (check
|
||||||
|
+ exit 0/1) and `--update-baseline` (overwrite baseline + exit 0).
|
||||||
|
- **Rationale**: One CLI surface to remember; the failure message
|
||||||
|
prints the exact command to run.
|
||||||
|
- **Trade-offs**: Slightly more conditional logic in one script;
|
||||||
|
acceptable given the small total LoC.
|
||||||
|
|
||||||
|
### Decision: Workflow runs only on `pull_request` to `main`
|
||||||
|
- **Context**: Requirement 5.1.
|
||||||
|
- **Alternatives Considered**: Run on `push` to all branches as well;
|
||||||
|
run on `pull_request` to any base branch.
|
||||||
|
- **Selected Approach**: `on.pull_request.branches: [main]` only.
|
||||||
|
- **Rationale**: Aligns with how the existing project uses `main` as
|
||||||
|
the protected branch (see `gh pr list` history; every feature PR
|
||||||
|
targets `main`). Avoids redundant runs on intra-branch chains.
|
||||||
|
- **Trade-offs**: A direct push to `main` would not be guarded — but
|
||||||
|
branch protection already discourages that path (per
|
||||||
|
`dev-guidelines.md`).
|
||||||
|
|
||||||
|
## Risks & Mitigations
|
||||||
|
|
||||||
|
- **Risk**: Baseline drifts upward unintentionally during
|
||||||
|
`--update-baseline` runs, hiding real regressions.
|
||||||
|
- *Mitigation*: Failure message instructs contributors to refresh
|
||||||
|
*only when intentional*; the baseline file is reviewed in the same
|
||||||
|
PR diff. Acceptance Criteria 3.3 makes this explicit.
|
||||||
|
- **Risk**: `git grep -P` not built with PCRE on a developer's local
|
||||||
|
git build (rare on Linux/macOS, possible on minimal Windows builds).
|
||||||
|
- *Mitigation*: The guard prints a clear error if `git grep` exits
|
||||||
|
non-zero with PCRE mode; documents Python ≥3.11 + git ≥2.20 as
|
||||||
|
prerequisites.
|
||||||
|
- **Risk**: Baseline counts captured on a feature branch include
|
||||||
|
changes not yet on `main`, mis-anchoring the ratchet.
|
||||||
|
- *Mitigation*: The implementation task explicitly recomputes
|
||||||
|
baseline against `origin/main` before committing; documented in
|
||||||
|
`tasks.md`.
|
||||||
|
|
||||||
|
## References
|
||||||
|
- PR #27 audit pipeline (`audit_cjk.sh`, `check_parity.py`,
|
||||||
|
`classify.py`) — methodology source of truth.
|
||||||
|
- `scripts/check_i18n_logs.py` — CLI/style precedent.
|
||||||
|
- `git grep` man page — `-n`, `-I`, `-P` flag semantics.
|
||||||
|
- GitHub Actions `actions/setup-python@v5` and `actions/checkout@v4`
|
||||||
|
README pages.
|
||||||
|
|
@ -0,0 +1,24 @@
|
||||||
|
{
|
||||||
|
"feature_name": "i18n-ci-guard",
|
||||||
|
"created_at": "2026-05-08T00:25:37Z",
|
||||||
|
"updated_at": "2026-05-08T00:40:00Z",
|
||||||
|
"language": "en",
|
||||||
|
"phase": "tasks-generated",
|
||||||
|
"approvals": {
|
||||||
|
"requirements": {
|
||||||
|
"generated": true,
|
||||||
|
"approved": true
|
||||||
|
},
|
||||||
|
"design": {
|
||||||
|
"generated": true,
|
||||||
|
"approved": true
|
||||||
|
},
|
||||||
|
"tasks": {
|
||||||
|
"generated": true,
|
||||||
|
"approved": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"ready_for_implementation": true,
|
||||||
|
"ticket": "26",
|
||||||
|
"ticket_url": "https://github.com/salestech-group/MiroFish/issues/26"
|
||||||
|
}
|
||||||
|
|
@ -0,0 +1,157 @@
|
||||||
|
# Implementation Tasks — i18n-ci-guard
|
||||||
|
|
||||||
|
> Approved spec: see `requirements.md`, `design.md`, `research.md`,
|
||||||
|
> `gap-analysis.md` in this directory.
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
|
||||||
|
- [x] 1. Foundation: scaffold the CI guard script with stable CLI surface and stdlib-only dependencies
|
||||||
|
- [x] 1.1 Create the empty guard script and CLI skeleton
|
||||||
|
- Place the new script at the path designated by the design (`scripts/ci/`).
|
||||||
|
- Establish the module docstring, the canonical CJK regex constant, the
|
||||||
|
scoped-paths constant tuple, and the `argparse` parser exposing default
|
||||||
|
check mode plus an explicit `--update-baseline` flag and a
|
||||||
|
`--baseline` path override.
|
||||||
|
- Confirm the script exits 0 on a smoke `--help` invocation and rejects
|
||||||
|
unknown flags with non-zero exit.
|
||||||
|
- Observable: running `python scripts/ci/i18n_cjk_guard.py --help` from
|
||||||
|
the repo root prints usage text containing every documented flag and
|
||||||
|
exits 0; running with an unknown flag exits non-zero.
|
||||||
|
- _Requirements: 5.5, 6.2, 6.3_
|
||||||
|
- _Boundary: i18n_cjk_guard.py_
|
||||||
|
|
||||||
|
- [x] 2. Core: implement the two CJK checks
|
||||||
|
- [x] 2.1 Implement the locale-catalogue scan
|
||||||
|
- Recursively walk the parsed `locales/en.json` dict, applying the
|
||||||
|
canonical regex to every string leaf to gather offending entries.
|
||||||
|
- Compute the source line number by re-reading the file as text and
|
||||||
|
matching the value's first textual occurrence; truncate snippets to
|
||||||
|
the documented snippet length.
|
||||||
|
- On a missing or unreadable catalogue file, emit a clear stderr
|
||||||
|
message and exit non-zero.
|
||||||
|
- Observable: against a synthetic clean catalogue, the function returns
|
||||||
|
an empty list; against a synthetic catalogue with one CJK value, it
|
||||||
|
returns exactly one finding tuple with the correct dotted key and
|
||||||
|
line number.
|
||||||
|
- _Requirements: 1.1, 1.2, 1.3, 1.4, 3.1_
|
||||||
|
- _Boundary: i18n_cjk_guard.py_
|
||||||
|
|
||||||
|
- [x] 2.2 (P) Implement the per-path CJK count via `git grep`
|
||||||
|
- Invoke `git grep -nIP '[\x{4e00}-\x{9fff}]' -- <scoped_path>` for each
|
||||||
|
scoped path; treat exit codes 0 (matches found) and 1 (no matches) as
|
||||||
|
success, any other exit code as a hard error reported on stderr.
|
||||||
|
- Count lines of stdout; the result for a zero-match path must be the
|
||||||
|
integer `0`, never an exception.
|
||||||
|
- Reject working-tree states where `git` is not available or PCRE is
|
||||||
|
not enabled, with a clear stderr message.
|
||||||
|
- Observable: against a tmp git repository with N planted CJK lines
|
||||||
|
under a scoped path, the function returns N; with zero CJK content,
|
||||||
|
it returns 0; binary files and untracked files do not contribute.
|
||||||
|
- _Requirements: 2.1, 2.4, 2.5, 2.6_
|
||||||
|
- _Boundary: i18n_cjk_guard.py_
|
||||||
|
|
||||||
|
- [x] 2.3 Implement baseline file read/write with strict format
|
||||||
|
- Parse the baseline file as `<path>\t<count>` lines, ignoring `#`
|
||||||
|
comments and blank lines, raising a typed error on malformed input
|
||||||
|
or missing file.
|
||||||
|
- Write atomically (`tmp + os.replace`) with sorted entries, a single
|
||||||
|
header comment block, and a single trailing newline.
|
||||||
|
- Observable: a round-trip write/read of a deterministic counts dict
|
||||||
|
yields the same dict; a baseline file containing a non-tab line is
|
||||||
|
rejected with a clear error; the baseline file ends with exactly one
|
||||||
|
`\n`.
|
||||||
|
- _Requirements: 4.2, 4.3_
|
||||||
|
- _Boundary: i18n_cjk_guard.py_
|
||||||
|
|
||||||
|
- [x] 3. Integration: wire the two checks into the default and refresh modes
|
||||||
|
- [x] 3.1 Compose the default check mode
|
||||||
|
- Run both checks under all conditions (do not short-circuit), so a
|
||||||
|
single CI log shows every failure in one pass.
|
||||||
|
- Print a one-line success summary per check on stdout when both pass.
|
||||||
|
- On locale failure, print `<file>:<line>: <reason>: <snippet>` lines
|
||||||
|
on stderr and a trailing `N issues` summary; on regression failure,
|
||||||
|
print `<path>: cjk-regression: baseline=<b> current=<c> delta=+<d>`
|
||||||
|
lines plus the exact verbatim refresh command.
|
||||||
|
- Surface a non-zero exit when either check fails and exit 0 only when
|
||||||
|
both pass.
|
||||||
|
- Observable: against a working tree with the committed baseline at or
|
||||||
|
above the current count and a CJK-clean en.json, exit code is 0 and
|
||||||
|
stdout contains the success summary; planting one CJK char in
|
||||||
|
en.json or planting enough new CJK lines to break the baseline
|
||||||
|
yields exit 1 and the documented stderr text.
|
||||||
|
- _Requirements: 1.2, 1.3, 1.4, 2.2, 2.3, 2.4, 3.1, 3.2, 3.3, 3.4, 4.4, 4.5_
|
||||||
|
- _Boundary: i18n_cjk_guard.py_
|
||||||
|
|
||||||
|
- [x] 3.2 Compose the `--update-baseline` mode
|
||||||
|
- When the flag is provided, recompute current per-path counts and
|
||||||
|
overwrite the baseline file via the atomic writer; print the new
|
||||||
|
counts on stdout; exit 0.
|
||||||
|
- When the flag is absent, never write the baseline file under any
|
||||||
|
code path.
|
||||||
|
- Observable: invoking with `--update-baseline` rewrites the baseline
|
||||||
|
file's contents to match current counts and exits 0; running the
|
||||||
|
default mode immediately afterward exits 0.
|
||||||
|
- _Requirements: 4.3, 4.4_
|
||||||
|
- _Boundary: i18n_cjk_guard.py_
|
||||||
|
|
||||||
|
- [x] 4. Establish the committed baseline anchored to `main`
|
||||||
|
- [x] 4.1 Capture initial baseline counts against `main`
|
||||||
|
- Operate from a tree that reflects `origin/main`'s state for the
|
||||||
|
scoped paths (e.g., a fresh checkout, a worktree at `origin/main`,
|
||||||
|
or `git checkout origin/main -- backend/app frontend/src` followed
|
||||||
|
by a clean revert) so the committed baseline does not over- or
|
||||||
|
under-count relative to the merge target.
|
||||||
|
- Run `--update-baseline` to materialize the counts; confirm the
|
||||||
|
resulting file is exactly two non-comment data lines (one per
|
||||||
|
scoped path) sorted lexicographically.
|
||||||
|
- Observable: the baseline file is committed to
|
||||||
|
`.kiro/specs/i18n-ci-guard/baseline.txt` and `python scripts/ci/i18n_cjk_guard.py`
|
||||||
|
against the same `main`-aligned tree exits 0.
|
||||||
|
- _Requirements: 4.1, 4.2_
|
||||||
|
- _Boundary: baseline.txt_
|
||||||
|
|
||||||
|
- [x] 5. Wire the guard into GitHub Actions on every PR to `main`
|
||||||
|
- [x] 5.1 Add the PR-time workflow
|
||||||
|
- Create the workflow file at the path designated by the design,
|
||||||
|
triggered on `pull_request` whose base ref is `main`.
|
||||||
|
- Set explicit minimal permissions (`contents: read`), a one-minute
|
||||||
|
job timeout, `actions/checkout@v4` with `fetch-depth: 1`, and
|
||||||
|
`actions/setup-python@v5` pinned to Python 3.11.
|
||||||
|
- The single executable step invokes the guard script with no
|
||||||
|
arguments; the workflow surfaces the script's stdout and stderr in
|
||||||
|
the GitHub Actions log without filtering.
|
||||||
|
- Observable: the workflow YAML parses cleanly; on a PR with no CJK
|
||||||
|
regression, the job passes; on a PR that introduces a CJK regression
|
||||||
|
or CJK in en.json, the job fails and the log shows the documented
|
||||||
|
failure messages.
|
||||||
|
- _Requirements: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6_
|
||||||
|
- _Boundary: i18n-cjk-guard.yml_
|
||||||
|
|
||||||
|
- [x] 6. Validation: tests and end-to-end checks
|
||||||
|
- [x] 6.1 Add unit and integration tests for the guard script
|
||||||
|
- Cover the locale scan against a synthetic clean catalogue and a
|
||||||
|
synthetic CJK-tainted catalogue, asserting findings tuples match.
|
||||||
|
- Cover the per-path counter against a tmp git repo with both N>0
|
||||||
|
and N=0 planted CJK lines, asserting the zero-match path exits
|
||||||
|
cleanly with a count of 0.
|
||||||
|
- Cover the baseline read/write round-trip and the malformed-input
|
||||||
|
rejection path.
|
||||||
|
- Cover the default mode end-to-end (pass and fail paths) with the
|
||||||
|
expected exit codes and stderr fragments, including the verbatim
|
||||||
|
refresh command on regression failure.
|
||||||
|
- Observable: `python -m pytest scripts/ci/tests/test_i18n_cjk_guard.py`
|
||||||
|
from the repo root passes locally with stdlib-only Python.
|
||||||
|
- _Requirements: 1.1, 1.2, 1.3, 1.4, 2.1, 2.4, 2.5, 2.6, 3.3, 4.3, 4.5, 6.1, 6.3_
|
||||||
|
- _Boundary: scripts/ci/tests/_
|
||||||
|
|
||||||
|
- [x] 6.2 Run the guard locally to confirm reproducibility against the committed baseline
|
||||||
|
- From a clean working tree at `main` (or a worktree at `origin/main`
|
||||||
|
+ this branch's new files merged on top), invoke the guard with no
|
||||||
|
arguments and confirm exit code 0 and the success summary.
|
||||||
|
- Confirm the same command is the documented developer entry point
|
||||||
|
referenced from the failure-message refresh hint.
|
||||||
|
- Observable: terminal session shows exit code 0 and the documented
|
||||||
|
one-line per-check success summary; the same script path (`scripts/ci/i18n_cjk_guard.py`)
|
||||||
|
appears verbatim in the regression-failure refresh hint.
|
||||||
|
- _Requirements: 6.1, 6.2, 6.3_
|
||||||
|
- _Boundary: i18n_cjk_guard.py, baseline.txt_
|
||||||
|
|
@ -0,0 +1,393 @@
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
"""i18n CJK guard for pull-request CI.
|
||||||
|
|
||||||
|
Run from the repository root::
|
||||||
|
|
||||||
|
python scripts/ci/i18n_cjk_guard.py
|
||||||
|
python scripts/ci/i18n_cjk_guard.py --update-baseline
|
||||||
|
|
||||||
|
Two checks always run (no short-circuit):
|
||||||
|
|
||||||
|
* ``locales/en.json`` must contain zero CJK characters
|
||||||
|
(range ``U+4E00..U+9FFF``).
|
||||||
|
* CJK match counts under ``backend/app/`` and ``frontend/src/`` must not
|
||||||
|
exceed the committed per-path baseline at
|
||||||
|
``.kiro/specs/i18n-ci-guard/baseline.txt``.
|
||||||
|
|
||||||
|
Both checks rely on the canonical scan
|
||||||
|
``git grep -nIP '[\\x{4e00}-\\x{9fff}]' -- <scoped_path>`` so the guard
|
||||||
|
stays bytewise-aligned with the broader audit pipeline.
|
||||||
|
|
||||||
|
Stdlib only. Exit code is 0 on success and 1 on any failure or hard
|
||||||
|
error.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import re
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
CJK_RE: re.Pattern[str] = re.compile(r"[一-鿿]")
|
||||||
|
CJK_PATTERN: str = r"[\x{4e00}-\x{9fff}]"
|
||||||
|
SCOPED_PATHS: tuple[str, ...] = ("backend/app", "frontend/src")
|
||||||
|
EN_JSON_REL_PATH: str = "locales/en.json"
|
||||||
|
DEFAULT_BASELINE_REL_PATH: str = ".kiro/specs/i18n-ci-guard/baseline.txt"
|
||||||
|
SNIPPET_MAX_LEN: int = 80
|
||||||
|
REFRESH_COMMAND: str = "python scripts/ci/i18n_cjk_guard.py --update-baseline"
|
||||||
|
REFRESH_HINT: str = f"# refresh via: {REFRESH_COMMAND}"
|
||||||
|
|
||||||
|
LocaleFinding = tuple[str, int, str]
|
||||||
|
|
||||||
|
|
||||||
|
class BaselineError(Exception):
|
||||||
|
"""Raised when the baseline file is missing or malformed."""
|
||||||
|
|
||||||
|
|
||||||
|
def _truncate(text: str, limit: int = SNIPPET_MAX_LEN) -> str:
|
||||||
|
if len(text) <= limit:
|
||||||
|
return text
|
||||||
|
return text[: limit - 3] + "..."
|
||||||
|
|
||||||
|
|
||||||
|
def _flatten(prefix: str, value: object, out: list[tuple[str, object]]) -> None:
|
||||||
|
if isinstance(value, dict):
|
||||||
|
for key, child in value.items():
|
||||||
|
child_prefix = f"{prefix}.{key}" if prefix else str(key)
|
||||||
|
_flatten(child_prefix, child, out)
|
||||||
|
else:
|
||||||
|
out.append((prefix, value))
|
||||||
|
|
||||||
|
|
||||||
|
def _value_line_number(text_lines: list[str], value: str) -> int:
|
||||||
|
"""Best-effort line number for ``value`` in the original JSON text.
|
||||||
|
|
||||||
|
Tries the raw value first (matches when the JSON file was written with
|
||||||
|
``ensure_ascii=False``), then the JSON-escaped form, then falls back to
|
||||||
|
line 1 so callers always have a usable integer.
|
||||||
|
"""
|
||||||
|
candidates: list[str] = [value]
|
||||||
|
escaped = json.dumps(value)[1:-1]
|
||||||
|
if escaped not in candidates:
|
||||||
|
candidates.append(escaped)
|
||||||
|
for candidate in candidates:
|
||||||
|
if not candidate:
|
||||||
|
continue
|
||||||
|
for index, line in enumerate(text_lines, start=1):
|
||||||
|
if candidate in line:
|
||||||
|
return index
|
||||||
|
return 1
|
||||||
|
|
||||||
|
|
||||||
|
def scan_locale_cjk(en_json_path: Path) -> list[LocaleFinding]:
|
||||||
|
"""Return ``(dotted_key, line_number, snippet)`` for every CJK leaf.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
en_json_path: Path to ``locales/en.json``.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A list of findings in document order. Empty when the catalogue is
|
||||||
|
CJK-clean. Non-string leaves and empty strings are skipped.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
FileNotFoundError: If ``en_json_path`` does not exist.
|
||||||
|
json.JSONDecodeError: If the file is not valid JSON.
|
||||||
|
"""
|
||||||
|
raw = en_json_path.read_text(encoding="utf-8")
|
||||||
|
data = json.loads(raw)
|
||||||
|
flat: list[tuple[str, object]] = []
|
||||||
|
_flatten("", data, flat)
|
||||||
|
text_lines = raw.splitlines()
|
||||||
|
findings: list[LocaleFinding] = []
|
||||||
|
for key, value in flat:
|
||||||
|
if not isinstance(value, str) or not value:
|
||||||
|
continue
|
||||||
|
if not CJK_RE.search(value):
|
||||||
|
continue
|
||||||
|
line_no = _value_line_number(text_lines, value)
|
||||||
|
findings.append((key, line_no, _truncate(value)))
|
||||||
|
return findings
|
||||||
|
|
||||||
|
|
||||||
|
def count_path_cjk(repo_root: Path, scoped_path: str) -> int:
|
||||||
|
"""Count CJK match lines under ``scoped_path`` via ``git grep -nIP``.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
repo_root: Working-tree root used as ``git`` CWD.
|
||||||
|
scoped_path: Repo-relative path to scan (e.g. ``backend/app``).
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
The number of matching tracked-text lines. ``-I`` excludes binary
|
||||||
|
files; untracked files are excluded by default.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
RuntimeError: If ``git grep`` fails for any reason other than
|
||||||
|
"no matches" (exit code 1, which is treated as zero matches).
|
||||||
|
"""
|
||||||
|
cmd = ["git", "grep", "-nIP", CJK_PATTERN, "--", scoped_path]
|
||||||
|
proc = subprocess.run(
|
||||||
|
cmd,
|
||||||
|
cwd=repo_root,
|
||||||
|
stdout=subprocess.PIPE,
|
||||||
|
stderr=subprocess.PIPE,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
if proc.returncode not in (0, 1):
|
||||||
|
raise RuntimeError(
|
||||||
|
f"git grep failed (exit {proc.returncode}) for {scoped_path}: "
|
||||||
|
f"{proc.stderr.strip()}"
|
||||||
|
)
|
||||||
|
if not proc.stdout:
|
||||||
|
return 0
|
||||||
|
return sum(1 for line in proc.stdout.splitlines() if line)
|
||||||
|
|
||||||
|
|
||||||
|
def read_baseline(baseline_path: Path) -> dict[str, int]:
|
||||||
|
"""Parse the baseline file and return ``{scoped_path: count}``.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
baseline_path: Absolute path to the baseline file.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
A dict keyed by scoped path with non-negative integer counts.
|
||||||
|
|
||||||
|
Raises:
|
||||||
|
BaselineError: If the file is missing or contains a malformed line.
|
||||||
|
"""
|
||||||
|
if not baseline_path.exists():
|
||||||
|
raise BaselineError(
|
||||||
|
f"{baseline_path}: missing or malformed; "
|
||||||
|
f"refresh via: {REFRESH_COMMAND}"
|
||||||
|
)
|
||||||
|
counts: dict[str, int] = {}
|
||||||
|
for raw_line in baseline_path.read_text(encoding="utf-8").splitlines():
|
||||||
|
line = raw_line.rstrip()
|
||||||
|
if not line or line.startswith("#"):
|
||||||
|
continue
|
||||||
|
if "\t" not in line:
|
||||||
|
raise BaselineError(
|
||||||
|
f"{baseline_path}: malformed line {raw_line!r}; "
|
||||||
|
f"expected '<path>\\t<count>'"
|
||||||
|
)
|
||||||
|
path, _, count_str = line.partition("\t")
|
||||||
|
if not path or not count_str.isdigit():
|
||||||
|
raise BaselineError(
|
||||||
|
f"{baseline_path}: malformed line {raw_line!r}; "
|
||||||
|
f"expected '<path>\\t<count>'"
|
||||||
|
)
|
||||||
|
counts[path] = int(count_str)
|
||||||
|
return counts
|
||||||
|
|
||||||
|
|
||||||
|
def write_baseline(baseline_path: Path, counts: dict[str, int]) -> None:
|
||||||
|
"""Atomically write the baseline file with sorted entries.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
baseline_path: Target file path.
|
||||||
|
counts: Per-path baseline counts; keys are written in lexicographic
|
||||||
|
order with a single trailing newline.
|
||||||
|
"""
|
||||||
|
header = (
|
||||||
|
"# Per-path CJK baseline for the i18n CI guard.\n"
|
||||||
|
"# Format: <path>\\t<count>. Sorted lexicographically.\n"
|
||||||
|
f"# Refresh via: {REFRESH_COMMAND}\n"
|
||||||
|
)
|
||||||
|
body_lines = [f"{path}\t{counts[path]}" for path in sorted(counts)]
|
||||||
|
body = "\n".join(body_lines) + "\n"
|
||||||
|
contents = header + body
|
||||||
|
baseline_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
tmp = baseline_path.with_suffix(baseline_path.suffix + ".tmp")
|
||||||
|
tmp.write_text(contents, encoding="utf-8")
|
||||||
|
os.replace(tmp, baseline_path)
|
||||||
|
|
||||||
|
|
||||||
|
def _format_locale_finding(key: str, line_no: int, snippet: str) -> str:
|
||||||
|
return f"{EN_JSON_REL_PATH}:{line_no}: cjk-in-en: {key} = {snippet}"
|
||||||
|
|
||||||
|
|
||||||
|
def _format_regression_line(path: str, baseline: int, current: int) -> str:
|
||||||
|
delta = current - baseline
|
||||||
|
sign = "+" if delta > 0 else ""
|
||||||
|
return (
|
||||||
|
f"{path}: cjk-regression: baseline={baseline} "
|
||||||
|
f"current={current} delta={sign}{delta}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def run_check(repo_root: Path, baseline_path: Path) -> int:
|
||||||
|
"""Run both guard checks and return the script exit code.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
repo_root: Working-tree root passed to ``git grep``.
|
||||||
|
baseline_path: Path to the baseline file.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
``0`` when both checks pass, ``1`` otherwise.
|
||||||
|
"""
|
||||||
|
failed = False
|
||||||
|
success_summary: list[str] = []
|
||||||
|
|
||||||
|
en_json_path = repo_root / EN_JSON_REL_PATH
|
||||||
|
if not en_json_path.exists():
|
||||||
|
print(f"{EN_JSON_REL_PATH}: missing catalogue file", file=sys.stderr)
|
||||||
|
failed = True
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
findings = scan_locale_cjk(en_json_path)
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
print(
|
||||||
|
f"{EN_JSON_REL_PATH}: invalid JSON: {exc.msg}",
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
findings = []
|
||||||
|
failed = True
|
||||||
|
if findings:
|
||||||
|
for key, line_no, snippet in findings:
|
||||||
|
print(
|
||||||
|
_format_locale_finding(key, line_no, snippet),
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
print(f"{len(findings)} issues", file=sys.stderr)
|
||||||
|
failed = True
|
||||||
|
elif not failed:
|
||||||
|
success_summary.append("OK locales/en.json is CJK-clean")
|
||||||
|
|
||||||
|
try:
|
||||||
|
baseline = read_baseline(baseline_path)
|
||||||
|
except BaselineError as exc:
|
||||||
|
print(str(exc), file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
current_counts: dict[str, int] = {}
|
||||||
|
try:
|
||||||
|
for path in SCOPED_PATHS:
|
||||||
|
current_counts[path] = count_path_cjk(repo_root, path)
|
||||||
|
except RuntimeError as exc:
|
||||||
|
print(f"git grep failed: {exc}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
regressions: list[str] = []
|
||||||
|
for path in SCOPED_PATHS:
|
||||||
|
baseline_value = baseline.get(path, 0)
|
||||||
|
current_value = current_counts[path]
|
||||||
|
if current_value > baseline_value:
|
||||||
|
regressions.append(
|
||||||
|
_format_regression_line(path, baseline_value, current_value)
|
||||||
|
)
|
||||||
|
|
||||||
|
if regressions:
|
||||||
|
for line in regressions:
|
||||||
|
print(line, file=sys.stderr)
|
||||||
|
print(REFRESH_HINT, file=sys.stderr)
|
||||||
|
failed = True
|
||||||
|
else:
|
||||||
|
per_path = ", ".join(
|
||||||
|
f"{path}={current_counts[path]}<={baseline.get(path, 0)}"
|
||||||
|
for path in SCOPED_PATHS
|
||||||
|
)
|
||||||
|
success_summary.append(
|
||||||
|
f"OK per-path counts within baseline ({per_path})"
|
||||||
|
)
|
||||||
|
|
||||||
|
if not failed:
|
||||||
|
for line in success_summary:
|
||||||
|
print(line)
|
||||||
|
|
||||||
|
return 1 if failed else 0
|
||||||
|
|
||||||
|
|
||||||
|
def update_baseline(repo_root: Path, baseline_path: Path) -> int:
|
||||||
|
"""Refresh ``baseline_path`` with current per-path counts.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
repo_root: Working-tree root passed to ``git grep``.
|
||||||
|
baseline_path: Target baseline file path; created if missing.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
``0`` on success.
|
||||||
|
"""
|
||||||
|
counts: dict[str, int] = {}
|
||||||
|
for path in SCOPED_PATHS:
|
||||||
|
counts[path] = count_path_cjk(repo_root, path)
|
||||||
|
write_baseline(baseline_path, counts)
|
||||||
|
print(f"baseline updated: {baseline_path}")
|
||||||
|
for path in sorted(counts):
|
||||||
|
print(f" {path}\t{counts[path]}")
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
def _build_parser() -> argparse.ArgumentParser:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
prog="i18n_cjk_guard",
|
||||||
|
description=(
|
||||||
|
"PR-time guard: fail when locales/en.json contains CJK or when "
|
||||||
|
"backend/app + frontend/src CJK match counts exceed the "
|
||||||
|
"committed baseline."
|
||||||
|
),
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--update-baseline",
|
||||||
|
action="store_true",
|
||||||
|
help=(
|
||||||
|
"overwrite the baseline file with current counts and exit 0"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--baseline",
|
||||||
|
type=Path,
|
||||||
|
default=None,
|
||||||
|
help=(
|
||||||
|
f"path to the baseline file (default: {DEFAULT_BASELINE_REL_PATH})"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--repo-root",
|
||||||
|
type=Path,
|
||||||
|
default=None,
|
||||||
|
help=(
|
||||||
|
"repository root (default: detected via "
|
||||||
|
"`git rev-parse --show-toplevel`)"
|
||||||
|
),
|
||||||
|
)
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def _detect_repo_root(explicit: Path | None) -> Path:
|
||||||
|
if explicit is not None:
|
||||||
|
return explicit.resolve()
|
||||||
|
proc = subprocess.run(
|
||||||
|
["git", "rev-parse", "--show-toplevel"],
|
||||||
|
stdout=subprocess.PIPE,
|
||||||
|
stderr=subprocess.PIPE,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
if proc.returncode != 0:
|
||||||
|
raise RuntimeError(
|
||||||
|
f"unable to detect repository root: {proc.stderr.strip()}"
|
||||||
|
)
|
||||||
|
return Path(proc.stdout.strip())
|
||||||
|
|
||||||
|
|
||||||
|
def main(argv: list[str] | None = None) -> int:
|
||||||
|
"""CLI entry point. Returns the script exit code."""
|
||||||
|
parser = _build_parser()
|
||||||
|
args = parser.parse_args(argv)
|
||||||
|
try:
|
||||||
|
repo_root = _detect_repo_root(args.repo_root)
|
||||||
|
except RuntimeError as exc:
|
||||||
|
print(str(exc), file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
if args.baseline is not None:
|
||||||
|
baseline_path = args.baseline.resolve()
|
||||||
|
else:
|
||||||
|
baseline_path = (repo_root / DEFAULT_BASELINE_REL_PATH).resolve()
|
||||||
|
if args.update_baseline:
|
||||||
|
return update_baseline(repo_root, baseline_path)
|
||||||
|
return run_check(repo_root, baseline_path)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
|
|
@ -0,0 +1,358 @@
|
||||||
|
"""Unit and integration tests for ``scripts/ci/i18n_cjk_guard.py``.
|
||||||
|
|
||||||
|
Stdlib-only tests using ``unittest``. Run from the repository root with::
|
||||||
|
|
||||||
|
python -m unittest scripts/ci/tests/test_i18n_cjk_guard.py
|
||||||
|
|
||||||
|
or as a script::
|
||||||
|
|
||||||
|
python scripts/ci/tests/test_i18n_cjk_guard.py
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import json
|
||||||
|
import os
|
||||||
|
import subprocess
|
||||||
|
import sys
|
||||||
|
import tempfile
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
_HERE = Path(__file__).resolve().parent
|
||||||
|
_GUARD_DIR = _HERE.parent
|
||||||
|
sys.path.insert(0, str(_GUARD_DIR))
|
||||||
|
|
||||||
|
import i18n_cjk_guard as guard # noqa: E402
|
||||||
|
|
||||||
|
|
||||||
|
def _git(repo: Path, *args: str) -> subprocess.CompletedProcess[str]:
|
||||||
|
"""Run a git command in ``repo`` and return the completed process."""
|
||||||
|
return subprocess.run(
|
||||||
|
["git", *args],
|
||||||
|
cwd=repo,
|
||||||
|
check=True,
|
||||||
|
stdout=subprocess.PIPE,
|
||||||
|
stderr=subprocess.PIPE,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _make_repo(tmp: Path) -> Path:
|
||||||
|
"""Initialize an isolated git repository at ``tmp`` and return the path."""
|
||||||
|
_git(tmp, "init", "-q", "-b", "main")
|
||||||
|
_git(tmp, "config", "user.email", "test@example.com")
|
||||||
|
_git(tmp, "config", "user.name", "Test")
|
||||||
|
return tmp
|
||||||
|
|
||||||
|
|
||||||
|
def _commit_file(repo: Path, rel: str, content: str | bytes) -> None:
|
||||||
|
"""Write a file under ``repo`` and commit it."""
|
||||||
|
target = repo / rel
|
||||||
|
target.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
if isinstance(content, str):
|
||||||
|
target.write_text(content, encoding="utf-8")
|
||||||
|
else:
|
||||||
|
target.write_bytes(content)
|
||||||
|
_git(repo, "add", "--", rel)
|
||||||
|
_git(repo, "commit", "-q", "-m", f"add {rel}")
|
||||||
|
|
||||||
|
|
||||||
|
class ScanLocaleCjkTests(unittest.TestCase):
|
||||||
|
"""``scan_locale_cjk`` returns one ``LocaleFinding`` per CJK leaf string."""
|
||||||
|
|
||||||
|
def test_clean_catalogue_returns_empty_list(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
en_path = Path(tmp) / "en.json"
|
||||||
|
en_path.write_text(
|
||||||
|
json.dumps(
|
||||||
|
{"common": {"confirm": "Confirm", "cancel": "Cancel"}},
|
||||||
|
indent=2,
|
||||||
|
),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
self.assertEqual(guard.scan_locale_cjk(en_path), [])
|
||||||
|
|
||||||
|
def test_planted_cjk_returns_one_finding(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
en_path = Path(tmp) / "en.json"
|
||||||
|
data = {
|
||||||
|
"common": {
|
||||||
|
"confirm": "Confirm",
|
||||||
|
"cancel": "取消",
|
||||||
|
}
|
||||||
|
}
|
||||||
|
en_path.write_text(
|
||||||
|
json.dumps(data, indent=2, ensure_ascii=False),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
findings = guard.scan_locale_cjk(en_path)
|
||||||
|
self.assertEqual(len(findings), 1)
|
||||||
|
key, line_no, snippet = findings[0]
|
||||||
|
self.assertEqual(key, "common.cancel")
|
||||||
|
self.assertGreaterEqual(line_no, 1)
|
||||||
|
self.assertIn("取消", snippet)
|
||||||
|
|
||||||
|
def test_long_value_is_truncated(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
en_path = Path(tmp) / "en.json"
|
||||||
|
value = "前置" + ("x" * 200)
|
||||||
|
en_path.write_text(
|
||||||
|
json.dumps({"k": value}, ensure_ascii=False),
|
||||||
|
encoding="utf-8",
|
||||||
|
)
|
||||||
|
findings = guard.scan_locale_cjk(en_path)
|
||||||
|
self.assertEqual(len(findings), 1)
|
||||||
|
self.assertLessEqual(len(findings[0][2]), guard.SNIPPET_MAX_LEN)
|
||||||
|
|
||||||
|
|
||||||
|
class CountPathCjkTests(unittest.TestCase):
|
||||||
|
"""``count_path_cjk`` shells out to ``git grep -nIP``."""
|
||||||
|
|
||||||
|
def test_returns_zero_for_empty_match(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo = _make_repo(Path(tmp))
|
||||||
|
_commit_file(repo, "src/a.txt", "hello world\n")
|
||||||
|
self.assertEqual(guard.count_path_cjk(repo, "src"), 0)
|
||||||
|
|
||||||
|
def test_counts_planted_cjk_lines(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo = _make_repo(Path(tmp))
|
||||||
|
_commit_file(
|
||||||
|
repo,
|
||||||
|
"src/a.py",
|
||||||
|
"# 一\nprint('hi')\n# 二三\nx = '四'\n",
|
||||||
|
)
|
||||||
|
# Three lines contain CJK: # 一 ; # 二三 ; x = '四'.
|
||||||
|
self.assertEqual(guard.count_path_cjk(repo, "src"), 3)
|
||||||
|
|
||||||
|
def test_skips_binary_files(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo = _make_repo(Path(tmp))
|
||||||
|
# A "binary" blob containing CJK bytes; -I should exclude it.
|
||||||
|
_commit_file(
|
||||||
|
repo,
|
||||||
|
"src/blob.bin",
|
||||||
|
b"\x00\x01\x02\xe4\xb8\x80\x00\xff",
|
||||||
|
)
|
||||||
|
self.assertEqual(guard.count_path_cjk(repo, "src"), 0)
|
||||||
|
|
||||||
|
def test_skips_untracked_files(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo = _make_repo(Path(tmp))
|
||||||
|
_commit_file(repo, "src/.gitkeep", "")
|
||||||
|
(repo / "src" / "untracked.py").write_text(
|
||||||
|
"x = '中'\n", encoding="utf-8"
|
||||||
|
)
|
||||||
|
self.assertEqual(guard.count_path_cjk(repo, "src"), 0)
|
||||||
|
|
||||||
|
|
||||||
|
class BaselineRoundTripTests(unittest.TestCase):
|
||||||
|
"""``read_baseline`` and ``write_baseline`` round-trip cleanly."""
|
||||||
|
|
||||||
|
def test_round_trip(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
path = Path(tmp) / "baseline.txt"
|
||||||
|
counts = {"backend/app": 2792, "frontend/src": 902}
|
||||||
|
guard.write_baseline(path, counts)
|
||||||
|
self.assertTrue(path.read_text().endswith("\n"))
|
||||||
|
self.assertEqual(guard.read_baseline(path), counts)
|
||||||
|
|
||||||
|
def test_sorted_lexicographically_and_single_trailing_newline(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
path = Path(tmp) / "baseline.txt"
|
||||||
|
guard.write_baseline(path, {"frontend/src": 1, "backend/app": 2})
|
||||||
|
text = path.read_text(encoding="utf-8")
|
||||||
|
data_lines = [
|
||||||
|
line for line in text.splitlines() if not line.startswith("#")
|
||||||
|
]
|
||||||
|
self.assertEqual(
|
||||||
|
data_lines,
|
||||||
|
["backend/app\t2", "frontend/src\t1"],
|
||||||
|
)
|
||||||
|
self.assertTrue(text.endswith("\n"))
|
||||||
|
self.assertFalse(text.endswith("\n\n"))
|
||||||
|
|
||||||
|
def test_missing_file_raises_baseline_error(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
path = Path(tmp) / "missing.txt"
|
||||||
|
with self.assertRaises(guard.BaselineError):
|
||||||
|
guard.read_baseline(path)
|
||||||
|
|
||||||
|
def test_malformed_line_raises_baseline_error(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
path = Path(tmp) / "baseline.txt"
|
||||||
|
path.write_text(
|
||||||
|
"# header\nbackend/app 100\n", encoding="utf-8"
|
||||||
|
)
|
||||||
|
with self.assertRaises(guard.BaselineError):
|
||||||
|
guard.read_baseline(path)
|
||||||
|
|
||||||
|
|
||||||
|
class RunCheckEndToEndTests(unittest.TestCase):
|
||||||
|
"""End-to-end test of ``run_check`` against a synthetic repo."""
|
||||||
|
|
||||||
|
def _make_full_repo(
|
||||||
|
self,
|
||||||
|
tmp: Path,
|
||||||
|
*,
|
||||||
|
en_json: dict,
|
||||||
|
backend_lines: int,
|
||||||
|
frontend_lines: int,
|
||||||
|
) -> tuple[Path, Path]:
|
||||||
|
repo = _make_repo(tmp)
|
||||||
|
_commit_file(
|
||||||
|
repo,
|
||||||
|
"locales/en.json",
|
||||||
|
json.dumps(en_json, indent=2, ensure_ascii=False),
|
||||||
|
)
|
||||||
|
if backend_lines:
|
||||||
|
content = "\n".join(f"# 中{i}" for i in range(backend_lines)) + "\n"
|
||||||
|
_commit_file(repo, "backend/app/x.py", content)
|
||||||
|
else:
|
||||||
|
_commit_file(repo, "backend/app/.gitkeep", "")
|
||||||
|
if frontend_lines:
|
||||||
|
content = "\n".join(f"// 中{i}" for i in range(frontend_lines)) + "\n"
|
||||||
|
_commit_file(repo, "frontend/src/x.js", content)
|
||||||
|
else:
|
||||||
|
_commit_file(repo, "frontend/src/.gitkeep", "")
|
||||||
|
baseline_path = repo / "baseline.txt"
|
||||||
|
return repo, baseline_path
|
||||||
|
|
||||||
|
def test_pass_within_baseline(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo, baseline_path = self._make_full_repo(
|
||||||
|
Path(tmp),
|
||||||
|
en_json={"k": "Confirm"},
|
||||||
|
backend_lines=3,
|
||||||
|
frontend_lines=2,
|
||||||
|
)
|
||||||
|
guard.write_baseline(
|
||||||
|
baseline_path,
|
||||||
|
{"backend/app": 5, "frontend/src": 5},
|
||||||
|
)
|
||||||
|
rc = guard.run_check(repo, baseline_path)
|
||||||
|
self.assertEqual(rc, 0)
|
||||||
|
|
||||||
|
def test_fail_on_locale_cjk(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo, baseline_path = self._make_full_repo(
|
||||||
|
Path(tmp),
|
||||||
|
en_json={"k": "中文"},
|
||||||
|
backend_lines=0,
|
||||||
|
frontend_lines=0,
|
||||||
|
)
|
||||||
|
guard.write_baseline(
|
||||||
|
baseline_path,
|
||||||
|
{"backend/app": 0, "frontend/src": 0},
|
||||||
|
)
|
||||||
|
rc = guard.run_check(repo, baseline_path)
|
||||||
|
self.assertEqual(rc, 1)
|
||||||
|
|
||||||
|
def test_fail_on_regression_with_refresh_hint(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo, baseline_path = self._make_full_repo(
|
||||||
|
Path(tmp),
|
||||||
|
en_json={"k": "Confirm"},
|
||||||
|
backend_lines=10,
|
||||||
|
frontend_lines=0,
|
||||||
|
)
|
||||||
|
guard.write_baseline(
|
||||||
|
baseline_path,
|
||||||
|
{"backend/app": 5, "frontend/src": 0},
|
||||||
|
)
|
||||||
|
# Capture stderr.
|
||||||
|
from io import StringIO
|
||||||
|
|
||||||
|
captured_err = StringIO()
|
||||||
|
old_err = sys.stderr
|
||||||
|
sys.stderr = captured_err
|
||||||
|
try:
|
||||||
|
rc = guard.run_check(repo, baseline_path)
|
||||||
|
finally:
|
||||||
|
sys.stderr = old_err
|
||||||
|
self.assertEqual(rc, 1)
|
||||||
|
err_text = captured_err.getvalue()
|
||||||
|
self.assertIn("cjk-regression", err_text)
|
||||||
|
self.assertIn(
|
||||||
|
"python scripts/ci/i18n_cjk_guard.py --update-baseline",
|
||||||
|
err_text,
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_missing_en_json_fails(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo = _make_repo(Path(tmp))
|
||||||
|
_commit_file(repo, "backend/app/.gitkeep", "")
|
||||||
|
_commit_file(repo, "frontend/src/.gitkeep", "")
|
||||||
|
baseline_path = repo / "baseline.txt"
|
||||||
|
guard.write_baseline(
|
||||||
|
baseline_path,
|
||||||
|
{"backend/app": 0, "frontend/src": 0},
|
||||||
|
)
|
||||||
|
rc = guard.run_check(repo, baseline_path)
|
||||||
|
self.assertEqual(rc, 1)
|
||||||
|
|
||||||
|
def test_missing_baseline_fails(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo, baseline_path = self._make_full_repo(
|
||||||
|
Path(tmp),
|
||||||
|
en_json={"k": "Confirm"},
|
||||||
|
backend_lines=0,
|
||||||
|
frontend_lines=0,
|
||||||
|
)
|
||||||
|
# Do not write the baseline.
|
||||||
|
self.assertFalse(baseline_path.exists())
|
||||||
|
rc = guard.run_check(repo, baseline_path)
|
||||||
|
self.assertEqual(rc, 1)
|
||||||
|
|
||||||
|
|
||||||
|
class UpdateBaselineTests(unittest.TestCase):
|
||||||
|
"""``update_baseline`` writes current counts and exits 0."""
|
||||||
|
|
||||||
|
def test_update_then_check_passes(self) -> None:
|
||||||
|
with tempfile.TemporaryDirectory() as tmp:
|
||||||
|
repo = _make_repo(Path(tmp))
|
||||||
|
_commit_file(
|
||||||
|
repo,
|
||||||
|
"locales/en.json",
|
||||||
|
json.dumps({"k": "Confirm"}, indent=2),
|
||||||
|
)
|
||||||
|
_commit_file(repo, "backend/app/x.py", "# 一\n# 二\n")
|
||||||
|
_commit_file(repo, "frontend/src/.gitkeep", "")
|
||||||
|
baseline_path = repo / "baseline.txt"
|
||||||
|
self.assertEqual(
|
||||||
|
guard.update_baseline(repo, baseline_path), 0
|
||||||
|
)
|
||||||
|
counts = guard.read_baseline(baseline_path)
|
||||||
|
self.assertEqual(counts["backend/app"], 2)
|
||||||
|
self.assertEqual(counts["frontend/src"], 0)
|
||||||
|
self.assertEqual(guard.run_check(repo, baseline_path), 0)
|
||||||
|
|
||||||
|
|
||||||
|
class CliSmokeTests(unittest.TestCase):
|
||||||
|
"""``main`` exposes the documented CLI surface."""
|
||||||
|
|
||||||
|
def test_help_flag_exits_zero(self) -> None:
|
||||||
|
guard_script = _GUARD_DIR / "i18n_cjk_guard.py"
|
||||||
|
proc = subprocess.run(
|
||||||
|
[sys.executable, str(guard_script), "--help"],
|
||||||
|
stdout=subprocess.PIPE,
|
||||||
|
stderr=subprocess.PIPE,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
self.assertEqual(proc.returncode, 0)
|
||||||
|
for flag in ("--update-baseline", "--baseline", "--repo-root"):
|
||||||
|
self.assertIn(flag, proc.stdout)
|
||||||
|
|
||||||
|
def test_unknown_flag_exits_nonzero(self) -> None:
|
||||||
|
guard_script = _GUARD_DIR / "i18n_cjk_guard.py"
|
||||||
|
proc = subprocess.run(
|
||||||
|
[sys.executable, str(guard_script), "--no-such-flag"],
|
||||||
|
stdout=subprocess.PIPE,
|
||||||
|
stderr=subprocess.PIPE,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
self.assertNotEqual(proc.returncode, 0)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
Loading…
Reference in New Issue