176 lines
8.5 KiB
Markdown
176 lines
8.5 KiB
Markdown
# Research & Design Decisions — i18n-ci-guard
|
||
|
||
## Summary
|
||
- **Feature**: `i18n-ci-guard`
|
||
- **Discovery Scope**: Simple Addition (one Python script + one GH Actions
|
||
workflow + one baseline file). Extension-flavoured because it builds on
|
||
established `scripts/` conventions and the canonical CJK regex used by
|
||
the larger audit pipeline.
|
||
- **Key Findings**:
|
||
- The canonical CJK match command `git grep -nIP '[\x{4e00}-\x{9fff}]'
|
||
-- <path>` is already used by the unmerged audit pipeline (PR #27)
|
||
and is portable on every git ≥2.4 (`ubuntu-latest` ships ≥2.40).
|
||
- `scripts/check_i18n_logs.py` is a strong CLI/style precedent:
|
||
Python-stdlib-only, exit `0`/`1`, output as `<file>:<line>:
|
||
<reason>: <snippet>`, canonical regex `[一-鿿]`.
|
||
- The repository has no existing `pull_request`-triggered GH Actions
|
||
workflow; this guard introduces the first one. The only existing
|
||
workflow (`.github/workflows/docker-image.yml`) runs on tag pushes
|
||
only.
|
||
- Current per-path counts on this branch:
|
||
`backend/app=2707, frontend/src=902, locales/en.json=0`. These are
|
||
sample counts; the committed baseline must be regenerated against
|
||
`main` at implementation time.
|
||
|
||
## Research Log
|
||
|
||
### Canonical scan command
|
||
- **Context**: Requirement 2 needs a stable per-path CJK count and
|
||
Requirement 5.5 forbids third-party packages.
|
||
- **Sources Consulted**:
|
||
- `audit_cjk.sh` from PR #27 commit `3481408`.
|
||
- `git grep` man page.
|
||
- **Findings**:
|
||
- `git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path>` returns one match
|
||
per matching line in tracked, text-only files. `-I` excludes binary
|
||
files; `-P` enables PCRE2 so the `\x{...}` Unicode range works.
|
||
- This matches the input format consumed by the existing audit
|
||
classifier, so the guard's match counts are directly comparable
|
||
across pipelines.
|
||
- **Implications**:
|
||
- The guard re-uses this exact command; no new dependencies.
|
||
- Because `-I` skips binary files and tracked-only is the default,
|
||
Requirements 2.5 and 2.6 are satisfied by the command itself
|
||
rather than by additional script logic.
|
||
|
||
### Baseline file format
|
||
- **Context**: Requirement 4 needs a diff-friendly committed baseline.
|
||
- **Sources Consulted**:
|
||
- Diff churn behaviour of JSON vs. line-oriented text in this repo's
|
||
history (e.g. `locales/*.json` PR diffs frequently re-key, while
|
||
plain-text `parity.txt` from PR #27 reads cleanly).
|
||
- **Findings**:
|
||
- Line-oriented `<path>\t<count>` files produce minimal diffs and
|
||
require no JSON parser.
|
||
- A two-line file (one per scoped path) is large enough to be
|
||
self-explanatory and small enough to never line-shuffle.
|
||
- **Implications**:
|
||
- Use plain text, sorted by path, single trailing newline. Reject
|
||
the file as malformed if the script cannot parse it (Req 4.5).
|
||
|
||
### Locale-catalogue scan path
|
||
- **Context**: Requirement 1 wants `key:line` per CJK offender in
|
||
`locales/en.json`.
|
||
- **Sources Consulted**:
|
||
- `scripts/check_i18n_logs.py` (`flatten_keys` reuse pattern).
|
||
- `check_parity.py` from PR #27 (`flatten`, `[cjk-in-en]` block).
|
||
- **Findings**:
|
||
- Both precedents flatten the locale dict and run the canonical
|
||
regex against each leaf string value. Line numbers are derivable
|
||
by re-reading the file as text and matching the value's first
|
||
occurrence (good enough for an actionable error message).
|
||
- Empty-string values and non-string leaf values (booleans, null)
|
||
are skipped.
|
||
- **Implications**:
|
||
- Implement a tiny flatten-then-scan helper inside the guard
|
||
script; do not add a new shared utility module.
|
||
|
||
### GH Actions trigger and budget
|
||
- **Context**: Requirements 5.1, 5.5, 5.6.
|
||
- **Sources Consulted**:
|
||
- GitHub-hosted runners reference (`ubuntu-latest`).
|
||
- `actions/setup-python@v5` README.
|
||
- **Findings**:
|
||
- `ubuntu-latest` has Python 3.10+ pre-installed; `actions/setup-python@v5`
|
||
pins to 3.11 in <5 s.
|
||
- A single `git grep` over the scoped paths runs in <2 s on this
|
||
repo (~3.6k matches). End-to-end the workflow comfortably fits
|
||
inside the 60 s ceiling.
|
||
- **Implications**:
|
||
- Use `actions/checkout@v4` with `fetch-depth: 1`,
|
||
`actions/setup-python@v5` with `python-version: '3.11'`, and run
|
||
the script directly. No caching layer needed.
|
||
|
||
## Architecture Pattern Evaluation
|
||
|
||
| Option | Description | Strengths | Risks / Limitations | Notes |
|
||
|--------|-------------|-----------|---------------------|-------|
|
||
| A. Extend `check_i18n_logs.py` | Add `--cjk-guard` mode to existing script | Reuses one file | Conflates two scopes; existing script is module-scoped, guard is subtree-scoped | Rejected |
|
||
| B. New `scripts/ci/i18n_cjk_guard.py` + new workflow | Single-purpose script + workflow + baseline file | Clean SRP; matches "one script per responsibility" precedent | One additional file | **Selected** |
|
||
| C. Shared `cjk_scan.py` helper + thin guard | Factor regex/git-grep into helper | DRY for regex constant | Premature abstraction; only one shared symbol today | Rejected |
|
||
|
||
## Design Decisions
|
||
|
||
### Decision: Single-purpose CI script + GH Actions workflow (Option B)
|
||
- **Context**: Requirements 1–6 demand a small, self-contained guard.
|
||
- **Alternatives Considered**: A (extend), C (shared helper).
|
||
- **Selected Approach**: New script `scripts/ci/i18n_cjk_guard.py`,
|
||
new workflow `.github/workflows/i18n-cjk-guard.yml`, baseline file
|
||
`.kiro/specs/i18n-ci-guard/baseline.txt`.
|
||
- **Rationale**: Matches the project's "one focused script per
|
||
responsibility" convention; isolates a CI-blocking surface from the
|
||
existing i18n developer scripts; keeps the baseline collocated with
|
||
the spec for review traceability.
|
||
- **Trade-offs**: One more file in `scripts/` vs. tighter cohesion.
|
||
- **Follow-up**: When a third caller wants the canonical regex, factor
|
||
it out then.
|
||
|
||
### Decision: Plain-text baseline format
|
||
- **Context**: Requirement 4.2 demands stable, diff-friendly format.
|
||
- **Alternatives Considered**: JSON, YAML.
|
||
- **Selected Approach**: One line per scoped path: `<path>\t<count>`,
|
||
sorted lexicographically by path, single trailing newline.
|
||
- **Rationale**: Zero parser dependency; predictable diffs; trivial
|
||
to refresh atomically.
|
||
- **Trade-offs**: Less expressive than JSON (no nested structure), but
|
||
the data model is two integers — nesting is unnecessary.
|
||
|
||
### Decision: Refresh via `--update-baseline` subcommand-style flag
|
||
- **Context**: Requirement 4.3 needs an explicit refresh path.
|
||
- **Alternatives Considered**: Separate `update_baseline.py` script;
|
||
Makefile target.
|
||
- **Selected Approach**: Single script with two modes: default (check
|
||
+ exit 0/1) and `--update-baseline` (overwrite baseline + exit 0).
|
||
- **Rationale**: One CLI surface to remember; the failure message
|
||
prints the exact command to run.
|
||
- **Trade-offs**: Slightly more conditional logic in one script;
|
||
acceptable given the small total LoC.
|
||
|
||
### Decision: Workflow runs only on `pull_request` to `main`
|
||
- **Context**: Requirement 5.1.
|
||
- **Alternatives Considered**: Run on `push` to all branches as well;
|
||
run on `pull_request` to any base branch.
|
||
- **Selected Approach**: `on.pull_request.branches: [main]` only.
|
||
- **Rationale**: Aligns with how the existing project uses `main` as
|
||
the protected branch (see `gh pr list` history; every feature PR
|
||
targets `main`). Avoids redundant runs on intra-branch chains.
|
||
- **Trade-offs**: A direct push to `main` would not be guarded — but
|
||
branch protection already discourages that path (per
|
||
`dev-guidelines.md`).
|
||
|
||
## Risks & Mitigations
|
||
|
||
- **Risk**: Baseline drifts upward unintentionally during
|
||
`--update-baseline` runs, hiding real regressions.
|
||
- *Mitigation*: Failure message instructs contributors to refresh
|
||
*only when intentional*; the baseline file is reviewed in the same
|
||
PR diff. Acceptance Criteria 3.3 makes this explicit.
|
||
- **Risk**: `git grep -P` not built with PCRE on a developer's local
|
||
git build (rare on Linux/macOS, possible on minimal Windows builds).
|
||
- *Mitigation*: The guard prints a clear error if `git grep` exits
|
||
non-zero with PCRE mode; documents Python ≥3.11 + git ≥2.20 as
|
||
prerequisites.
|
||
- **Risk**: Baseline counts captured on a feature branch include
|
||
changes not yet on `main`, mis-anchoring the ratchet.
|
||
- *Mitigation*: The implementation task explicitly recomputes
|
||
baseline against `origin/main` before committing; documented in
|
||
`tasks.md`.
|
||
|
||
## References
|
||
- PR #27 audit pipeline (`audit_cjk.sh`, `check_parity.py`,
|
||
`classify.py`) — methodology source of truth.
|
||
- `scripts/check_i18n_logs.py` — CLI/style precedent.
|
||
- `git grep` man page — `-n`, `-I`, `-P` flag semantics.
|
||
- GitHub Actions `actions/setup-python@v5` and `actions/checkout@v4`
|
||
README pages.
|