MicroFish/.kiro/specs/i18n-ci-guard/research.md

176 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research & Design Decisions — i18n-ci-guard
## Summary
- **Feature**: `i18n-ci-guard`
- **Discovery Scope**: Simple Addition (one Python script + one GH Actions
workflow + one baseline file). Extension-flavoured because it builds on
established `scripts/` conventions and the canonical CJK regex used by
the larger audit pipeline.
- **Key Findings**:
- The canonical CJK match command `git grep -nIP '[\x{4e00}-\x{9fff}]'
-- <path>` is already used by the unmerged audit pipeline (PR #27)
and is portable on every git ≥2.4 (`ubuntu-latest` ships ≥2.40).
- `scripts/check_i18n_logs.py` is a strong CLI/style precedent:
Python-stdlib-only, exit `0`/`1`, output as `<file>:<line>:
<reason>: <snippet>`, canonical regex `[一-鿿]`.
- The repository has no existing `pull_request`-triggered GH Actions
workflow; this guard introduces the first one. The only existing
workflow (`.github/workflows/docker-image.yml`) runs on tag pushes
only.
- Current per-path counts on this branch:
`backend/app=2707, frontend/src=902, locales/en.json=0`. These are
sample counts; the committed baseline must be regenerated against
`main` at implementation time.
## Research Log
### Canonical scan command
- **Context**: Requirement 2 needs a stable per-path CJK count and
Requirement 5.5 forbids third-party packages.
- **Sources Consulted**:
- `audit_cjk.sh` from PR #27 commit `3481408`.
- `git grep` man page.
- **Findings**:
- `git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path>` returns one match
per matching line in tracked, text-only files. `-I` excludes binary
files; `-P` enables PCRE2 so the `\x{...}` Unicode range works.
- This matches the input format consumed by the existing audit
classifier, so the guard's match counts are directly comparable
across pipelines.
- **Implications**:
- The guard re-uses this exact command; no new dependencies.
- Because `-I` skips binary files and tracked-only is the default,
Requirements 2.5 and 2.6 are satisfied by the command itself
rather than by additional script logic.
### Baseline file format
- **Context**: Requirement 4 needs a diff-friendly committed baseline.
- **Sources Consulted**:
- Diff churn behaviour of JSON vs. line-oriented text in this repo's
history (e.g. `locales/*.json` PR diffs frequently re-key, while
plain-text `parity.txt` from PR #27 reads cleanly).
- **Findings**:
- Line-oriented `<path>\t<count>` files produce minimal diffs and
require no JSON parser.
- A two-line file (one per scoped path) is large enough to be
self-explanatory and small enough to never line-shuffle.
- **Implications**:
- Use plain text, sorted by path, single trailing newline. Reject
the file as malformed if the script cannot parse it (Req 4.5).
### Locale-catalogue scan path
- **Context**: Requirement 1 wants `key:line` per CJK offender in
`locales/en.json`.
- **Sources Consulted**:
- `scripts/check_i18n_logs.py` (`flatten_keys` reuse pattern).
- `check_parity.py` from PR #27 (`flatten`, `[cjk-in-en]` block).
- **Findings**:
- Both precedents flatten the locale dict and run the canonical
regex against each leaf string value. Line numbers are derivable
by re-reading the file as text and matching the value's first
occurrence (good enough for an actionable error message).
- Empty-string values and non-string leaf values (booleans, null)
are skipped.
- **Implications**:
- Implement a tiny flatten-then-scan helper inside the guard
script; do not add a new shared utility module.
### GH Actions trigger and budget
- **Context**: Requirements 5.1, 5.5, 5.6.
- **Sources Consulted**:
- GitHub-hosted runners reference (`ubuntu-latest`).
- `actions/setup-python@v5` README.
- **Findings**:
- `ubuntu-latest` has Python 3.10+ pre-installed; `actions/setup-python@v5`
pins to 3.11 in <5 s.
- A single `git grep` over the scoped paths runs in <2 s on this
repo (~3.6k matches). End-to-end the workflow comfortably fits
inside the 60 s ceiling.
- **Implications**:
- Use `actions/checkout@v4` with `fetch-depth: 1`,
`actions/setup-python@v5` with `python-version: '3.11'`, and run
the script directly. No caching layer needed.
## Architecture Pattern Evaluation
| Option | Description | Strengths | Risks / Limitations | Notes |
|--------|-------------|-----------|---------------------|-------|
| A. Extend `check_i18n_logs.py` | Add `--cjk-guard` mode to existing script | Reuses one file | Conflates two scopes; existing script is module-scoped, guard is subtree-scoped | Rejected |
| B. New `scripts/ci/i18n_cjk_guard.py` + new workflow | Single-purpose script + workflow + baseline file | Clean SRP; matches "one script per responsibility" precedent | One additional file | **Selected** |
| C. Shared `cjk_scan.py` helper + thin guard | Factor regex/git-grep into helper | DRY for regex constant | Premature abstraction; only one shared symbol today | Rejected |
## Design Decisions
### Decision: Single-purpose CI script + GH Actions workflow (Option B)
- **Context**: Requirements 16 demand a small, self-contained guard.
- **Alternatives Considered**: A (extend), C (shared helper).
- **Selected Approach**: New script `scripts/ci/i18n_cjk_guard.py`,
new workflow `.github/workflows/i18n-cjk-guard.yml`, baseline file
`.kiro/specs/i18n-ci-guard/baseline.txt`.
- **Rationale**: Matches the project's "one focused script per
responsibility" convention; isolates a CI-blocking surface from the
existing i18n developer scripts; keeps the baseline collocated with
the spec for review traceability.
- **Trade-offs**: One more file in `scripts/` vs. tighter cohesion.
- **Follow-up**: When a third caller wants the canonical regex, factor
it out then.
### Decision: Plain-text baseline format
- **Context**: Requirement 4.2 demands stable, diff-friendly format.
- **Alternatives Considered**: JSON, YAML.
- **Selected Approach**: One line per scoped path: `<path>\t<count>`,
sorted lexicographically by path, single trailing newline.
- **Rationale**: Zero parser dependency; predictable diffs; trivial
to refresh atomically.
- **Trade-offs**: Less expressive than JSON (no nested structure), but
the data model is two integers nesting is unnecessary.
### Decision: Refresh via `--update-baseline` subcommand-style flag
- **Context**: Requirement 4.3 needs an explicit refresh path.
- **Alternatives Considered**: Separate `update_baseline.py` script;
Makefile target.
- **Selected Approach**: Single script with two modes: default (check
+ exit 0/1) and `--update-baseline` (overwrite baseline + exit 0).
- **Rationale**: One CLI surface to remember; the failure message
prints the exact command to run.
- **Trade-offs**: Slightly more conditional logic in one script;
acceptable given the small total LoC.
### Decision: Workflow runs only on `pull_request` to `main`
- **Context**: Requirement 5.1.
- **Alternatives Considered**: Run on `push` to all branches as well;
run on `pull_request` to any base branch.
- **Selected Approach**: `on.pull_request.branches: [main]` only.
- **Rationale**: Aligns with how the existing project uses `main` as
the protected branch (see `gh pr list` history; every feature PR
targets `main`). Avoids redundant runs on intra-branch chains.
- **Trade-offs**: A direct push to `main` would not be guarded but
branch protection already discourages that path (per
`dev-guidelines.md`).
## Risks & Mitigations
- **Risk**: Baseline drifts upward unintentionally during
`--update-baseline` runs, hiding real regressions.
- *Mitigation*: Failure message instructs contributors to refresh
*only when intentional*; the baseline file is reviewed in the same
PR diff. Acceptance Criteria 3.3 makes this explicit.
- **Risk**: `git grep -P` not built with PCRE on a developer's local
git build (rare on Linux/macOS, possible on minimal Windows builds).
- *Mitigation*: The guard prints a clear error if `git grep` exits
non-zero with PCRE mode; documents Python 3.11 + git 2.20 as
prerequisites.
- **Risk**: Baseline counts captured on a feature branch include
changes not yet on `main`, mis-anchoring the ratchet.
- *Mitigation*: The implementation task explicitly recomputes
baseline against `origin/main` before committing; documented in
`tasks.md`.
## References
- PR #27 audit pipeline (`audit_cjk.sh`, `check_parity.py`,
`classify.py`) methodology source of truth.
- `scripts/check_i18n_logs.py` CLI/style precedent.
- `git grep` man page `-n`, `-I`, `-P` flag semantics.
- GitHub Actions `actions/setup-python@v5` and `actions/checkout@v4`
README pages.