MicroFish/.kiro/specs/i18n-ci-guard/research.md

8.5 KiB
Raw Blame History

Research & Design Decisions — i18n-ci-guard

Summary

  • Feature: i18n-ci-guard
  • Discovery Scope: Simple Addition (one Python script + one GH Actions workflow + one baseline file). Extension-flavoured because it builds on established scripts/ conventions and the canonical CJK regex used by the larger audit pipeline.
  • Key Findings:
    • The canonical CJK match command git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path> is already used by the unmerged audit pipeline (PR #27) and is portable on every git ≥2.4 (ubuntu-latest ships ≥2.40).
    • scripts/check_i18n_logs.py is a strong CLI/style precedent: Python-stdlib-only, exit 0/1, output as <file>:<line>: <reason>: <snippet>, canonical regex [一-鿿].
    • The repository has no existing pull_request-triggered GH Actions workflow; this guard introduces the first one. The only existing workflow (.github/workflows/docker-image.yml) runs on tag pushes only.
    • Current per-path counts on this branch: backend/app=2707, frontend/src=902, locales/en.json=0. These are sample counts; the committed baseline must be regenerated against main at implementation time.

Research Log

Canonical scan command

  • Context: Requirement 2 needs a stable per-path CJK count and Requirement 5.5 forbids third-party packages.
  • Sources Consulted:
    • audit_cjk.sh from PR #27 commit 3481408.
    • git grep man page.
  • Findings:
    • git grep -nIP '[\x{4e00}-\x{9fff}]' -- <path> returns one match per matching line in tracked, text-only files. -I excludes binary files; -P enables PCRE2 so the \x{...} Unicode range works.
    • This matches the input format consumed by the existing audit classifier, so the guard's match counts are directly comparable across pipelines.
  • Implications:
    • The guard re-uses this exact command; no new dependencies.
    • Because -I skips binary files and tracked-only is the default, Requirements 2.5 and 2.6 are satisfied by the command itself rather than by additional script logic.

Baseline file format

  • Context: Requirement 4 needs a diff-friendly committed baseline.
  • Sources Consulted:
    • Diff churn behaviour of JSON vs. line-oriented text in this repo's history (e.g. locales/*.json PR diffs frequently re-key, while plain-text parity.txt from PR #27 reads cleanly).
  • Findings:
    • Line-oriented <path>\t<count> files produce minimal diffs and require no JSON parser.
    • A two-line file (one per scoped path) is large enough to be self-explanatory and small enough to never line-shuffle.
  • Implications:
    • Use plain text, sorted by path, single trailing newline. Reject the file as malformed if the script cannot parse it (Req 4.5).

Locale-catalogue scan path

  • Context: Requirement 1 wants key:line per CJK offender in locales/en.json.
  • Sources Consulted:
    • scripts/check_i18n_logs.py (flatten_keys reuse pattern).
    • check_parity.py from PR #27 (flatten, [cjk-in-en] block).
  • Findings:
    • Both precedents flatten the locale dict and run the canonical regex against each leaf string value. Line numbers are derivable by re-reading the file as text and matching the value's first occurrence (good enough for an actionable error message).
    • Empty-string values and non-string leaf values (booleans, null) are skipped.
  • Implications:
    • Implement a tiny flatten-then-scan helper inside the guard script; do not add a new shared utility module.

GH Actions trigger and budget

  • Context: Requirements 5.1, 5.5, 5.6.
  • Sources Consulted:
    • GitHub-hosted runners reference (ubuntu-latest).
    • actions/setup-python@v5 README.
  • Findings:
    • ubuntu-latest has Python 3.10+ pre-installed; actions/setup-python@v5 pins to 3.11 in <5 s.
    • A single git grep over the scoped paths runs in <2 s on this repo (~3.6k matches). End-to-end the workflow comfortably fits inside the 60 s ceiling.
  • Implications:
    • Use actions/checkout@v4 with fetch-depth: 1, actions/setup-python@v5 with python-version: '3.11', and run the script directly. No caching layer needed.

Architecture Pattern Evaluation

Option Description Strengths Risks / Limitations Notes
A. Extend check_i18n_logs.py Add --cjk-guard mode to existing script Reuses one file Conflates two scopes; existing script is module-scoped, guard is subtree-scoped Rejected
B. New scripts/ci/i18n_cjk_guard.py + new workflow Single-purpose script + workflow + baseline file Clean SRP; matches "one script per responsibility" precedent One additional file Selected
C. Shared cjk_scan.py helper + thin guard Factor regex/git-grep into helper DRY for regex constant Premature abstraction; only one shared symbol today Rejected

Design Decisions

Decision: Single-purpose CI script + GH Actions workflow (Option B)

  • Context: Requirements 16 demand a small, self-contained guard.
  • Alternatives Considered: A (extend), C (shared helper).
  • Selected Approach: New script scripts/ci/i18n_cjk_guard.py, new workflow .github/workflows/i18n-cjk-guard.yml, baseline file .kiro/specs/i18n-ci-guard/baseline.txt.
  • Rationale: Matches the project's "one focused script per responsibility" convention; isolates a CI-blocking surface from the existing i18n developer scripts; keeps the baseline collocated with the spec for review traceability.
  • Trade-offs: One more file in scripts/ vs. tighter cohesion.
  • Follow-up: When a third caller wants the canonical regex, factor it out then.

Decision: Plain-text baseline format

  • Context: Requirement 4.2 demands stable, diff-friendly format.
  • Alternatives Considered: JSON, YAML.
  • Selected Approach: One line per scoped path: <path>\t<count>, sorted lexicographically by path, single trailing newline.
  • Rationale: Zero parser dependency; predictable diffs; trivial to refresh atomically.
  • Trade-offs: Less expressive than JSON (no nested structure), but the data model is two integers — nesting is unnecessary.

Decision: Refresh via --update-baseline subcommand-style flag

  • Context: Requirement 4.3 needs an explicit refresh path.
  • Alternatives Considered: Separate update_baseline.py script; Makefile target.
  • Selected Approach: Single script with two modes: default (check
    • exit 0/1) and --update-baseline (overwrite baseline + exit 0).
  • Rationale: One CLI surface to remember; the failure message prints the exact command to run.
  • Trade-offs: Slightly more conditional logic in one script; acceptable given the small total LoC.

Decision: Workflow runs only on pull_request to main

  • Context: Requirement 5.1.
  • Alternatives Considered: Run on push to all branches as well; run on pull_request to any base branch.
  • Selected Approach: on.pull_request.branches: [main] only.
  • Rationale: Aligns with how the existing project uses main as the protected branch (see gh pr list history; every feature PR targets main). Avoids redundant runs on intra-branch chains.
  • Trade-offs: A direct push to main would not be guarded — but branch protection already discourages that path (per dev-guidelines.md).

Risks & Mitigations

  • Risk: Baseline drifts upward unintentionally during --update-baseline runs, hiding real regressions.
    • Mitigation: Failure message instructs contributors to refresh only when intentional; the baseline file is reviewed in the same PR diff. Acceptance Criteria 3.3 makes this explicit.
  • Risk: git grep -P not built with PCRE on a developer's local git build (rare on Linux/macOS, possible on minimal Windows builds).
    • Mitigation: The guard prints a clear error if git grep exits non-zero with PCRE mode; documents Python ≥3.11 + git ≥2.20 as prerequisites.
  • Risk: Baseline counts captured on a feature branch include changes not yet on main, mis-anchoring the ratchet.
    • Mitigation: The implementation task explicitly recomputes baseline against origin/main before committing; documented in tasks.md.

References

  • PR #27 audit pipeline (audit_cjk.sh, check_parity.py, classify.py) — methodology source of truth.
  • scripts/check_i18n_logs.py — CLI/style precedent.
  • git grep man page — -n, -I, -P flag semantics.
  • GitHub Actions actions/setup-python@v5 and actions/checkout@v4 README pages.