MicroFish/.kiro/specs/i18n-ci-guard/requirements.md

190 lines
8.9 KiB
Markdown

# Requirements Document
## Project Description (Input)
Add a permanent CI guard that runs an i18n CJK audit on every pull request.
Linked GitHub issue: #26 (.ticket/26.md).
The guard must fail a PR build when:
1. locales/en.json contains any CJK character (range U+4E00..U+9FFF), or
2. The total count of CJK matches across backend/app/ and frontend/src/ regresses (i.e. exceeds) a committed baseline value.
## Introduction
The i18n initiative has driven the project toward English-by-default UI, logs,
prompts, and documentation. Manual audits (see PR #27, the
`i18n-e2e-english-verification` spec) have repeatedly surfaced regressions
where Chinese strings re-enter the codebase. This spec installs a permanent,
self-contained CI guard that runs on every pull request and fails the build
when (a) `locales/en.json` is no longer CJK-clean, or (b) the total CJK match
count under `backend/app/` and `frontend/src/` regresses against a committed
baseline.
The guard is intentionally minimal: it captures the two highest-signal checks
from the larger audit pipeline so it can run on every PR with a sub-minute
budget and without depending on the (currently unmerged) verification spec.
The committed baseline lets the project ratchet down gaps over time without
blocking unrelated PRs on pre-existing CJK content.
## Boundary Context
- **In scope**:
- A locally runnable Python script that performs both guard checks on the
current working tree.
- A baseline file committed under the spec directory recording the
accepted CJK match counts per scoped path.
- A GitHub Actions workflow that runs the script on every pull request
targeting `main` and fails the build when either check fails.
- A clear, actionable failure message (which path regressed, baseline
value, current value, command to update the baseline).
- **Out of scope**:
- The full classification pipeline (`classify.py`, `render_report.py`,
`post_comment.sh`) from the unmerged `i18n-e2e-english-verification`
spec — those scripts perform deeper audit work and are not required
for the PR-time guard.
- Auto-updating the baseline on `main` (the baseline is a normal
reviewable file).
- Translation work itself; this spec only enforces a regression gate.
- Any change to production source under `backend/app/`, `frontend/src/`,
or `locales/` apart from translations needed to satisfy the guard
against its own initial baseline.
- **Adjacent expectations**:
- PR #27 (`chore/i18n-10-e2e-english-verification`) provides the
methodology referenced here. This spec must remain functional whether
PR #27 has been merged or not.
- The guard reuses the canonical CJK regex range
`[一-鿿]` already established by that audit.
## Requirements
### Requirement 1: Locale-catalogue CJK cleanliness check
**Objective:** As a maintainer of the English locale catalogue, I want every
PR to fail when `locales/en.json` reintroduces any CJK character, so that the
English catalogue stays CJK-free.
#### Acceptance Criteria
1. When the guard script is run from the repository root, the i18n CI Guard
shall scan the contents of `locales/en.json` for any character in the
range `U+4E00..U+9FFF`.
2. If `locales/en.json` contains at least one such character, the i18n CI
Guard shall exit with a non-zero status and report each offending
`key:line` pair on standard output.
3. While `locales/en.json` contains zero such characters, the i18n CI Guard
shall report the catalogue as CJK-clean.
4. If `locales/en.json` is missing or unreadable, the i18n CI Guard shall
exit with a non-zero status and emit an explicit error message naming
the missing file.
### Requirement 2: Backend/frontend CJK regression check against committed baseline
**Objective:** As a maintainer of English support across the codebase, I
want every PR to fail when the total CJK match count under `backend/app/`
or `frontend/src/` exceeds a committed baseline, so that the codebase
ratchets monotonically toward English-only without blocking PRs on
pre-existing CJK content.
#### Acceptance Criteria
1. When the guard script is run, the i18n CI Guard shall count the total
number of CJK matches (range `U+4E00..U+9FFF`, line-level, text files
only) under each of the scoped paths `backend/app/` and `frontend/src/`.
2. The i18n CI Guard shall read the baseline counts from a single
committed baseline file under the spec directory.
3. If the current count for any scoped path exceeds the baseline count for
that path, the i18n CI Guard shall exit with a non-zero status.
4. While the current count for every scoped path is less than or equal to
the baseline, the i18n CI Guard shall exit with status zero for this
check.
5. The i18n CI Guard shall ignore matches inside binary files
(image, font, archive, lockfile, or other non-text formats) by relying
on `git grep -I` semantics.
6. The i18n CI Guard shall scope its scan to tracked files only (matches
in untracked or ignored files shall not contribute to the count).
### Requirement 3: Actionable failure messaging
**Objective:** As a contributor whose PR was rejected by the guard, I want
the failure message to tell me exactly what regressed and how to fix it,
so that I can either translate the offending content or — when intentional —
update the baseline through normal review.
#### Acceptance Criteria
1. If the locale-catalogue check fails, the i18n CI Guard shall print, for
each offending entry: the dotted catalogue key, the line number in
`locales/en.json`, and a truncated snippet of the value.
2. If the regression check fails, the i18n CI Guard shall print, for each
regressed scoped path: the path name, the baseline count, the current
count, and the delta.
3. If the regression check fails, the i18n CI Guard shall print the exact
shell command a contributor must run locally to refresh the baseline
file so the PR can be re-reviewed against the new value.
4. The i18n CI Guard shall print, on success, a one-line summary per check
confirming the catalogue is CJK-clean and the per-path counts are at or
below baseline.
### Requirement 4: Baseline file lifecycle
**Objective:** As a reviewer enforcing English support, I want the baseline
to live in the repository as a small, human-readable file that only changes
through code review, so that downward ratcheting is intentional and
auditable.
#### Acceptance Criteria
1. The i18n CI Guard shall store the baseline as a single committed file
under `.kiro/specs/i18n-ci-guard/`.
2. The baseline file shall record one count per scoped path, in a stable,
diff-friendly text format (no JSON line shuffling, no trailing
whitespace).
3. When the guard script is invoked with an explicit "refresh baseline"
subcommand or flag, the i18n CI Guard shall overwrite the baseline file
with the current per-path counts and exit with status zero.
4. While no refresh flag is supplied, the i18n CI Guard shall never modify
the baseline file.
5. If the baseline file is missing at check time, the i18n CI Guard shall
exit with a non-zero status and instruct the contributor to refresh it.
### Requirement 5: GitHub Actions PR integration
**Objective:** As a project maintainer, I want every pull request targeting
`main` to be gated by the guard, so that no merge silently regresses the
English-only state of the catalogue or codebase.
#### Acceptance Criteria
1. The i18n CI Guard workflow shall trigger on every `pull_request` event
whose base ref is `main`.
2. While the workflow runs, the i18n CI Guard shall check out the PR head
commit with full history sufficient for `git grep` to scan tracked
files.
3. When the guard script exits with non-zero status, the workflow shall
fail and surface the script's standard output and standard error in the
GitHub Actions log.
4. When the guard script exits with status zero, the workflow shall pass.
5. The workflow shall use only Python from the standard
`actions/setup-python` distribution and tools already available on the
GitHub-hosted `ubuntu-latest` runner (`bash`, `git`); it shall not
install third-party Python packages.
6. The workflow shall complete within sixty seconds of wall-clock time on
a clean `ubuntu-latest` runner.
### Requirement 6: Local reproducibility
**Objective:** As a developer preparing a PR, I want to run the same guard
locally before pushing, so that I can catch regressions before CI does.
#### Acceptance Criteria
1. When the guard script is invoked from a developer machine that has
Python 3.11 or newer and `git` available, the i18n CI Guard shall
produce the same pass/fail result and the same per-path counts that
it would produce in CI for the same working tree.
2. The i18n CI Guard shall expose a single, stable invocation entry point
(a script under `scripts/ci/`) documented in the spec's design and
README touchpoints.
3. The i18n CI Guard shall require zero environment variables or secrets
to run locally.