mirror of https://github.com/garrytan/gstack.git
94 lines
3.7 KiB
Cheetah
94 lines
3.7 KiB
Cheetah
## Step 4: Test Framework Bootstrap
|
|
|
|
{{TEST_BOOTSTRAP}}
|
|
|
|
---
|
|
|
|
## Step 5: Run tests (on merged code)
|
|
|
|
**Do NOT run `RAILS_ENV=test bin/rails db:migrate`** — `bin/test-lane` already calls
|
|
`db:test:prepare` internally, which loads the schema into the correct lane database.
|
|
Running bare test migrations without INSTANCE hits an orphan DB and corrupts structure.sql.
|
|
|
|
Run both test suites in parallel:
|
|
|
|
```bash
|
|
bin/test-lane 2>&1 | tee /tmp/ship_tests.txt &
|
|
npm run test 2>&1 | tee /tmp/ship_vitest.txt &
|
|
wait
|
|
```
|
|
|
|
After both complete, read the output files and check pass/fail.
|
|
|
|
**If any test fails:** Do NOT immediately stop. Apply the Test Failure Ownership Triage:
|
|
|
|
{{TEST_FAILURE_TRIAGE}}
|
|
|
|
**After triage:** If any in-branch failures remain unfixed, **STOP**. Do not proceed. If all failures were pre-existing and handled (fixed, TODOed, assigned, or skipped), continue to Step 6.
|
|
|
|
**If all pass:** Continue silently — just note the counts briefly.
|
|
|
|
---
|
|
|
|
## Step 6: Eval Suites (conditional)
|
|
|
|
Evals are mandatory when prompt-related files change. Skip this step entirely if no prompt files are in the diff.
|
|
|
|
**1. Check if the diff touches prompt-related files:**
|
|
|
|
```bash
|
|
git diff origin/<base> --name-only
|
|
```
|
|
|
|
Match against these patterns (from CLAUDE.md):
|
|
- `app/services/*_prompt_builder.rb`
|
|
- `app/services/*_generation_service.rb`, `*_writer_service.rb`, `*_designer_service.rb`
|
|
- `app/services/*_evaluator.rb`, `*_scorer.rb`, `*_classifier_service.rb`, `*_analyzer.rb`
|
|
- `app/services/concerns/*voice*.rb`, `*writing*.rb`, `*prompt*.rb`, `*token*.rb`
|
|
- `app/services/chat_tools/*.rb`, `app/services/x_thread_tools/*.rb`
|
|
- `config/system_prompts/*.txt`
|
|
- `test/evals/**/*` (eval infrastructure changes affect all suites)
|
|
|
|
**If no matches:** Print "No prompt-related files changed — skipping evals." and continue to Step 9.
|
|
|
|
**2. Identify affected eval suites:**
|
|
|
|
Each eval runner (`test/evals/*_eval_runner.rb`) declares `PROMPT_SOURCE_FILES` listing which source files affect it. Grep these to find which suites match the changed files:
|
|
|
|
```bash
|
|
grep -l "changed_file_basename" test/evals/*_eval_runner.rb
|
|
```
|
|
|
|
Map runner → test file: `post_generation_eval_runner.rb` → `post_generation_eval_test.rb`.
|
|
|
|
**Special cases:**
|
|
- Changes to `test/evals/judges/*.rb`, `test/evals/support/*.rb`, or `test/evals/fixtures/` affect ALL suites that use those judges/support files. Check imports in the eval test files to determine which.
|
|
- Changes to `config/system_prompts/*.txt` — grep eval runners for the prompt filename to find affected suites.
|
|
- If unsure which suites are affected, run ALL suites that could plausibly be impacted. Over-testing is better than missing a regression.
|
|
|
|
**3. Run affected suites at `EVAL_JUDGE_TIER=full`:**
|
|
|
|
`/ship` is a pre-merge gate, so always use full tier (Sonnet structural + Opus persona judges).
|
|
|
|
```bash
|
|
EVAL_JUDGE_TIER=full EVAL_VERBOSE=1 bin/test-lane --eval test/evals/<suite>_eval_test.rb 2>&1 | tee /tmp/ship_evals.txt
|
|
```
|
|
|
|
If multiple suites need to run, run them sequentially (each needs a test lane). If the first suite fails, stop immediately — don't burn API cost on remaining suites.
|
|
|
|
**4. Check results:**
|
|
|
|
- **If any eval fails:** Show the failures, the cost dashboard, and **STOP**. Do not proceed.
|
|
- **If all pass:** Note pass counts and cost. Continue to Step 9.
|
|
|
|
**5. Save eval output** — include eval results and cost dashboard in the PR body (Step 19).
|
|
|
|
**Tier reference (for context — /ship always uses `full`):**
|
|
| Tier | When | Speed (cached) | Cost |
|
|
|------|------|----------------|------|
|
|
| `fast` (Haiku) | Dev iteration, smoke tests | ~5s (14x faster) | ~$0.07/run |
|
|
| `standard` (Sonnet) | Default dev, `bin/test-lane --eval` | ~17s (4x faster) | ~$0.37/run |
|
|
| `full` (Opus persona) | **`/ship` and pre-merge** | ~72s (baseline) | ~$1.27/run |
|
|
|
|
---
|