feat: add CI eval workflow on Ubicloud runners

Single-job GitHub Actions workflow that runs E2E evals on every PR using
Ubicloud runners ($0.006/run — 10x cheaper than GitHub standard). Uses
EVALS_CONCURRENCY=40 with the new within-file concurrency for ~6min
wall clock. Downloads previous eval artifact from main for comparison,
uploads results, and posts a PR comment with pass/fail + cost.

Ubicloud setup required: connect GitHub repo via ubicloud.com dashboard,
add ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY as repo secrets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Garry Tan 2026-03-22 23:06:13 -07:00
parent 7ab00588c3
commit e589a95116
No known key found for this signature in database
GPG Key ID: C1F69E85C74EFE1D
3 changed files with 121 additions and 32 deletions

93
.github/workflows/evals.yml vendored Normal file
View File

@ -0,0 +1,93 @@
name: E2E Evals
on:
pull_request:
branches: [main]
concurrency:
group: evals-${{ github.head_ref }}
cancel-in-progress: true
jobs:
evals:
runs-on: ubicloud-standard-2
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: oven-sh/setup-bun@v2
- run: bun install
- run: bun run build
- name: Verify browse binary
run: test -f browse/dist/browse || (echo "Browse binary missing after build" && exit 1)
- name: Install Claude CLI
run: npm i -g @anthropic-ai/claude-code
- name: Download previous eval baseline
uses: dawidd6/action-download-artifact@v6
with:
name: eval-results
branch: main
path: /tmp/eval-baseline
if_no_artifact_found: warn
continue-on-error: true
- name: Copy baseline for comparison
run: |
if [ -d /tmp/eval-baseline ]; then
mkdir -p ~/.gstack-dev/evals
cp /tmp/eval-baseline/*.json ~/.gstack-dev/evals/ 2>/dev/null || true
fi
- name: Run E2E evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
EVALS_CONCURRENCY: "40"
run: bun run test:evals
- name: Upload eval results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: ~/.gstack-dev/evals/*.json
retention-days: 90
- name: Post PR comment
if: always() && github.event_name == 'pull_request'
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
RESULT=$(ls -t ~/.gstack-dev/evals/*.json 2>/dev/null | grep -v _partial | head -1)
if [ -z "$RESULT" ]; then
echo "No eval results found"
exit 0
fi
TOTAL=$(jq .total_tests "$RESULT")
PASSED=$(jq .passed "$RESULT")
FAILED=$(jq .failed "$RESULT")
COST=$(jq .total_cost_usd "$RESULT")
WALL=$(jq '.wall_clock_ms // 0 | . / 1000 | floor' "$RESULT")
STATUS="pass"
[ "$FAILED" -gt 0 ] && STATUS="FAIL"
BODY="**E2E Evals:** ${STATUS} ${PASSED}/${TOTAL} passed | \$${COST} | ${WALL}s wall clock"
if [ "$FAILED" -gt 0 ]; then
FAILURES=$(jq -r '.tests[] | select(.passed == false) | "- FAIL \(.name): \(.exit_reason // "unknown")"' "$RESULT")
BODY="${BODY}
Failures:
${FAILURES}"
fi
gh pr comment ${{ github.event.pull_request.number }} --body "$BODY"

View File

@ -1,6 +1,5 @@
--- ---
name: gstack name: gstack
version: 1.1.0
description: | description: |
Fast headless browser for QA testing and site dogfooding. Navigate pages, interact with Fast headless browser for QA testing and site dogfooding. Navigate pages, interact with
elements, verify state, diff before/after, take annotated screenshots, test responsive elements, verify state, diff before/after, take annotated screenshots, test responsive
@ -14,11 +13,6 @@ description: |
/unfreeze; gstack upgrades /gstack-upgrade. If the user opts out of suggestions, stop /unfreeze; gstack upgrades /gstack-upgrade. If the user opts out of suggestions, stop
and run gstack-config set proactive false; if they opt back in, run gstack-config set and run gstack-config set proactive false; if they opt back in, run gstack-config set
proactive true. proactive true.
allowed-tools:
- Bash
- Read
- AskUserQuestion
--- ---
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly --> <!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs --> <!-- Regenerate: bun run gen:skill-docs -->
@ -26,23 +20,28 @@ allowed-tools:
## Preamble (run first) ## Preamble (run first)
```bash ```bash
_UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/skills/gstack/bin/gstack-update-check 2>/dev/null || true) _ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
GSTACK_ROOT="$HOME/.codex/skills/gstack"
[ -n "$_ROOT" ] && [ -d "$_ROOT/.agents/skills/gstack" ] && GSTACK_ROOT="$_ROOT/.agents/skills/gstack"
GSTACK_BIN="$GSTACK_ROOT/bin"
GSTACK_BROWSE="$GSTACK_ROOT/browse/dist"
_UPD=$($GSTACK_BIN/gstack-update-check 2>/dev/null || .agents/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true [ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID" touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ') _SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true find ~/.gstack/sessions -mmin +120 -type f -delete 2>/dev/null || true
_CONTRIB=$(~/.claude/skills/gstack/bin/gstack-config get gstack_contributor 2>/dev/null || true) _CONTRIB=$($GSTACK_BIN/gstack-config get gstack_contributor 2>/dev/null || true)
_PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null || echo "true") _PROACTIVE=$($GSTACK_BIN/gstack-config get proactive 2>/dev/null || echo "true")
_BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown") _BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown")
echo "BRANCH: $_BRANCH" echo "BRANCH: $_BRANCH"
echo "PROACTIVE: $_PROACTIVE" echo "PROACTIVE: $_PROACTIVE"
source <(~/.claude/skills/gstack/bin/gstack-repo-mode 2>/dev/null) || true source <($GSTACK_BIN/gstack-repo-mode 2>/dev/null) || true
REPO_MODE=${REPO_MODE:-unknown} REPO_MODE=${REPO_MODE:-unknown}
echo "REPO_MODE: $REPO_MODE" echo "REPO_MODE: $REPO_MODE"
_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no") _LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no")
echo "LAKE_INTRO: $_LAKE_SEEN" echo "LAKE_INTRO: $_LAKE_SEEN"
_TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true) _TEL=$($GSTACK_BIN/gstack-config get telemetry 2>/dev/null || true)
_TEL_PROMPTED=$([ -f ~/.gstack/.telemetry-prompted ] && echo "yes" || echo "no") _TEL_PROMPTED=$([ -f ~/.gstack/.telemetry-prompted ] && echo "yes" || echo "no")
_TEL_START=$(date +%s) _TEL_START=$(date +%s)
_SESSION_ID="$$-$(date +%s)" _SESSION_ID="$$-$(date +%s)"
@ -50,13 +49,13 @@ echo "TELEMETRY: ${_TEL:-off}"
echo "TEL_PROMPTED: $_TEL_PROMPTED" echo "TEL_PROMPTED: $_TEL_PROMPTED"
mkdir -p ~/.gstack/analytics mkdir -p ~/.gstack/analytics
echo '{"skill":"gstack","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true echo '{"skill":"gstack","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}' >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
for _PF in ~/.gstack/analytics/.pending-*; do [ -f "$_PF" ] && ~/.claude/skills/gstack/bin/gstack-telemetry-log --event-type skill_run --skill _pending_finalize --outcome unknown --session-id "$_SESSION_ID" 2>/dev/null || true; break; done for _PF in ~/.gstack/analytics/.pending-*; do [ -f "$_PF" ] && $GSTACK_BIN/gstack-telemetry-log --event-type skill_run --skill _pending_finalize --outcome unknown --session-id "$_SESSION_ID" 2>/dev/null || true; break; done
``` ```
If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills — only invoke
them when the user explicitly asks. The user opted out of proactive suggestions. them when the user explicitly asks. The user opted out of proactive suggestions.
If output shows `UPGRADE_AVAILABLE <old> <new>`: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue. If output shows `UPGRADE_AVAILABLE <old> <new>`: read `$GSTACK_ROOT/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED <from> <to>`: tell user "Running gstack v{to} (just updated!)" and continue.
If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle. If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete
@ -82,7 +81,7 @@ Options:
- A) Help gstack get better! (recommended) - A) Help gstack get better! (recommended)
- B) No thanks - B) No thanks
If A: run `~/.claude/skills/gstack/bin/gstack-config set telemetry community` If A: run `$GSTACK_BIN/gstack-config set telemetry community`
If B: ask a follow-up AskUserQuestion: If B: ask a follow-up AskUserQuestion:
@ -93,8 +92,8 @@ Options:
- A) Sure, anonymous is fine - A) Sure, anonymous is fine
- B) No thanks, fully off - B) No thanks, fully off
If B→A: run `~/.claude/skills/gstack/bin/gstack-config set telemetry anonymous` If B→A: run `$GSTACK_BIN/gstack-config set telemetry anonymous`
If B→B: run `~/.claude/skills/gstack/bin/gstack-config set telemetry off` If B→B: run `$GSTACK_BIN/gstack-config set telemetry off`
Always run: Always run:
```bash ```bash
@ -154,7 +153,7 @@ Never let a noticed issue silently pass. The whole point is proactive communicat
## Search Before Building ## Search Before Building
Before building infrastructure, unfamiliar patterns, or anything the runtime might have a built-in — **search first.** Read `~/.claude/skills/gstack/ETHOS.md` for the full philosophy. Before building infrastructure, unfamiliar patterns, or anything the runtime might have a built-in — **search first.** Read `$GSTACK_ROOT/ETHOS.md` for the full philosophy.
**Three layers of knowledge:** **Three layers of knowledge:**
- **Layer 1** (tried and true — in distribution). Don't reinvent the wheel. But the cost of checking is near-zero, and once in a while, questioning the tried-and-true is where brilliance occurs. - **Layer 1** (tried and true — in distribution). Don't reinvent the wheel. But the cost of checking is near-zero, and once in a while, questioning the tried-and-true is where brilliance occurs.
@ -252,7 +251,7 @@ Run this bash:
_TEL_END=$(date +%s) _TEL_END=$(date +%s)
_TEL_DUR=$(( _TEL_END - _TEL_START )) _TEL_DUR=$(( _TEL_END - _TEL_START ))
rm -f ~/.gstack/analytics/.pending-"$_SESSION_ID" 2>/dev/null || true rm -f ~/.gstack/analytics/.pending-"$_SESSION_ID" 2>/dev/null || true
~/.claude/skills/gstack/bin/gstack-telemetry-log \ $GSTACK_ROOT/bin/gstack-telemetry-log \
--skill "SKILL_NAME" --duration "$_TEL_DUR" --outcome "OUTCOME" \ --skill "SKILL_NAME" --duration "$_TEL_DUR" --outcome "OUTCOME" \
--used-browse "USED_BROWSE" --session-id "$_SESSION_ID" 2>/dev/null & --used-browse "USED_BROWSE" --session-id "$_SESSION_ID" 2>/dev/null &
``` ```
@ -271,7 +270,7 @@ When you are in plan mode and about to call ExitPlanMode:
3. If it does NOT — run this command: 3. If it does NOT — run this command:
\`\`\`bash \`\`\`bash
~/.claude/skills/gstack/bin/gstack-review-read $GSTACK_ROOT/bin/gstack-review-read
\`\`\` \`\`\`
Then write a `## GSTACK REVIEW REPORT` section to the end of the plan file: Then write a `## GSTACK REVIEW REPORT` section to the end of the plan file:
@ -312,8 +311,8 @@ Auto-shuts down after 30 min idle. State persists between calls (cookies, tabs,
```bash ```bash
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) _ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
B="" B=""
[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" [ -n "$_ROOT" ] && [ -x "$_ROOT/.agents/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.agents/skills/gstack/browse/dist/browse"
[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse [ -z "$B" ] && B=$GSTACK_BROWSE/browse
if [ -x "$B" ]; then if [ -x "$B" ]; then
echo "READY: $B" echo "READY: $B"
else else

View File

@ -338,17 +338,6 @@
**Depends on:** Video recording **Depends on:** Video recording
### GitHub Actions eval upload
**What:** Run eval suite in CI, upload result JSON as artifact, post summary comment on PR.
**Why:** CI integration catches quality regressions before merge and provides persistent eval records per PR.
**Context:** Requires `ANTHROPIC_API_KEY` in CI secrets. Cost is ~$4/run. Eval persistence system (v0.3.6) writes JSON to `~/.gstack-dev/evals/` — CI would upload as GitHub Actions artifacts and use `eval:compare` to post delta comment.
**Effort:** M
**Priority:** P2
**Depends on:** Eval persistence (shipped in v0.3.6)
### E2E model pinning — SHIPPED ### E2E model pinning — SHIPPED
@ -539,6 +528,14 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr
## Completed ## Completed
### CI eval pipeline (v0.9.9.0)
- GitHub Actions eval upload on Ubicloud runners ($0.006/run)
- Within-file test concurrency (test() → testConcurrentIfSelected())
- Eval artifact upload + PR comment with pass/fail + cost
- Baseline comparison via artifact download from main
- EVALS_CONCURRENCY=40 for ~6min wall clock (was ~18min)
**Completed:** v0.9.9.0
### Deploy pipeline (v0.9.8.0) ### Deploy pipeline (v0.9.8.0)
- /land-and-deploy — merge PR, wait for CI/deploy, canary verification - /land-and-deploy — merge PR, wait for CI/deploy, canary verification
- /canary — post-deploy monitoring loop with anomaly detection - /canary — post-deploy monitoring loop with anomaly detection