skillby payloadcms
triage-ci-flake
Use when CI tests fail on main branch after PR merge, or when investigating flaky test failures in CI environments
Installs: 0
Used in: 1 repos
Updated: 5h ago
$
npx ai-builder add skill payloadcms/triage-ci-flakeInstalls to .claude/skills/triage-ci-flake/
# Triage CI Failure
## Overview
Systematic workflow for triaging and fixing test failures in CI, especially flaky tests that pass locally but fail in CI. Tests that made it to `main` are usually flaky due to timing, bundling, or environment differences.
**CRITICAL RULE: You MUST run the reproduction workflow before proposing any fixes. No exceptions.**
## When to Use
- CI test fails on `main` branch after PR was merged
- Test passes locally but fails in CI
- Test failure labeled as "flaky" or intermittent
- E2E or integration test timing out in CI only
## MANDATORY First Steps
**YOU MUST EXECUTE THESE COMMANDS. Reading code or analyzing logs does NOT count as reproduction.**
1. **Extract** suite name, test name, and error from CI logs
2. **EXECUTE**: Kill ports 3000 and 3001 to avoid conflicts
3. **EXECUTE**: `pnpm dev $SUITE_NAME` (use run_in_background=true)
4. **EXECUTE**: Wait for server to be ready (check with curl or sleep)
5. **EXECUTE**: Run the specific failing test with Playwright directly (npx playwright test test/TEST_SUITE_NAME/e2e.spec.ts:31:3 --headed -g "TEST_DESCRIPTION_TARGET_GOES_HERE")
6. **If test passes**, **EXECUTE**: `pnpm prepare-run-test-against-prod`
7. **EXECUTE**: `pnpm dev:prod $SUITE_NAME` and run test again
**Only after EXECUTING these commands and seeing their output** can you proceed to analysis and fixes.
**"Analysis from logs" is NOT reproduction. You must RUN the commands.**
## Core Workflow
```dot
digraph triage_ci {
"CI failure reported" [shape=box];
"Extract details from CI logs" [shape=box];
"Identify suite and test name" [shape=box];
"Run dev server: pnpm dev $SUITE" [shape=box];
"Run specific test by name" [shape=box];
"Did test fail?" [shape=diamond];
"Debug with dev code" [shape=box];
"Run prepare-run-test-against-prod" [shape=box];
"Run: pnpm dev:prod $SUITE" [shape=box];
"Run specific test again" [shape=box];
"Did test fail now?" [shape=diamond];
"Debug bundling issue" [shape=box];
"Unable to reproduce - check logs" [shape=box];
"Fix and verify" [shape=box];
"CI failure reported" -> "Extract details from CI logs";
"Extract details from CI logs" -> "Identify suite and test name";
"Identify suite and test name" -> "Run dev server: pnpm dev $SUITE";
"Run dev server: pnpm dev $SUITE" -> "Run specific test by name";
"Run specific test by name" -> "Did test fail?";
"Did test fail?" -> "Debug with dev code" [label="yes"];
"Did test fail?" -> "Run prepare-run-test-against-prod" [label="no"];
"Run prepare-run-test-against-prod" -> "Run: pnpm dev:prod $SUITE";
"Run: pnpm dev:prod $SUITE" -> "Run specific test again";
"Run specific test again" -> "Did test fail now?";
"Did test fail now?" -> "Debug bundling issue" [label="yes"];
"Did test fail now?" -> "Unable to reproduce - check logs" [label="no"];
"Debug with dev code" -> "Fix and verify";
"Debug bundling issue" -> "Fix and verify";
}
```
## Step-by-Step Process
### 1. Extract CI Details
From CI logs or GitHub Actions URL, identify:
- **Suite name**: Directory name (e.g., `i18n`, `fields`, `lexical`)
- **Test file**: Full path (e.g., `test/i18n/e2e.spec.ts`)
- **Test name**: Exact test description
- **Error message**: Full stack trace
- **Test type**: E2E (Playwright) or integration (Vitest)
### 2. Reproduce with Dev Code
**CRITICAL: Always run the specific test by name, not the full suite.**
**SERVER MANAGEMENT RULES:**
1. **ALWAYS kill all servers before starting a new one**
2. **NEVER assume ports are free**
3. **ALWAYS wait for server ready confirmation before running tests**
```bash
# ========================================
# STEP 2A: STOP ALL SERVERS
# ========================================
lsof -ti:3000 | xargs kill -9 2>/dev/null || echo "Port 3000 clear"
# ========================================
# STEP 2B: START DEV SERVER
# ========================================
# Start dev server with the suite (in background with run_in_background=true)
pnpm dev $SUITE_NAME
# ========================================
# STEP 2C: WAIT FOR SERVER READY
# ========================================
# Wait for server to be ready (REQUIRED - do not skip)
until curl -s http://localhost:3000/admin > /dev/null 2>&1; do sleep 1; done && echo "Server ready"
# ========================================
# STEP 2D: RUN SPECIFIC TEST
# ========================================
# Run ONLY the specific failing test using Playwright directly
# For E2E tests (DO NOT use pnpm test:e2e as it spawns its own server):
pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts -g "exact test name"
# For integration tests:
pnpm test:int $SUITE_NAME -t "exact test name"
```
**Did the test fail?**
- ✅ **YES**: You reproduced it! Proceed to debug with dev code.
- ❌ **NO**: Continue to step 3 (bundled code test).
### 3. Reproduce with Bundled Code
If test passed with dev code, the issue is likely in bundled/production code.
**IMPORTANT: You MUST stop the dev server before starting prod server.**
```bash
# ========================================
# STEP 3A: STOP ALL SERVERS (INCLUDING DEV SERVER FROM STEP 2)
# ========================================
lsof -ti:3000 | xargs kill -9 2>/dev/null || echo "Port 3000 clear"
lsof -ti:3001 | xargs kill -9 2>/dev/null || echo "Port 3001 clear"
# ========================================
# STEP 3B: BUILD AND PACK FOR PROD
# ========================================
# Build all packages and pack them (this takes time - be patient)
pnpm prepare-run-test-against-prod
# ========================================
# STEP 3C: START PROD SERVER
# ========================================
# Start prod dev server (in background with run_in_background=true)
pnpm dev:prod $SUITE_NAME
# ========================================
# STEP 3D: WAIT FOR SERVER READY
# ========================================
# Wait for server to be ready (REQUIRED - do not skip)
until curl -s http://localhost:3000/admin > /dev/null 2>&1; do sleep 1; done && echo "Server ready"
# ========================================
# STEP 3E: RUN SPECIFIC TEST
# ========================================
# Run the specific test again using Playwright directly
pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts -g "exact test name"
# OR for integration tests:
pnpm test:int $SUITE_NAME -t "exact test name"
```
**Did the test fail now?**
- ✅ **YES**: Bundling or production build issue. Look for:
- Missing exports in package.json
- Build configuration problems
- Code that behaves differently when bundled
- ❌ **NO**: Unable to reproduce locally. Proceed to step 4.
### 4. Unable to Reproduce
If you cannot reproduce locally after both attempts:
- Review CI logs more carefully for environment differences
- Check for race conditions (run test multiple times: `for i in {1..10}; do pnpm test:e2e...; done`)
- Look for CI-specific constraints (memory, CPU, timing)
- Consider if it's a true race condition that's highly timing-dependent
## Common Flaky Test Patterns
### Race Conditions
- Page navigating while assertions run
- Network requests not settled before assertions
- State updates not completed
**Fix patterns:**
- Use Playwright's web-first assertions (`toBeVisible()`, `toHaveText()`)
- Wait for specific conditions, not arbitrary timeouts
- Use `waitForFunction()` with condition checks
### Test Pollution
- Tests leaving data in database
- Shared state between tests
- Missing cleanup in `afterEach`
**Fix patterns:**
- Track created IDs and clean up in `afterEach`
- Use isolated test data
- Don't use `deleteAll` that affects other tests
### Timing Issues
- `setTimeout`/`sleep` instead of condition-based waiting
- Not waiting for page stability
- Animations/transitions not complete
**Fix patterns:**
- Use `waitForPageStability()` helper
- Wait for specific DOM states
- Use Playwright's built-in waiting mechanisms
## Linting Considerations
When fixing e2e tests, be aware of these eslint rules:
- `playwright/no-networkidle` - Avoid `waitForLoadState('networkidle')` (use condition-based waiting instead)
- `payload/no-wait-function` - Avoid custom `wait()` functions (use Playwright's built-in waits)
- `payload/no-flaky-assertions` - Avoid non-retryable assertions
- `playwright/prefer-web-first-assertions` - Use built-in Playwright assertions
**Existing code may violate these rules** - when adding new code, follow the rules even if existing code doesn't.
## Verification
After fixing:
```bash
# Ensure dev server is running on port 3000
# Run test multiple times to confirm stability
for i in {1..10}; do
pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts -g "exact test name" || break
done
# Run full suite
pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts
# If you modified bundled code, test with prod build
lsof -ti:3000 | xargs kill -9 2>/dev/null
pnpm prepare-run-test-against-prod
pnpm dev:prod $SUITE_NAME
until curl -s http://localhost:3000/admin > /dev/null; do sleep 1; done
pnpm exec playwright test test/$SUITE_NAME/e2e.spec.ts
```
## The Iron Law
**NO FIX WITHOUT REPRODUCTION FIRST**
If you propose a fix before completing steps 1-3 of the workflow, you've violated this skill.
**This applies even when:**
- The fix seems obvious from the logs
- You've seen this error before
- Time pressure from the team
- You're confident about the root cause
- The logs show clear stack traces
**No exceptions. Run the reproduction workflow first.**
## Rationalization Table
Every excuse for skipping reproduction, and why it's wrong:
| Rationalization | Reality |
| ------------------------------------ | ---------------------------------------------- |
| "The logs show the exact error" | Logs show symptoms, not root cause. Reproduce. |
| "I can see the problem in the code" | You're guessing. Reproduce to confirm. |
| "This is obviously a race condition" | Maybe. Reproduce to be sure. |
| "I've seen this error before" | This might be different. Reproduce. |
| "The stack trace is clear" | Stack trace shows where, not why. Reproduce. |
| "Time pressure - need to fix fast" | Reproducing IS fast. Guessing wastes time. |
| "The test file shows the issue" | Reading ≠ running. Execute the commands. |
| "I'll analyze the code first" | Code analysis comes AFTER reproduction. |
| "Let me investigate the root cause" | Reproduction IS the investigation. |
| "I need to understand the error" | Understanding comes from seeing it fail. |
## Red Flags - STOP
These thoughts mean you're about to violate the skill:
- "Let me just fix the obvious issue" → STOP. Execute reproduction first.
- "I'll increase the timeout" → STOP. Find root cause via reproduction.
- "This looks like X from the logs" → STOP. Run the test.
- "I'll run the full suite" → STOP. Run specific test first.
- "Let me check the code" → STOP. Run reproduction first.
- "I can see the problem in the stack trace" → STOP. Reproduce to confirm.
- "The logs are clear enough" → STOP. Execute the commands.
- "Let me analyze..." → STOP. Analysis comes after reproduction.
**All of these mean: Stop immediately. Execute the reproduction commands. No analysis without execution.**
## Example Session
**CORRECT approach - executes commands first:**
```
User: CI failed on i18n suite - https://github.com/.../actions/runs/XXX
Agent:
1. Extracted details from CI logs:
- Suite: i18n
- Test: "ensure translations update correctly when switching language"
- File: test/i18n/e2e.spec.ts:124
- Error: "Error loading language: en" TypeError: network error
2. Executing reproduction workflow:
[Uses Bash tool to kill ports]
$ lsof -ti:3000 | xargs kill -9 2>/dev/null || echo "Port 3000 clear"
$ lsof -ti:3001 | xargs kill -9 2>/dev/null || echo "Port 3001 clear"
[Uses Bash tool with run_in_background=true]
$ pnpm dev i18n
[Waits for server to start with curl check]
$ until curl -s http://localhost:3000/admin > /dev/null; do sleep 1; done && echo "Server ready"
[Uses Bash tool to run Playwright directly]
$ pnpm exec playwright test test/i18n/e2e.spec.ts -g "ensure translations update correctly when switching language"
Output shows: Test PASSED locally
3. Test passed with dev code, trying bundled code:
[Kills ports again]
$ lsof -ti:3000 | xargs kill -9 2>/dev/null
[Uses Bash tool]
$ pnpm prepare-run-test-against-prod
[Uses Bash tool with run_in_background=true]
$ pnpm dev:prod i18n
[Waits for server]
$ until curl -s http://localhost:3000/admin > /dev/null; do sleep 1; done
[Uses Bash tool]
$ pnpm exec playwright test test/i18n/e2e.spec.ts -g "ensure translations update correctly when switching language"
Output shows: Test FAILED - reproduced!
4. Now that I've reproduced it, analyzing the failure...
[Code analysis, root cause identification]
5. Implementing fix: [specific changes]
6. Verification:
[Uses Bash tool]
$ for i in {1..10}; do pnpm test:e2e i18n -g "exact test"; done
All 10 runs passed!
```
**INCORRECT approach - analyzes without executing:**
```
User: CI failed on i18n suite
Agent: Looking at the error logs, I can see this is a race condition...
[Analyzes code, proposes fix without running any commands]
❌ WRONG - This violates the skill. Must execute reproduction commands first.
```
## Common Mistakes
| Mistake | Fix |
| --------------------------------- | -------------------------------------------------------- |
| Running full test suite first | Run specific test by name |
| Skipping dev code reproduction | Always try dev code first |
| Not testing with bundled code | If dev passes, test with `prepare-run-test-against-prod` |
| Proposing fix without reproducing | Follow the workflow - reproduce first |
| Using `networkidle` in new code | Use condition-based waiting with `waitForFunction()` |
| Adding arbitrary `wait()` calls | Use Playwright's built-in assertions and waits |
## Key Principles
1. **Reproduce before fixing**: Never propose a fix without reproducing the issue
2. **Test specifically**: Run the exact failing test, not the full suite
3. **Dev first, prod second**: Check dev code before bundled code
4. **Follow the workflow**: No shortcuts - the steps exist to save time
5. **Verify stability**: Run tests multiple times to confirm fix
## Completion: Creating a PR
**After you have:**
1. ✅ Reproduced the issue
2. ✅ Implemented a fix
3. ✅ Verified the fix passes locally (multiple runs)
4. ✅ Tested with prod build (if applicable)
**You MUST prompt the user to create a PR:**
```
The fix has been verified and is ready for review. Would you like me to create a PR with these changes?
Summary of changes:
- [List files modified]
- [Brief description of the fix]
- [Verification results]
```
**IMPORTANT:**
- **DO NOT automatically create a PR** - always ask the user first
- Provide a clear summary of what was changed and why
- Include verification results (number of test runs, pass rate)
- Let the user decide whether to create the PR immediately or make additional changes first
This ensures the user has visibility and control over what gets submitted for review.Quick Install
$
npx ai-builder add skill payloadcms/triage-ci-flakeDetails
- Type
- skill
- Author
- payloadcms
- Slug
- payloadcms/triage-ci-flake
- Created
- 6h ago