Breaking points identified (stress testing)
Stress testing pushes beyond normal capacity to understand how the system fails. Not just "it slows down" but understanding specific failure modes and cascades.
Question to ask
"What breaks first under overload — database or API?"
Pass criteria
- ✓ Stress tests have been run to find breaking points
- ✓ Failure modes documented (what breaks first, how it manifests)
- ✓ Team understands the cascade (DB fails → API queues → timeouts → user errors)
Fail criteria
- ✗ Never stress tested ("afraid to break things")
- ✗ Only discovered breaking points during real incidents
- ✗ Breaking points unknown
Related items
Verification guide
Severity: Recommended
Stress testing pushes beyond normal capacity to understand how the system fails. It's not just "it slows down" but "at 2000 RPS the database connection pool exhausts and requests start failing with X error."
Check automatically:
- Look for stress test documentation:
# Look for stress test documentation
grep -riE "stress.*test|breaking.*point|failure.*mode|max.*load|overload" docs/ README.md CLAUDE.md --include="*.md" 2>/dev/null
# Look for stress test scripts (often separate from load tests)
find . -name "*stress*" -type f 2>/dev/null | grep -v node_modules
# Check for chaos engineering / failure injection
grep -riE "chaos|gremlin|litmus|failure.*inject" docs/ .github/ package.json --include="*.md" --include="*.yml" --include="*.json" 2>/dev/null
# Look for documented failure modes
grep -riE "failure.*mode|what.*happens.*when|cascad|circuit.*break" docs/ runbooks/ --include="*.md" 2>/dev/null
Ask user:
- "Have you ever stress tested to find where things break?"
- "What happens when you hit 5x or 10x normal traffic?"
- "Do you know your failure modes? (timeout, OOM, connection pool exhaustion, etc.)"
Common failure modes to document:
| Failure Mode | Symptoms |
|---|---|
| Connection pool exhaustion | Requests queue, then timeout |
| Memory exhaustion (OOM) | Process killed, restarts |
| CPU saturation | Response times spike, eventual timeouts |
| Database locks | Queries queue, deadlocks possible |
| External API rate limits | 429 errors from dependencies |
Cross-reference with:
- LST-005 (capacity limits are the boundary before breaking)
- LST-007 (graceful degradation is the response to stress)
- Section 26 (high availability) - understanding failure modes informs HA design
Pass criteria:
- Stress tests have been run to find breaking points
- Failure modes documented (what breaks first, how it manifests)
- Team understands the cascade (DB fails → API queues → timeouts → user errors)
Fail criteria:
- Never stress tested ("afraid to break things")
- Only discovered breaking points during real incidents
- Breaking points unknown
Notes: Stress testing is different from load testing. Load testing validates expected capacity. Stress testing intentionally exceeds capacity to understand failure behavior. Both are valuable.
Evidence to capture:
- Whether stress testing has been performed
- Known breaking points and failure modes
- What component fails first under extreme load