LST-006 recommended stress-testing

Breaking points identified (stress testing)

Stress testing pushes beyond normal capacity to understand how the system fails. Not just "it slows down" but understanding specific failure modes and cascades.

Question to ask

"What breaks first under overload — database or API?"

Pass criteria

  • Stress tests have been run to find breaking points
  • Failure modes documented (what breaks first, how it manifests)
  • Team understands the cascade (DB fails → API queues → timeouts → user errors)

Fail criteria

  • Never stress tested ("afraid to break things")
  • Only discovered breaking points during real incidents
  • Breaking points unknown

Verification guide

Severity: Recommended

Stress testing pushes beyond normal capacity to understand how the system fails. It's not just "it slows down" but "at 2000 RPS the database connection pool exhausts and requests start failing with X error."

Check automatically:

  1. Look for stress test documentation:
# Look for stress test documentation
grep -riE "stress.*test|breaking.*point|failure.*mode|max.*load|overload" docs/ README.md CLAUDE.md --include="*.md" 2>/dev/null

# Look for stress test scripts (often separate from load tests)
find . -name "*stress*" -type f 2>/dev/null | grep -v node_modules

# Check for chaos engineering / failure injection
grep -riE "chaos|gremlin|litmus|failure.*inject" docs/ .github/ package.json --include="*.md" --include="*.yml" --include="*.json" 2>/dev/null

# Look for documented failure modes
grep -riE "failure.*mode|what.*happens.*when|cascad|circuit.*break" docs/ runbooks/ --include="*.md" 2>/dev/null

Ask user:

  • "Have you ever stress tested to find where things break?"
  • "What happens when you hit 5x or 10x normal traffic?"
  • "Do you know your failure modes? (timeout, OOM, connection pool exhaustion, etc.)"

Common failure modes to document:

Failure Mode Symptoms
Connection pool exhaustion Requests queue, then timeout
Memory exhaustion (OOM) Process killed, restarts
CPU saturation Response times spike, eventual timeouts
Database locks Queries queue, deadlocks possible
External API rate limits 429 errors from dependencies

Cross-reference with:

  • LST-005 (capacity limits are the boundary before breaking)
  • LST-007 (graceful degradation is the response to stress)
  • Section 26 (high availability) - understanding failure modes informs HA design

Pass criteria:

  • Stress tests have been run to find breaking points
  • Failure modes documented (what breaks first, how it manifests)
  • Team understands the cascade (DB fails → API queues → timeouts → user errors)

Fail criteria:

  • Never stress tested ("afraid to break things")
  • Only discovered breaking points during real incidents
  • Breaking points unknown

Notes: Stress testing is different from load testing. Load testing validates expected capacity. Stress testing intentionally exceeds capacity to understand failure behavior. Both are valuable.

Evidence to capture:

  • Whether stress testing has been performed
  • Known breaking points and failure modes
  • What component fails first under extreme load

Section

36. Load & Stress Testing

Operations & Incident Management