LST-007 recommended stress-testing

Graceful degradation under load

When load exceeds capacity, good systems degrade gracefully instead of crashing completely. They shed load, return cached responses, or disable non-critical features.

Question to ask

"When a dependency dies, does it take everything down?"

Pass criteria

  • Degradation strategy documented and implemented
  • Circuit breakers protect against cascading failures
  • Non-critical features can be disabled (feature flags, config)
  • System returns errors gracefully rather than hanging/crashing

Fail criteria

  • System crashes or hangs completely under overload
  • No circuit breakers (one slow dependency takes down everything)
  • We just hope it doesn't happen
  • Degradation is uncontrolled (random failures)

Verification guide

Severity: Recommended

When load exceeds capacity, good systems degrade gracefully instead of crashing completely. They shed load, return cached responses, or disable non-critical features.

Check automatically:

  1. Look for degradation patterns in code:
# Look for circuit breakers, load shedding, degradation patterns
grep -riE "circuit.*breaker|load.*shed|graceful.*degrad|fallback|bulkhead" src/ lib/ app/ --include="*.ts" --include="*.js" --include="*.py" --include="*.go" 2>/dev/null

# Check for libraries that implement these patterns
grep -E "opossum|cockatiel|hystrix|resilience4j|polly|circuitbreaker|pybreaker" package.json requirements.txt go.mod Gemfile 2>/dev/null

# Look for rate limiting / throttling at app level
grep -riE "rate.*limit|throttl|too.*many.*request|429" src/ lib/ app/ --include="*.ts" --include="*.js" 2>/dev/null

# Check for feature flags that could disable features under load
grep -riE "feature.*flag|launchdarkly|flagsmith|unleash|growthbook" package.json src/ --include="*.json" --include="*.ts" 2>/dev/null

# Look for queue/backpressure patterns
grep -riE "backpressure|queue.*full|reject.*request|shed" src/ lib/ --include="*.ts" --include="*.js" 2>/dev/null

Ask user:

  • "What happens when your system is overloaded?"
  • "Do you have circuit breakers for external dependencies?"
  • "Can you disable non-critical features under load?"
  • "Is there a 'degraded mode' the system can operate in?"

Graceful degradation strategies:

Strategy Description
Circuit breakers Stop calling failing services, return fallback
Load shedding Reject excess requests early (429)
Feature flags Disable non-critical features
Cached fallbacks Return stale data instead of failing
Queue limits Cap queue depth, reject when full

Cross-reference with:

  • LST-006 (stress testing reveals what needs degradation handling)
  • LST-008 (auto-scaling is one response, degradation is another)
  • Section 30 (rate limiting) - rate limiting is a form of load shedding
  • Section 19 (error handling) - graceful errors under load
  • Section 33 (feature flags) - kill switches for degradation

Pass criteria:

  • Degradation strategy documented and implemented
  • Circuit breakers protect against cascading failures
  • Non-critical features can be disabled (feature flags, config)
  • System returns errors gracefully rather than hanging/crashing

Fail criteria:

  • System crashes or hangs completely under overload
  • No circuit breakers (one slow dependency takes down everything)
  • "We just hope it doesn't happen"
  • Degradation is uncontrolled (random failures)

Evidence to capture:

  • Degradation strategies in place (circuit breakers, load shedding, feature flags)
  • Libraries/patterns used
  • What features can be disabled under load

Section

36. Load & Stress Testing

Operations & Incident Management