LST-008 recommended stress-testing

Auto-scaling triggers tested

Auto-scaling that's never been triggered is theoretical. It might not scale fast enough, might have permission issues, or might hit account limits. If you rely on auto-scaling, test it.

Question to ask

"Auto-scaling configured but never triggered — sure it works?"

Pass criteria

  • Auto-scaling configured with appropriate triggers
  • Scaling has been tested under load (not just configured and hoped)
  • Scale-up time is known and acceptable
  • Account/quota limits checked (won't hit "max instances" unexpectedly)

Fail criteria

  • Auto-scaling configured but never tested
  • Scaling triggers misconfigured (scales on wrong metric, too slow)
  • Hits account limits during scale-up
  • No auto-scaling when traffic is unpredictable

Verification guide

Severity: Recommended

Auto-scaling that's never been triggered is theoretical. It might not scale fast enough, might have permission issues, or might hit account limits. If you rely on auto-scaling, test it.

Check automatically:

  1. Check for auto-scaling configs:
# Check for Kubernetes HPA
find . -name "*.yaml" -o -name "*.yml" | xargs grep -l "HorizontalPodAutoscaler" 2>/dev/null | grep -v node_modules

# Check Terraform for auto-scaling
grep -riE "autoscal|aws_autoscaling|google_compute_autoscaler" terraform/ infrastructure/ --include="*.tf" 2>/dev/null

# Check for AWS auto-scaling
grep -riE "AutoScalingGroup|TargetTrackingScaling|ScalingPolicy" terraform/ cloudformation/ --include="*.tf" --include="*.yml" --include="*.yaml" --include="*.json" 2>/dev/null

# Check for GCP auto-scaling
grep -riE "autoscaler|minReplicas|maxReplicas" terraform/ --include="*.tf" 2>/dev/null

# Check for scaling documentation
grep -riE "auto.*scal|scale.*up|scale.*down|scaling.*trigger|hpa" docs/ README.md CLAUDE.md --include="*.md" 2>/dev/null

# Check Kubernetes manifests for resource requests (needed for HPA)
grep -riE "resources:|requests:|limits:|cpu:|memory:" k8s/ kubernetes/ helm/ --include="*.yaml" --include="*.yml" 2>/dev/null

Ask user:

  • "Do you use auto-scaling?"
  • "Has it ever actually triggered? (intentionally or during an incident)"
  • "What are the scaling triggers? (CPU, memory, request count, custom metrics)"
  • "How long does it take to scale up?"

Auto-scaling validation checklist:

  • Scaling triggers defined and appropriate
  • Min/max instances configured
  • Scale-up time measured (cold start + provisioning)
  • Account quotas checked (won't hit limits during scale)
  • Scaling has been tested under real load
  • Cooldown periods appropriate

Cross-reference with:

  • LST-006 (stress testing can validate auto-scaling)
  • LST-007 (graceful degradation while waiting for scale-up)
  • Section 26 (high availability) - auto-scaling is part of HA strategy
  • Section 38 (cost monitoring) - auto-scaling affects costs

Pass criteria:

  • Auto-scaling configured with appropriate triggers
  • Scaling has been tested under load (not just configured and hoped)
  • Scale-up time is known and acceptable
  • Account/quota limits checked (won't hit "max instances" unexpectedly)

Fail criteria:

  • Auto-scaling configured but never tested
  • Scaling triggers misconfigured (scales on wrong metric, too slow)
  • Hits account limits during scale-up
  • No auto-scaling when traffic is unpredictable

Notes: Not all systems need auto-scaling. Fixed capacity is fine if traffic is predictable and you've sized appropriately. The key is: if you're relying on auto-scaling, test it.

Evidence to capture:

  • Auto-scaling configuration (triggers, min/max instances)
  • Whether scaling has been tested
  • Scale-up time under load
  • Any account limits that could block scaling

Section

36. Load & Stress Testing

Operations & Incident Management