Section 36 · Operations & Incident Management
Load & Stress Testing
Performance baselines, capacity planning, load testing practices, and resilience under extreme conditions
This guide walks you through auditing a project's load and stress testing practices - performance baselines, capacity planning, and resilience under extreme conditions.
The Goal: Known Limits, Graceful Failures
Your system should never surprise you under load. Know your capacity, understand your breaking points, and fail gracefully when pushed beyond limits.
- Measured — Baselines and capacity limits derived from actual testing, not guesswork
- Predictable — Breaking points identified before they occur in production
- Resilient — System degrades gracefully under overload with circuit breakers and load shedding
- Validated — Auto-scaling and recovery mechanisms tested, not just configured
Before You Start
- Identify the technology stack (affects which load testing tools are appropriate)
- Understand traffic patterns (steady, spiky, seasonal)
- Check for existing load test scripts (k6, Artillery, Locust, Gatling, etc.)
- Review recent incidents (any caused by traffic spikes or capacity issues?)
load-testing-setup
You can't do load testing without a tool. Modern options like k6, Artillery, or Locust make it easy to script realistic traffic patterns and measure system behavior.
“When was your last load test?”
You can't know if performance degraded if you don't know what "normal" looks like. Baselines are the reference point for all performance work.
“What's your p95 response time right now?”
pre-release-testing
Performance regressions should be caught before they reach production. Load testing before significant releases validates that new code doesn't break under realistic traffic.
“Last release — did you load test before or after shipping?”
Quick, lightweight load tests running in CI catch obvious regressions automatically. Not full production-scale tests, but sanity checks.
“Would a 10x response time regression slip past your CI?”
stress-testing
Stress testing pushes beyond normal capacity to understand how the system fails. Not just "it slows down" but understanding specific failure modes and cascades.
“What breaks first under overload — database or API?”
When load exceeds capacity, good systems degrade gracefully instead of crashing completely. They shed load, return cached responses, or disable non-critical features.
“When a dependency dies, does it take everything down?”
Auto-scaling that's never been triggered is theoretical. It might not scale fast enough, might have permission issues, or might hit account limits. If you rely on auto-scaling, test it.
“Auto-scaling configured but never triggered — sure it works?”