Section 36 · Operations & Incident Management

Load & Stress Testing

Performance baselines, capacity planning, load testing practices, and resilience under extreme conditions

8 items 8 recommended

This guide walks you through auditing a project's load and stress testing practices - performance baselines, capacity planning, and resilience under extreme conditions.

The Goal: Known Limits, Graceful Failures

Your system should never surprise you under load. Know your capacity, understand your breaking points, and fail gracefully when pushed beyond limits.

Measured — Baselines and capacity limits derived from actual testing, not guesswork
Predictable — Breaking points identified before they occur in production
Resilient — System degrades gracefully under overload with circuit breakers and load shedding
Validated — Auto-scaling and recovery mechanisms tested, not just configured

Before You Start

Identify the technology stack (affects which load testing tools are appropriate)
Understand traffic patterns (steady, spiky, seasonal)
Check for existing load test scripts (k6, Artillery, Locust, Gatling, etc.)
Review recent incidents (any caused by traffic spikes or capacity issues?)

load-testing-setup

LST-001

Load testing tool configured recommended

You can't do load testing without a tool. Modern options like k6, Artillery, or Locust make it easy to script realistic traffic patterns and measure system behavior.

“When was your last load test?”

LST-002

Baseline performance metrics established recommended

You can't know if performance degraded if you don't know what "normal" looks like. Baselines are the reference point for all performance work.

“What's your p95 response time right now?”

pre-release-testing

LST-003

Load testing before major releases recommended

Performance regressions should be caught before they reach production. Load testing before significant releases validates that new code doesn't break under realistic traffic.

“Last release — did you load test before or after shipping?”

LST-004

Automated smoke load tests in CI recommended

Quick, lightweight load tests running in CI catch obvious regressions automatically. Not full production-scale tests, but sanity checks.

“Would a 10x response time regression slip past your CI?”

capacity-planning

LST-005

Capacity limits documented recommended

"How much traffic can we handle?" is a question every team should be able to answer. Documented capacity limits inform scaling decisions, incident response, and business planning.

“How much traffic can you handle before things break?”

stress-testing

LST-006

Breaking points identified (stress testing) recommended

Stress testing pushes beyond normal capacity to understand how the system fails. Not just "it slows down" but understanding specific failure modes and cascades.

“What breaks first under overload — database or API?”

LST-007

Graceful degradation under load recommended

When load exceeds capacity, good systems degrade gracefully instead of crashing completely. They shed load, return cached responses, or disable non-critical features.

“When a dependency dies, does it take everything down?”

LST-008

Auto-scaling triggers tested recommended

Auto-scaling that's never been triggered is theoretical. It might not scale fast enough, might have permission issues, or might hit account limits. If you rely on auto-scaling, test it.

“Auto-scaling configured but never triggered — sure it works?”