Section 35 · API & Security
Incident Response
On-call coverage, escalation procedures, runbooks, and post-mortem practices
This guide walks you through auditing a project's incident response capabilities - on-call coverage, escalation procedures, runbooks, and post-mortem practices.
The Goal: Calm Under Pressure
Incidents are stressful. Good preparation turns chaos into a checklist. When production breaks at 3am, everyone knows what to do.
- Owned — On-call coverage and escalation paths are defined so incidents have clear ownership at any time
- Reachable — Emergency contact information is documented and accessible even during outages
- Runbooked — Common incident types have step-by-step playbooks any team member can follow
- Learning — Post-mortem practices capture learnings and generate tracked action items
- Improving — Incident response matures from ad-hoc handling to continuous improvement
Before You Start
- Identify team size and coverage needs (24/7 vs business hours only)
- Identify incident management tools (PagerDuty, Opsgenie, incident.io, etc.)
- Check for existing runbooks/playbooks (docs/, runbooks/, wiki)
- Review recent incidents (if any) to understand current practices
on-call-escalation
When incidents happen outside business hours, someone needs to be responsible. A defined rotation ensures 24/7 coverage without burning out individuals.
“Who's getting paged at 2am if prod goes down tonight?”
When the on-call person can't resolve an issue alone, they need to know who to escalate to. Clear paths prevent panic during incidents.
“What happens when the on-call person is stuck and panicking?”
During an incident, you shouldn't be hunting for phone numbers. A readily accessible contact list with multiple reach methods saves critical minutes.
“If Slack is down, how does your team reach each other?”
Incident management tools handle alerting, on-call scheduling, escalation, and incident tracking in one place.
“How does an alert wake someone up at 3am without Slack?”
post-mortems
Post-mortems turn incidents into learning opportunities. Blameless means focusing on systems and processes, not individuals.
“After your last incident, did you fix the system or blame a person?”
Post-mortems are worthless if action items never get done. Action items must be tracked in a system with clear ownership.
“How many post-mortem action items are still open from last year?”