Section 35 · API & Security

Incident Response

On-call coverage, escalation procedures, runbooks, and post-mortem practices

7 items 2 critical 5 recommended

This guide walks you through auditing a project's incident response capabilities - on-call coverage, escalation procedures, runbooks, and post-mortem practices.

The Goal: Calm Under Pressure

Incidents are stressful. Good preparation turns chaos into a checklist. When production breaks at 3am, everyone knows what to do.

Owned — On-call coverage and escalation paths are defined so incidents have clear ownership at any time
Reachable — Emergency contact information is documented and accessible even during outages
Runbooked — Common incident types have step-by-step playbooks any team member can follow
Learning — Post-mortem practices capture learnings and generate tracked action items
Improving — Incident response matures from ad-hoc handling to continuous improvement

Before You Start

Identify team size and coverage needs (24/7 vs business hours only)
Identify incident management tools (PagerDuty, Opsgenie, incident.io, etc.)
Check for existing runbooks/playbooks (docs/, runbooks/, wiki)
Review recent incidents (if any) to understand current practices

on-call-escalation

IR-001

On-call rotation defined recommended

When incidents happen outside business hours, someone needs to be responsible. A defined rotation ensures 24/7 coverage without burning out individuals.

“Who's getting paged at 2am if prod goes down tonight?”

IR-002

Escalation paths documented recommended

When the on-call person can't resolve an issue alone, they need to know who to escalate to. Clear paths prevent panic during incidents.

“What happens when the on-call person is stuck and panicking?”

IR-003

Contact list for emergencies critical

During an incident, you shouldn't be hunting for phone numbers. A readily accessible contact list with multiple reach methods saves critical minutes.

“If Slack is down, how does your team reach each other?”

IR-004

PagerDuty/Opsgenie or similar recommended

Incident management tools handle alerting, on-call scheduling, escalation, and incident tracking in one place.

“How does an alert wake someone up at 3am without Slack?”

runbooks

IR-005

Common incidents have runbooks critical

Runbooks turn tribal knowledge into documented steps anyone can follow. During an incident isn't the time to figure out procedures.

“Which incidents only one person on your team knows how to fix?”

post-mortems

IR-006

Blameless post-mortems after incidents recommended

Post-mortems turn incidents into learning opportunities. Blameless means focusing on systems and processes, not individuals.

“After your last incident, did you fix the system or blame a person?”

IR-007

Action items tracked to completion recommended

Post-mortems are worthless if action items never get done. Action items must be tracked in a system with clear ownership.

“How many post-mortem action items are still open from last year?”