IR-005 critical runbooks

Common incidents have runbooks

Runbooks turn tribal knowledge into documented steps anyone can follow. During an incident isn't the time to figure out procedures.

Question to ask

"Which incidents only one person on your team knows how to fix?"

Pass criteria

  • Runbooks exist for the most common/critical incident types
  • Runbooks are step-by-step (not just "fix the database")
  • Runbooks are accessible during outages
  • Team knows where to find them

Fail criteria

  • No runbooks
  • Runbooks exist but are outdated/wrong
  • Runbooks only in one person's head
  • Runbooks stored only in systems that could be down

Verification guide

Severity: Critical

Runbooks turn tribal knowledge into documented steps anyone can follow. During an incident isn't the time to figure out "how do we restart the database?"

Check automatically:

  1. Look for runbook directories and files:
# Check for runbook directories
ls -la runbooks/ playbooks/ docs/incidents/ docs/runbooks/ 2>/dev/null

# Search for runbook content
grep -riE "runbook|playbook|incident.*response|troubleshoot" docs/ README.md CLAUDE.md --include="*.md" 2>/dev/null

# Look for specific incident types
grep -riE "server.*down|database.*issue|high.*traffic|outage|incident" docs/ runbooks/ --include="*.md" 2>/dev/null
  1. Check for runbook templates:
# Look for templates
find . -maxdepth 3 -name "*template*" -path "*/runbook*" -o -name "*template*" -path "*/playbook*" 2>/dev/null | grep -v node_modules

Ask user:

  • "Do you have written runbooks for common incidents?"
  • "What incidents have you had before? Are they documented?"
  • "Can a new team member follow the runbook without help?"

Minimum runbooks to have:

Incident Type What It Covers
Server/app down How to check status, restart, rollback, who to contact
Database issues Connection problems, slow queries, failover procedures
High traffic/load Scaling procedures, what to shed, caching knobs
Security incident Who to contact, containment steps, communication plan

Cross-reference with:

  • Section 34 (rollback/recovery) - rollback procedure is a type of runbook
  • IR-001/IR-002 (on-call/escalation) - runbooks reference who to escalate to
  • Section 12 (monitoring) - runbooks triggered by alerts

Pass criteria:

  • Runbooks exist for the most common/critical incident types
  • Runbooks are step-by-step (not just "fix the database")
  • Runbooks are accessible during outages
  • Team knows where to find them

Fail criteria:

  • No runbooks ("we wing it")
  • Runbooks exist but are outdated/wrong
  • Runbooks only in one person's head
  • Runbooks stored only in systems that could be down

Evidence to capture:

  • Location of runbooks
  • Incident types covered
  • Last update date (if visible)
  • Whether they're accessible offline

Section

35. Incident Response

API & Security