IR-005 critical runbooks
Common incidents have runbooks
Runbooks turn tribal knowledge into documented steps anyone can follow. During an incident isn't the time to figure out procedures.
Question to ask
"Which incidents only one person on your team knows how to fix?"
Pass criteria
- ✓ Runbooks exist for the most common/critical incident types
- ✓ Runbooks are step-by-step (not just "fix the database")
- ✓ Runbooks are accessible during outages
- ✓ Team knows where to find them
Fail criteria
- ✗ No runbooks
- ✗ Runbooks exist but are outdated/wrong
- ✗ Runbooks only in one person's head
- ✗ Runbooks stored only in systems that could be down
Related items
Verification guide
Severity: Critical
Runbooks turn tribal knowledge into documented steps anyone can follow. During an incident isn't the time to figure out "how do we restart the database?"
Check automatically:
- Look for runbook directories and files:
# Check for runbook directories
ls -la runbooks/ playbooks/ docs/incidents/ docs/runbooks/ 2>/dev/null
# Search for runbook content
grep -riE "runbook|playbook|incident.*response|troubleshoot" docs/ README.md CLAUDE.md --include="*.md" 2>/dev/null
# Look for specific incident types
grep -riE "server.*down|database.*issue|high.*traffic|outage|incident" docs/ runbooks/ --include="*.md" 2>/dev/null
- Check for runbook templates:
# Look for templates
find . -maxdepth 3 -name "*template*" -path "*/runbook*" -o -name "*template*" -path "*/playbook*" 2>/dev/null | grep -v node_modules
Ask user:
- "Do you have written runbooks for common incidents?"
- "What incidents have you had before? Are they documented?"
- "Can a new team member follow the runbook without help?"
Minimum runbooks to have:
| Incident Type | What It Covers |
|---|---|
| Server/app down | How to check status, restart, rollback, who to contact |
| Database issues | Connection problems, slow queries, failover procedures |
| High traffic/load | Scaling procedures, what to shed, caching knobs |
| Security incident | Who to contact, containment steps, communication plan |
Cross-reference with:
- Section 34 (rollback/recovery) - rollback procedure is a type of runbook
- IR-001/IR-002 (on-call/escalation) - runbooks reference who to escalate to
- Section 12 (monitoring) - runbooks triggered by alerts
Pass criteria:
- Runbooks exist for the most common/critical incident types
- Runbooks are step-by-step (not just "fix the database")
- Runbooks are accessible during outages
- Team knows where to find them
Fail criteria:
- No runbooks ("we wing it")
- Runbooks exist but are outdated/wrong
- Runbooks only in one person's head
- Runbooks stored only in systems that could be down
Evidence to capture:
- Location of runbooks
- Incident types covered
- Last update date (if visible)
- Whether they're accessible offline