IR-004 recommended on-call-escalation
PagerDuty/Opsgenie or similar
Incident management tools handle alerting, on-call scheduling, escalation, and incident tracking in one place.
Question to ask
"How does an alert wake someone up at 3am without Slack?"
Pass criteria
- ✓ Using an incident management tool (PagerDuty, Opsgenie, incident.io, etc.)
- ✓ Tool is integrated with monitoring/alerting systems
- ✓ On-call schedules managed in the tool
Fail criteria
- ✗ No incident management tool (relying on Slack mentions or manual calls)
- ✗ Tool exists but not integrated with monitoring
- ✗ Tool exists but nobody uses it properly
Related items
Verification guide
Severity: Recommended
Incident management tools handle alerting, on-call scheduling, escalation, and incident tracking in one place. They're the glue between monitoring and humans.
Check automatically:
- Check for incident management tools:
# Search package.json and configs
grep -riE "pagerduty|opsgenie|incident\.io|rootly|firehydrant|victorops|splunk-on-call" package.json .github/ terraform/ infrastructure/ --include="*.json" --include="*.yml" --include="*.tf" 2>/dev/null
# Look for webhook configs pointing to incident platforms
grep -riE "events\.pagerduty\.com|api\.opsgenie\.com|api\.incident\.io" .github/ terraform/ --include="*.yml" --include="*.tf" 2>/dev/null
# Check for config files
find . -maxdepth 2 -name "*pagerduty*" -o -name "*opsgenie*" 2>/dev/null | grep -v node_modules
- Check monitoring tool integrations:
# Datadog, Sentry, etc. often integrate with incident tools
grep -riE "pagerduty|opsgenie" datadog/ sentry/ monitoring/ --include="*.yml" --include="*.json" 2>/dev/null
Ask user:
- "What tool do you use for incident management/paging?"
- "Is it integrated with your monitoring/alerting?" (Datadog → PagerDuty, etc.)
- "Does it handle on-call scheduling, or do you manage that separately?"
Cross-reference with:
- IR-001/IR-002 (on-call and escalation) - tool often manages both
- IR-003 (contact list) - tool becomes the source of truth for contacts
- Section 12 (monitoring/alerting) - alerts should trigger the incident tool
Pass criteria:
- Using an incident management tool (PagerDuty, Opsgenie, incident.io, etc.)
- Tool is integrated with monitoring/alerting systems
- On-call schedules managed in the tool
Fail criteria:
- No incident management tool (relying on Slack mentions or manual calls)
- Tool exists but not integrated with monitoring (alerts don't auto-page)
- Tool exists but nobody uses it properly
Notes: For small teams/early stage: not having PagerDuty is fine if you have a simple contact list and Slack alerts. This becomes more critical as team grows or when 24/7 uptime matters.
Evidence to capture:
- Incident management tool in use (or none)
- Integrations with monitoring tools
- Whether on-call scheduling is managed there