IR-004 recommended on-call-escalation

PagerDuty/Opsgenie or similar

Incident management tools handle alerting, on-call scheduling, escalation, and incident tracking in one place.

Question to ask

"How does an alert wake someone up at 3am without Slack?"

Pass criteria

  • Using an incident management tool (PagerDuty, Opsgenie, incident.io, etc.)
  • Tool is integrated with monitoring/alerting systems
  • On-call schedules managed in the tool

Fail criteria

  • No incident management tool (relying on Slack mentions or manual calls)
  • Tool exists but not integrated with monitoring
  • Tool exists but nobody uses it properly

Verification guide

Severity: Recommended

Incident management tools handle alerting, on-call scheduling, escalation, and incident tracking in one place. They're the glue between monitoring and humans.

Check automatically:

  1. Check for incident management tools:
# Search package.json and configs
grep -riE "pagerduty|opsgenie|incident\.io|rootly|firehydrant|victorops|splunk-on-call" package.json .github/ terraform/ infrastructure/ --include="*.json" --include="*.yml" --include="*.tf" 2>/dev/null

# Look for webhook configs pointing to incident platforms
grep -riE "events\.pagerduty\.com|api\.opsgenie\.com|api\.incident\.io" .github/ terraform/ --include="*.yml" --include="*.tf" 2>/dev/null

# Check for config files
find . -maxdepth 2 -name "*pagerduty*" -o -name "*opsgenie*" 2>/dev/null | grep -v node_modules
  1. Check monitoring tool integrations:
# Datadog, Sentry, etc. often integrate with incident tools
grep -riE "pagerduty|opsgenie" datadog/ sentry/ monitoring/ --include="*.yml" --include="*.json" 2>/dev/null

Ask user:

  • "What tool do you use for incident management/paging?"
  • "Is it integrated with your monitoring/alerting?" (Datadog → PagerDuty, etc.)
  • "Does it handle on-call scheduling, or do you manage that separately?"

Cross-reference with:

  • IR-001/IR-002 (on-call and escalation) - tool often manages both
  • IR-003 (contact list) - tool becomes the source of truth for contacts
  • Section 12 (monitoring/alerting) - alerts should trigger the incident tool

Pass criteria:

  • Using an incident management tool (PagerDuty, Opsgenie, incident.io, etc.)
  • Tool is integrated with monitoring/alerting systems
  • On-call schedules managed in the tool

Fail criteria:

  • No incident management tool (relying on Slack mentions or manual calls)
  • Tool exists but not integrated with monitoring (alerts don't auto-page)
  • Tool exists but nobody uses it properly

Notes: For small teams/early stage: not having PagerDuty is fine if you have a simple contact list and Slack alerts. This becomes more critical as team grows or when 24/7 uptime matters.

Evidence to capture:

  • Incident management tool in use (or none)
  • Integrations with monitoring tools
  • Whether on-call scheduling is managed there

Section

35. Incident Response

API & Security