MON-004 critical general
HTTP error alerting
Alert on high volume of 404s (team-defined threshold). Alert on any 500s within 1-minute window. Alerts route to appropriate channel. Alert delivery verified.
Question to ask
"Would a spike in 500s wake anyone up tonight?"
Related items
Verification guide
Severity: Critical
Check automatically:
Check for alert configurations:
Prometheus AlertManager:
# Look for HTTP error alert rules grep -riE "status.*(4|5)[0-9]{2}|http.*(error|4xx|5xx)|response.*code" . --include="*.yml" --include="*.yaml" --include="*.rules" 2>/dev/nullDatadog:
# Look for Datadog monitor configs grep -riE "datadog.*monitor|monitor.*type.*metric" . --include="*.yml" --include="*.yaml" --include="*.tf" 2>/dev/nullCloudWatch:
# List CloudWatch alarms aws cloudwatch describe-alarms --query 'MetricAlarms[?contains(MetricName, `5xx`) || contains(MetricName, `4xx`) || contains(MetricName, `HTTPCode`)]'PagerDuty/Opsgenie:
# Look for incident management integration grep -riE "pagerduty|opsgenie|victorops|incident" . --include="*.yml" --include="*.yaml" --include="*.json" 2>/dev/nullCheck Sentry or error tracking alerts:
# Look for Sentry config grep -riE "sentry.*dsn|SENTRY_DSN|@sentry" . --include="*.js" --include="*.ts" --include="*.py" --include="*.env*" 2>/dev/null
Ask user for alert configuration details: "Please provide details on HTTP error alerting:
404 Alerts:
- Is there an alert for high volume of 404s?
- What is the threshold? (e.g., >100 per 5 minutes)
- Where does the alert go? (Slack, PagerDuty, email)
500 Alerts:
- Is there an alert for 500 errors?
- Is the threshold sensitive enough to catch even a few 500s? (Target: any 500s within 1-minute window)
- Where does the alert go?
Please provide:
- Screenshot of alert rules, OR
- Alert configuration file/export"
Cross-reference with:
- MON-003 (HTTP logging) - alerts need logs/metrics to trigger from
- Section 35 (Incident Response) - alerts should route to on-call
- Section 19 (Error Reporting - Sentry) - Sentry may provide 500 alerting
Pass criteria:
- 404 alerting configured with volume threshold (team-defined)
- 500 alerting configured to catch any 500s within 1-minute window
- Alerts route to appropriate channel (on-call, Slack, etc.)
- Alert delivery verified (has fired and been received)
Fail criteria:
- No HTTP error alerting configured
- 404 alerting missing
- 500 alerting threshold too high (missing low-volume errors)
- 500 alerting only at high thresholds (e.g., >100)
- Alerts configured but no one receiving them
Evidence to capture:
- Alert tool in use
- 404 alert threshold
- 500 alert threshold (should be ~1 per minute)
- Notification channel
- Date of last alert fired (proves it works)