MON-004 critical general

HTTP error alerting

Alert on high volume of 404s (team-defined threshold). Alert on any 500s within 1-minute window. Alerts route to appropriate channel. Alert delivery verified.

Question to ask

"Would a spike in 500s wake anyone up tonight?"

Verification guide

Severity: Critical

Check automatically:

Check for alert configurations:

Prometheus AlertManager:

# Look for HTTP error alert rules
grep -riE "status.*(4|5)[0-9]{2}|http.*(error|4xx|5xx)|response.*code" . --include="*.yml" --include="*.yaml" --include="*.rules" 2>/dev/null

Datadog:

# Look for Datadog monitor configs
grep -riE "datadog.*monitor|monitor.*type.*metric" . --include="*.yml" --include="*.yaml" --include="*.tf" 2>/dev/null

CloudWatch:

# List CloudWatch alarms
aws cloudwatch describe-alarms --query 'MetricAlarms[?contains(MetricName, `5xx`) || contains(MetricName, `4xx`) || contains(MetricName, `HTTPCode`)]'

PagerDuty/Opsgenie:

# Look for incident management integration
grep -riE "pagerduty|opsgenie|victorops|incident" . --include="*.yml" --include="*.yaml" --include="*.json" 2>/dev/null

Check Sentry or error tracking alerts:

# Look for Sentry config
grep -riE "sentry.*dsn|SENTRY_DSN|@sentry" . --include="*.js" --include="*.ts" --include="*.py" --include="*.env*" 2>/dev/null

Ask user for alert configuration details: "Please provide details on HTTP error alerting:

404 Alerts:

Is there an alert for high volume of 404s?
What is the threshold? (e.g., >100 per 5 minutes)
Where does the alert go? (Slack, PagerDuty, email)

500 Alerts:

Is there an alert for 500 errors?
Is the threshold sensitive enough to catch even a few 500s? (Target: any 500s within 1-minute window)
Where does the alert go?

Please provide:

Screenshot of alert rules, OR
Alert configuration file/export"

Cross-reference with:

MON-003 (HTTP logging) - alerts need logs/metrics to trigger from
Section 35 (Incident Response) - alerts should route to on-call
Section 19 (Error Reporting - Sentry) - Sentry may provide 500 alerting

Pass criteria:

404 alerting configured with volume threshold (team-defined)
500 alerting configured to catch any 500s within 1-minute window
Alerts route to appropriate channel (on-call, Slack, etc.)
Alert delivery verified (has fired and been received)

Fail criteria:

No HTTP error alerting configured
404 alerting missing
500 alerting threshold too high (missing low-volume errors)
500 alerting only at high thresholds (e.g., >100)
Alerts configured but no one receiving them

Evidence to capture:

Alert tool in use
404 alert threshold
500 alert threshold (should be ~1 per minute)
Notification channel
Date of last alert fired (proves it works)

Section

12. Monitoring

Observability

HTTP error alerting

Related items

Verification guide