MON-004 critical general

HTTP error alerting

Alert on high volume of 404s (team-defined threshold). Alert on any 500s within 1-minute window. Alerts route to appropriate channel. Alert delivery verified.

Question to ask

"Would a spike in 500s wake anyone up tonight?"

Related items

Verification guide

Severity: Critical

Check automatically:

  1. Check for alert configurations:

    Prometheus AlertManager:

    # Look for HTTP error alert rules
    grep -riE "status.*(4|5)[0-9]{2}|http.*(error|4xx|5xx)|response.*code" . --include="*.yml" --include="*.yaml" --include="*.rules" 2>/dev/null
    

    Datadog:

    # Look for Datadog monitor configs
    grep -riE "datadog.*monitor|monitor.*type.*metric" . --include="*.yml" --include="*.yaml" --include="*.tf" 2>/dev/null
    

    CloudWatch:

    # List CloudWatch alarms
    aws cloudwatch describe-alarms --query 'MetricAlarms[?contains(MetricName, `5xx`) || contains(MetricName, `4xx`) || contains(MetricName, `HTTPCode`)]'
    

    PagerDuty/Opsgenie:

    # Look for incident management integration
    grep -riE "pagerduty|opsgenie|victorops|incident" . --include="*.yml" --include="*.yaml" --include="*.json" 2>/dev/null
    
  2. Check Sentry or error tracking alerts:

    # Look for Sentry config
    grep -riE "sentry.*dsn|SENTRY_DSN|@sentry" . --include="*.js" --include="*.ts" --include="*.py" --include="*.env*" 2>/dev/null
    

Ask user for alert configuration details: "Please provide details on HTTP error alerting:

404 Alerts:

  1. Is there an alert for high volume of 404s?
  2. What is the threshold? (e.g., >100 per 5 minutes)
  3. Where does the alert go? (Slack, PagerDuty, email)

500 Alerts:

  1. Is there an alert for 500 errors?
  2. Is the threshold sensitive enough to catch even a few 500s? (Target: any 500s within 1-minute window)
  3. Where does the alert go?

Please provide:

  • Screenshot of alert rules, OR
  • Alert configuration file/export"

Cross-reference with:

  • MON-003 (HTTP logging) - alerts need logs/metrics to trigger from
  • Section 35 (Incident Response) - alerts should route to on-call
  • Section 19 (Error Reporting - Sentry) - Sentry may provide 500 alerting

Pass criteria:

  • 404 alerting configured with volume threshold (team-defined)
  • 500 alerting configured to catch any 500s within 1-minute window
  • Alerts route to appropriate channel (on-call, Slack, etc.)
  • Alert delivery verified (has fired and been received)

Fail criteria:

  • No HTTP error alerting configured
  • 404 alerting missing
  • 500 alerting threshold too high (missing low-volume errors)
  • 500 alerting only at high thresholds (e.g., >100)
  • Alerts configured but no one receiving them

Evidence to capture:

  • Alert tool in use
  • 404 alert threshold
  • 500 alert threshold (should be ~1 per minute)
  • Notification channel
  • Date of last alert fired (proves it works)

Section

12. Monitoring

Observability