MON-006 critical general

Status pages and downtime alerts

Production status page exists and is accessible. Staging status page exists (recommended). Uptime monitoring configured checking health endpoints. Downtime alerts route to appropriate channel.

Question to ask

"Do your customers know before you do when you're down?"

Verification guide

Severity: Critical (production), Recommended (staging)

Check automatically:

  1. Look for status page references:

    # Search for status page URLs in docs
    grep -riE "status\.(page|io)|statuspage|instatus|cachet|uptime|status\.your" . --include="*.md" --include="*.yml" --include="*.yaml" 2>/dev/null
    
    # Check for status page in README
    grep -iE "status|uptime" README.md 2>/dev/null
    
  2. Check for uptime monitoring configuration:

    # Look for uptime monitoring tools
    grep -riE "pingdom|uptimerobot|better.?uptime|statuscake|checkly|pagerduty.*heartbeat" . --include="*.yml" --include="*.yaml" --include="*.json" --include="*.tf" 2>/dev/null
    
  3. Verify status page URLs (if found):

    # Test status page is accessible
    curl -s -o /dev/null -w "%{http_code}" https://status.example.com
    

Ask user for status page details: "Please provide status page and uptime monitoring details:

Status Pages:

  1. Does a production status page exist? (Required)

    • URL:
    • Provider (Statuspage.io, Instatus, custom, etc.):
  2. Does a staging status page exist? (Recommended)

    • URL:
    • Can be internal-only

Downtime Alerting:

  1. What uptime monitoring is in place? (Pingdom, UptimeRobot, Better Uptime, etc.)
  2. What endpoints are monitored?
  3. Do monitors check health endpoints or just HTTP 200?
  4. Where do downtime alerts go?
  5. When did the last downtime alert fire?"

Cross-reference with:

  • HEALTH-001 (Basic health endpoint) - uptime monitors should check this
  • HEALTH-002 (Deep health endpoint) - status page should reflect dependency status
  • Section 35 (Incident Response) - status page is incident communication tool
  • DEPLOY-002 (Deployment notifications) - deployment status vs uptime status

Pass criteria:

  • Production status page exists and is accessible
  • Uptime monitoring configured for production
  • Monitors check health endpoints (not just any HTTP 200)
  • Downtime alerts route to appropriate channel
  • Staging status page exists (Recommended, not required)

Fail criteria:

  • No production status page
  • No uptime monitoring
  • Monitors only check for HTTP 200 (miss dependency failures)
  • Downtime alerts not configured
  • Status page exists but not maintained/accurate

Evidence to capture:

  • Production status page URL
  • Staging status page URL (if exists)
  • Uptime monitoring tool
  • Endpoints monitored
  • Downtime alert channel
  • Date of last downtime alert

Section

12. Monitoring

Observability