HA-002 recommended High Availability

Multi-region server deployment with failover

Servers in 2+ regions/data centers OR single region with quick-failover capability; traffic can route away from failed region

Question to ask

"One region goes dark — how long until users notice?"

Verification guide

Severity: Recommended (Critical when serious money involved)

Production servers should be deployed across multiple regions/data centers with the ability to failover traffic if one region goes down.

Check automatically:

  1. Check for multi-region deployment:
# AWS - instances across regions
aws ec2 describe-instances --query "Reservations[].Instances[].{ID:InstanceId,AZ:Placement.AvailabilityZone,State:State.Name}" --output table

# Check for global load balancer
aws elbv2 describe-load-balancers --query "LoadBalancers[].{Name:LoadBalancerName,Type:Type,Scheme:Scheme}" --output table

# GCP - instances across regions
gcloud compute instances list --format="table(name,zone,status)"

# Kubernetes nodes across zones
kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels."topology\.kubernetes\.io/zone"
  1. Check Terraform/IaC for multi-region:
# Look for multiple region definitions
grep -rE "region\s*=|availability_zone|location\s*=" --include="*.tf" 2>/dev/null | sort | uniq

# Check for global load balancer resources
grep -rE "aws_globalaccelerator|google_compute_global|azurerm_frontdoor|cloudflare_load_balancer" --include="*.tf" 2>/dev/null
  1. Check Kubernetes for multi-zone:
# Node distribution
kubectl get nodes --show-labels | grep -E "zone|region"

# Pod anti-affinity rules (spread across zones)
grep -rE "topologySpreadConstraints|podAntiAffinity" --include="*.yaml" --include="*.yml" 2>/dev/null
  1. Check for DNS failover:
# Cloudflare load balancing
curl -sX GET "https://api.cloudflare.com/client/v4/zones/{zone_id}/load_balancers" \
  -H "Authorization: Bearer $CLOUDFLARE_API_TOKEN" 2>/dev/null | jq '.result[] | {name, proxied, pools}'

# Route53 health checks (DNS failover)
aws route53 list-health-checks --query "HealthChecks[].{Id:Id,Type:HealthCheckConfig.Type}" --output table
  1. Check for container orchestration multi-region:
# ECS clusters
aws ecs list-clusters --query "clusterArns"

# Check Fly.io regions
grep -rE "primary_region|regions\s*=" fly.toml 2>/dev/null

# Check Railway/Render multi-region config
grep -rE "region|replicas" --include="*.toml" --include="railway.json" 2>/dev/null

Ask user:

  • "Are your production servers deployed across multiple regions or data centers?"
  • "What happens if one region/data center goes down?"
  • "How quickly can you spin up servers in a different region if needed?"

Acceptable alternatives to multi-region:

  • Single region with documented quick-failover capability (can deploy elsewhere within hours)
  • Multi-AZ within single region (less resilient but acceptable for smaller projects)
  • PaaS with built-in regional failover (Vercel, Cloudflare Workers)

Cross-reference with:

  • HA-001 (database HA - both layers need resilience)
  • MON-006 (status pages - should reflect regional status)
  • Section 34 (Rollback & Recovery - RTO/RPO)

Pass criteria:

  • Servers in 2+ regions/data centers, OR
  • Single region with multi-AZ AND documented quick-failover capability
  • Traffic can route away from failed region (load balancer, DNS failover)
  • Failover tested or documented

Fail criteria:

  • Single region, single AZ deployment with no failover plan
  • "We'd figure it out if it happened"
  • Multi-region configured but no traffic routing

Evidence to capture:

  • Regions/zones where servers are deployed
  • Failover mechanism (load balancer, DNS, manual)
  • RTO for regional failover
  • Last failover test date (if any)

Section

26. High Availability & Backups

High Availability & DR